Object Storage (Amazon S3)
Store exabytes of objects with 11 nines of durability and high availability.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- put/get/delete objects
- Buckets & keys
- Versioning
- Lifecycle/tiering
Non-functional
- 11 nines durability
- High availability
- Cheap at scale
Scale
Exabytes; trillions of objects
The approach
Split objects into chunks, erasure-code them across many disks/AZs (survive failures with less overhead than full replication), track placement in a metadata service; background scrubbing repairs bit-rot; tiered storage for cold data.
Key components
API → metadata service → storage nodes (erasure-coded) · repair/scrubber
Numbers that matter
- S3 stores over 350 trillion objects (as of 2023 public disclosures) and serves over 100 million requests per second at peak across all buckets globally.
- Erasure coding at S3's scale uses schemes like Reed-Solomon (14+6) — splitting an object into 14 data shards + 6 parity shards, tolerating any 6 simultaneous failures while using ~1.43× storage overhead vs 3× for full replication.
- S3 single-object PUT limit is 5 GB; above that, multipart upload is mandatory — the minimum part size is 5 MB (except the last part), and a single object can be up to 5 TB via up to 10,000 parts.
- S3 prefix-based request rate scales to 3,500 PUT/COPY/POST/DELETE/s and 5,500 GET/HEAD/s per prefix — before 2018's key-name randomization guidance was moot because of hash-based prefix sharding.
Senior deep-dive
Durability is an erasure coding problem, not a replication problem: 11 nines (99.999999999%) is achieved by striping objects across many drives/AZs with erasure codes — far more storage-efficient than 3× replication.
The metadata tier is the hard part: billions of objects means the index of where each object's chunks live must itself be distributed, replicated, and fast — S3's internal namespace service is a massive distributed B-tree under the hood.
Eventual consistency in S3 was real until 2020: strong read-after-write consistency was only added then — before that, a PUT followed immediately by a GET could return 404, a footgun that broke countless distributed workflows.
Erasure coding beats replication past a crossover point
3× replication gives 200% storage overhead but simple recovery (copy from a live replica). Erasure coding (e.g., 14+6) gives ~43% overhead and survives more simultaneous failures, but reconstruction requires reading k shards and XOR-combining them — read amplification on recovery. S3 uses replication for small/hot objects (fast recovery matters more) and erasure coding for large cold objects (storage cost dominates). The crossover decision is roughly: replicate when object is small or frequently accessed; EC when object is large or cold.
Namespace service: the hidden bottleneck
Every S3 GET/PUT/DELETE requires a metadata lookup: given a bucket+key, find the chunk locations. This lookup must be low-latency (~milliseconds) and highly available. AWS partitions the keyspace (bucket+prefix hash) across a fleet of metadata nodes. Before 2018, sequential key names (timestamps, sequential IDs) caused all writes to land on the same metadata shard — the reason AWS recommended random prefixes. Today the sharding is auto-managed, but a single bucket with a trillion objects still strains the metadata layer and can cause request-rate throttling per prefix.
Multipart upload: durability and throughput together
For objects over 100 MB in practice (required over 5 GB), multipart upload lets you upload parts in parallel from multiple threads/machines, each independently checksummed. A failed part retries without restarting the whole upload. The key operational gotcha: incomplete multipart uploads accumulate and incur storage charges — every production S3 bucket needs a lifecycle rule to abort incomplete MPUs after N days. The CompleteMultipartUpload call is atomic from the client's perspective but internally assembles an object manifest linking the parts.
Strong consistency: what changed in 2020
Pre-2020, S3 offered read-after-write consistency for new PUTs but only eventual consistency for overwrites and DELETEs — a GET immediately after a DELETE could return the old object. This broke distributed systems that used S3 as a coordination primitive. AWS retrofitted strong consistency by making the metadata lookup go through a strongly-consistent metadata service (not just a cached layer). The lesson: don't assume object stores are eventually consistent today, but verify per-provider — MinIO and older deployments may still have weaker guarantees.
Durability vs availability are different axes
S3 Standard: 11 nines durability, 99.99% availability. Durability means data is not lost; availability means it's accessible. S3 One Zone-IA drops to ~99.5% availability and loses the cross-AZ durability guarantee — one AZ outage = your data is gone. S3 Glacier trades availability (hours to restore) for cost. Designing around S3 means choosing on these axes deliberately: a media asset CDN origin can tolerate minutes of S3 unavailability (CDN serves stale); a primary database backup cannot tolerate even ~0.01% annual durability loss.
What breaks at scale
Request rate throttling (HTTP 503 SlowDown) is the most common scale failure — S3 rate-limits per prefix, so a single hot prefix (e.g., all writes to `logs/2024-01-01/`) hits the ceiling fast. The fix is prefix randomization or spreading writes across prefixes. Large-object cold reads trigger erasure code reconstruction across many nodes — at exabyte scale, a single batch job doing full-table scans can saturate reconstruction capacity. Finally, cross-region replication lag is real: S3 CRR has a P99 replication time of minutes, not milliseconds — systems expecting immediate cross-region consistency after a PUT will read stale data.
In production
Amazon S3 pioneered object storage and every cloud-native architecture uses it or its clones (GCS, Azure Blob, MinIO). The real engineering challenge is the metadata layer: at exabyte scale, the mapping from object key → chunk locations cannot live in a single database — AWS runs a fleet of partitioned metadata services (sometimes described as a massive distributed hash table layered over hierarchical namespaces) with aggressive caching. Multipart upload is how large object durability is guaranteed under unreliable networks: each part is checksummed independently, uploaded in parallel (dramatically improving throughput for large objects), and the final CompleteMultipartUpload call atomically commits the assembled object.
Common mistakes
- Full 3× replication (too costly at exabyte scale)
- No background integrity scrubbing
- Single metadata DB (must be sharded)