System Design Library

Object Storage (Amazon S3)

Store exabytes of objects with 11 nines of durability and high availability.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • put/get/delete objects
  • Buckets & keys
  • Versioning
  • Lifecycle/tiering

Non-functional

  • 11 nines durability
  • High availability
  • Cheap at scale

Scale

Exabytes; trillions of objects

The approach

Split objects into chunks, erasure-code them across many disks/AZs (survive failures with less overhead than full replication), track placement in a metadata service; background scrubbing repairs bit-rot; tiered storage for cold data.

Key components

API → metadata service → storage nodes (erasure-coded) · repair/scrubber

Numbers that matter

Senior deep-dive

Durability is an erasure coding problem, not a replication problem: 11 nines (99.999999999%) is achieved by striping objects across many drives/AZs with erasure codes — far more storage-efficient than 3× replication.

The metadata tier is the hard part: billions of objects means the index of where each object's chunks live must itself be distributed, replicated, and fast — S3's internal namespace service is a massive distributed B-tree under the hood.

Eventual consistency in S3 was real until 2020: strong read-after-write consistency was only added then — before that, a PUT followed immediately by a GET could return 404, a footgun that broke countless distributed workflows.

Erasure coding beats replication past a crossover point

3× replication gives 200% storage overhead but simple recovery (copy from a live replica). Erasure coding (e.g., 14+6) gives ~43% overhead and survives more simultaneous failures, but reconstruction requires reading k shards and XOR-combining them — read amplification on recovery. S3 uses replication for small/hot objects (fast recovery matters more) and erasure coding for large cold objects (storage cost dominates). The crossover decision is roughly: replicate when object is small or frequently accessed; EC when object is large or cold.

Namespace service: the hidden bottleneck

Every S3 GET/PUT/DELETE requires a metadata lookup: given a bucket+key, find the chunk locations. This lookup must be low-latency (~milliseconds) and highly available. AWS partitions the keyspace (bucket+prefix hash) across a fleet of metadata nodes. Before 2018, sequential key names (timestamps, sequential IDs) caused all writes to land on the same metadata shard — the reason AWS recommended random prefixes. Today the sharding is auto-managed, but a single bucket with a trillion objects still strains the metadata layer and can cause request-rate throttling per prefix.

Multipart upload: durability and throughput together

For objects over 100 MB in practice (required over 5 GB), multipart upload lets you upload parts in parallel from multiple threads/machines, each independently checksummed. A failed part retries without restarting the whole upload. The key operational gotcha: incomplete multipart uploads accumulate and incur storage charges — every production S3 bucket needs a lifecycle rule to abort incomplete MPUs after N days. The CompleteMultipartUpload call is atomic from the client's perspective but internally assembles an object manifest linking the parts.

Strong consistency: what changed in 2020

Pre-2020, S3 offered read-after-write consistency for new PUTs but only eventual consistency for overwrites and DELETEs — a GET immediately after a DELETE could return the old object. This broke distributed systems that used S3 as a coordination primitive. AWS retrofitted strong consistency by making the metadata lookup go through a strongly-consistent metadata service (not just a cached layer). The lesson: don't assume object stores are eventually consistent today, but verify per-provider — MinIO and older deployments may still have weaker guarantees.

Durability vs availability are different axes

S3 Standard: 11 nines durability, 99.99% availability. Durability means data is not lost; availability means it's accessible. S3 One Zone-IA drops to ~99.5% availability and loses the cross-AZ durability guarantee — one AZ outage = your data is gone. S3 Glacier trades availability (hours to restore) for cost. Designing around S3 means choosing on these axes deliberately: a media asset CDN origin can tolerate minutes of S3 unavailability (CDN serves stale); a primary database backup cannot tolerate even ~0.01% annual durability loss.

What breaks at scale

Request rate throttling (HTTP 503 SlowDown) is the most common scale failure — S3 rate-limits per prefix, so a single hot prefix (e.g., all writes to `logs/2024-01-01/`) hits the ceiling fast. The fix is prefix randomization or spreading writes across prefixes. Large-object cold reads trigger erasure code reconstruction across many nodes — at exabyte scale, a single batch job doing full-table scans can saturate reconstruction capacity. Finally, cross-region replication lag is real: S3 CRR has a P99 replication time of minutes, not milliseconds — systems expecting immediate cross-region consistency after a PUT will read stale data.

In production

Amazon S3 pioneered object storage and every cloud-native architecture uses it or its clones (GCS, Azure Blob, MinIO). The real engineering challenge is the metadata layer: at exabyte scale, the mapping from object key → chunk locations cannot live in a single database — AWS runs a fleet of partitioned metadata services (sometimes described as a massive distributed hash table layered over hierarchical namespaces) with aggressive caching. Multipart upload is how large object durability is guaranteed under unreliable networks: each part is checksummed independently, uploaded in parallel (dramatically improving throughput for large objects), and the final CompleteMultipartUpload call atomically commits the assembled object.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →