Pastebin
Store and serve text/code snippets by short URL, some large, some private.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Create paste (text, expiry, visibility)
- Retrieve by key
- Syntax/raw view
Non-functional
- Durable storage
- Fast reads
- Large pastes supported
Scale
Read-heavy; pastes up to a few MB
The approach
Metadata in a DB (key, owner, expiry, size); paste body in object storage (S3) once it exceeds an inline threshold; cache hot pastes; CDN for popular raw content.
Key components
App → Metadata DB + Object store · Cache · CDN
Numbers that matter
- ~10KB is the typical inline/offload threshold — pastes below stay in the DB row, above go to object storage to avoid row bloat
- A single S3 GET for a cached paste URL completes in ~20-30ms P50 from the same region; CDN hit is <5ms for popular public pastes
- Object storage costs ~$0.023/GB/month vs SSD-backed DB storage at ~$0.10-0.25/GB — a 10× cost gap that compounds fast at millions of large pastes
- Expiry sweepers that DELETE by timestamp should batch no more than ~1K rows per tick at ~1-second intervals to avoid locking hotspots on the expiry index
Senior deep-dive
Object storage is the only sane choice for large pastes — inlining blobs in a relational row blows page sizes and kills index performance past a few KB threshold.
Tiered storage (inline ≤ 10KB → DB, > 10KB → S3) cuts storage costs by 10-100× but forces your read path to fan out; cache hot paste IDs → S3 URLs to keep p99 reads fast.
Expiry is the sleeper problem — a naive DELETE-by-expiry cron hammers the DB at scale; use a TTL index (DynamoDB) or background sweeper with a cursor so deletion is streaming, not a thundering herd.
Inline vs. offload: pick a threshold and commit
The 10KB inline threshold is not magic — it's where a DB page starts losing most of its capacity to a single blob. Below the threshold, keeping the content in the metadata row eliminates an extra network hop on every read. Above it, you pay that hop once in exchange for avoiding row-size explosions that degrade index scan performance across all pastes, not just the big ones. Tune the threshold based on your median paste size distribution.
Short-URL generation mirrors TinyURL exactly
You can hash the content (SHA-256 truncated to 8 chars, base62-encoded) or generate a random/sequential key. Content hashing gives free deduplication — identical pastes share a URL — but collision handling adds a read-before-write. Random keys avoid that but waste the dedup opportunity. At Pastebin scale, random 8-char base62 gives 62^8 ≈ 218 trillion slots, so collision probability is negligible for decades.
Access control without a session per request
Private pastes need a secret token embedded in the URL (e.g. `/p/<id>/<secret>`), not a login wall. The server verifies the secret is a HMAC of the paste ID signed with a server key — zero extra DB lookup. Never store the secret separately; derive it on-the-fly. This pattern is used by Google Docs share links and is immune to enumeration because the ID space is still random.
CDN invalidation vs. cache-by-TTL
Public pastes are read-heavy and immutable after creation, making them perfect CDN candidates. Set a long `Cache-Control: max-age` (hours to days) on raw content. The trap: once you purge a paste or it expires, CDN nodes hold a stale cached copy until TTL. Use versioned URLs (`/raw/<id>/v1`) or a short CDN TTL (5–15 min) for pastes with any mutability, and explicit purge calls for deletions.
Syntax highlighting is a client-side problem
Server-side syntax highlighting with Pygments or Rouge is tempting but burns CPU on every uncached request. Push highlighting to the browser using highlight.js or Prism — ship the raw text and let the client do the work. Reserve server-side rendering for SEO-critical pastes (public, no auth) where you need the highlighted HTML in the initial response for crawlers. Cache the rendered HTML in Redis with a content-hash key.
What breaks at scale
Hot pastes go viral: a single popular paste can spike to millions of reads/hour, overwhelming a DB-backed read path. The fix is a two-layer cache (in-process LRU for the top 100 pastes + Redis for the top 10K) with the CDN as L0. Expiry at write volume is the other cliff — millions of pastes expiring simultaneously creates a DELETE storm; bucket expiry times to the nearest hour and spread sweepers across the bucket range. Finally, syntax detection on upload is surprisingly expensive for large files; cap file size at 10MB and run detection async.
In production
GitHub Gist stores snippets in Git repositories (giving you full revision history for free) while Pastebin.com uses a simple key→blob model with MySQL + file storage. The real engineering challenge is expiry at scale: a naive `DELETE WHERE expires_at < NOW()` on a 500M-row table with a non-selective index turns into a full scan; systems like this move expiry logic to a Kafka-backed cleanup pipeline that processes expired IDs in order, decoupling deletion throughput from read latency.
Common mistakes
- Storing large bodies in the DB
- No expiry/garbage collection
- Treating private pastes as just "unlisted" (need real auth)