System Design Library

Link Preview / Unfurl

Generate rich previews (title/image) for pasted URLs, safely and fast.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Fetch & parse Open Graph
  • Cache previews
  • Handle slow/dead URLs
  • SSRF protection

Non-functional

  • Fast (cached)
  • Safe fetching

Scale

Many links per second

The approach

On first sight, fetch the URL server-side, parse OG/meta tags, store the preview; serve from cache thereafter; fetches run async with timeouts; block internal IPs to prevent SSRF.

Key components

App → preview cache → (miss) async fetcher → parser

Numbers that matter

Senior deep-dive

SSRF is the primary security threat — your unfurl service fetches arbitrary URLs supplied by users; without IP allowlist filtering, attackers use it to probe internal services (AWS metadata at 169.254.169.254, internal APIs).

Cache aggressively on the first fetch — the same URL is pasted repeatedly; without a cache you re-fetch the same page thousands of times and the target server blocks your IP as a scraper.

Async fetch, sync serve: trigger the fetch on URL paste, store the preview, and serve from cache — making the user wait for a live fetch on every paste is unacceptable latency.

SSRF prevention: the mandatory first design decision

Before any fetch, resolve the URL's hostname to an IP and validate it against a blocklist: block private RFC1918 ranges (10.x, 172.16–31.x, 192.168.x), loopback (127.x), link-local (169.254.x — the AWS metadata endpoint), and IPv6 equivalents. Redirect following requires re-validation at each hop — an initial URL can resolve to a public IP but redirect to an internal one. Run unfurl workers in a network-isolated VPC with no route to internal services as defense-in-depth.

Fetch pipeline: timeouts, byte limits, and content-type guards

Set a 3-second connection timeout and 5-second total request timeout. Limit response body reads to 1–2 MB — stop reading after that and parse what you have (OG tags are always near the top of well-formed HTML). Check Content-Type first (HEAD request before GET) to skip binary files, audio, and video — never try to parse a 500 MB video as HTML. For HTTPS, validate the certificate but have a configurable option to skip for internal testing (never skip in production).

OG tag parsing: what to extract and fallbacks

Parse in priority order: Open Graph tags (og:title, og:description, og:image, og:url), then Twitter Card tags, then vanilla `<title>` and `<meta name='description'>`. Many sites set og:image to a relative URL — resolve relative URLs against the page's base URL before storing. Images should be proxied through your own CDN rather than linked directly: target servers change images, and direct-linking leaks your users' IPs to third parties.

Caching strategy: URL normalization is critical

Cache key must be the canonical URL — strip UTM parameters, normalize trailing slashes, lowercase hostname. Without normalization, `example.com/page?utm_source=twitter` and `example.com/page?utm_source=email` fetch the same page twice and store two cache entries. Cache TTL of 24 hours is appropriate for most content; news sites may need shorter TTLs (1–2 hours). Use a negative cache (store a 'no preview available' sentinel) for URLs that 404 or return no OG tags, to prevent repeated futile fetches.

Image proxying: the necessary complexity

Serving the og:image URL directly has three problems: mixed content (HTTP image on HTTPS page), hot-linking (target server blocks your IP after 10,000 requests), and privacy (target server logs your users' IPs). The fix is to proxy and cache the image through your CDN: fetch and store the image at unfurl time, serve your CDN URL. Add a max image size limit (5 MB) and dimension constraints to prevent serving enormous images in a small preview card.

What breaks at scale

Link bombs in chat: a user posts 1,000 URLs in a single message, triggering 1,000 simultaneous fetch workers — rate-limit unfurl fetches per user and per target domain (max 5 concurrent fetches to the same domain). Infinite redirect loops: some sites redirect A→B→A; cap redirect follows at 5 and track visited URLs within a chain. JavaScript-rendered OG tags: single-page apps often set og:tags via JS after DOMContentLoaded — a static HTML fetch returns empty tags. A headless browser (Puppeteer) fixes this but is 10–100x more expensive per fetch; reserve it for a low-frequency fallback, not the default path.

In production

Slack's unfurl pipeline uses a per-URL fetch-and-cache flow: first paste triggers a background worker, subsequent shares within the TTL serve the cached preview. iMessage and WhatsApp perform unfurls client-side in some modes or via an Apple/Meta proxy — the proxy approach hides the recipient's IP from the target server but centralizes fetch traffic. Telegram uses a server-side approach similar to Slack. The real engineering challenge is dealing with hostile or slow target servers: pages that stream HTML infinitely, pages behind login walls that return 200 with no OG tags, and pages with 10MB of HTML before the `<meta>` tags — all require defensive parsing with byte limits and timeouts.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →