System Design Library

Link Preview / Unfurl

Generate rich previews (title/image) for pasted URLs, safely and fast.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Fetch & parse Open Graph
Cache previews
Handle slow/dead URLs
SSRF protection

Non-functional

Fast (cached)
Safe fetching

Scale

Many links per second

The approach

On first sight, fetch the URL server-side, parse OG/meta tags, store the preview; serve from cache thereafter; fetches run async with timeouts; block internal IPs to prevent SSRF.

Key components

App → preview cache → (miss) async fetcher → parser

Numbers that matter

OG tag parsing takes <10 ms once the HTML is fetched; the fetch itself (network round-trip + TTFB) dominates at 200–2000 ms for a cold URL.
Slack fetches and caches link previews with a TTL of ~30 minutes — popular links shared across workspaces are fetched once per region, not per message.
A timeout of 3–5 seconds for the unfurl HTTP request is the production standard — beyond that, show a degraded preview (title only from URL) rather than blocking the UX.
AWS EC2 instance metadata at 169.254.169.254 responds in <1 ms — without IP filtering, an SSRF via unfurl can leak IAM credentials in a single fast request.

Senior deep-dive

SSRF is the primary security threat — your unfurl service fetches arbitrary URLs supplied by users; without IP allowlist filtering, attackers use it to probe internal services (AWS metadata at 169.254.169.254, internal APIs).

Cache aggressively on the first fetch — the same URL is pasted repeatedly; without a cache you re-fetch the same page thousands of times and the target server blocks your IP as a scraper.

Async fetch, sync serve: trigger the fetch on URL paste, store the preview, and serve from cache — making the user wait for a live fetch on every paste is unacceptable latency.

SSRF prevention: the mandatory first design decision

Before any fetch, resolve the URL's hostname to an IP and validate it against a blocklist: block private RFC1918 ranges (10.x, 172.16–31.x, 192.168.x), loopback (127.x), link-local (169.254.x — the AWS metadata endpoint), and IPv6 equivalents. Redirect following requires re-validation at each hop — an initial URL can resolve to a public IP but redirect to an internal one. Run unfurl workers in a network-isolated VPC with no route to internal services as defense-in-depth.

Fetch pipeline: timeouts, byte limits, and content-type guards

Set a 3-second connection timeout and 5-second total request timeout. Limit response body reads to 1–2 MB — stop reading after that and parse what you have (OG tags are always near the top of well-formed HTML). Check Content-Type first (HEAD request before GET) to skip binary files, audio, and video — never try to parse a 500 MB video as HTML. For HTTPS, validate the certificate but have a configurable option to skip for internal testing (never skip in production).

OG tag parsing: what to extract and fallbacks

Parse in priority order: Open Graph tags (og:title, og:description, og:image, og:url), then Twitter Card tags, then vanilla `<title>` and `<meta name='description'>`. Many sites set og:image to a relative URL — resolve relative URLs against the page's base URL before storing. Images should be proxied through your own CDN rather than linked directly: target servers change images, and direct-linking leaks your users' IPs to third parties.

Caching strategy: URL normalization is critical

Cache key must be the canonical URL — strip UTM parameters, normalize trailing slashes, lowercase hostname. Without normalization, `example.com/page?utm_source=twitter` and `example.com/page?utm_source=email` fetch the same page twice and store two cache entries. Cache TTL of 24 hours is appropriate for most content; news sites may need shorter TTLs (1–2 hours). Use a negative cache (store a 'no preview available' sentinel) for URLs that 404 or return no OG tags, to prevent repeated futile fetches.

Image proxying: the necessary complexity

Serving the og:image URL directly has three problems: mixed content (HTTP image on HTTPS page), hot-linking (target server blocks your IP after 10,000 requests), and privacy (target server logs your users' IPs). The fix is to proxy and cache the image through your CDN: fetch and store the image at unfurl time, serve your CDN URL. Add a max image size limit (5 MB) and dimension constraints to prevent serving enormous images in a small preview card.

What breaks at scale

Link bombs in chat: a user posts 1,000 URLs in a single message, triggering 1,000 simultaneous fetch workers — rate-limit unfurl fetches per user and per target domain (max 5 concurrent fetches to the same domain). Infinite redirect loops: some sites redirect A→B→A; cap redirect follows at 5 and track visited URLs within a chain. JavaScript-rendered OG tags: single-page apps often set og:tags via JS after DOMContentLoaded — a static HTML fetch returns empty tags. A headless browser (Puppeteer) fixes this but is 10–100x more expensive per fetch; reserve it for a low-frequency fallback, not the default path.

In production

Slack's unfurl pipeline uses a per-URL fetch-and-cache flow: first paste triggers a background worker, subsequent shares within the TTL serve the cached preview. iMessage and WhatsApp perform unfurls client-side in some modes or via an Apple/Meta proxy — the proxy approach hides the recipient's IP from the target server but centralizes fetch traffic. Telegram uses a server-side approach similar to Slack. The real engineering challenge is dealing with hostile or slow target servers: pages that stream HTML infinitely, pages behind login walls that return 200 with no OG tags, and pages with 10MB of HTML before the `<meta>` tags — all require defensive parsing with byte limits and timeouts.

Common mistakes

Fetching without SSRF protection
Synchronous fetch on the hot path
No cache (refetch every render)

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →