Link Preview / Unfurl
Generate rich previews (title/image) for pasted URLs, safely and fast.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Fetch & parse Open Graph
- Cache previews
- Handle slow/dead URLs
- SSRF protection
Non-functional
- Fast (cached)
- Safe fetching
Scale
Many links per second
The approach
On first sight, fetch the URL server-side, parse OG/meta tags, store the preview; serve from cache thereafter; fetches run async with timeouts; block internal IPs to prevent SSRF.
Key components
App → preview cache → (miss) async fetcher → parser
Numbers that matter
- OG tag parsing takes <10 ms once the HTML is fetched; the fetch itself (network round-trip + TTFB) dominates at 200–2000 ms for a cold URL.
- Slack fetches and caches link previews with a TTL of ~30 minutes — popular links shared across workspaces are fetched once per region, not per message.
- A timeout of 3–5 seconds for the unfurl HTTP request is the production standard — beyond that, show a degraded preview (title only from URL) rather than blocking the UX.
- AWS EC2 instance metadata at 169.254.169.254 responds in <1 ms — without IP filtering, an SSRF via unfurl can leak IAM credentials in a single fast request.
Senior deep-dive
SSRF is the primary security threat — your unfurl service fetches arbitrary URLs supplied by users; without IP allowlist filtering, attackers use it to probe internal services (AWS metadata at 169.254.169.254, internal APIs).
Cache aggressively on the first fetch — the same URL is pasted repeatedly; without a cache you re-fetch the same page thousands of times and the target server blocks your IP as a scraper.
Async fetch, sync serve: trigger the fetch on URL paste, store the preview, and serve from cache — making the user wait for a live fetch on every paste is unacceptable latency.
SSRF prevention: the mandatory first design decision
Before any fetch, resolve the URL's hostname to an IP and validate it against a blocklist: block private RFC1918 ranges (10.x, 172.16–31.x, 192.168.x), loopback (127.x), link-local (169.254.x — the AWS metadata endpoint), and IPv6 equivalents. Redirect following requires re-validation at each hop — an initial URL can resolve to a public IP but redirect to an internal one. Run unfurl workers in a network-isolated VPC with no route to internal services as defense-in-depth.
Fetch pipeline: timeouts, byte limits, and content-type guards
Set a 3-second connection timeout and 5-second total request timeout. Limit response body reads to 1–2 MB — stop reading after that and parse what you have (OG tags are always near the top of well-formed HTML). Check Content-Type first (HEAD request before GET) to skip binary files, audio, and video — never try to parse a 500 MB video as HTML. For HTTPS, validate the certificate but have a configurable option to skip for internal testing (never skip in production).
OG tag parsing: what to extract and fallbacks
Parse in priority order: Open Graph tags (og:title, og:description, og:image, og:url), then Twitter Card tags, then vanilla `<title>` and `<meta name='description'>`. Many sites set og:image to a relative URL — resolve relative URLs against the page's base URL before storing. Images should be proxied through your own CDN rather than linked directly: target servers change images, and direct-linking leaks your users' IPs to third parties.
Caching strategy: URL normalization is critical
Cache key must be the canonical URL — strip UTM parameters, normalize trailing slashes, lowercase hostname. Without normalization, `example.com/page?utm_source=twitter` and `example.com/page?utm_source=email` fetch the same page twice and store two cache entries. Cache TTL of 24 hours is appropriate for most content; news sites may need shorter TTLs (1–2 hours). Use a negative cache (store a 'no preview available' sentinel) for URLs that 404 or return no OG tags, to prevent repeated futile fetches.
Image proxying: the necessary complexity
Serving the og:image URL directly has three problems: mixed content (HTTP image on HTTPS page), hot-linking (target server blocks your IP after 10,000 requests), and privacy (target server logs your users' IPs). The fix is to proxy and cache the image through your CDN: fetch and store the image at unfurl time, serve your CDN URL. Add a max image size limit (5 MB) and dimension constraints to prevent serving enormous images in a small preview card.
What breaks at scale
Link bombs in chat: a user posts 1,000 URLs in a single message, triggering 1,000 simultaneous fetch workers — rate-limit unfurl fetches per user and per target domain (max 5 concurrent fetches to the same domain). Infinite redirect loops: some sites redirect A→B→A; cap redirect follows at 5 and track visited URLs within a chain. JavaScript-rendered OG tags: single-page apps often set og:tags via JS after DOMContentLoaded — a static HTML fetch returns empty tags. A headless browser (Puppeteer) fixes this but is 10–100x more expensive per fetch; reserve it for a low-frequency fallback, not the default path.
In production
Slack's unfurl pipeline uses a per-URL fetch-and-cache flow: first paste triggers a background worker, subsequent shares within the TTL serve the cached preview. iMessage and WhatsApp perform unfurls client-side in some modes or via an Apple/Meta proxy — the proxy approach hides the recipient's IP from the target server but centralizes fetch traffic. Telegram uses a server-side approach similar to Slack. The real engineering challenge is dealing with hostile or slow target servers: pages that stream HTML infinitely, pages behind login walls that return 200 with no OG tags, and pages with 10MB of HTML before the `<meta>` tags — all require defensive parsing with byte limits and timeouts.
Common mistakes
- Fetching without SSRF protection
- Synchronous fetch on the hot path
- No cache (refetch every render)