System Design Library

LinkedIn (connections / PYMK)

Model a professional graph and suggest "People You May Know" at scale.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Connections (graph)
  • Degrees of separation
  • PYMK suggestions
  • Feed

Non-functional

  • Fast graph queries
  • Relevant suggestions

Scale

Billions of edges

The approach

Graph stored sharded (adjacency lists); PYMK precomputed offline via friends-of-friends + signals (shared employers/schools); served from a candidate store; 2nd-degree queries bounded.

Key components

Graph store (sharded) · offline PYMK pipeline → suggestion cache

Numbers that matter

Senior deep-dive

PYMK is a graph problem, not a query — computing friends-of-friends live for every request is O(degree²) and unviable; you precompute candidates offline and serve from a fast candidate store.

The professional graph has very different properties than Facebook's social graph — edges are bidirectional, high-signal, and sparse (most professionals have 200-500 connections vs. thousands of followers); path-based inference (shared employer, school, industry) dominates over pure topology.

2nd-degree queries must be bounded — a well-connected LinkedIn power user has 500 connections, each with 500 connections = 250K 2nd-degree candidates; without degree-capping or sampling, graph traversal blows up.

Precompute offline: PYMK cannot be a live graph query

A naive PYMK: for user A, fetch A's connections, then for each connection fetch their connections, dedupe, score. With 500 connections × 500 connections = 250K candidates and a latency budget of 50ms, you cannot traverse the graph in real time. The production pattern: an offline Spark/Hadoop job runs nightly over the full adjacency list, computes 2nd-degree candidates for every user, scores and ranks them, and writes the top-K to a fast KV store. The API tier just does a key lookup.

Graph storage: adjacency lists, not a graph DB

Despite PYMK being a graph problem, LinkedIn does not store the social graph in a graph database like Neo4j. The professional graph is stored as adjacency lists in a distributed KV store (Voldemort/Espresso): `userId → [connectionId, connectionId, ...]`. Graph databases add operational complexity and don't outperform a well-sharded adjacency list for the BFS-style 2nd-degree traversal that PYMK requires. Neo4j is fine for small graphs; at LinkedIn scale, the adjacency list is king.

Ranking signals: topology + profile attributes

Raw 2nd-degree count is a strong baseline but misses context. LinkedIn's PYMK ranking combines: shared connections count (strongest signal), shared employer (worked at Company X together), shared school, shared industry, geographic proximity, and profile view history (has A visited B's profile?). These signals are precomputed offline and stored as feature vectors alongside the candidate list. A lightweight ranking model (logistic regression or GBDT) combines them at serve time — inference on a 500-candidate list takes ~5ms.

Incremental updates: the nightly batch problem

If A connects with B tonight at 11pm, and the batch job ran at midnight last night, the PYMK recommendations won't reflect this new edge for up to 24 hours. LinkedIn layered a streaming update path on top of batch: connection events flow to Kafka, a stream processor identifies which PYMK candidate lists are affected (A's list, B's list, all of A's connections whose 2nd-degree set now includes B's connections), and updates them in near-real-time. This is the lambda architecture pattern: batch for the baseline, streaming for the delta.

Privacy and the 'don't know each other' surface

PYMK can surface connections between people who deliberately don't want to be connected — a domestic violence survivor and their abuser who share a mutual contact, or an employee and the manager they blocked. LinkedIn must respect block lists (never suggest a blocked user), opt-out preferences, and handle the GDPR-sensitive signal that 'you two may know each other' reveals connection graph information. The candidate generation step must filter against these lists before writing to the candidate store, not at serve time, to prevent timing attacks.

What breaks at scale

Super-connectors (LinkedIn influencers with 30,000+ connections) explode the 2nd-degree candidate count — 30K × 500 average = 15M 2nd-degree candidates, which the offline job must cap via sampling or degree-limiting (only sample N random connections of each connection). Graph partitioning: if the graph is sharded and user A's connections span 10 shards, the offline job must do cross-shard joins, which are expensive. LinkedIn co-locates the graph so that connected users tend to land on the same shard. Staleness during batch runs: the nightly batch job takes 4-6 hours to run; during that window, newly joined users have no PYMK candidates at all — bootstrap them with attribute-based suggestions (people in your city, industry) until the first batch run.

In production

LinkedIn built Voldemort (distributed KV store) to serve precomputed PYMK candidates and Galene (their search infrastructure) for people search. Their engineering blog describes running offline Hadoop/Spark jobs nightly to compute the connection graph, extract 2nd-degree candidates, score them by shared connections + shared employer/school, and write ranked candidate lists to Voldemort. The real engineering challenge is incremental graph updates: when user A connects with user B, A's PYMK candidates change (B's connections become A's 2nd-degree), but so do hundreds of other users' PYMK lists. Fully recomputing offline nightly means a 24-hour lag; LinkedIn layered a real-time streaming update (via Kafka + Samza) that handles recent connections, on top of the batch baseline.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →