LinkedIn (connections / PYMK)
Model a professional graph and suggest "People You May Know" at scale.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Connections (graph)
- Degrees of separation
- PYMK suggestions
- Feed
Non-functional
- Fast graph queries
- Relevant suggestions
Scale
Billions of edges
The approach
Graph stored sharded (adjacency lists); PYMK precomputed offline via friends-of-friends + signals (shared employers/schools); served from a candidate store; 2nd-degree queries bounded.
Key components
Graph store (sharded) · offline PYMK pipeline → suggestion cache
Numbers that matter
- ~900M members on LinkedIn with an average ~300 connections each — the professional graph has ~270B potential edges to traverse for global PYMK.
- ~50ms target latency for PYMK API responses, served from a pre-materialized candidate store — live graph traversal at this latency is impossible.
- ~10-20 PYMK candidates shown per user per session; the offline pipeline generates ~500-2,000 raw candidates that are then ranked.
- 2nd-degree overlap score (number of shared connections) is the single strongest PYMK signal — two people with 20+ mutual connections convert at ~5× the baseline rate.
Senior deep-dive
PYMK is a graph problem, not a query — computing friends-of-friends live for every request is O(degree²) and unviable; you precompute candidates offline and serve from a fast candidate store.
The professional graph has very different properties than Facebook's social graph — edges are bidirectional, high-signal, and sparse (most professionals have 200-500 connections vs. thousands of followers); path-based inference (shared employer, school, industry) dominates over pure topology.
2nd-degree queries must be bounded — a well-connected LinkedIn power user has 500 connections, each with 500 connections = 250K 2nd-degree candidates; without degree-capping or sampling, graph traversal blows up.
Precompute offline: PYMK cannot be a live graph query
A naive PYMK: for user A, fetch A's connections, then for each connection fetch their connections, dedupe, score. With 500 connections × 500 connections = 250K candidates and a latency budget of 50ms, you cannot traverse the graph in real time. The production pattern: an offline Spark/Hadoop job runs nightly over the full adjacency list, computes 2nd-degree candidates for every user, scores and ranks them, and writes the top-K to a fast KV store. The API tier just does a key lookup.
Graph storage: adjacency lists, not a graph DB
Despite PYMK being a graph problem, LinkedIn does not store the social graph in a graph database like Neo4j. The professional graph is stored as adjacency lists in a distributed KV store (Voldemort/Espresso): `userId → [connectionId, connectionId, ...]`. Graph databases add operational complexity and don't outperform a well-sharded adjacency list for the BFS-style 2nd-degree traversal that PYMK requires. Neo4j is fine for small graphs; at LinkedIn scale, the adjacency list is king.
Ranking signals: topology + profile attributes
Raw 2nd-degree count is a strong baseline but misses context. LinkedIn's PYMK ranking combines: shared connections count (strongest signal), shared employer (worked at Company X together), shared school, shared industry, geographic proximity, and profile view history (has A visited B's profile?). These signals are precomputed offline and stored as feature vectors alongside the candidate list. A lightweight ranking model (logistic regression or GBDT) combines them at serve time — inference on a 500-candidate list takes ~5ms.
Incremental updates: the nightly batch problem
If A connects with B tonight at 11pm, and the batch job ran at midnight last night, the PYMK recommendations won't reflect this new edge for up to 24 hours. LinkedIn layered a streaming update path on top of batch: connection events flow to Kafka, a stream processor identifies which PYMK candidate lists are affected (A's list, B's list, all of A's connections whose 2nd-degree set now includes B's connections), and updates them in near-real-time. This is the lambda architecture pattern: batch for the baseline, streaming for the delta.
Privacy and the 'don't know each other' surface
PYMK can surface connections between people who deliberately don't want to be connected — a domestic violence survivor and their abuser who share a mutual contact, or an employee and the manager they blocked. LinkedIn must respect block lists (never suggest a blocked user), opt-out preferences, and handle the GDPR-sensitive signal that 'you two may know each other' reveals connection graph information. The candidate generation step must filter against these lists before writing to the candidate store, not at serve time, to prevent timing attacks.
What breaks at scale
Super-connectors (LinkedIn influencers with 30,000+ connections) explode the 2nd-degree candidate count — 30K × 500 average = 15M 2nd-degree candidates, which the offline job must cap via sampling or degree-limiting (only sample N random connections of each connection). Graph partitioning: if the graph is sharded and user A's connections span 10 shards, the offline job must do cross-shard joins, which are expensive. LinkedIn co-locates the graph so that connected users tend to land on the same shard. Staleness during batch runs: the nightly batch job takes 4-6 hours to run; during that window, newly joined users have no PYMK candidates at all — bootstrap them with attribute-based suggestions (people in your city, industry) until the first batch run.
In production
LinkedIn built Voldemort (distributed KV store) to serve precomputed PYMK candidates and Galene (their search infrastructure) for people search. Their engineering blog describes running offline Hadoop/Spark jobs nightly to compute the connection graph, extract 2nd-degree candidates, score them by shared connections + shared employer/school, and write ranked candidate lists to Voldemort. The real engineering challenge is incremental graph updates: when user A connects with user B, A's PYMK candidates change (B's connections become A's 2nd-degree), but so do hundreds of other users' PYMK lists. Fully recomputing offline nightly means a 24-hour lag; LinkedIn layered a real-time streaming update (via Kafka + Samza) that handles recent connections, on top of the batch baseline.
Common mistakes
- Online multi-hop graph traversal
- No traversal depth bound
- Ignoring mutual-context signals