System Design Library

Food Delivery (DoorDash)

Connect customers, restaurants and couriers with live order tracking.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Browse/order
  • Assign courier
  • Live tracking
  • Payments

Non-functional

  • Realtime tracking
  • Reliable assignment

Scale

Millions of orders

The approach

Geo-indexed couriers (like Uber); order state machine (placed→preparing→picked→delivered); matching assigns nearest available courier via workers; live location over WS.

Key components

Order service (state machine) · geo courier index · matching workers · tracking (WS)

Numbers that matter

Senior deep-dive

Courier geo-indexing is the hot path — every order dispatch query must find available couriers within radius in real time; a stale or slow geo-index means bad matches and late deliveries.

The order state machine is the consistency core — transitions (placed → accepted → preparing → picked up → delivered) must be durable and exactly-once; double-state-transitions cause duplicate payouts or ghost orders.

Matching is an optimization problem, not a lookup — assigning courier to order considers ETA to restaurant, batch opportunities (pick up two orders in one run), and courier utilization; naive nearest-courier assignment leaves significant efficiency on the table.

Geo-index: the write-heavy hot path

Every courier position update must be indexed so dispatch queries can find nearby available couriers in milliseconds. A Redis GEOADD with S2-cell-bucketed keys gives O(log n) proximity queries and O(1) updates. The critical detail: courier availability state (online, on delivery, offline) is combined with geo — a query for "available couriers within 3km" must filter on both dimensions atomically to avoid dispatching to an already-assigned courier.

Order state machine: durable, exactly-once transitions

Order state (placed → restaurant-accepted → preparing → ready → picked-up → delivered) lives in a persistent store with optimistic locking on state version. Each transition is idempotent — retrying a "picked-up" transition is a no-op if already in that state. The failure mode to prevent is split-brain: a courier marking an order picked up while a timeout job simultaneously cancels it. The timeout job must check-and-cancel atomically, not blind-set the state.

Dispatch matching: optimization, not lookup

Naive nearest-courier dispatch ignores batch opportunities (two orders from nearby restaurants) and misestimates ETA by ignoring real-time traffic. DoorDash runs a dispatch optimization model that considers: courier ETA to restaurant (routing API call), restaurant prep time estimate (ML), and batching potential (is there a second nearby order the courier could pick up?). The model runs in a tight loop and must produce assignments in seconds, not minutes.

Restaurant prep time: the hidden variable

Dispatching a courier too early means they wait at the restaurant (idle capacity, frustrated courier); too late means cold food. DoorDash estimates prep time using a per-restaurant ML model incorporating historical times, current queue depth (how many active orders at the restaurant), and time of day. Real-time queue depth requires restaurants to either use a tablet to confirm orders and update status, or fall back to historical averages — tablet adoption is the business lever.

Live tracking: geo pipeline to customer UI

Courier GPS pings flow through Flink → state store → WebSocket push to customer. The ETA on the tracking screen is recomputed on each ping using live routing, not interpolated — this matters when a courier takes a detour. The tricky part is "approaching" detection: sending a push notification "your courier is 2 minutes away" requires detecting when ETA crosses a threshold, which must be exactly-once (no double notifications) despite multiple workers processing the same courier's ping stream.

What breaks at scale

Restaurant-area courier concentration: at peak dinner rush in a dense neighborhood, hundreds of couriers cluster near a few popular restaurants. Geo-index queries return massive candidate sets; the dispatch optimizer must handle ties gracefully without thundering-herd re-queries. Restaurant tablet failures are silent — the restaurant stops confirming orders but DoorDash can't distinguish this from an empty queue, causing systematic late dispatch. Geo-index staleness during a Redis failover means dispatch falls back to stale positions, degrading match quality until the replica catches up.

In production

DoorDash uses Apache Flink for real-time courier location processing and a geospatial index backed by Redis with S2 cells for proximity queries. The actual matching algorithm is a dispatch optimization service that runs an assignment model (not simple nearest-courier) considering delivery time windows, restaurant prep time, and batch opportunities. The hardest problem is restaurant prep time estimation — if the courier arrives before the food is ready, they wait, wasting capacity; if they arrive late, the food sits and quality drops. DoorDash uses ML models trained on per-restaurant historical prep times to time dispatch.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →