System Design Library

Ad Click Aggregator

Count billions of ad clicks in near-real-time for dashboards & billing, accurately.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Ingest click events
  • Aggregate by ad/time
  • Near-real-time dashboards
  • Accurate billing

Non-functional

  • High write throughput
  • Idempotent/dedup
  • Late & duplicate events

Scale

Billions of events/day

The approach

Stream ingestion (Kafka) → stream processor (Flink) windowed aggregation → OLAP store for queries; dedup via event IDs; a batch layer reconciles for billing-grade accuracy (lambda/kappa).

Key components

Producers → Kafka → stream processor → OLAP store · batch reconciliation

Numbers that matter

Senior deep-dive

Deduplication at ingest is the billing-grade correctness requirement — a single click can arrive 2–5 times (browser retries, proxy retries, bot replays) and each duplicate is a direct revenue error.

The lambda architecture exists because stream and batch disagree: your Flink job gives you near-real-time approximations for dashboards, but a nightly Spark job over the raw log is the only source of truth for billing — never invoice from the stream.

Attribution windows make this harder than a simple counter: a click today may convert 7 days later, so your aggregation schema must support retroactive updates without reprocessing everything.

Dedup at ingest: the only place it's cheap

Deduplication downstream (at the aggregator) is expensive and lossy. Do it at ingest: assign each click a deterministic event ID (hash of user ID + ad ID + timestamp rounded to 1s + user agent) at the browser/CDN, and use a Redis SET with a sliding TTL to reject duplicates at the Kafka producer or consumer level. The bloom filter variant is 10–50× more memory-efficient but admits false positives — acceptable for dashboards, unacceptable for billing. Use exact dedup (Redis SET) for the billing path.

Lambda architecture: stream for speed, batch for truth

Flink over Kafka gives you ~5-second latency dashboards — close enough to real-time for campaign managers to react. But stream processors can lose events, reprocess duplicates on restart, or miss late arrivals outside their watermark. A nightly batch job over the raw immutable Kafka topic archive (or S3) is the billing source of truth — it reprocesses everything with perfect dedup. The reconciliation job compares batch totals to stream totals and flags discrepancies before invoices go out. Never bill from the stream.

Attribution windows: the hidden aggregation complexity

A click at T=0 may produce a conversion at T=7 days. Your aggregation schema must attach the conversion to the original click's campaign and timestamp, not the conversion timestamp. This requires either event-time windowing with very long watermarks (expensive) or a separate attribution join service that matches conversion events to click events via a lookup store. Late-arriving attribution events require retroactive counter updates — most stream processors handle this poorly; batch is more natural.

OLAP store selection: write rate vs query speed

Druid is optimized for pre-aggregated rollup writes (good for known dimensions) and fast time-range queries. ClickHouse accepts raw events and aggregates at query time — more flexible, slightly slower for cardinality explosion. Pinot (LinkedIn/Uber) handles real-time + historical joins well. The key decision: how many dimensions do you need to filter by at query time? High-cardinality filtering (by campaign × region × device × hour) favors columnar scan engines over pre-aggregated cubes.

Backpressure and partition skew in Kafka

Ad click topics are skewed by popular ads: a single viral campaign can saturate one Kafka partition while others are idle. Repartition by a hash of (campaign_id + time_bucket) before the aggregation step to distribute load. Backpressure from a slow Flink job causes consumer lag to grow — monitor consumer group lag as the primary health signal, not throughput. A lag spike means your watermark is advancing slower than real time, causing windowed aggregations to delay closing.

What breaks at scale

State explosion in the stream processor is the first failure: Flink maintains per-window state for every (campaign, ad, user) tuple; at billions of unique tuples per day, state backends (RocksDB) can exhaust disk or slow down checkpointing, causing recovery times of 10–30 minutes after a crash. Pre-aggregate to coarser granularity in the stream (per-minute campaign totals) and push fine-grained attribution to the batch layer. The second failure: Kafka topic retention too short — if the batch reconciliation job runs 24h later but retention is 24h, you've lost your source of truth. Retention must be ≥ billing cycle + processing SLA.

In production

Google's ad click pipeline uses a Flume/Dataflow streaming layer for near-real-time campaign dashboards and a separate MapReduce/BigQuery batch layer for billing reconciliation — they explicitly don't trust the stream for money. Meta's pipeline similarly uses Scribe → Hive (batch truth) + a fast path via a custom stream processor. The real engineering challenge is invalid click detection: bots, click farms, and accidental double-clicks must be filtered before aggregation, requiring a streaming ML scorer that runs in the dedup window — flagging invalid clicks after billing settlement is an ops nightmare.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →