System Design Library

Realtime Analytics Dashboard

Live dashboards over high-volume event streams (clicks, metrics).

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Ingest events
  • Windowed aggregations
  • Live dashboards
  • Drill-down

Non-functional

  • Seconds-fresh
  • High throughput

Scale

Millions of events/sec

The approach

Events → Kafka → stream processor (Flink) computes windowed aggregates → OLAP/columnar store (Druid/ClickHouse) for fast slice-and-dice queries; dashboards query the OLAP store.

Key components

Producers → Kafka → stream processor → OLAP store → dashboard

Numbers that matter

Senior deep-dive

The query-latency contract forces you to choose your storage layer before anything else — raw Kafka offsets for seconds-fresh but hard to query, or a pre-aggregated OLAP store for interactive dashboards but with seconds-to-minutes of lag.

Flink windowed aggregations are the bridge — they consume the stream and materialize results into ClickHouse/Druid so dashboards don't hit raw logs.

Cardinality is the silent cost driver: high-cardinality dimensions (user_id in group-bys) explode storage and slow queries — pre-aggregate at ingest and cap raw retention.

Pipeline architecture: Kafka → Flink → OLAP

Kafka is the durable buffer that decouples ingest rate from processing rate and enables consumer replay. Flink reads partitions, applies tumbling or sliding windows (e.g., 1-minute clickthrough rates), and writes aggregated rows to ClickHouse. The key design choice is aggregation granularity at write time — finer granularity preserves query flexibility but inflates storage and query fan-out.

Watermarks and late data

Events arrive out of order — network jitter, mobile clients, CDN edge buffering. Event-time processing requires a watermark that advances as the stream progresses, signaling "all events before T have arrived." Flink's allowed lateness parameter holds windows open for a configurable extra period; events after that are either dropped or routed to a side output for reconciliation. Setting this too tight misses legitimate late events; too loose delays dashboard refresh.

Exactly-once semantics and idempotent sinks

Flink checkpoints operator state to durable storage (S3/HDFS) at intervals; on restart it replays from the last checkpoint. The sink must be idempotent (ClickHouse deduplication by insert ID, or upsert by primary key) or support two-phase commit so retried writes don't double-count. This is the most under-tested failure scenario — test by killing Flink mid-window and verifying counts match a batch recount.

OLAP layer: ClickHouse vs Druid vs Pinot

ClickHouse wins for ad-hoc SQL and simple deployments — great compression, vectorized execution, but upserts are expensive (ReplacingMergeTree is eventually consistent). Druid shines for time-series roll-ups with pre-aggregated datasources but adds operational complexity (6+ node types). Pinot handles upserts + real-time ingestion better but requires more schema discipline upfront. Choose based on whether you need exact counts or approximate are acceptable — HLL in Druid/Pinot can answer "unique users" in milliseconds at the cost of ~2% error.

Dashboards and query patterns

Dashboards are high-fan-out readers: a single page load might fire 10–20 queries simultaneously. Cache aggressively at the dashboard layer (Grafana query cache, materialized views) and ensure the OLAP store has columnar pruning — queries that filter by time range should skip entire partitions. Pre-materializing the most common groupings at ingest time (country + device + hour) dramatically cuts per-query compute but requires knowing the query patterns in advance.

What breaks at scale

Cardinality explosions are the silent killer: adding user_id as a dimension to a Druid datasource can 100x storage and 10x query latency overnight. The fix is strict schema governance — dimensions must be low-cardinality; metrics carry the numbers; high-cardinality analysis belongs in a separate raw log store. The second failure mode is Kafka consumer lag drift: if Flink falls behind, dashboards silently show stale data with no obvious error — instrument `consumer_lag` as a first-class alert.

In production

LinkedIn's Pinot and Uber's AresDB were both built because Druid wasn't fast enough at their specific query patterns (Pinot: real-time OLAP with upserts; AresDB: GPU-accelerated time-series joins). Cloudflare runs ClickHouse for its analytics dashboard and publishes that it handles ~4 million inserts/second at peak across a cluster. The real engineering challenge is late-arriving events: a mobile client with intermittent connectivity may send events 5 minutes late — Flink's watermarking + allowed lateness handles this, but you must decide how long to hold a window open before emitting, trading freshness against completeness. Dashboards that look live but are secretly stale by 10+ minutes due to a pipeline backlog are the most common production failure mode.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →