System Design Library

Data Warehouse (Snowflake)

Run analytical queries over petabytes with elastic compute.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Load/transform (ETL/ELT)
SQL analytics
Concurrent workloads

Non-functional

Fast aggregations over PB
Elastic compute
Cost-efficient

Scale

Petabytes

The approach

Columnar storage (read only needed columns) on cheap object storage; separate compute (virtual warehouses) from storage so each scales independently; massively parallel query execution; result + metadata caching.

Key components

Object storage (columnar) ↔ elastic compute clusters · query engine

Numbers that matter

Columnar compression ratios of 5–20× are typical for analytical data (high-cardinality strings compress less; timestamps and integers compress extremely well with delta/RLE encoding), reducing both storage cost and I/O
Snowflake's result cache serves repeated queries in <100ms regardless of data size — the most impactful optimization for dashboard workloads where the same query runs every 60 seconds
MPP (massively parallel processing) query execution scales near-linearly: a query taking 60 seconds on a 2-node warehouse takes ~15 seconds on an 8-node warehouse for I/O-bound workloads — doubling nodes roughly halves runtime until CPU becomes the bottleneck
Cross-region data transfer costs (~$0.09/GB on AWS) are the hidden cost of a 'separate storage from compute' architecture when compute and storage regions differ — keeping them co-located in the same cloud region is a non-obvious operational requirement

Senior deep-dive

Separating storage from compute is the architectural innovation that made cloud data warehouses viable — object storage (S3) is orders of magnitude cheaper than SSD-attached compute, and scaling compute independently means you pay only for query time, not idle disk.

Columnar storage with vectorized execution is why analytical queries are 10–100× faster than row-oriented databases for the same data — reading only the columns a query touches reduces I/O by 90%+ on wide tables.

Query compilation, not data movement, is the latency bottleneck at the start of a Snowflake query — cold virtual warehouse startup (spinning up compute nodes) takes 5–30 seconds, dominating query time for short-running queries on infrequently-used warehouses.

Columnar storage: read only what the query needs

In a row-oriented store (PostgreSQL, MySQL), a query `SELECT revenue FROM orders WHERE date > '2024-01-01'` reads every column of every matching row, even though only `revenue` and `date` are needed. A columnar store keeps each column in its own file segment — the same query reads only 2 column files instead of all 20+. On a 1TB table with 20 columns, this reduces I/O from 1TB to ~100GB. Columnar encoding (dictionary encoding for strings, delta encoding for timestamps, RLE for repeated values) further compresses each column by 5–20×. Zone maps (min/max metadata per file segment) enable predicate pushdown — skipping entire segments when the filter value falls outside the min/max range without reading a byte.

Virtual warehouse: ephemeral compute against durable storage

A virtual warehouse is a fleet of EC2 instances (in Snowflake's case) that cache data locally (the local SSD is a read-through cache of S3 data) and execute queries in parallel. Multiple warehouses can query the same S3 data simultaneously with no resource contention — isolation is compute-side only. Auto-suspend after N minutes of inactivity drops the compute fleet; auto-resume on the next query restarts it. Cold start (EBS/SSD cache is empty) means the first query reads all data from S3 (~2–5× slower); warm caches dramatically reduce subsequent query times. Cache warming (pre-running a representative query set before business hours) is a common operational practice for latency-sensitive dashboard workloads.

Query optimization: the planner does more work than the executor

The query optimizer in a columnar data warehouse (Snowflake, BigQuery, Redshift) performs: statistics-based cardinality estimation (how many rows will survive each filter), join reordering (smaller table on the build side of a hash join), predicate pushdown (filter before joining), and projection pruning (read only referenced columns). These decisions are made before a single byte is read. A poorly written query that joins before filtering can be 100× slower than an equivalent well-written query — the optimizer doesn't always fix this. Statistics staleness (column stats not refreshed after a large load) causes the optimizer to make bad cardinality estimates, leading to suboptimal join plans that spill to disk.

Partitioning and clustering control scan cost

Without partitioning, every query scans all data. Partition pruning uses a partition key (e.g. `date`) to skip partitions that cannot match the query's filter — a query for January data skips 11/12 of a year-partitioned table. In Snowflake, this is called micro-partition pruning (automatic 50–500MB file segments with zone maps). In BigQuery, partitioned tables require an explicit `PARTITION BY` clause; clustered tables additionally sort data within partitions by cluster key columns, enabling zone map pruning on non-partition columns. The combination of partitioning + clustering reduces bytes scanned by 90%+ on well-designed schemas — and in pay-per-scan pricing models like BigQuery, this directly maps to cost reduction.

Concurrency: multiple warehouses avoid query queuing

A single warehouse has a fixed concurrency limit (~8–32 simultaneous queries depending on size). When a long-running ETL query occupies a large warehouse, short dashboard queries queue behind it — a classic head-of-line blocking problem. The solution is workload isolation via separate warehouses: one large warehouse for ETL/heavy transforms, one medium warehouse for analysts, one small auto-scaling warehouse for dashboards. Auto-scaling (Snowflake's multi-cluster warehouse) adds nodes automatically when queries queue, up to a max cluster count. Query tagging and resource monitors enforce per-warehouse credit budgets and kill runaway queries before they cause cost incidents.

What breaks at scale

The silent production failure is data skew in distributed joins: if one partition of a large fact table has 10× more rows than others (e.g. a single customer who places 40% of all orders), the worker node handling that partition becomes the bottleneck and the query takes 10× longer than expected. MPP warehouses report this as 'skewed execution' in query profiles. The fix: salting the join key (appending a random suffix to the skewed key, replicating the dimension table) distributes the load. The second failure is schema-on-read gone wrong: a change to upstream data format (a new JSON field name) silently produces NULLs in a derived column, corrupting downstream dashboards for days before anyone notices. Data quality assertions (dbt tests, Great Expectations) in the transformation pipeline are the only reliable catch.

In production

Snowflake pioneered the shared-disk, separate-compute model: all virtual warehouses read from the same S3-backed storage, enabling zero-copy data sharing between accounts. Google BigQuery uses a serverless model (no warehouse sizing) with Dremel's columnar execution engine and Capacitor file format; you pay per TB scanned, which forces users to partition and cluster tables to control costs. Amazon Redshift offers a hybrid: columnar storage with optional concurrency scaling and Spectrum for querying S3 directly. The real operational challenge is cost governance — a runaway query on a large warehouse can scan petabytes and generate a $10,000 bill in 10 minutes; per-query cost limits and resource monitors are production necessities, not nice-to-haves.

Common mistakes

Row-store for analytics
Coupling compute to storage (no elasticity)
SELECT * over wide columnar tables

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →