System Design Library

Distributed Job Scheduler (cron)

Run scheduled & one-off jobs reliably across a cluster, exactly-once-ish, at scale.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Schedule recurring & delayed jobs
Distribute to workers
Retries
At-least-once + idempotent

Non-functional

No missed/duplicate runs
Survive worker death
Scales to millions of jobs

Scale

Millions of scheduled jobs

The approach

Durable job store + a time-index (timer wheel / sorted set by next-run); a scheduler (leader-elected) enqueues due jobs to a queue; workers pull, run, ack; visibility timeouts + retries handle crashes; idempotency keys dedupe.

Key components

Job store + time index → scheduler (leader) → queue → workers

Numbers that matter

A timer wheel with 1-second resolution across a 24-hour horizon needs only 86,400 buckets — trivial memory, O(1) insert and tick.
Leader lease TTL of 5–15 seconds is the standard for scheduler leader election — short enough to recover quickly from leader death, long enough to avoid false failovers on GC pauses.
At 1M scheduled jobs, a Redis sorted set (ZSET by next-run timestamp) handles range queries for due jobs in O(log N + K) where K is jobs due — sub-millisecond at this scale.
Visibility timeout of 30 seconds to 5 minutes is the typical SQS / Celery default — tuned to the 99th-percentile job duration to avoid false requeues while still recovering from crashes quickly.

Senior deep-dive

Exactly-once execution is impossible; at-least-once with idempotent jobs is the correct target — a scheduler can guarantee a job is enqueued at least once, but only the job itself can make re-execution safe.

The time-index (sorted set or timer wheel) is trivial; leader election for the poller is the hard part: two scheduler instances both firing the same job is a split-brain scenario that idempotent jobs paper over, but it's still double work — a single elected leader polls the time-index, backed by a lease that expires if the leader crashes.

Visibility timeout is the correctness mechanism for workers: a worker that crashes mid-job must have its in-flight task requeued automatically after a deadline — without this, jobs disappear silently.

Time-index design: sorted set is the right primitive

Jobs stored in a Redis ZSET scored by next-run epoch timestamp give O(log N) insert and O(log N + K) range query for due jobs via ZRANGEBYSCORE 0 <now>. The poller runs ZRANGEBYSCORE + ZREM in a Lua script (atomic) to claim jobs without race conditions. A DB table with an index on next_run works at smaller scale and survives Redis restarts, but the sorted set pattern is simpler and faster. Timer wheels are better for in-process scheduling (Netty, Kafka delayed messages) but don't survive process restarts.

Leader election: who pulls the trigger

Running multiple scheduler instances for HA is correct; having multiple instances all fire the same job is not. Leader election via a distributed lock (Redis SETNX with TTL, or etcd lease) designates one instance as the active poller. The leader renews its lease every T/2 seconds; if it dies, the lease expires and another instance wins. The lease TTL is the MTTR for a leader crash — 10 seconds is typical. Critically, the leader must check its lease is still valid before firing each batch — a GC pause longer than the TTL can cause a zombie leader to fire jobs after it's been superseded.

At-least-once delivery: idempotency is the worker's job

The scheduler guarantees it will enqueue a job at least once — visibility timeouts cause redelivery, leader failover can cause double-enqueue. Workers must be idempotent: the same job payload executed twice must produce the same outcome as executing it once. Patterns: check-then-act with a unique job-execution ID in the DB (if it's already marked complete, skip); upsert instead of insert; idempotency keys on external side effects (emails, charges). The scheduler should embed a unique execution ID in each enqueue so workers can deduplicate.

Recurring jobs: RRULE computation vs. pre-expansion

Pre-expanding a cron expression into 1,000 future rows at creation time is simple but explodes storage for high-frequency jobs. Computing the next-run time at dispatch is better: after firing a job, compute next_run = cron_next(now, expression) and reinsert into the ZSET. Daylight saving time is a bug source: 'every day at 2:30am' can fire twice or not at all during DST transitions — always compute next_run in the job's specified time zone, not UTC, and use a proper cron library (e.g., croniter, Quartz CronExpression) that handles DST correctly.

Visibility timeout: the correctness mechanism for worker crashes

When a worker pulls a job, the job becomes invisible to other workers for the visibility timeout period. If the worker completes and deletes the job, it's done. If the worker crashes, the timeout expires and the job reappears in the queue for redelivery. Set the timeout to P95 job duration + a safety margin — too short causes false redeliveries while the job is still running; too long causes jobs to pile up when workers crash. Workers doing long-running jobs must heartbeat to extend the visibility timeout or risk being requeued mid-execution.

What breaks at scale

Job thundering herd at the top of the minute: if 100,000 cron jobs are all scheduled for ':00' (e.g., '0 '), the ZRANGEBYSCORE query at T=00:00 returns all 100k, the poller enqueues them in a burst, and the worker pool saturates. Spread jobs with jitter (randomize the second offset at registration time) to distribute load. The second failure: the poller falls behind — next_run times pile up in the past, and the ZRANGEBYSCORE query returns millions of overdue jobs, causing the poller to loop trying to drain the backlog while new jobs also pile up. Circuit-break the poller if lag exceeds a threshold and alert — don't let it try to catch up indefinitely.

In production

AWS EventBridge Scheduler and Celery Beat use a leader-elected poller against a persistent store (Redis ZSET or a DB table with a next_run index); the poller enqueues due jobs to SQS/RabbitMQ/Kafka for workers to pull. Quartz Scheduler (Java) uses database row locking as its distributed coordination mechanism — the scheduler that acquires the lock on the job fires it. The real engineering challenge is clock skew: if two scheduler nodes disagree on the current time by 500ms, a job due at T may fire twice (once by each node thinking it's 'due now') — use the DB server's clock, not the application server's clock, as the reference time for due-date comparisons.

Common mistakes

Polling the whole job table for due jobs
No idempotency (double execution)
Single scheduler SPOF (no leader election/failover)

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →