System Design Library

OTP / 2FA Service

Generate and verify one-time codes for login/2FA securely.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Generate code
Deliver (SMS/email/app)
Verify with expiry
Rate limit attempts

Non-functional

Secure
Short-lived
Abuse-resistant

Scale

Millions of logins

The approach

Generate a short code (or TOTP from a shared secret), store hashed with a short TTL, deliver via the notification system, verify with attempt limits + lockout; rate-limit generation to prevent SMS-pumping fraud.

Key components

Auth → OTP store (TTL) → notification channel · rate limiter

Numbers that matter

TOTP uses 30-second time windows — clients and servers must have clocks within ±1 window (30–90s drift) or verification fails; NTP sync is mandatory.
SMS OTP delivery costs ~$0.005–$0.05 per message (Twilio rates); at 10M logins/day, that's $50,000–$500,000/month — TOTP apps have zero marginal cost.
A 6-digit OTP has 10^6 possibilities; with a 30-second window and a 3-attempt limit before lockout, brute-force probability is 3/1,000,000 = 0.0003%.
App-based TOTP (Google Authenticator, Authy) achieves ~99.9% delivery success vs. SMS OTP which averages 94–98% due to carrier delays and number portability issues.

Senior deep-dive

TOTP (RFC 6238) is stateless on the server — no code needs to be stored; the server derives the same code from the shared secret + current time window, making it massively scalable.

The shared secret is the crown jewel — it must be stored encrypted at rest; a leaked TOTP secret is as bad as a leaked password because it enables offline code generation forever.

Rate-limiting and lockout are the real defense: a 6-digit code has only 1,000,000 possibilities; without attempt throttling, brute-forcing a single code window takes seconds.

TOTP mechanics: time-based HMAC, no storage needed

TOTP computes `HMAC-SHA1(secret, floor(unix_time / 30))` and truncates to 6 digits. The server holds the secret, not the code — it recomputes the expected code at verification time. To tolerate clock skew, servers typically accept codes from the previous and next windows (±30s). This means the effective validity window is 90 seconds, which is a known tradeoff — narrowing to ±0 windows tightens security but causes friction for users with drifting clocks.

SMS OTP: delivery pipeline and failure modes

SMS OTP requires a delivery pipeline: generate code → hash + store with TTL → dispatch via Twilio/SNS → carrier routes to handset → user submits. Each hop can fail: SIM-swapping attacks let an attacker receive the SMS; carrier delays (up to minutes) cause users to request resends, consuming your SMS budget. Use SMS as a fallback, not a primary 2FA channel — TOTP or push-based 2FA (Duo) is more reliable and less expensive.

Storage: what to store and how

For SMS OTP: store `{user_id, code_hash, expiry, attempt_count}` — never the plaintext code, because a DB read shouldn't reveal a valid credential. For TOTP: store only the encrypted TOTP secret (AES-256, key in KMS); the current code is derived at verification time and never persisted. Recovery codes should be bcrypt-hashed (slow hash appropriate here since volume is low and offline cracking risk is real).

Rate limiting and lockout: the non-negotiable controls

3–5 attempts per code window before lockout is the standard; lockout should be per-user not per-IP (shared IPs shouldn't block all users). Resend rate limits prevent SMS pumping fraud: limit to 3 sends per hour per phone number and 10 per day. Flag anomalies — a user requesting OTPs from 5 different countries in an hour is a takeover signal. Exponential backoff on lockout (1 min, 5 min, 30 min) is better than flat lockout for UX.

Replay prevention: the used-code window

TOTP codes are valid for 90 seconds (with skew tolerance) — an intercepted code can be replayed during that window. Prevention: after a successful verification, mark that time-window slot as used in a small cache (Redis SET with TTL = window size). Any reuse returns a rejection. This is a small, bounded cache (one entry per active user verifying in that window) and is the standard mitigation in all major 2FA implementations.

What breaks at scale

SMS pumping fraud is the top production incident: attackers use your OTP SMS endpoint to deliver spam by registering phone numbers they control — costing you money, not them. Mitigate with velocity checks on the phone number, CAPTCHA before dispatch, and carrier fraud networks (Twilio's Verify product includes this). Clock drift between microservice instances causes TOTP verification to fail intermittently if NTP is unreliable in your infra — monitor clock offset and alert at >10s drift. Shared TOTP secret exposure through a backup/export vulnerability is a silent catastrophe — audit who can read the secrets table and rotate secrets on any suspected breach.

In production

Google, GitHub, and AWS all use TOTP (RFC 6238) as their primary 2FA method, storing the encrypted TOTP secret per user and doing server-side HMAC-SHA1 to verify. Twilio Verify and AWS SNS back SMS OTP delivery at scale — both abstract carrier routing, delivery receipts, and international number handling. Okta and Duo add device trust and adaptive authentication (skip 2FA for trusted devices) on top of basic TOTP. The real operational challenge is account recovery: users who lose their TOTP device lock themselves out; recovery codes (pre-generated one-time-use codes stored at account creation) must be offered and stored securely hashed.

Common mistakes

Long-lived or reusable codes
No attempt limit (brute force)
Unthrottled SMS generation (fraud)

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →