System Design

System design fundamentals covering scalability, caching, databases, load balancing, CAP theorem, common patterns, and how to approach design interviews.

10 sections23 cards

The framework

System design interviews are open-ended. There is no single correct answer. What matters is structured thinking, asking the right questions, and making reasonable trade-offs.

Step 1 — Clarify requirements (5 min)

Functional: what does the system do? Non-functional: scale, latency, availability, consistency requirements. Constraints: read-heavy or write-heavy? DAU? Data size?

Step 2 — Estimate scale (3 min)

DAU → requests/second. Storage needed per day/year. Bandwidth. Back-of-envelope math — don't need to be exact.

Step 3 — High-level design (10 min)

Draw the main components: clients, API servers, databases, caches, queues. Don't over-detail yet.

Step 4 — Deep dive (15 min)

Pick the most critical or interesting parts and go deep. The interviewer often guides this.

Step 5 — Wrap up

Discuss bottlenecks, single points of failure, monitoring, future scaling.

Vertical vs Horizontal scaling

Vertical scaling (scale up) — add more CPU/RAM/disk to existing server. Simple — no code changes. Has limits — biggest machine only goes so far. Single point of failure.

Horizontal scaling (scale out) — add more servers. No theoretical limit. Requires load balancing and stateless design. More complex.

Stateless servers — servers don't store session data. Any server can handle any request. Enables horizontal scaling. Store state in shared store (Redis, DB) instead.

Load balancing

Distributes incoming traffic across multiple servers. Removes single point of failure. Enables horizontal scaling.

Algorithms:

Round robin — rotate through servers equally. Simple.

Least connections — send to server with fewest active connections. Better for varying request durations.

IP hash — same client always goes to same server. Useful for session stickiness.

Weighted — servers get traffic proportional to their weight. Useful when servers have different capacity.

Layer 4 (transport) — routes based on IP/TCP. Fast, no content inspection. Layer 7 (application) — routes based on URL, headers, cookies. Smarter, can do SSL termination, content-based routing.

CDN

Content Delivery Network — geographically distributed servers that cache and serve static content close to users.

Push CDN — you push content to CDN when it changes. Predictable, but you manage it. Good for content that changes infrequently.

Pull CDN — CDN fetches from origin on first request, then caches. Simpler to manage. Traffic to origin only on cache miss.

Use for: static files (JS, CSS, images), video streaming, large file downloads. Reduces origin server load and latency for global users.

Cache strategies

Cache-aside (lazy loading) — app checks cache first. On miss, read from DB and write to cache. Most common. Cache only what's requested.

Write-through — write to cache and DB together. Cache always consistent. Slower writes. Good for data that's read soon after write.

Write-behind (write-back) — write to cache immediately, write to DB asynchronously. Fast writes. Risk of data loss if cache fails before DB write.

Read-through — cache handles DB reads. On miss, cache fetches from DB. App only talks to cache.

Cache invalidation strategies: TTL (expire after time), event-based (invalidate on update), versioning (new key for new version).

Redis essentials

In-memory data store. Sub-millisecond reads/writes. Used for caching, sessions, pub/sub, rate limiting, leaderboards, job queues.

Key data structures: String (cache values), Hash (objects), List (queues, feeds), Set (unique collections, tags), Sorted Set (leaderboards, priority queues), HyperLogLog (approximate unique counts).

SET key value EX 3600 — set with TTL in seconds.

INCR counter — atomic increment. Used for rate limiting, counters.

Redis is single-threaded for commands — no race conditions on individual operations. Use Lua scripts or transactions for multi-command atomicity.

Persistence: RDB (snapshots) for backups, AOF (append-only log) for durability. Redis Cluster for horizontal scaling.

Cache problems

Cache stampede (thundering herd) — cache expires, many requests hit DB simultaneously. Solutions: mutex lock on cache miss, probabilistic early expiration, background refresh.

Cache penetration — requests for data that doesn't exist in DB or cache. Attacker can overwhelm DB. Solutions: cache null results with short TTL, bloom filter to check existence before DB query.

Cache avalanche — many cache keys expire at same time, overwhelming DB. Solution: randomize TTLs, circuit breaker, gradual cache warm-up.

Replication

Primary-replica (master-slave) — one primary handles writes. Replicas handle reads. Scales read-heavy workloads. Replication lag means replicas may serve stale data.

Read replicas — great for analytics queries that would slow down the primary.

Multi-primary — multiple primaries accept writes. Handles write scaling and availability. Conflict resolution is complex.

Synchronous replication — primary waits for replica to confirm write. Strong consistency, higher latency. Asynchronous — primary doesn't wait. Better performance, possible data loss on failure.

Sharding

Horizontal partitioning — split data across multiple DB instances (shards). Each shard has a subset of the data.

Sharding strategies:

Hash sharding — hash(key) % num_shards. Even distribution. Can't do range queries across shards.

Range sharding — shard by value range (A-M on shard 1, N-Z on shard 2). Supports range queries. Risk of hotspots.

Directory sharding — lookup table maps keys to shards. Flexible. Lookup table becomes bottleneck.

Problems with sharding: joins across shards are expensive, resharding is painful, transactions across shards are complex. Add sharding only when you actually need it.

SQL vs NoSQL

Use SQL when: data is structured and relational, ACID transactions are needed, complex queries (joins, aggregations), data integrity is critical.

Use NoSQL when: flexible/evolving schema, horizontal scale from the start, simple access patterns, very high write throughput.

NoSQL types: Document (MongoDB) — JSON-like documents. Key-value (Redis, DynamoDB) — simple lookups. Wide-column (Cassandra) — high write throughput, time-series. Graph (Neo4j) — relationship-heavy data.

Most companies use both — SQL for transactional data, NoSQL for specific use cases. Don't pick one as a religion.

CAP explained

In a distributed system, during a network partition you can only guarantee two of three properties:

Consistency (C) — every read receives the most recent write or an error. All nodes see the same data at the same time.

Availability (A) — every request receives a response (not necessarily the latest data).

Partition Tolerance (P) — system continues operating despite network partitions (nodes can't communicate).

Network partitions are unavoidable in distributed systems — so you're always choosing between C and A during a partition.

CP systems (e.g. MongoDB, HBase, Zookeeper) — return error on partition rather than stale data.

AP systems (e.g. Cassandra, DynamoDB, CouchDB) — return possibly stale data rather than error.

CAP is about behavior during partitions only. In normal operation, good systems provide both consistency and availability.

PACELC

CAP only covers behavior during partitions. PACELC extends it to normal operation:

If Partition: choose between Availability and Consistency. Else (normal): choose between Latency and Consistency.

More realistic than CAP — most of the time there's no partition, and the real tradeoff is latency vs consistency. Strong consistency requires coordination between nodes which adds latency.

Example: DynamoDB — PA/EL. Returns available (possibly stale) data during partition, and favors low latency over consistency in normal operation.

Message queues

Decouple producers and consumers. Producer sends messages to queue. Consumer processes at its own pace. Enables async processing, handles traffic spikes, retries on failure.

Use cases: email sending, notifications, image processing, order processing, any task that doesn't need an immediate response.

RabbitMQ — traditional message broker. Complex routing, acknowledgements, dead-letter queues. AMQP protocol.

Kafka — distributed log. High throughput, durable, replay-able. Consumers track their own offset. Ideal for event streaming, analytics, audit logs.

Redis (BullMQ) — lightweight queue on Redis. Good for job queues in smaller apps.

At-most-once: fire and forget. At-least-once: retry until acknowledged (may process duplicates). Exactly-once: hardest, requires idempotent consumers or distributed transactions.

Event-driven architecture

Services communicate via events rather than direct calls. Publisher emits events. Subscribers react independently. Services are decoupled — publisher doesn't know about subscribers.

Event sourcing — store state as a sequence of events rather than current state. Full audit log. Can replay events to rebuild state. Complex to implement.

CQRS (Command Query Responsibility Segregation) — separate read and write models. Writes go to command model (normalized). Reads come from query model (denormalized, optimized). Eventually consistent.

Good for: audit requirements, complex business logic, high read/write ratio imbalance.

REST conventions

Resources are nouns, not verbs. GET /users not GET /getUsers.

HTTP methods: GET (read), POST (create), PUT (replace), PATCH (partial update), DELETE (remove).

Status codes: 200 OK, 201 Created, 204 No Content, 400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found, 409 Conflict, 422 Unprocessable, 429 Too Many Requests, 500 Internal Error.

Versioning: URL path /api/v1/ (most common), header Accept: application/vnd.api.v1+json, query param ?version=1.

Pagination: cursor-based (stable, scalable) vs offset-based (simple but slow at scale).

Use plural nouns: /users not /user. Nested resources: /users/:id/orders.

Rate limiting

Protect APIs from abuse and ensure fair usage. Implemented at API gateway or middleware level.

Algorithms:

Token bucket — bucket holds tokens. Each request consumes a token. Tokens refill at a rate. Allows bursting up to bucket size.

Leaky bucket — requests drain at a fixed rate regardless of input rate. Smooths bursts. Implemented as a queue.

Fixed window — count requests per window (e.g. 100/minute). Simple. Edge case: burst at window boundary allows 2x rate.

Sliding window — rolling time window. More accurate than fixed. Slightly more complex to implement.

Response headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After.

Return 429 Too Many Requests when limit exceeded.

Key concepts

Availability — percentage of time the system is operational. 99.9% = 8.7h downtime/year. 99.99% = 52min/year. 99.999% (five nines) = 5min/year.

SLA — Service Level Agreement. Contract with users. SLO — Service Level Objective. Internal target. SLI — Service Level Indicator. The actual measured metric.

Single point of failure (SPOF) — any component whose failure brings down the whole system. Eliminate by adding redundancy.

Redundancy — active-active (both handle traffic) vs active-passive (failover). Active-active gives better utilization. Active-passive gives simpler consistency.

Resilience patterns

Circuit breaker — after N consecutive failures to a dependency, stop calling it and return a fallback immediately. After a timeout, try again. Prevents cascade failures.

Retry with exponential backoff — retry failed requests but wait longer each time. Add jitter (random offset) to prevent all clients retrying simultaneously.

Bulkhead — isolate failures. Separate thread pools or resource limits per dependency. One slow service doesn't exhaust all resources.

Timeout — always set timeouts on outbound calls. A slow dependency should fail fast, not hang forever.

Graceful degradation — when a non-critical feature fails, serve a reduced experience rather than a full error. Show cached data, hide optional features.

Health checks — expose /health endpoint. Load balancers route around unhealthy instances. Liveness (is it running?) vs readiness (is it ready for traffic?).

URL shortener

Requirements: given a long URL, generate a short code. Redirect short URL to long URL.

Core design: generate a unique short code (base62 encoding of an auto-incremented ID, or random 6-8 char string). Store mapping in DB: { short_code, long_url, created_at, user_id }. On redirect: look up code, return 301 (permanent) or 302 (temporary) redirect.

Scale: cache popular redirects in Redis (read-heavy, small data per entry). Use CDN for extremely hot links. DB: any SQL or KV store works.

Considerations: custom short codes, expiry, analytics (click counts, geo), abuse prevention.

Rate limiter

Core: use Redis to track request counts per user/IP per window. INCR user:123:minute:1700000060 with TTL of 60 seconds.

Distributed: Redis is the shared counter store across all API servers. A single Redis instance handles ~100k ops/sec — usually sufficient.

Where to implement: API gateway (centralized, language-agnostic), middleware (per service), or both. Gateway rate limiting for coarse limits, service-level for fine-grained.

Notification system

Components: API server receives notification requests → message queue (Kafka/RabbitMQ) → notification workers → delivery services (APNs for iOS, FCM for Android, SMTP for email, SMS gateways).

Challenges: fan-out (one event → millions of notifications), delivery guarantees, user preferences (opt-out), deduplication (don't send same notification twice), rate limiting per user.

Queue per channel type — separate workers for email vs push vs SMS. Different delivery speeds and failure modes.

Store notifications in DB for inbox feature. Mark as read. Pagination of notification history.

Latency numbers

L1 cache: ~1ns. L2 cache: ~10ns. RAM: ~100ns. SSD random read: ~100μs. Network round trip same DC: ~500μs. HDD seek: ~10ms. Network round trip cross-continent: ~150ms.

Rule of thumb: memory is ~1000x faster than SSD. SSD is ~100x faster than network. Network is ~100x faster than disk. Cache everything you can.

Back of envelope

1 million DAU, 10 requests/day each = 10M requests/day = ~116 req/sec average. Peak is usually 2-3x average = ~300 req/sec.

1 tweet = ~280 chars = ~280 bytes. 100M tweets/day = 28GB/day = ~10TB/year.

1 photo = ~1MB. 1M photos/day = 1TB/day.

Powers of 2 to remember: 2^10 = 1K, 2^20 = 1M, 2^30 = 1B, 2^40 = 1T.

A single server can handle ~1000 concurrent connections. A single Redis can handle ~100k ops/sec. A single Postgres can handle ~10k simple queries/sec.