System Design
System Design · Senior Software Engineer

System Design Interview Questions

The system design prompts senior interviewers actually ask — design a URL shortener, a rate limiter, a news feed, a chat app — with the framework, capacity math, and trade-offs that separate a senior signal from a junior one.

System design rounds at the senior+ level test the same dozen patterns over and over: a write-heavy timeline, a read-heavy feed, a rate limiter, a coordination service, a cache. The interviewer doesn't want a CS-paper architecture; they want to see whether you ask about scale before drawing boxes, whether you have intuition for the right capacity numbers, and whether you can name the trade-offs of every component you reach for.

Below are 10 prompts that cover most of the senior system-design surface area. Each section gives the framework, the back-of-envelope numbers, the design, and the follow-ups senior interviewers actually push on.

If you'd rather work through these out loud against an AI interviewer that asks for capacity estimates and pushes on your trade-offs, run a system design mock.

1. Design a URL shortener

Clarify before drawing. Read:write ratio? (Heavy reads — at least 100:1 for a public service.) Custom slugs? Analytics on each click? Latency budget? (~50ms p99 for redirects.) Daily new URLs? (Assume 100M new URLs/day at scale.)

Capacity. 100M URLs/day × 365 days × 5 years ≈ 180B URLs. At 7 bytes per encoded slug (base62, 62^7 ≈ 3.5T) that's 1.3TB of slugs alone, plus the actual URLs. Reads at peak = roughly 12K QPS. Redirect path is read-only, cacheable.

Design.

  • Write path. A coordinator service calls a slug-generation service (either base62 of a 64-bit ID from a centralized counter / Snowflake, or a hash of the URL with collision retry). Persist (slug, long_url, created_at, owner_id) to a primary key-value store (DynamoDB / Cassandra / Spanner — sharded by slug).
  • Read path. CDN edge cache → application cache (Redis) → DB. The redirect itself is a 301 (cacheable forever) or 302 (recoverable later). 301 is cheaper to serve but harder to update; pick based on the use case.
  • Custom slugs. A second write path with conflict detection — if INSERT IF NOT EXISTS fails, return error.

Follow-ups. How would you do analytics on each click? (Async write to a clickstream topic, batch-aggregate.) How would you handle a celebrity URL getting hot? (Negative caching for the long URL, plus aggressive TTLs at the edge — common interview trap is forgetting the edge tier.) How would you delete URLs? (Tombstone in the KV store; let a background sweeper expire them.)

2. Design Twitter / a public timeline

Clarify before drawing. Mostly read or write? (Read — about 100:1 for a public-feed product.) Per-user timeline? Global feed? Real-time delivery? Search?

Capacity. ~500M users, 100M DAU writing a few tweets each → ~600M tweets/day, peak ~10K writes/sec. Reads dominate: ~5M reads/sec at peak.

Design. The core trade-off is fan-out on write vs. fan-out on read.

  • Fan-out on write (push). When user X tweets, push the tweet ID into the inbox of every follower. Reads are O(1) per user (just read your inbox). Writes are O(followers). Catastrophic for celebrities with 50M followers.
  • Fan-out on read (pull). When user Y opens the app, read the latest tweets of everyone Y follows and merge. Reads are expensive but writes are cheap.
  • Hybrid (the answer interviewers want). Default to push for the median user (small fanout), but pull-on-demand for celebrity tweets. Store celebrities' tweets separately and merge them into the inbox at read time.

Other components. Tweet store (sharded by tweet ID), user graph (sharded by user ID), inbox cache (Redis), search index (Elasticsearch), media store (S3 + CDN). Real-time delivery uses websockets backed by a fan-out service that subscribes to the tweet topic.

Follow-ups. What happens when a celebrity tweets? (Don't fan-out; pull at read time.) How do you handle tweet edits? (Versioned record; index includes version.) How do you guarantee chronological order across shards? (You don't — clients sort the merged result. Interviewers expect this answer.)

3. Design a rate limiter

Clarify before drawing. Per-user, per-IP, or per-API-key? Distributed across many app servers, or single-process? Burst tolerance, or strict average? What's the action on limit exceeded — drop, queue, or 429?

Capacity. Often this is the only component that needs to handle every API request. At 100K RPS, the rate limiter has to add <1ms.

Design. Three common algorithms:

  • Fixed window. Count requests per user per minute; reject if count > limit. Simple but allows 2× the limit at a window boundary (a user can send limit at 12:00:59 and limit again at 12:01:00).
  • Sliding window log. Per user, store a log of timestamps. On each request, drop expired timestamps and compare count to limit. Exact but expensive — O(limit) memory per user.
  • Token bucket (the answer interviewers want most often). Each user has a bucket that refills at a constant rate up to a max. Each request consumes a token. Smooth burst behavior, O(1) state per user (just tokens and last_refill).

For distributed rate limiting, the bucket lives in Redis with INCRBY + EXPIRE for atomicity, or use a script (Lua) to make refill+consume atomic.

Follow-ups. What happens on a Redis outage — fail open or closed? (Fail open is the textbook answer — never block legitimate traffic on infra failure unless the workload is security-critical.) How do you avoid the rate-limiter becoming the SPOF? (Local-first with sync — each app server holds a small bucket and syncs to Redis async; trade exactness for availability.) What about IPv6 with effectively unlimited address space? (Group by /64 for IPv6, /24 for IPv4, with optional finer-grained limits per authenticated user.)

4. Design a news feed (Facebook / Instagram)

Clarify before drawing. Personalized ranking or chronological? Read latency budget? Fan-out scale? Dwell-time signals?

Capacity. 1B DAU × 3 sessions × 20 cards = 60B feed reads/day, peak ~700K QPS.

Design. Three layers:

  • Candidate generation. For user Y, gather candidate posts from people they follow, groups they're in, and a few global sources. Hundreds to a few thousand candidates per request.
  • Ranking. Each candidate gets a score from a model (engagement probability, recency, relationship strength). Top ~100 returned to the client.
  • Diversification + business rules. Drop near-duplicates, rate-cap any one source, inject ads.

The first layer is the system-design problem; the second is the ML problem. For candidate generation, follow the same fan-out-on-write vs. fan-out-on-read trade-off as Twitter.

Critical infra. Hot users' feeds are precomputed and cached; cold users compute on read. The feed cache TTL is short (minutes) so new content can surface quickly.

Follow-ups. How do you handle a rapidly growing graph? (Async re-fanout on follow change.) How do you ensure the cache doesn't go stale on viral content? (Push invalidations from the post-creation pipeline.) How do you A/B-test ranking changes? (Stable user-bucket hashing into experiments at the candidate-generation layer.)

5. Design a chat application (WhatsApp / iMessage)

Clarify before drawing. 1:1 or group chats? Group size cap? Online/offline delivery? End-to-end encryption? Message ordering across clients?

Capacity. 500M DAU, 50 messages/user/day → 25B messages/day, peak ~1M writes/sec.

Design.

  • Connection layer. Long-lived WebSocket per online user, terminated on a fleet of edge servers. Each edge server holds a stateful registry of user_id → connection_id. Distributed presence service tracks which edge server holds which user.
  • Message router. When user A sends to user B, the message hits A's edge server. Router looks up B's edge server, forwards via internal pubsub, B's edge server pushes the message over the WebSocket. Round-trip <50ms.
  • Persistence. Write the message to a sharded message store keyed by conversation ID, with a per-conversation monotonic sequence number for ordering. The sender writes to persistence first (durability), then routes — so an edge crash doesn't lose data.
  • Group chats. Same pattern, but the router fans out to N recipients (N is typically capped at 256 or 1024).
  • Offline delivery. When recipient is offline, message is stored in their inbox; on reconnect, edge server pulls and pushes the backlog.

Follow-ups. End-to-end encryption (Signal protocol — every conversation has a forward-secret double ratchet; the server stores ciphertext only). Read receipts (separate event channel; same routing). Multi-device (each device has its own connection; per-device sequence numbers + de-dupe at the client).

6. Design a distributed cache (Redis / Memcached at scale)

Clarify before drawing. Read-heavy or write-heavy? Eviction policy? Persistence? Consistency requirements? Multi-region?

Capacity. 10TB of cached data, 1M GET/sec, 100K SET/sec at peak.

Design.

  • Sharding. Consistent hashing of key → node. Use 200 virtual nodes per physical node so adding/removing a node only moves 1/N of the keys, not 1/2.
  • Replication. Each shard has 1 primary + 2 replicas. Writes go to primary, replicate async; reads can hit any replica with a stale-read tolerance.
  • Eviction. LRU is the default. Approximated LRU (random sample, evict the oldest of the sample) is what Redis actually does in production — it's O(1) and gets within ~5% of true LRU.
  • Failure handling. Promote a replica to primary on primary failure. Coordination via a small consensus group (etcd / Zookeeper / built-in Sentinel for Redis).

Follow-ups. How do you handle a hot key (one key getting 30% of all traffic)? (Replicate the key across multiple shards, route reads with a coin flip; or keep it client-side if read-only.) Cache stampedes when a popular key expires? (Probabilistic early refresh — refresh with probability 1/(remaining_ttl) on every read.) How do you keep cross-region caches consistent? (Don't, at the cache layer. Fix it at the source — write-through to the DB and invalidate caches.)

7. Design a job scheduler / cron-at-scale

Clarify before drawing. Recurring or one-shot jobs? Time precision (minute, second, millisecond)? Job count and concurrency? At-least-once or at-most-once delivery?

Capacity. 10M scheduled jobs, peak 50K jobs/min triggering simultaneously.

Design.

  • Job storage. Sharded by next_run_time modulo N to spread load. Each row carries job_id, next_run_time, cron_expr, state.
  • Scheduler. A pool of workers, each owning a slice of the time-space. Worker polls "all jobs with next_run_time ≤ now+lookahead" and locks each one for execution. Atomic lock via DB row update with a version check.
  • Execution. Worker dispatches the job onto a task queue (Kafka / SQS) with the actual work as the payload. Recompute next_run_time and write back.
  • Retries. If the job fails, exponential backoff on next_run_time. Cap retries; dead-letter the rest.

Follow-ups. What if a worker dies mid-execution? (Lock has TTL; another worker picks up after expiry. Trade-off: too short and you double-execute; too long and recovery is slow.) How do you handle "every minute" precision at 10M jobs? (Move scheduling out of the DB into a sorted in-memory data structure on the worker, sync periodically.) How do you support "as soon as possible"? (Special priority lane with no next_run_time math, just FIFO.)

8. Design a leaderboard

Clarify before drawing. Global or per-region? Per-day or all-time? Top N or full ranking? Score updates per second?

Capacity. 10M users, 100K score updates/sec at peak, query top-100 ~10K QPS.

Design.

  • Sorted set in Redis. ZADD for updates, ZRANGE for top-N, ZRANK for "where am I?" — all O(log N). One sorted set per leaderboard scope (global, per-day, per-region).
  • Sharding for very large boards. If 10M scores in one sorted set is too big, shard by user ID modulo K and merge top-N from each shard at query time (K-way merge of K sorted streams).
  • Persistence. Sorted set in Redis is the read path; an append-only log of score updates is the write path for durability.

Follow-ups. How do you handle ties? (Tie-break on user ID for stable ordering.) How do you do "your rank" when there are 10M users? (Approximate rank from a count-min sketch; exact rank is O(log N) on the sorted set anyway.) Time-windowed leaderboards (hourly/daily)? (One sorted set per window; expire old windows.) Anti-cheating? (Score updates go through a validation service before reaching Redis — score deltas above a threshold get flagged.)

9. Design a search autocomplete

Clarify before drawing. What's "trending"? How fast does the prefix tree need to update? Cold start for new queries? Personalized vs. global?

Capacity. 100M unique queries seen, ~10K query-prefix QPS at peak, latency budget <50ms p99.

Design.

  • Aggregation pipeline. Stream of search queries → per-prefix counters in a fast store (Redis ZSET or RocksDB). Aggregate counts hourly and persist a top-K-per-prefix snapshot.
  • Serving. A trie loaded into RAM on each autocomplete server, with each node carrying its top-K completions sorted by frequency (or by a recency-weighted score). Lookup: walk to the prefix node, return top-K.
  • Updates. Trie is rebuilt every N minutes from the snapshot, swapped in atomically. For real-time updates (rare), apply increments to the in-memory trie and sync periodically.

Follow-ups. Personalization (per-user history layer merged with the global top-K). Trending detection (decay older counts so a sudden surge surfaces fast). Typo tolerance (search the trie with edit distance ≤ 1, expensive — usually done with a separate spell-corrector before the prefix walk). How do you handle non-Latin scripts? (Tokenize on grapheme clusters; the trie structure is unchanged.)

10. Design a logging and metrics pipeline

Clarify before drawing. Log volume? Retention? Search latency target? Real-time alerting on metrics?

Capacity. 10TB/day of logs, 1B metric data points/day, search across 30 days.

Design.

  • Log path. Apps write to local agents (Fluent Bit / Vector). Agents batch and ship to Kafka. A consumer indexes hot logs (last 7 days) into Elasticsearch / OpenSearch and writes cold logs (older) to object storage (S3 / GCS) with Parquet for cheap scan-on-read via Athena/BigQuery.
  • Metrics path. Apps emit metrics via a Prometheus-style scrape or a push gateway. Time-series DB (Prometheus, M3, VictoriaMetrics) stores them. Grafana queries for dashboards. Alertmanager evaluates alerting rules.
  • Trace path. OpenTelemetry collector receives spans, forwards to a backend (Jaeger / Tempo / Honeycomb).

Cost trade-off. Hot indexed logs are 100× more expensive per byte than cold object-store logs. Aggressive retention tiering (3 days hot, 30 days warm, 90 days cold) saves >80% on the log bill at scale.

Follow-ups. How do you avoid losing logs on Kafka outage? (Local on-disk buffer at the agent, with a cap.) How do you sample expensive trace data without losing rare-failure traces? (Tail-based sampling: buffer the trace, sample after the trace completes, keep all errors.) How do you de-dupe metrics from rolling deploys? (Per-pod labels; Prometheus dedupes natively.)

FAQ

What's the right framework for any system design question?

Roughly: clarify → capacity → API → high-level design → component deep-dive → trade-offs → follow-ups. Spend the first 5 minutes on clarification and capacity numbers — most candidates skip this and lose the round in the first 10 minutes by designing for the wrong constraints.

How do interviewers grade system design rounds?

The bar is whether you (a) ask about scale before drawing, (b) have some number for QPS, latency, and storage, (c) can name the trade-off of every component you reach for, and (d) handle follow-ups gracefully. Most candidates fail (a) or (b).

How much depth should I go into on each component?

Enough that the interviewer can see you'd be the one debugging it at 2 AM. If you said "we'd use Cassandra," be ready to explain why over DynamoDB, what happens on a partition, and what your read-consistency level is.

Do I need to memorize numbers like "Redis does 100K ops/sec"?

You need a rough order of magnitude for: a single-node SQL DB (10K writes/sec), Redis (100K ops/sec/core), a single Kafka broker (1M msg/sec), a typical RPC roundtrip (1ms intra-region, ~50ms cross-region). The exact numbers move, but interviewers are checking that you have a feel for scale.

What if I haven't worked on a system at this scale?

Say so. "I've worked at smaller scale, so I'd estimate the numbers from first principles and double-check them later — let me show how I'd reason about it." Senior interviewers respect that; they hate watching candidates make up numbers.

Related roadmap

System Design
System Design Interview Guide