Common System Design Interview Examples (With Solutions)

System design interviews can be challenging. They often require designing scalable architectures for real-world scenarios.

Below, we explore common system design interview examples (both small-scale and large-scale) with high-level solutions. Each example includes key considerations, design breakdowns, and important trade-offs.

1. URL Shortener (TinyURL)

A URL Shortener service (e.g., TinyURL or Bitly) converts long URLs into short, unique aliases. This is a classic system design question focusing on high QPS reads, data storage, and uniqueness of generated links.

Requirements & Challenges:

Generate a unique short link (alias) for each long URL (e.g., 6-8 character code).
When a user hits the short URL, redirect them to the original long URL with minimal latency.
System should be highly available and scalable (potentially millions of URLs and redirections).
Short URLs should ideally never collide and remain valid for years (durability).
Additional features: custom aliases, link expiration, and click analytics (optional).

High-Level Design:

Use a Base62 encoding (characters 0-9, A-Z, a-z) for generating short codes. For instance, a 6-character Base62 string yields 62^6 ≈ 56 billion possible combinations, ensuring plenty of unique IDs.
Maintain a database (DB) to store mappings from short code → original URL. A simple key-value schema works (short_code as key, long_url as value). This could be a relational DB or a NoSQL store for scalability.
On URL-shortening request: generate a unique ID for the new URL. This could be done via an auto-incrementing ID or a distributed ID generator (like Twitter Snowflake) to avoid collisions. Convert the ID to a Base62 string to get the short code .
On redirection request: parse the short code from the URL, look up the mapping in the DB, and redirect (HTTP 302) the user to the original URL.

Step-by-Step Solution:

Shortening Flow: The user provides a long URL via an API or web form. The application server checks if this URL was seen before (to possibly return an existing short link). If not, it generates a new ID (ensuring global uniqueness) and encodes it in Base62 to produce the short alias. It then stores {short_code -> long_url} in the database and returns the short URL to the user.
Redirection Flow: When a user hits a short URL (e.g., https://sho.rt/abc123), a request goes to the URL shortener service. The service extracts the code (abc123), looks up the long URL in the database, and then issues a redirect response to send the user to the original URL. This lookup should be very fast – using an in-memory cache (like Redis) for popular links can help reduce DB hits.

Scaling & Additional Considerations:

Database Scaling: Use sharding or partitioning to distribute URL data across multiple DB servers (e.g., by hash of the short code). This ensures the system can handle millions of entries.
Cache: Implement a cache layer for frequently accessed short URLs to reduce latency. Most accesses are reads (redirections), so caching popular mappings in memory greatly improves performance.
High Availability: Deploy multiple service instances behind a load balancer. Use DB replication so that if one DB node fails, a replica can take over.
Analytics & Expiration: For tracking click counts, you could have a separate analytics service or table that increments counters (possibly asynchronously). For expiration, store an expiration timestamp with each URL and check it on access (removing expired links via background jobs).

Learn how to design a URL shortener.

2. Design a Rate Limiter

A Rate Limiter restricts the number of requests a user or client can make to an API or service in a given time window. This prevents abuse (e.g., spam or DDoS attacks) and helps maintain service quality.

Requirements & Challenges:

Limit client actions based on defined rules (e.g., max 100 requests per minute per IP or user). Excess requests should be denied (or queued/throttled).
The rules might vary (per user, per API key, per IP, etc.) and should be configurable.
The implementation should be low-latency (introducing minimal overhead) and highly available (often sitting in a critical path of request handling).
Should work in a distributed environment – if the service is running on multiple servers, the rate limit counts must be consistent across them.

High-Level Design:

In-Memory Counters: The simplest approach is to keep a counter per user/IP and timestamp of the window. For example, for a rule “100 requests per minute”, track the count of requests in the current minute. If a new request comes and the count < 100, allow and increment; if >=100, reject.
Sliding Window or Token Bucket: A fixed time window can lead to bursts at window boundaries. A better approach is the token bucket algorithm, which continuously replenishes tokens at a fixed rate. Each request consumes a token; if tokens are available, the request passes, otherwise it’s limited. The bucket has a max capacity (burst size). This smooths out the rate limiting. For instance, a bucket might add 1 token per second (up to 60 tokens max for a 60/sec limit); a request checks for a token and consumes it if present.
Distributed Coordination: In a distributed system with multiple servers, use a centralized store (like Redis or Memcached) to maintain counters/tokens. Each server checks & updates the count in this shared store to ensure global consistency of the rate limits.

Implementation Details:

Middleware or Service: The rate limiter can be implemented as a middleware in each service or as a separate service (e.g., an API gateway layer) that intercepts requests. A common practice is to put it in the API gateway or load balancer so all traffic passes through it.
Data Store: Use an in-memory store with high throughput (Redis is popular) to keep counters. Redis even has built-in commands (like INCR and EXPIRE) which are useful for implementing fixed-window counters. For token bucket, a slightly more complex logic is needed, but still achievable with atomic operations in Redis or using Lua scripts for atomicity.
Example (Token Bucket):
- Every user/IP has a token bucket with a certain capacity (say 100 tokens) and refill rate (e.g., 1 token/second).
- Each incoming request checks the bucket: if there’s at least 1 token, consume one and allow the request; if 0 tokens, reject the request with a “429 Too Many Requests” response.
- The bucket is refilled continuously (or periodically) at the set rate. This can be done by tracking the last refill time and calculating tokens to add on each check.
Alternatives: Other algorithms include Leaky Bucket (which enforces a fixed output rate) or Sliding Window Log (which records timestamps of requests and computes the rate over a sliding interval). Each has trade-offs in precision vs. memory. A sliding window counter approximation can combine fixed windows for efficiency with a sliding effect to avoid boundary bursts.

Considerations:

Accuracy vs. Performance: A distributed rate limiter might have slight delays syncing counters between nodes. It’s important to decide if a small overshoot is acceptable or if strong consistency is required. Often, eventual consistency is fine (a few extra requests passing during race conditions). Learn more about strong consistency vs eventual consistency.
Multi-level Limits: Systems may implement multiple limits (e.g., per-second, per-minute, per-day) simultaneously for different protections.
User Feedback: The system should inform clients when they are throttled (e.g., return error code or message) and possibly include a Retry-After header to tell when to try again.

3. Design a Message Queue (like Kafka)

Message Queues (or streams) enable asynchronous, decoupled communication between services. Apache Kafka is a popular example of a distributed message queue that can handle high throughput and scale horizontally.

Requirements & Use Cases:

Decouple producers (senders of messages) and consumers (receivers) so they can run independently. Producers publish messages without needing immediate processing, and consumers can process at their own pace.
Ensure durability of messages (they shouldn’t be lost if a server fails) and allow replays (consumers can re-read if needed).
Support scalability: handle very high volumes of messages (millions per second) by horizontal scaling (add more brokers/nodes).
Maintain order of messages where needed (often within a topic or partition).
Support pub/sub semantics: multiple consumers can subscribe to the same stream of messages (topic).

High-Level Architecture (Kafka-style):

Producers: client applications that publish messages to the queue (to a specific topic). They send data to a Kafka cluster over the network.
Topics & Partitions: A topic is like a category or feed name for messages (e.g., “UserSignups” topic). Each topic is partitioned into multiple logs. A partition is an append-only log file on disk – Kafka stores messages in these partitions in the order received. Partitions enable parallelism (multiple brokers can host different partitions of a topic) and scalability. Ordering is guaranteed within a single partition but not across partitions.
Brokers (Servers): Kafka runs as a cluster of one or more servers called brokers. Each broker handles some partitions. Partitions are replicated across brokers for fault tolerance (one broker is leader for a partition, others are followers). If a broker dies, a follower can take over as leader for its partitions. This replication ensures no single failure causes data loss. Kafka’s design of distributed, replicated partitions gives it high scalability and fault tolerance.
Consumers: client applications that subscribe to topics and process messages. Consumers often work as a consumer group (multiple instances in a group share the work). Kafka will assign different partitions to different consumers in the group, so they each get a subset of the messages. This allows scaling out consumers while each message is only processed by one consumer in the group. If a consumer in the group fails, Kafka rebalances and assigns its partitions to other consumers.

Message Flow (Step-by-Step):

Production: A producer publishes a message to a specific topic. The producer library on the client picks a partition for the message (by round-robin or based on a key for partitioning, e.g., all messages for user X go to the same partition to preserve order). It then sends the message to the broker that is leader for that partition.
Storage & Replication: The broker appends the message to the end of the partition log on disk. This write is persisted (Kafka relies on fast disk writes and OS page cache). The broker also forwards the message to follower replicas on other brokers; once enough replicas have stored it (per the replication factor and ack settings), the message is considered committed and durable. Kafka’s append-only log and sequential disk writes enable it to serve as a high-performance, persistent queue.
Consumption: Consumers pull messages from the brokers. Each consumer group keeps track of its read position (offset) in each partition. A consumer will periodically poll the broker for new messages, reading sequentially. After processing messages, the consumer can acknowledge by committing the offsets (either automatically or manually), so it won’t re-read those on restart. Multiple consumer groups can consume the same topic independently (pub-sub pattern).
Acknowledgments & Retention: Kafka doesn’t remove a message once consumed; instead, messages are retained for a configured time (e.g., 7 days) or until the log size limit is reached. This allows new consumers to join and replay past messages. Old messages get purged after the retention period, or Kafka can be configured as a compact log (keeping latest value for each key).

Design Considerations:

Ordering Guarantees: If strict ordering is required for a category of messages, ensure they go to the same partition (since order is only guaranteed per partition). For example, all messages for a particular user or entity might use a partition key.
Delivery Semantics: Kafka by default provides at-least-once delivery (a consumer may process the same message twice in some failure cases). Exactly-once can be achieved with additional logic (Kafka supports exactly-once semantics in Kafka Streams or using idempotent producers and transaction APIs).
Throughput: Kafka can handle very high throughput by scaling partitions. More partitions = more parallel writes and reads. However, too many partitions can increase overhead. The design should choose a partition count based on throughput needs.
ZooKeeper/Coordination: Traditionally, Kafka used ZooKeeper for cluster metadata (tracking brokers, topics, etc.). Newer versions have built-in consensus (KRaft) to eliminate external dependency. The important point is that there’s a cluster coordinator to manage leader election and metadata.
Use Cases: This kind of message queue is used for event pipelines, log aggregation, microservice communication, etc., where decoupling and reliability are key.

4. Design a Video Streaming Service (YouTube/Netflix)

Designing a video streaming service like YouTube or Netflix is a complex large-scale problem. It involves handling user uploads, storing huge video files, processing (encoding) them, and delivering streams to potentially millions of viewers concurrently. We’ll focus on the core of on-demand video streaming (not live streaming, though there are similarities).

Requirements & Challenges:

Video Upload: Users can upload video files of large size. The system must store these and process them (e.g., convert to streamable formats and multiple resolutions).
Video Storage and Streaming: Store videos efficiently and serve them with low latency. Handle many concurrent viewers, which means lots of bandwidth and a need for content distribution.
Scalability: The service should scale to millions of videos and users. This entails huge storage and the ability to handle high request rates.
Playback: Support different client devices, network conditions, so likely need adaptive streaming (different quality streams).
Other features: user authentication, video metadata (titles, descriptions), search, recommendations, comments, etc. (These add complexity but core focus is on the video delivery pipeline).

High-Level Design Components:

Client (Web/Mobile): Users upload videos via a web or mobile client and watch videos through a video player in their browser/app. The player handles streaming protocols and quality adaptation.
Upload Service & Processing: A service (or microservice) receives video uploads. Videos are typically stored initially in a raw storage (like a blob store). A processing pipeline (Video Encoding Service) then takes the uploaded file and transcodes it into various formats and resolutions (e.g., 240p, 480p, 1080p) to accommodate different devices and bandwidth. The video may also be sliced into segments (for adaptive bitrate streaming, e.g., HLS/DASH protocols split video into 2-10 second chunks).
Storage: Use a distributed object storage system for the video files (for example, cloud storage like AWS S3 or a Hadoop HDFS cluster). These systems can store massive files and replicate them for durability. Video files are large, so efficient storage and retrieval is key. Thumbnails and other static content can also go here.
Content Delivery Network (CDN): A CDN is crucial for a global video service. The CDN caches video content on edge servers around the world. When a user watches a video, the video segments stream from a nearby CDN edge, reducing latency and load on the origin servers. This improves content availability and reduces bandwidth costs by serving from caches close to users. YouTube and Netflix heavily use CDNs (Netflix even has its own CDN called Open Connect).
Backend Services: Various microservices and databases manage data like user info, video metadata (title, description, tags), likes, comments, and the mapping of videos to storage locations or CDN URLs. A database (SQL or NoSQL) stores metadata and maybe a search index for video search. There’s also a streaming backend that orchestrates which video files to serve to which user and handles licensing/DRM for services like Netflix.

Workflow Breakdown:

Video Upload Pipeline:

Upload: A user uploads a video file via the app. The file is sent to an Upload Service, which stores the raw file in a temporary storage or directly in the permanent storage (like an S3 bucket).
Encoding: After upload, a task is triggered to process the video. A transcoding service takes the file and converts it into multiple codecs and resolutions. For example, it might generate MP4/H.264 in 360p, 720p, 1080p, etc. It also splits the video into small segments and creates manifest files (playlist of segments) if using HLS/DASH for streaming. This step is CPU-intensive and might be done with a distributed cluster of encoder workers.
Storage & CDN Ingestion: The processed video files/segments are then stored in the permanent storage. The CDN is updated (or it will fetch on demand) so that future user requests for this video will be served from the CDN edges. Often the first viewer might trigger the CDN to pull from origin, and subsequent viewers get it from cache.
Metadata Update: The system updates the video catalog/database to mark the video as available, including metadata like its ID, title, location of files, duration, etc. Now the video can be searched and viewed by others.

Video Streaming (Playback):

Request Video: A user clicks a video to watch. The client app sends a request to the backend for the video stream. The request might go to a Video Streaming Service which handles serving the content.
Authentication & Authorization: (if needed) Check if the video is allowed (e.g., not private or region-restricted) and record a view count or any DRM handling.
Response (Video URL/Manifest): The server responds with the video playback info. For adaptive streaming, the client might get a manifest file (like an HLS playlist or MPEG-DASH manifest) which lists the URLs of video segment files for various bitrates. The URLs are often pointing to CDN domain names.
CDN Streaming: The client’s video player then requests video segments (chunks) from the CDN URL. For example, segment1.ts, segment2.ts, etc., at a certain quality. The CDN edge closest to the user will deliver these segments. If an edge cache doesn’t have a segment yet, it fetches it from an origin (the centralized storage or an upstream CDN node) and then caches it.
Adaptive Bitrate: As the video plays, the client monitors its bandwidth and can switch between quality streams (by requesting different URL lists from the manifest). The CDN and segment architecture makes this seamless.
Playback and Feedback: The video plays for the user. Meanwhile, stats can be sent back (view count increment, watch duration for recommendations, etc.). If a segment fails to load, the player might retry or drop to a lower quality.

Key Design Points:

Scalability via CDN: The use of a CDN is non-negotiable at YouTube scale. It offloads traffic from core servers. Popular videos will be distributed across many CDN nodes.
Storage and Bandwidth: Storing petabytes of video and serving terabits of data per second is a core challenge. Partition videos by ID across storage servers. Ensure redundancy (replicas) so that even if one storage node is down, the video is still accessible from another.
Data Consistency: When a video upload is completed and processed, the metadata service must reflect that so users can start viewing. There could be slight delays (a video might show up a few seconds/minutes after upload once encoding for at least one resolution is done).
Microservices: Typically, such a system would use many microservices (upload, encode, user service, video service, comment service, etc.) connected by APIs or message queues. But in an interview context, it’s fine to focus on major components and how data flows.
Additional: Implementing video recommendation algorithms or search would involve separate systems (e.g., Big Data processing to track views, ML models for recommendations, search indexes for titles). For the interview scope, these can be mentioned but kept brief.

Designing a social media news feed (like Twitter’s home timeline or Facebook’s news feed) involves handling real-time updates and a high volume of data. The system must gather posts from all the people/Pages a user follows and display them (often sorted by time or relevance). The challenge is in fan-out – one user’s post might need to be delivered to millions of followers.

Requirements & Challenges:

Users can follow other users. When someone you follow posts a new update (tweet), it should appear in your feed.
The feed should show recent posts from all followed accounts, ideally in reverse chronological order (or with some ranking logic).
High fan-out: If a popular user with 10 million followers tweets, that generates 10 million feed insertions. The system must handle these spikes.
Feeds should update in near-real-time, so users see new content quickly. But the system can tolerate eventual consistency (slight delays) in exchange for stability.
Read vs Write: Typically, read volume (users scrolling feeds) is much higher than write volume (posts). So optimizing feed reads (low latency) is crucial, even if writes are heavy.

Design Approaches (Push vs Pull):
There are two main models to handle feed generation – Push (fan-out on write) and Pull (fan-out on read) (Designing Twitter – A System Design Interview Question - DEV Community):

Push Model (Fan-out on write): When a user posts something, immediately push that post to all their followers’ feed storage. In practice, this means writing the new post ID into a feed datastore (like a timeline cache or database) for each follower. This way, when followers request their feed, it’s pre-assembled – read is fast (just fetch from their feed store). Pros: Very fast feed reads (just read a precomputed list) and feeds are up-to-date immediately (Designing Twitter – A System Design Interview Question - DEV Community). Cons: Enormous write amplification if a user has many followers. It can overwhelm the system for users like celebrities (millions of writes for one tweet). Also storing duplicate data (same tweet stored in many feeds).
Pull Model (Fan-out on read): New posts are not pushed to followers. Instead, each user’s feed is assembled on the fly when requested. When you open the app, the service will fetch the latest posts from all the users you follow (e.g., query the posts database for newest X posts of each followee, then merge-sort by time). Pros: Minimizes write work – each post is just stored once in a central place (no duplication). Good for very high-fan-out cases (popular users) because you don’t pre-compute 10M feeds. Cons: Reading the feed becomes expensive, especially if a user follows many people; it may require reading from many sources and sorting, leading to higher latency. Users might see slightly delayed results because it’s computed on demand (Designing Twitter – A System Design Interview Question - DEV Community).

Hybrid Approach:
In practice, large systems use a combination. For most users (who have a reasonable number of followers), use the push model to give instant, precomputed feeds. For users with extremely high followers (celebrities), use pull or a lazy fan-out – don’t push to all followers, maybe mark that those followers need to pull from the source when they check their feed. For example, Twitter might push tweets to feeds of users with <10k followers, but for someone like a celebrity with millions of followers, simply flag the followers to fetch that tweet later instead of pushing it to each (How to design a Twitter like application?).

High-Level Design Components:

User Service & Social Graph: A service/database to manage who follows whom (the follow graph). Given a user, we need to know their followers (for push) or followees (for pull). This data might be stored in adjacency lists, e.g., for each user store list of follower IDs and another list of followee IDs, possibly in a graph database or relational DB.
Post Storage: A database for all posts/tweets. This could be a distributed NoSQL store since the write rate is high (and each post is small text). Each post record will have author, timestamp, content, etc. Possibly partition posts by user (all tweets by user X in one partition for quick retrieval).
Feed Cache/Storage: In the push model, each user has a feed data store (could be an in-memory list or a database table) containing references to tweets that should appear in their timeline. This could be a Redis cache or a specialized timeline storage (even something like Cassandra can be used to store feeds sorted by time).
Feed Service: The service responsible for constructing and retrieving the feed for users. On a new post, it handles fan-out (push model logic) by updating follower feeds. On feed read, it fetches the timeline (either from the feed cache if push model or by aggregating data if pull).
Ranking Service (optional): If feeds are not purely chronological, a service might score and rank posts (e.g., based on engagement or relevance) before displaying. For simplicity, assume chronological order here.

Feed Delivery (Push Model) - Step-by-Step:

User creates a post: The user’s request to post goes to the Post Service. The new post is stored in the posts database (with a unique ID and timestamp).
Fan-out to feeds: The Feed Service retrieves the list of that user’s followers (from the social graph). For each follower, it inserts the new post’s ID into that follower’s timeline store. This could be done via a batch job or a distributed task queue to spread the work. If a follower’s feed storage is in Redis, for example, it might do an LPUSH to add the new post ID to the list for that follower (maintaining a capped list, e.g., last 1000 posts). For durability, it might also store in a persistent store (maybe Redis is backed by disk or a parallel write to Cassandra).
Timeline read: When a user opens their feed, the Feed Service simply pulls their feed data (e.g., get list of latest N post IDs from their feed store) and then fetches those post details (joining with Post Storage to get content). The results are returned to the client sorted by time (already sorted if we always insert new posts at the top). This read is fast because it’s essentially one or two lookups (one to get IDs, another to fetch contents), all precomputed.

Handling High Fan-out (Pull for certain users):

If a user has > N followers (a threshold), skip the push step for them. Mark their new post in a “fan-out deferred” list. For followers of such a user, when they load the feed, the system will know there are new posts not pre-pushed (perhaps via a pointer or by checking the post timestamps) and fetch them from Post Storage on the fly. This way, we avoid millions of writes at post time, trading for a slightly slower read for those followers. This is how a hybrid model ensures the system remains efficient.

Optimizations & Considerations:

Caching: Even with precomputed feeds, caching the feed results or the most recent posts in memory can speed up feed loads. Frequently accessed feeds (like active users refreshing often) might be kept in cache.
Feed Cap: We might not store all posts in a user’s timeline store (could be too large). Often, store the most recent X (like 1000) posts in fast storage, and for older, rely on on-demand fetch or pagination from the posts DB.
Eventually Consistent: Accept that some users might see a tweet a few seconds later than others. This is a reasonable trade-off for performance. The system design should prioritize not overwhelming the database.
Social Graph scaling: The follow relationships can be huge (billions of edges). Partition this data and possibly use in-memory solutions or graph databases optimized for these queries.
Ordering: If using push, the feed is naturally in time order. If using pull or hybrid, ensure you merge-sort properly by timestamp (or score).
Real-time updates: To show new tweets appearing in real-time (without manual refresh), the service can use a WebSocket or long-polling connection to push notifications to clients when new posts are available (small messages saying “user X posted, fetch new feed”). This adds complexity but enhances user experience.

Learn how to design Twitter.

6. Design a Distributed Cache System (like Redis)

A distributed cache (e.g., Redis or Memcached in cluster mode) is used to speed up data access by storing data in memory across multiple servers. It’s commonly placed in front of slower storage (databases) to reduce read latency and offload traffic. Designing a distributed cache involves data partitioning, replication, and consistency considerations.

Requirements & Use Cases:

Serve frequent read queries with low latency (few milliseconds). Example: caching database query results, session data, or computed data.
Handle more data or traffic than a single cache server’s capacity by distributing load across multiple machines.
Provide high availability – the cache cluster should tolerate node failures without data loss (or minimal loss) and with seamless failover.
Offer mechanisms for data expiration (TTL) and eviction (e.g., LRU policy when memory is full).

High-Level Design:

Cache Nodes (Servers): Instead of one monolithic cache, use multiple cache servers. Each server holds a portion of the overall cache data. Clients (or a proxy) need a way to determine which server holds a given piece of data (key). Commonly, consistent hashing is used to map keys to servers. This ensures even distribution and minimal remapping when scaling out or in (Distributed Caching - Redis).
Partitioning (Sharding): With consistent hashing or a similar scheme, each key (e.g., a cache key like user:123:profile) is hashed to find which node is responsible. This is the shard for that key. Partitioning allows the cache to scale – e.g., with 10 servers, roughly 1/10th of keys go to each (not exactly, but balanced by the hash function).
Replication: To achieve high availability, each cache shard can be replicated to one or more replica nodes. For example, use a master-slave model: each partition has one primary node handling reads/writes, and a secondary that has a copy of the data. If the master fails, a replica can take over (with some brief inconsistency possibly). Replication ensures the data is not lost if one node dies (Distributed Caching - Redis). Some systems also allow read from replicas to scale reads, though with eventual consistency.
Cache Coordination: A central or distributed mechanism is needed to know cluster topology (which nodes exist, which are masters vs replicas, etc.). In Redis Cluster, for example, the nodes themselves hold the configuration and communicate with each other to manage partitioning. Clients can be aware of the hash ring or use a discovery service.
Data Model: Typically key-value. The cache might store simple string values, or more complex structures (lists, hashes) if using an advanced cache like Redis. For system design, key-value is the simplest assumption. Keys and values are both in-memory (with optional persistence if needed).

Cache Operations:

Cache Read (lookup): The client (or a library) hashes the key to find the correct node, then sends a get request to that cache server. The server returns the value if present. If the key isn’t in cache (cache miss), the client then fetches from the source (e.g., database), and can populate the cache with that key-value for next time (“cache aside” pattern).
Cache Write/update: When data is written to the database, the cache should be updated or invalidated to avoid stale data. There are a few strategies:
- Write-through cache: Every write goes to the cache and the database in the same operation (cache updated with new value). This keeps cache and DB consistent, but adds write latency.
- Write-behind (write-back): Write to cache first, and asynchronously write to the database after a short delay. Faster writes, but risk of data loss if cache node fails before DB write.
- Cache-aside (lazy): Application writes to DB, and simply invalidates or updates the cache entry. Many systems use cache-aside: the cache is updated on read misses or explicitly invalidated on changes.
Eviction Policy: Since memory is limited, each node will evict entries when full. Policies include LRU (Least Recently Used), LFU (Least Frequently Used), etc. Redis by default uses an approximation of LRU. Expiration (TTL) on keys is also common for ephemeral data.

Ensuring Distribution & Consistency:

Consistent Hashing: Using consistent hashing to map keys to nodes allows dynamic scaling. When a node is added or removed, only a subset of keys move to different nodes, minimizing cache invalidation (Distributed Caching - Redis). Without consistent hashing (e.g., naive modulo hashing), adding a server would remap almost all keys.
Replication & Failover: Each primary node has replicas. On primary failure, a replica is promoted to primary. This failover can be handled by the cache software (e.g., Redis Sentinel or Redis Cluster handle promotion). During failover, there might be a brief unavailability or stale data window. By replicating data, the system ensures data is still available from another node if one goes down (Distributed Caching - Redis).
Synchronization: Invalidation or updates need to propagate to replicas. In master-slave, writes go to master and the system propagates to replicas (with a replication lag). If a client reads from a replica, it might get slightly stale data if a recent write hasn’t applied yet – decide if that’s acceptable or route all reads to master for strong consistency (trade-off with scalability).
Scaling Capacity: To scale, you can add more cache nodes (increasing hash ring slots). This increases total memory and throughput. The system should rebalance keys (consistent hashing helps reduce churn). There’s usually a rehashing process or the consistent hash just automatically maps some keys to the new node.

Use Case Example: Suppose we’re using a distributed cache to speed up user profile lookups for a social app:

We have 4 cache servers (A, B, C, D). Keys are hashed to these servers. If user123’s profile hashes to server B, any server in the cluster (application server) knows to query B for “user123” key.
When user123’s profile is updated, the app writes to the primary database and then deletes or updates the “user123” entry in cache (on server B). The new read will fetch from DB (miss) and then set the new cache value on B.
If server B crashes, the system will promote its replica (say B2) to primary, or if no replica, then those keys are temporarily unavailable until B recovers or is replaced (in which case consistent hashing would route keys to other servers). Having replicas prevents downtime for those keys.
By scaling to 4 servers, our cache can handle 4x the data and traffic of a single server (approximately), and each request is handled by one node (making reads/writes distributed).

Additional Considerations:

Cache Stampede: When a popular key expires, many requests may hit the DB at once. Solutions include using request locking or regenerating the cache asynchronously (refresh ahead of expiration).
Persistence: Pure caches don’t persist data (aside from replication). But systems like Redis allow RDB or AOF persistence to disk. If persistence is on, after a crash the cache can reload data (less cold-start). This blurs the line between cache and database but can be useful for recovery.
Security: Consider if the cache contains sensitive data – it might need access control or encryption, especially if it’s distributed across data centers.
Geo-Distribution: In large systems, you might have clusters per region (to keep cache close to users or services). That introduces another layer of complexity with multiple distributed caches that perhaps invalidate each other’s entries.

By understanding these common system design examples – from small utilities like a URL shortener or rate limiter to large-scale systems like social media feeds and video streaming – you can approach system design interviews with confidence.

The key is to break down the problem into components, address bottlenecks (scaling, latency, fault tolerance), and clearly justify design decisions and trade-offs. Each of these examples teaches core principles that can be mixed and matched for designing new systems.

Common System Design Interview Examples (With Solutions)

1. URL Shortener (TinyURL)

2. Design a Rate Limiter

3. Design a Message Queue (like Kafka)

4. Design a Video Streaming Service (YouTube/Netflix)

5. Design a Social Media Feed (Twitter Timeline)

6. Design a Distributed Cache System (like Redis)

Recommended Courses