System Design Master Template | Learn System Design

System Design

Learn System Design

Introduction to System Design

How to Learn System Design?

Functional vs. Non-functional Requirements

What are Back-of-the-Envelope Estimations?

Things to Avoid During System Design Interview

System Design Basics

Load Balancing

Introduction to Load Balancing

Load Balancing Algorithms

Uses of Load Balancing

Load Balancer Types

Stateless vs. Stateful Load Balancing

High Availability and Fault Tolerance

Scalability and Performance

Challenges of Load Balancers

API Gateway

Introduction to API Gateway

Usage of API gateway

Advantages and disadvantages of using API gateway

Key Characteristics of Distributed Systems

Scalability

Availability

Latency and Performance

Concurrency and Coordination

Monitoring and Observability

Resilience and Error Handling

Fault Tolerance vs. High Availability

Network Essentials

HTTP vs. HTTPS

TCP vs. UDP

HTTP: 1.0 vs. 1.1 vs 2.0 vs. 3.0

URL vs. URI vs. URN

Domain Name System (DNS)

Introduction to DNS

DNS Resolution Process

DNS Load Balancing and High Availability

Caching

Introduction to Caching

Why is Caching Important?

Types of Caching

Cache Replacement Policies

Cache Invalidation

Cache Read Strategies

Cache Coherence and Consistency Models

Caching Challenges

Cache Performance Metrics

CDN

What is CDN?

Origin Server vs. Edge Server

CDN Architecture

Push CDN vs. Pull CDN

Data Partitioning

Introduction to Data Partitioning

Partitioning Methods

Data Sharding Techniques

Benefits of Data Partitioning

Common Problems Associated with Data Partitioning

Proxies

What is a Proxy Server?

Uses of Proxies

VPN vs. Proxy Server

Redundancy and Replication

What is Redundancy?

What is Replication?

Replication Methods

Data Backup vs. Disaster Recovery

CAP & PACELC Theorems

Introduction to CAP Theorem

Components of CAP Theorem

Trade-offs in CAP Theorem

Examples of CAP Theorem in Practice

Beyond CAP Theorem

System Design Trade-offs in Interviews

Databases (SQL vs. NoSQL)

Introduction to Databases

SQL Databases

NoSQL Databases

SQL vs. NoSQL

ACID vs BASE Properties

Real-World Examples and Case Studies

SQL Normalization and Denormalization

In-Memory Database vs. On-Disk Database

Data Replication vs. Data Mirroring

Database Federation

Indexes

What are Indexes?

Types of Indexes

Bloom Filters

Introduction to Bloom Filters

Benefits & Limitations of Bloom Filters

Variants and Extensions of Bloom Filters

Applications of Bloom Filters

Long-Polling vs. WebSockets vs. Server-Sent Events

Difference Between Long-Polling, WebSockets, and Server-Sent Events

Quorum

Why Quorum?

What is Quorum?

Heartbeat

What is Heartbeat?

Checksum

What is Checksum?

Uses of Checksum

Leader and Follower

What is Leader and Follower Pattern?

Security

What is Security and Privacy?

What is Authentication?

What is Authorization?

Authentication vs. Authorization

OAuth vs. JWT for Authentication

What is Encryption?

What are DDoS Attacks?

Distributed Messaging System

Introduction to Messaging System

Introduction to Kafka

Messaging patterns

Popular Messaging Queue Systems

RabbitMQ vs. Kafka vs. ActiveMQ

Scalability and Performance

Distributed File Systems

What is a Distributed File System?

Architecture of a Distributed File System

Key Components of a DFS

Misc Concepts

Batch Processing vs. Stream Processing

XML vs. JSON

Synchronous vs. Asynchronous Communication

Push vs. Pull Notification Systems

Microservices vs. Serverless Architecture

Message Queues vs. Service Bus

Stateful vs. Stateless Architecture

Event-Driven vs. Polling Architecture

Quiz - System Design Fundamentals

Quiz

System Design Trade-offs

Importance of Discussing Trade-offs

Strong vs Eventual Consistency

Latency vs Throughput

ACID vs BASE Properties in Databases

Read-Through vs Write-Through Cache

Batch Processing vs Stream Processing

Load Balancer vs. API Gateway

API Gateway vs Direct Service Exposure

Proxy vs. Reverse Proxy

API Gateway vs. Reverse Proxy

SQL vs. NoSQL

Primary-Replica vs Peer-to-Peer Replication

Data Compression vs Data Deduplication

Server-Side Caching vs Client-Side Caching

REST vs RPC

Polling vs. Long-Polling vs. WebSockets vs. Webhooks

CDN Usage vs Direct Server Serving

Serverless Architecture vs Traditional Server-based

Stateful vs Stateless Architecture

Hybrid Cloud Storage vs All-Cloud Storage

Token Bucket vs Leaky Bucket

Read Heavy vs Write Heavy System

Quiz

System Design Master Template

System Design Interviews - A step by step guide

System Design Master Template

Designing a URL Shortening Service like TinyURL

Quiz - Designing URL Shortner

Designing Pastebin

Quiz - Designing Pastebin

Designing Instagram

Quiz - Designing Instagram

Designing Dropbox

Quiz - Designing Dropbox

Designing Facebook Messenger

Quiz - Designing Facebook Messenger

Designing Twitter

Quiz - Designing Twitter

Designing Youtube or Netflix

Quiz - Designing Youtube

Designing Typeahead Suggestion

Quiz - Designing Typeahead Suggestion

Designing an API Rate Limiter

Quiz - Designing an API Rate Limiter

Designing Twitter Search

Quiz - Designing Twitter Search

Designing a Web Crawler

Quiz - Designing a Web Crawler

Designing Facebook’s Newsfeed

Quiz - Designing Facebook’s Newsfeed

Designing Yelp or Nearby Friends

Quiz - Designing Yelp or Nearby Friends

Designing Uber backend

Quiz - Designing Uber backend

Designing Ticketmaster

Quiz - Designing Ticketmaster

Dynamo: How to design a key value store?

Dynamo: Introduction

High-Level Architecture

Data Partitioning

Replication

Vector Clocks and Conflicting Data

The Life of Dynamo’s put() & get() Operations

Anti-entropy Through Merkle Trees

Gossip Protocol

Dynamo Characteristics and Criticism

Summary: Dynamo

Quiz: Dynamo

Mock Interview: Dynamo

Designing YouTube Likes Counter (medium)

YouTube Likes Counter

Quiz

Cassandra: How to Design a Wide-column NoSQL Database?

Cassandra: Introduction

High-level Architecture

Replication

Cassandra Consistency Levels

Gossiper

Anatomy of Cassandra's Write Operation

Anatomy of Cassandra's Read Operation

Compaction

Tombstones

Summary: Cassandra

Quiz: Cassandra

Mock Interview: Cassandra

Kafka: How to Design a Distributed Messaging System?

Messaging Systems: Introduction

Kafka: Introduction

High-level Architecture

Kafka: Deep Dive

Consumer Groups

Kafka Workflow

Role of ZooKeeper

Controller Broker

Kafka Delivery Semantics

Kafka Characteristics

Summary: Kafka

Quiz: Kafka

Mock Interview: Kafka

Chubby: How to Design a Distributed Locking Service?

Chubby: Introduction

High-level Architecture

Design Rationale

How Chubby Works

File, Directories, and Handles

Locks, Sequencers, and Lock-delays

Sessions and Events

Master Election and Chubby Events

Caching

Database

Scaling Chubby

Summary: Chubby

Quiz: Chubby

Mock Interview: Chubby

HDFS: How to Design File Storage System?

Hadoop Distributed File System: Introduction

High-level Architecture

Deep Dive

Anatomy of a Read Operation

Anatomy of a Write Operation

Data Integrity & Caching

Fault Tolerance

HDFS High Availability (HA)

HDFS Characteristics

Summary: HDFS

Quiz: HDFS

Mock Interview: HDFS

GFS: How to Design a Distributed File System Storage?

Google File System: Introduction

High-level Architecture

Single Master and Large Chunk Size

Metadata

Master Operations

Anatomy of a Read Operation

Anatomy of a Write Operation

Anatomy of an Append Operation

GFS Consistency Model and Snapshotting

Fault Tolerance, High Availability, and Data Integrity

Garbage Collection

Criticism on GFS

Summary: GFS

Quiz: GFS

Mock Interview: GFS

BigTable: How to Design a Wide Column Storage System?

BigTable: Introduction

BigTable Data Model

System APIs

Partitioning and High-level Architecture

SSTable

GFS and Chubby

Bigtable Components

Working with Tablets

The Life of BigTable's Read & Write Operations

Fault Tolerance and Compaction

BigTable Refinements

BigTable Characteristics

Summary: BigTable

Quiz: BigTable

Mock Interview: BigTable

Designing Reddit (medium)

Design Reddit

Quiz

Designing Notification Service (medium)

Designing a Notification System

Quiz

Design Google Calendar (medium)

Design Google calendar (Medium)

Quiz

Design a Recommendation System (medium)

Design a Recommendation System for Netflix

Quiz

Designing Gmail (medium)

Design Gmail

Quiz

Designing Google News (medium)

Design Google News, a Global News Aggregator System (Medium)

Quiz

Designing Unique ID Generator (medium)

Design Unique ID Generator (Easy)

Quiz

Designing Code Judging System (medium)

Design Code Judging System like LeetCode (Medium)

Quiz

Designing Payment System (hard)

Design Payment System

Quiz

Designing Flash Sale System (hard)

Design a Flash Sale for an E-commerce Site (Hard)

Quiz

Designing Reminder Alert System (hard)

Design a Reminder Alert System

Quiz

System Design Patterns

Introduction: System Design Patterns

1. Bloom Filters

2. Consistent Hashing

3. Quorum

4. Leader and Follower

5. Write-ahead Log

6. Segmented Log

7. High-Water Mark

8. Lease

9. Heartbeat

10. Gossip Protocol

11. Phi Accrual Failure Detection

12. Split Brain

13. Fencing

14. Checksum

15. Vector Clocks

16. CAP Theorem

17. PACELC Theorem

18. Hinted Handoff

19. Read Repair

20. Merkle Trees

Quiz

System Design Master Template

distributed systems

microservices

dns

load balancing

hard

20 min

·Updated Oct 2025

System design interviews are unstructured by design. In these interviews, you are asked to take on an open-ended design problem that doesn’t have a standard solution.

The two biggest challenges of answering a system design interview question are:

To know where to start.
To know if you have talked about all the important parts of the system.

To simplify this process, this course offers a comprehensive system design template that can effectively guide you in addressing any system design interview question.

Have a look at the following image to understand the major components that could be part of any system design and how these components interact with each other.

System Design Master Template

System Design Master Template (video)

Here is a video discussing the System Design Master Template:

System Design Master Template

With this master template in mind, we will discuss the 18 essential system design concepts. Here is a brief description of each:

1. Domain Name System (DNS)

The Domain Name System (DNS) serves as a fundamental component of the internet infrastructure, translating user-friendly domain names into their corresponding IP addresses. It acts as a phonebook for the internet, enabling users to access websites and services by entering easily memorable domain names, such as www.designgurus.io, rather than the numerical IP addresses like "192.0.2.1" that computers utilize to identify each other.

When you input a domain name into your web browser, the DNS is responsible for finding the associated IP address and directing your request to the appropriate server. This process commences with your computer sending a query to a recursive resolver, which then searches a series of DNS servers, beginning with the root server, followed by the Top-Level Domain (TLD) server, and ultimately the authoritative name server. Once the IP address is located, the recursive resolver returns it to your computer, allowing your browser to establish a connection with the target server and access the desired content.

DNS

2. Load Balancer

A load balancer is a networking device or software designed to distribute incoming network traffic across multiple servers, ensuring optimal resource utilization, reduced latency, and maintained high availability. It plays a crucial role in scaling applications and efficiently managing server workloads, particularly in situations where there is a sudden surge in traffic or uneven distribution of requests among servers.

Load balancers employ various algorithms to determine the distribution of incoming traffic. Some common algorithms include:

Round Robin: Requests are sequentially and evenly distributed across all available servers in a cyclical manner.
Least Connections: The load balancer assigns requests to the server with the fewest active connections, giving priority to less-busy servers.
IP Hash: The client's IP address is hashed, and the resulting value is used to determine which server the request should be directed to. This method ensures that a specific client's requests are consistently routed to the same server, helping maintain session persistence.

Load Balancer

3. API Gateway

An API Gateway serves as a server or service that functions as an intermediary between external clients and the internal microservices or API-based backend services of an application. It is a vital component in contemporary architectures, particularly in microservices-based systems, where it streamlines the communication process and offers a single entry point for clients to access various services.

The primary functions of an API Gateway encompass:

Request Routing: The API Gateway directs incoming API requests from clients to the appropriate backend service or microservice, based on predefined rules and configurations.
Authentication and Authorization: The API Gateway manages user authentication and authorization, ensuring that only authorized clients can access the services. It verifies API keys, tokens, or other credentials before routing requests to the backend services.
Rate Limiting and Throttling: To safeguard backend services from excessive load or abuse, the API Gateway enforces rate limits or throttles requests from clients according to predefined policies.
Caching: In order to minimize latency and backend load, the API Gateway caches frequently-used responses, serving them directly to clients without the need to query the backend services.
Request and Response Transformation: The API Gateway can modify requests and responses, such as converting data formats, adding or removing headers, or altering query parameters, to ensure compatibility between clients and services.

Is it Possible to Use a Load Balancer and an API Gateway Together? Yes, in many real-world architectures, both of these components work together, where the Load Balancer effectively manages traffic across multiple instances of API Gateways or directly to services, depending on the setup. See details here.

API Gateway

4. CDN

A Content Delivery Network (CDN) is a distributed network of servers that store and deliver content, such as images, videos, stylesheets, and scripts, to users from locations that are geographically closer to them. CDNs are designed to enhance the performance, speed, and reliability of content delivery to end-users, irrespective of their location relative to the origin server. Here's how a CDN operates:

When a user requests content from a website or application, the request is directed to the nearest CDN server, also known as an edge server.
If the edge server has the requested content cached, it directly serves the content to the user. This process reduces latency and improves the user experience, as the content travels a shorter distance.
If the content is not cached on the edge server, the CDN retrieves it from the origin server or another nearby CDN server. Once the content is fetched, it is cached on the edge server and served to the user.
To ensure the content stays up-to-date, the CDN periodically checks the origin server for changes and updates its cache accordingly.

CDN

5. Forward Proxy vs. Reverse Proxy

A forward proxy, also referred to as a "proxy server" or simply "proxy," is a server positioned in front of one or more client machines, acting as an intermediary between the clients and the internet. When a client machine requests a resource on the internet, the request is initially sent to the forward proxy. The forward proxy then forwards the request to the internet on behalf of the client machine and returns the response to the client machine.

On the other hand, a reverse proxy is a server that sits in front of one or more web servers, serving as an intermediary between the web servers and the internet. When a client requests a resource on the internet, the request is first sent to the reverse proxy. The reverse proxy then forwards the request to one of the web servers, which returns the response to the reverse proxy. Finally, the reverse proxy returns the response to the client.

Forward Proxy vs. Reverse Proxy

6. Caching

Cache is a high-speed storage layer positioned between the application and the original data source, such as a database, file system, or remote web service. When an application requests data, the cache is checked first. If the data is present in the cache, it is returned to the application. If the data is not found in the cache, it is retrieved from its original source, stored in the cache for future use, and then returned to the application. In a distributed system, caching can occur in multiple locations, including the client, DNS, CDN, load balancer, API gateway, server, database, and more.

Cache

7. Data Partitioning

In a database, horizontal partitioning, often referred to as sharding, entails dividing the rows of a table into smaller tables and storing them on distinct servers or database instances. This method is employed to distribute the database load across multiple servers, thereby enhancing performance.

Conversely, vertical partitioning involves splitting the columns of a table into separate tables. This technique aims to reduce the column count in a table and boost the performance of queries that only access a limited number of columns.

Data partitioning

8. Database Replication

Database replication is a method employed to maintain multiple copies of the same database across various servers or locations. The main objective of database replication is to enhance data availability, redundancy, and fault tolerance, ensuring the system remains operational even in the face of hardware failures or other issues.

In a replicated database configuration, one server serves as the primary (or master) database, while others act as replicas (or slaves). This process involves synchronizing data between the primary database and replicas, ensuring all possess the same up-to-date information. Database replication provides several advantages, including:

Improved Performance: By distributing read queries among multiple replicas, the load on the primary database can be reduced, leading to faster query response times.
High Availability: If the primary database experiences failure or downtime, replicas can continue to provide data, ensuring uninterrupted access to the application.
Enhanced Data Protection: Maintaining multiple copies of the database across different locations helps safeguard against data loss due to hardware failures or other disasters.
Load Balancing: Replicas can handle read queries, allowing for better load distribution and reducing overall stress on the primary database.

9. Distributed Messaging Systems

Distributed messaging systems provide a reliable, scalable, and fault-tolerant means for exchanging messages between numerous, possibly geographically-dispersed applications, services, or components. These systems facilitate communication by decoupling sender and receiver components, enabling them to develop and function independently. Distributed messaging systems are especially valuable in large-scale or intricate systems, like those seen in microservices architectures or distributed computing environments. Examples of these systems include Apache Kafka and RabbitMQ.

10. Microservices

Microservices represent an architectural style wherein an application is organized as an assembly of small, loosely-coupled, and autonomously deployable services. Each microservice is accountable for a distinct aspect of functionality or domain within the application and communicates with other microservices via well-defined APIs. This method deviates from the conventional monolithic architecture, where an application is constructed as a single, tightly-coupled unit.

The primary characteristics of microservices include:

Single Responsibility: Adhering to the Single Responsibility Principle, each microservice focuses on a specific function or domain, making the services more straightforward to comprehend, develop, and maintain.
Independence: Microservices can be independently developed, deployed, and scaled, offering increased flexibility and agility in the development process. Teams can work on various services simultaneously without impacting the entire system.
Decentralization: Typically, microservices are decentralized, with each service possessing its data and business logic. This approach fosters separation of concerns and empowers teams to make decisions and select technologies tailored to their unique requirements.
Communication: Microservices interact with each other using lightweight protocols, such as HTTP/REST, gRPC, or message queues. This fosters interoperability and facilitates the integration of new services or the replacement of existing ones.
Fault Tolerance: As microservices are independent, the failure of one service does not necessarily result in the collapse of the entire system, enhancing the application's overall resiliency.

11. NoSQL Databases

NoSQL databases, or “Not Only SQL” databases, are non-relational databases designed to store, manage, and retrieve unstructured or semi-structured data. They offer an alternative to traditional relational databases, which rely on structured data and predefined schemas. NoSQL databases have become popular due to their flexibility, scalability, and ability to handle large volumes of data, making them well-suited for modern applications, big data processing, and real-time analytics.

NoSQL databases can be categorized into four main types:

Document-Based: These databases store data in document-like structures, such as JSON or BSON. Each document is self-contained and can have its own unique structure, making them suitable for handling heterogeneous data. Examples of document-based NoSQL databases include MongoDB and Couchbase.
Key-Value: These databases store data as key-value pairs, where the key acts as a unique identifier, and the value holds the associated data. Key-value databases are highly efficient for simple read and write operations, and they can be easily partitioned and scaled horizontally. Examples of key-value NoSQL databases include Redis and Amazon DynamoDB.
Column-Family: These databases store data in column families, which are groups of related columns. They are designed to handle write-heavy workloads and are highly efficient for querying data with a known row and column keys. Examples of column-family NoSQL databases include Apache Cassandra and HBase.
Graph-Based: These databases are designed for storing and querying data that has complex relationships and interconnected structures, such as social networks or recommendation systems. Graph databases use nodes, edges, and properties to represent and store data, making it easier to perform complex traversals and relationship-based queries. Examples of graph-based NoSQL databases include Neo4j and Amazon Neptune.

Types of NoSQL databases

12. Database Index

Database indexes are data structures that enhance the speed and efficiency of query operations within a database. They function similarly to an index in a book, enabling the database management system (DBMS) to swiftly locate data associated with a specific value or group of values, without the need to search through every row in a table. By offering a more direct route to the desired data, indexes can considerably decrease the time required to retrieve information from a database.

Indexes are typically constructed on one or more columns of a database table. The B-tree index is the most prevalent type, organizing data in a hierarchical tree structure, which allows for rapid search, insertion, and deletion operations. Other types of indexes, such as bitmap indexes and hash indexes, exist as well, each with their particular use cases and advantages.

Although indexes can significantly enhance query performance, they also involve certain trade-offs:

Storage Space: Indexes require additional storage space since they generate and maintain separate data structures alongside the original table data.
Write Performance: When data is inserted, updated, or deleted in a table, the corresponding indexes must also be updated, which may slow down write operations.

Database Index

13. Distributed File Systems

Distributed file systems are storage systems designed to manage and grant access to files and directories across multiple servers, nodes, or machines, frequently distributed across a network. They allow users and applications to access and modify files as though they were situated on a local file system, despite the fact that the actual files may be physically located on various remote servers. Distributed file systems are commonly employed in large-scale or distributed computing environments to offer fault tolerance, high availability, and enhanced performance.

14. Notification System

These are used to send notifications or alerts to users, such as emails, push notifications, or text messages.

15. Full-text Search

Full-text search allows users to search for particular words or phrases within an application or website. Upon receiving a user query, the application or website delivers the most relevant results. To accomplish this rapidly and effectively, full-text search utilizes an inverted index, a data structure that associates words or phrases with the documents where they are found. Elastic Search is an example of such systems.

16. Distributed Coordination Services

Distributed coordination services are systems engineered to regulate and synchronize the actions of distributed applications, services, or nodes in a dependable, efficient, and fault-tolerant way. They assist in maintaining consistency, addressing distributed synchronization, and overseeing the configuration and state of diverse components in a distributed setting. Distributed coordination services are especially valuable in large-scale or intricate systems, like those encountered in microservices architectures, distributed computing environments, or clustered databases. Apache ZooKeeper, etcd, and Consul are examples of such services.

17. Heartbeat

In a distributed environment, work/data is distributed among servers. To efficiently route requests in such a setup, servers need to know what other servers are part of the system. Furthermore, servers should know if other servers are alive and working. In a decentralized system, whenever a request arrives at a server, the server should have enough information to decide which server is responsible for entertaining that request. This makes the timely detection of server failure an important task, which also enables the system to take corrective actions and move the data/work to another healthy server and stop the environment from further deterioration.

To solve this, each server periodically sends a heartbeat message to a central monitoring server or other servers in the system to show that it is still alive and functioning.

Heartbeating is one of the mechanisms for detecting failures in a distributed system. If there is a central server, all servers periodically send a heartbeat message to it. If there is no central server, all servers randomly choose a set of servers and send them a heartbeat message every few seconds. This way, if no heartbeat message is received from a server for a while, the system can suspect that the server might have crashed. If there is no heartbeat within a configured timeout period, the system can conclude that the server is not alive anymore and stop sending requests to it and start working on its replacement.

18. Checksum

In a distributed system, while moving data between components, it is possible that the data fetched from a node may arrive corrupted. This corruption can occur because of faults in a storage device, network, software, etc. How can a distributed system ensure data integrity, so that the client receives an error instead of corrupt data?

To solve this, we can calculate a checksum and store it with data.

To calculate a checksum, a cryptographic hash-function like MD5, SHA-1, SHA-256, or SHA-512 is used. The hash function takes the input data and produces a string (containing letters and numbers) of fixed length; this string is called the checksum.

When a system is storing some data, it computes a checksum of the data and stores the checksum with the data. When a client retrieves data, it verifies that the data it received from the server matches the checksum stored. If not, then the client can opt to retrieve that data from another replica.

Download System Design Master Template (pdf).

With this, let's proceed to solve our first system design problem: 'Designing a URL Shortening Service'.

Mark as read

PreviousSystem Design Interviews - A step by step guide

NextDesigning a URL Shortening Service like TinyURL

Discussion

Have a question or insight about this topic? Share it with the community.

Reading Progress

On This Page