System Design
Learn System Design
Introduction to System Design
How to Learn System Design?
Functional vs. Non-functional Requirements
What are Back-of-the-Envelope Estimations?
Things to Avoid During System Design Interview
System Design Basics
Load Balancing
Introduction to Load Balancing
Load Balancing Algorithms
Uses of Load Balancing
Load Balancer Types
Stateless vs. Stateful Load Balancing
High Availability and Fault Tolerance
Scalability and Performance
Challenges of Load Balancers
API Gateway
Introduction to API Gateway
Usage of API gateway
Advantages and disadvantages of using API gateway
Key Characteristics of Distributed Systems
Scalability
Availability
Latency and Performance
Concurrency and Coordination
Monitoring and Observability
Resilience and Error Handling
Fault Tolerance vs. High Availability
Network Essentials
HTTP vs. HTTPS
TCP vs. UDP
HTTP: 1.0 vs. 1.1 vs 2.0 vs. 3.0
URL vs. URI vs. URN
Domain Name System (DNS)
Introduction to DNS
DNS Resolution Process
DNS Load Balancing and High Availability
Caching
Introduction to Caching
Why is Caching Important?
Types of Caching
Cache Replacement Policies
Cache Invalidation
Cache Read Strategies
Cache Coherence and Consistency Models
Caching Challenges
Cache Performance Metrics
CDN
What is CDN?
Origin Server vs. Edge Server
CDN Architecture
Push CDN vs. Pull CDN
Data Partitioning
Introduction to Data Partitioning
Partitioning Methods
Data Sharding Techniques
Benefits of Data Partitioning
Common Problems Associated with Data Partitioning
Proxies
What is a Proxy Server?
Uses of Proxies
VPN vs. Proxy Server
Redundancy and Replication
What is Redundancy?
What is Replication?
Replication Methods
Data Backup vs. Disaster Recovery
CAP & PACELC Theorems
Introduction to CAP Theorem
Components of CAP Theorem
Trade-offs in CAP Theorem
Examples of CAP Theorem in Practice
Beyond CAP Theorem
System Design Trade-offs in Interviews
Databases (SQL vs. NoSQL)
Introduction to Databases
SQL Databases
NoSQL Databases
SQL vs. NoSQL
ACID vs BASE Properties
Real-World Examples and Case Studies
SQL Normalization and Denormalization
In-Memory Database vs. On-Disk Database
Data Replication vs. Data Mirroring
Database Federation
Indexes
What are Indexes?
Types of Indexes
Bloom Filters
Introduction to Bloom Filters
Benefits & Limitations of Bloom Filters
Variants and Extensions of Bloom Filters
Applications of Bloom Filters
Long-Polling vs. WebSockets vs. Server-Sent Events
Difference Between Long-Polling, WebSockets, and Server-Sent Events
Quorum
Why Quorum?
What is Quorum?
Heartbeat
What is Heartbeat?
Checksum
What is Checksum?
Uses of Checksum
Leader and Follower
What is Leader and Follower Pattern?
Security
What is Security and Privacy?
What is Authentication?
What is Authorization?
Authentication vs. Authorization
OAuth vs. JWT for Authentication
What is Encryption?
What are DDoS Attacks?
Distributed Messaging System
Introduction to Messaging System
Introduction to Kafka
Messaging patterns
Popular Messaging Queue Systems
RabbitMQ vs. Kafka vs. ActiveMQ
Scalability and Performance
Distributed File Systems
What is a Distributed File System?
Architecture of a Distributed File System
Key Components of a DFS
Misc Concepts
Batch Processing vs. Stream Processing
XML vs. JSON
Synchronous vs. Asynchronous Communication
Push vs. Pull Notification Systems
Microservices vs. Serverless Architecture
Message Queues vs. Service Bus
Stateful vs. Stateless Architecture
Event-Driven vs. Polling Architecture
Quiz - System Design Fundamentals
Quiz
System Design Trade-offs
Importance of Discussing Trade-offs
Strong vs Eventual Consistency
Latency vs Throughput
ACID vs BASE Properties in Databases
Read-Through vs Write-Through Cache
Batch Processing vs Stream Processing
Load Balancer vs. API Gateway
API Gateway vs Direct Service Exposure
Proxy vs. Reverse Proxy
API Gateway vs. Reverse Proxy
SQL vs. NoSQL
Primary-Replica vs Peer-to-Peer Replication
Data Compression vs Data Deduplication
Server-Side Caching vs Client-Side Caching
REST vs RPC
Polling vs. Long-Polling vs. WebSockets vs. Webhooks
CDN Usage vs Direct Server Serving
Serverless Architecture vs Traditional Server-based
Stateful vs Stateless Architecture
Hybrid Cloud Storage vs All-Cloud Storage
Token Bucket vs Leaky Bucket
Read Heavy vs Write Heavy System
Quiz
Designing a URL Shortening Service like TinyURL
Designing a URL Shortening Service like TinyURL
Quiz - Designing URL Shortner
Designing Pastebin
Designing Pastebin
Quiz - Designing Pastebin
Designing Instagram
Designing Instagram
Quiz - Designing Instagram
Designing Dropbox
Designing Dropbox
Quiz - Designing Dropbox
Designing Facebook Messenger
Designing Facebook Messenger
Quiz - Designing Facebook Messenger
Designing Twitter
Designing Twitter
Quiz - Designing Twitter
Designing Youtube or Netflix
Designing Youtube or Netflix
Quiz - Designing Youtube
Designing Typeahead Suggestion
Designing Typeahead Suggestion
Quiz - Designing Typeahead Suggestion
Designing an API Rate Limiter
Designing an API Rate Limiter
Quiz - Designing an API Rate Limiter
Designing Twitter Search
Designing Twitter Search
Quiz - Designing Twitter Search
Designing a Web Crawler
Designing a Web Crawler
Quiz - Designing a Web Crawler
Designing Facebook’s Newsfeed
Designing Facebook’s Newsfeed
Quiz - Designing Facebook’s Newsfeed
Designing Yelp or Nearby Friends
Designing Yelp or Nearby Friends
Quiz - Designing Yelp or Nearby Friends
Designing Uber backend
Designing Uber backend
Quiz - Designing Uber backend
Designing Ticketmaster
Designing Ticketmaster
Quiz - Designing Ticketmaster
Dynamo: How to design a key value store?
Dynamo: Introduction
High-Level Architecture
Data Partitioning
Replication
Vector Clocks and Conflicting Data
The Life of Dynamo’s put() & get() Operations
Anti-entropy Through Merkle Trees
Gossip Protocol
Dynamo Characteristics and Criticism
Summary: Dynamo
Quiz: Dynamo
Mock Interview: Dynamo
Designing YouTube Likes Counter (medium)
YouTube Likes Counter
Quiz
Cassandra: How to Design a Wide-column NoSQL Database?
Cassandra: Introduction
High-level Architecture
Replication
Cassandra Consistency Levels
Gossiper
Anatomy of Cassandra's Write Operation
Anatomy of Cassandra's Read Operation
Compaction
Tombstones
Summary: Cassandra
Quiz: Cassandra
Mock Interview: Cassandra
Kafka: How to Design a Distributed Messaging System?
Messaging Systems: Introduction
Kafka: Introduction
High-level Architecture
Kafka: Deep Dive
Consumer Groups
Kafka Workflow
Role of ZooKeeper
Controller Broker
Kafka Delivery Semantics
Kafka Characteristics
Summary: Kafka
Quiz: Kafka
Mock Interview: Kafka
Chubby: How to Design a Distributed Locking Service?
Chubby: Introduction
High-level Architecture
Design Rationale
How Chubby Works
File, Directories, and Handles
Locks, Sequencers, and Lock-delays
Sessions and Events
Master Election and Chubby Events
Caching
Database
Scaling Chubby
Summary: Chubby
Quiz: Chubby
Mock Interview: Chubby
HDFS: How to Design File Storage System?
Hadoop Distributed File System: Introduction
High-level Architecture
Deep Dive
Anatomy of a Read Operation
Anatomy of a Write Operation
Data Integrity & Caching
Fault Tolerance
HDFS High Availability (HA)
HDFS Characteristics
Summary: HDFS
Quiz: HDFS
Mock Interview: HDFS
GFS: How to Design a Distributed File System Storage?
Google File System: Introduction
High-level Architecture
Single Master and Large Chunk Size
Metadata
Master Operations
Anatomy of a Read Operation
Anatomy of a Write Operation
Anatomy of an Append Operation
GFS Consistency Model and Snapshotting
Fault Tolerance, High Availability, and Data Integrity
Garbage Collection
Criticism on GFS
Summary: GFS
Quiz: GFS
Mock Interview: GFS
BigTable: How to Design a Wide Column Storage System?
BigTable: Introduction
BigTable Data Model
System APIs
Partitioning and High-level Architecture
SSTable
GFS and Chubby
Bigtable Components
Working with Tablets
The Life of BigTable's Read & Write Operations
Fault Tolerance and Compaction
BigTable Refinements
BigTable Characteristics
Summary: BigTable
Quiz: BigTable
Mock Interview: BigTable
Designing Reddit (medium)
Design Reddit
Quiz
Designing Notification Service (medium)
Designing a Notification System
Quiz
Design Google Calendar (medium)
Design Google calendar (Medium)
Quiz
Design a Recommendation System (medium)
Design a Recommendation System for Netflix
Quiz
Designing Gmail (medium)
Design Gmail
Quiz
Designing Google News (medium)
Design Google News, a Global News Aggregator System (Medium)
Quiz
Designing Unique ID Generator (medium)
Design Unique ID Generator (Easy)
Quiz
Designing Code Judging System (medium)
Design Code Judging System like LeetCode (Medium)
Quiz
Designing Payment System (hard)
Design Payment System
Quiz
Designing Flash Sale System (hard)
Design a Flash Sale for an E-commerce Site (Hard)
Quiz
Designing Reminder Alert System (hard)
Design a Reminder Alert System
Quiz
System Design Patterns
Introduction: System Design Patterns
1. Bloom Filters
2. Consistent Hashing
3. Quorum
4. Leader and Follower
5. Write-ahead Log
6. Segmented Log
7. High-Water Mark
8. Lease
9. Heartbeat
10. Gossip Protocol
11. Phi Accrual Failure Detection
12. Split Brain
13. Fencing
14. Checksum
15. Vector Clocks
16. CAP Theorem
17. PACELC Theorem
18. Hinted Handoff
19. Read Repair
20. Merkle Trees
Quiz
System Design Master Template
distributed systems
microservices
dns
load balancing
+2
System design interviews are unstructured by design. In these interviews, you are asked to take on an open-ended design problem that doesn’t have a standard solution.
The two biggest challenges of answering a system design interview question are:
- To know where to start.
- To know if you have talked about all the important parts of the system.
To simplify this process, this course offers a comprehensive system design template that can effectively guide you in addressing any system design interview question.
Have a look at the following image to understand the major components that could be part of any system design and how these components interact with each other.
System Design Master Template (video)
Here is a video discussing the System Design Master Template:
With this master template in mind, we will discuss the 18 essential system design concepts. Here is a brief description of each:
1. Domain Name System (DNS)
The Domain Name System (DNS) serves as a fundamental component of the internet infrastructure, translating user-friendly domain names into their corresponding IP addresses. It acts as a phonebook for the internet, enabling users to access websites and services by entering easily memorable domain names, such as www.designgurus.io, rather than the numerical IP addresses like "192.0.2.1" that computers utilize to identify each other.
When you input a domain name into your web browser, the DNS is responsible for finding the associated IP address and directing your request to the appropriate server. This process commences with your computer sending a query to a recursive resolver, which then searches a series of DNS servers, beginning with the root server, followed by the Top-Level Domain (TLD) server, and ultimately the authoritative name server. Once the IP address is located, the recursive resolver returns it to your computer, allowing your browser to establish a connection with the target server and access the desired content.
2. Load Balancer
A load balancer is a networking device or software designed to distribute incoming network traffic across multiple servers, ensuring optimal resource utilization, reduced latency, and maintained high availability. It plays a crucial role in scaling applications and efficiently managing server workloads, particularly in situations where there is a sudden surge in traffic or uneven distribution of requests among servers.
Load balancers employ various algorithms to determine the distribution of incoming traffic. Some common algorithms include:
- Round Robin: Requests are sequentially and evenly distributed across all available servers in a cyclical manner.
- Least Connections: The load balancer assigns requests to the server with the fewest active connections, giving priority to less-busy servers.
- IP Hash: The client's IP address is hashed, and the resulting value is used to determine which server the request should be directed to. This method ensures that a specific client's requests are consistently routed to the same server, helping maintain session persistence.
3. API Gateway
An API Gateway serves as a server or service that functions as an intermediary between external clients and the internal microservices or API-based backend services of an application. It is a vital component in contemporary architectures, particularly in microservices-based systems, where it streamlines the communication process and offers a single entry point for clients to access various services.
The primary functions of an API Gateway encompass:
- Request Routing: The API Gateway directs incoming API requests from clients to the appropriate backend service or microservice, based on predefined rules and configurations.
- Authentication and Authorization: The API Gateway manages user authentication and authorization, ensuring that only authorized clients can access the services. It verifies API keys, tokens, or other credentials before routing requests to the backend services.
- Rate Limiting and Throttling: To safeguard backend services from excessive load or abuse, the API Gateway enforces rate limits or throttles requests from clients according to predefined policies.
- Caching: In order to minimize latency and backend load, the API Gateway caches frequently-used responses, serving them directly to clients without the need to query the backend services.
- Request and Response Transformation: The API Gateway can modify requests and responses, such as converting data formats, adding or removing headers, or altering query parameters, to ensure compatibility between clients and services.
Is it Possible to Use a Load Balancer and an API Gateway Together? Yes, in many real-world architectures, both of these components work together, where the Load Balancer effectively manages traffic across multiple instances of API Gateways or directly to services, depending on the setup. See details here.
4. CDN
A Content Delivery Network (CDN) is a distributed network of servers that store and deliver content, such as images, videos, stylesheets, and scripts, to users from locations that are geographically closer to them. CDNs are designed to enhance the performance, speed, and reliability of content delivery to end-users, irrespective of their location relative to the origin server. Here's how a CDN operates:
- When a user requests content from a website or application, the request is directed to the nearest CDN server, also known as an edge server.
- If the edge server has the requested content cached, it directly serves the content to the user. This process reduces latency and improves the user experience, as the content travels a shorter distance.
- If the content is not cached on the edge server, the CDN retrieves it from the origin server or another nearby CDN server. Once the content is fetched, it is cached on the edge server and served to the user.
- To ensure the content stays up-to-date, the CDN periodically checks the origin server for changes and updates its cache accordingly.
5. Forward Proxy vs. Reverse Proxy
A forward proxy, also referred to as a "proxy server" or simply "proxy," is a server positioned in front of one or more client machines, acting as an intermediary between the clients and the internet. When a client machine requests a resource on the internet, the request is initially sent to the forward proxy. The forward proxy then forwards the request to the internet on behalf of the client machine and returns the response to the client machine.
On the other hand, a reverse proxy is a server that sits in front of one or more web servers, serving as an intermediary between the web servers and the internet. When a client requests a resource on the internet, the request is first sent to the reverse proxy. The reverse proxy then forwards the request to one of the web servers, which returns the response to the reverse proxy. Finally, the reverse proxy returns the response to the client.
6. Caching
Cache is a high-speed storage layer positioned between the application and the original data source, such as a database, file system, or remote web service. When an application requests data, the cache is checked first. If the data is present in the cache, it is returned to the application. If the data is not found in the cache, it is retrieved from its original source, stored in the cache for future use, and then returned to the application. In a distributed system, caching can occur in multiple locations, including the client, DNS, CDN, load balancer, API gateway, server, database, and more.
7. Data Partitioning
In a database, horizontal partitioning, often referred to as sharding, entails dividing the rows of a table into smaller tables and storing them on distinct servers or database instances. This method is employed to distribute the database load across multiple servers, thereby enhancing performance.
Conversely, vertical partitioning involves splitting the columns of a table into separate tables. This technique aims to reduce the column count in a table and boost the performance of queries that only access a limited number of columns.
8. Database Replication
Database replication is a method employed to maintain multiple copies of the same database across various servers or locations. The main objective of database replication is to enhance data availability, redundancy, and fault tolerance, ensuring the system remains operational even in the face of hardware failures or other issues.
In a replicated database configuration, one server serves as the primary (or master) database, while others act as replicas (or slaves). This process involves synchronizing data between the primary database and replicas, ensuring all possess the same up-to-date information. Database replication provides several advantages, including:
- Improved Performance: By distributing read queries among multiple replicas, the load on the primary database can be reduced, leading to faster query response times.
- High Availability: If the primary database experiences failure or downtime, replicas can continue to provide data, ensuring uninterrupted access to the application.
- Enhanced Data Protection: Maintaining multiple copies of the database across different locations helps safeguard against data loss due to hardware failures or other disasters.
- Load Balancing: Replicas can handle read queries, allowing for better load distribution and reducing overall stress on the primary database.
9. Distributed Messaging Systems
Distributed messaging systems provide a reliable, scalable, and fault-tolerant means for exchanging messages between numerous, possibly geographically-dispersed applications, services, or components. These systems facilitate communication by decoupling sender and receiver components, enabling them to develop and function independently. Distributed messaging systems are especially valuable in large-scale or intricate systems, like those seen in microservices architectures or distributed computing environments. Examples of these systems include Apache Kafka and RabbitMQ.
10. Microservices
Microservices represent an architectural style wherein an application is organized as an assembly of small, loosely-coupled, and autonomously deployable services. Each microservice is accountable for a distinct aspect of functionality or domain within the application and communicates with other microservices via well-defined APIs. This method deviates from the conventional monolithic architecture, where an application is constructed as a single, tightly-coupled unit.
The primary characteristics of microservices include:
- Single Responsibility: Adhering to the Single Responsibility Principle, each microservice focuses on a specific function or domain, making the services more straightforward to comprehend, develop, and maintain.
- Independence: Microservices can be independently developed, deployed, and scaled, offering increased flexibility and agility in the development process. Teams can work on various services simultaneously without impacting the entire system.
- Decentralization: Typically, microservices are decentralized, with each service possessing its data and business logic. This approach fosters separation of concerns and empowers teams to make decisions and select technologies tailored to their unique requirements.
- Communication: Microservices interact with each other using lightweight protocols, such as HTTP/REST, gRPC, or message queues. This fosters interoperability and facilitates the integration of new services or the replacement of existing ones.
- Fault Tolerance: As microservices are independent, the failure of one service does not necessarily result in the collapse of the entire system, enhancing the application's overall resiliency.
11. NoSQL Databases
NoSQL databases, or “Not Only SQL” databases, are non-relational databases designed to store, manage, and retrieve unstructured or semi-structured data. They offer an alternative to traditional relational databases, which rely on structured data and predefined schemas. NoSQL databases have become popular due to their flexibility, scalability, and ability to handle large volumes of data, making them well-suited for modern applications, big data processing, and real-time analytics.
NoSQL databases can be categorized into four main types:
- Document-Based: These databases store data in document-like structures, such as JSON or BSON. Each document is self-contained and can have its own unique structure, making them suitable for handling heterogeneous data. Examples of document-based NoSQL databases include MongoDB and Couchbase.
- Key-Value: These databases store data as key-value pairs, where the key acts as a unique identifier, and the value holds the associated data. Key-value databases are highly efficient for simple read and write operations, and they can be easily partitioned and scaled horizontally. Examples of key-value NoSQL databases include Redis and Amazon DynamoDB.
- Column-Family: These databases store data in column families, which are groups of related columns. They are designed to handle write-heavy workloads and are highly efficient for querying data with a known row and column keys. Examples of column-family NoSQL databases include Apache Cassandra and HBase.
- Graph-Based: These databases are designed for storing and querying data that has complex relationships and interconnected structures, such as social networks or recommendation systems. Graph databases use nodes, edges, and properties to represent and store data, making it easier to perform complex traversals and relationship-based queries. Examples of graph-based NoSQL databases include Neo4j and Amazon Neptune.
12. Database Index
Database indexes are data structures that enhance the speed and efficiency of query operations within a database. They function similarly to an index in a book, enabling the database management system (DBMS) to swiftly locate data associated with a specific value or group of values, without the need to search through every row in a table. By offering a more direct route to the desired data, indexes can considerably decrease the time required to retrieve information from a database.
Indexes are typically constructed on one or more columns of a database table. The B-tree index is the most prevalent type, organizing data in a hierarchical tree structure, which allows for rapid search, insertion, and deletion operations. Other types of indexes, such as bitmap indexes and hash indexes, exist as well, each with their particular use cases and advantages.
Although indexes can significantly enhance query performance, they also involve certain trade-offs:
- Storage Space: Indexes require additional storage space since they generate and maintain separate data structures alongside the original table data.
- Write Performance: When data is inserted, updated, or deleted in a table, the corresponding indexes must also be updated, which may slow down write operations.
13. Distributed File Systems
Distributed file systems are storage systems designed to manage and grant access to files and directories across multiple servers, nodes, or machines, frequently distributed across a network. They allow users and applications to access and modify files as though they were situated on a local file system, despite the fact that the actual files may be physically located on various remote servers. Distributed file systems are commonly employed in large-scale or distributed computing environments to offer fault tolerance, high availability, and enhanced performance.
14. Notification System
These are used to send notifications or alerts to users, such as emails, push notifications, or text messages.
15. Full-text Search
Full-text search allows users to search for particular words or phrases within an application or website. Upon receiving a user query, the application or website delivers the most relevant results. To accomplish this rapidly and effectively, full-text search utilizes an inverted index, a data structure that associates words or phrases with the documents where they are found. Elastic Search is an example of such systems.
16. Distributed Coordination Services
Distributed coordination services are systems engineered to regulate and synchronize the actions of distributed applications, services, or nodes in a dependable, efficient, and fault-tolerant way. They assist in maintaining consistency, addressing distributed synchronization, and overseeing the configuration and state of diverse components in a distributed setting. Distributed coordination services are especially valuable in large-scale or intricate systems, like those encountered in microservices architectures, distributed computing environments, or clustered databases. Apache ZooKeeper, etcd, and Consul are examples of such services.
17. Heartbeat
In a distributed environment, work/data is distributed among servers. To efficiently route requests in such a setup, servers need to know what other servers are part of the system. Furthermore, servers should know if other servers are alive and working. In a decentralized system, whenever a request arrives at a server, the server should have enough information to decide which server is responsible for entertaining that request. This makes the timely detection of server failure an important task, which also enables the system to take corrective actions and move the data/work to another healthy server and stop the environment from further deterioration.
To solve this, each server periodically sends a heartbeat message to a central monitoring server or other servers in the system to show that it is still alive and functioning.
Heartbeating is one of the mechanisms for detecting failures in a distributed system. If there is a central server, all servers periodically send a heartbeat message to it. If there is no central server, all servers randomly choose a set of servers and send them a heartbeat message every few seconds. This way, if no heartbeat message is received from a server for a while, the system can suspect that the server might have crashed. If there is no heartbeat within a configured timeout period, the system can conclude that the server is not alive anymore and stop sending requests to it and start working on its replacement.
18. Checksum
In a distributed system, while moving data between components, it is possible that the data fetched from a node may arrive corrupted. This corruption can occur because of faults in a storage device, network, software, etc. How can a distributed system ensure data integrity, so that the client receives an error instead of corrupt data?
To solve this, we can calculate a checksum and store it with data.
To calculate a checksum, a cryptographic hash-function like MD5, SHA-1, SHA-256, or SHA-512 is used. The hash function takes the input data and produces a string (containing letters and numbers) of fixed length; this string is called the checksum.
When a system is storing some data, it computes a checksum of the data and stores the checksum with the data. When a client retrieves data, it verifies that the data it received from the server matches the checksum stored. If not, then the client can opt to retrieve that data from another replica.
Download System Design Master Template (pdf).
With this, let's proceed to solve our first system design problem: 'Designing a URL Shortening Service'.
Discussion
On This Page