Grokking the Advanced System Design Interview
Ask Author
Back to course home

0% completed

High-level Architecture

This lesson gives a brief overview of HDFS’s architecture.

HDFS architecture

All files stored in HDFS are broken into multiple fixed-size blocks, where each block is 128 megabytes in size by default (configurable on a per-file basis). Each file stored in HDFS consists of two parts: the actual file data and the metadata, i.e., how many block parts the file has, their locations and the total file size, etc. HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data.

HDFS high-level architecture
  • All blocks of a file are of the same size except the last one.
  • HDFS uses large block sizes because it is designed to store extremely large files to enable MapReduce jobs to process them efficiently.
  • Each block is identified by a unique 64-bit ID called BlockID.
  • All read/write operations in HDFS operate at the block level.
  • DataNodes store each block in a separate file on the local file system and provide read/write access.
  • When a DataNode starts up, it scans through its local file system and sends the list of hosted data blocks (called BlockReport) to the NameNode.
  • The NameNode maintains two on-disk data structures to store the file system's state: an FsImage file and an EditLog. FsImage is a checkpoint of the file system metadata at some point in time, while the EditLog is a log of all of the file system metadata transactions since the image file was last created. These two files help NameNode to recover from failure.
  • User applications interact with HDFS through its client. HDFS Client interacts with NameNode for metadata, but all data transfers happen directly between the client and DataNodes.
  • To achieve high-availability, HDFS creates multiple copies of the data and distributes them on nodes throughout the cluster.
HDFS block replication

Comparison between GFS and HDFS

HDFS architecture is similar to GFS, although there are differences in the terminology. Here is the comparison between the two file systems:

<style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;border-color:black;} .tg td{font-family:Arial, sans-serif;font-size:17px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;color:black;background-color:#67AB9F;} .tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;color:#493F3a;background-color:#9DE0AD;} .tg .tg-rmb8{background-color:#C5D6C4;vertical-align:top} .tg .tg-1rmb8{background-color:#C5D6C4;vertical-align:top; font-weight:bold;} .tg .tg-1yw4l{vertical-align:top; font-weight:bold;} .tg .tg-yw4l{vertical-align:top;} </style> <table class="tg" style="border-style:solid;border-width:1px;border-color:black;"> <tbody><tr> <td class="tg-1yw4l" style="width:25%"></td> <td class="tg-1yw4l">GFS</td> <td class="tg-1yw4l">HDFS</td> </tr> <tr> <td class="tg-1rmb8">Storage node</td> <td class="tg-rmb8">ChunkServer</td> <td class="tg-rmb8">DataNode</td> </tr> <tr> <td class="tg-1yw4l">File part</td> <td class="tg-yw4l">Chunk</td> <td class="tg-yw4l">Block</td> </tr> <tr> <td class="tg-1rmb8">File part size</td> <td class="tg-rmb8">Default chunk size is 64MB (adjustable)</td> <td class="tg-rmb8">Default block size is 128MB (adjustable)</td> </tr> <tr> <td class="tg-1yw4l">Metadata Checkpoint</td> <td class="tg-yw4l">Checkpoint image</td> <td class="tg-yw4l">FsImage</td> </tr> <tr> <td class="tg-1rmb8">Write ahead log</td> <td class="tg-rmb8">Operation log</td> <td class="tg-rmb8">EditLog</td> </tr> <tr> <td class="tg-1yw4l">Platform</td> <td class="tg-yw4l">Linux</td> <td class="tg-yw4l">Cross-Platform</td> </tr> <tr> <td class="tg-1rmb8">Language</td> <td class="tg-rmb8">Developed in C++</td> <td class="tg-rmb8">Developed in Java</td> </tr> <tr> <td class="tg-1yw4l">Available Implementation</td> <td class="tg-yw4l">Only used internally by Google</td> <td class="tg-yw4l">Opensource</td> </tr> <tr> <td class="tg-1rmb8">Monitoring</td> <td class="tg-rmb8">Master receives HeartBeat from ChunkServers</td> <td class="tg-rmb8">NameNode receives HeartBeat from DataNodes</td> </tr> <tr> <td class="tg-1yw4l">Concurrency</td> <td class="tg-yw4l">Follows multiple writers and multiple readers model.</td> <td class="tg-yw4l">Does not support multiple writers. HDFS follows the write-once and read-many model.</td> </tr> <tr> <td class="tg-1rmb8">File operations</td> <td class="tg-rmb8">Append and Random writes are possible.</td> <td class="tg-rmb8">Only append is possible.</td> </tr> <tr> <td class="tg-1yw4l">Garbage Collection</td> <td class="tg-yw4l">Any deleted file is renamed into a particular folder to be garbage collected later.</td> <td class="tg-yw4l">Any deleted file is renamed to a hidden name to be garbage collected later.</td> </tr> <tr> <td class="tg-1rmb8">Communication</td> <td class="tg-rmb8">RPC over TCP is used for communication with the master.

To minimize latency, pipelining and streaming are used over TCP for data transfer.

</td> <td class="tg-rmb8">RPC over TCP is used for communication with the NameNode.

For data transfer, pipelining and streaming are used over TCP.

</td> </tr> <tr> <td class="tg-1yw4l">Cache Management</td> <td class="tg-yw4l">Clients cache metadata.

Client or ChunkServer does not cache file data.

ChunkServers rely on the buffer cache in Linux to maintain frequently accessed data in memory.

</td> <td class="tg-yw4l">HDFS uses distributed cache.

User-specified paths are cached explicitly in the DataNode's memory in an off-heap block cache.

The cache could be private (belonging to one user) or public (belonging to all the users of the same node).

</td> </tr> <tr> <td class="tg-1rmb8">Replication Strategy</td> <td class="tg-rmb8">Chunk replicas are spread across the racks. Master automatically replicates the chunks.

By default, three copies of each chunk are stored. User can specify a different replication factor.

The master re-replicates a chunk replica as soon as the number of available replicas falls below a user-specified number.

</td> <td class="tg-rmb8">The HDFS has an automatic rack-ware replication system.

By default, two copies of each block are stored at two different DataNodes in the same rack, and a third copy is stored on a Data Node in a different rack (for better reliability).

User can specify a different replication factor.

</td> </tr> <tr> <td class="tg-1yw4l">File system Namespace</td> <td class="tg-yw4l">Files are organized hierarchically in directories and identified by pathnames.</td> <td class="tg-yw4l">HDFS supports a traditional hierarchical file organization. Users or applications can create directories to store files inside.

HDFS also supports third-party file systems such as Amazon Simple Storage Service (S3) and Cloud Store.

</td> </tr> <tr> <td class="tg-1rmb8">Database</td> <td class="tg-rmb8">Bigtable uses GFS as its storage engine.</td> <td class="tg-rmb8">HBase uses HDFS as its storage engine.</td> </tr> </tbody></table>
Mark as Completed