Image
Arslan Ahmad

Unveiling the Secrets of Apache ZooKeeper's Architecture

Embark on a journey to explore the inner workings of Apache ZooKeeper's architecture. Discover how this reliable, high-performance coordination service keeps distributed systems running smoothly.
Image

Introduction

What is ZooKeeper

ZooKeeper is an open source project of Apache Software Foundation which provides synchronization across a distributed system. The ZooKeeper services coordinate the naming, configuration information, and group services in distributed applications. The main challenge for any distributed application is to synchronize the configurations of all its connected nodes throughout the life cycle of an application. In addition, such a complex process also needs to update and propagate any software, status, or data change that occurs anywhere in the cluster. The ZooKeeper achieves synchronization reliably and effectively by providing different types of APIs in Java and C Languages. Its data model is quite similar to the directory tree structure of file systems which makes it easier to implement and program.

Background

The ZooKeeper was introduced by Yahoo! to simplify their access to various processes which were running on distributed systems. Later on, the project was handed over by Apache Hadoop and now uses ZooKeeper to manage distributed jobs in a cluster. Later on, it became a standalone Apache Foundation project in 2008.

Design Goals

An architecture was aimed which has the capabilities to:

  • be implemented easily in a variety of distributed applications and systems.
  • be fault tolerant and recover from errors.
  • be scaled up easily.
  • be fast and reliable.

Why ZooKeeper?

The ZooKeeper provides a simple approach for coordination, management, and interaction of nodes in a cluster. Here a few key features of ZooKeeper are highlighted: There is no single point of failure which makes the ZooKeeper services more reliable. ZooKeeper replicates its services over multiple nodes in an ensemble so these services will be available unless more than half of the nodes fail.

  1. The ZooKeeper data model follows hierarchical name-space which is similar to standard file systems. This helps the programmers to easily exploit ZooKeeper features in their distributed applications.
  2. The servers in ZooKeeper ensemble maintain shared data and have robust synchronization techniques.
  3. The scalability of the ZooKeeper can easily be boosted by deploying more nodes into the cluster.
  4. The ZooKeeper services follow and assign ids to updates which ensures the correct order of messages. This helps while using data watches in ZooKeeper, the server sends notifications to clients in asynchronous mode and maintains the proper sequence. The ordered messages help in implementing higher lever abstraction where the order is essential. Synchronization helps in avoiding deadlocks in distributed systems and applications.
  5. The ZooKeeper helps you to encode the data as per a specified set of rules. A node can join or leave a cluster in real time
  6. Zookeeper service keeps track of all transactions which can be utilized for synchronization.
  7. The result of a transaction is returned in only two possible options: a transaction can be completely successful or fail but there is no partial transaction possible in ZooKeeper.

How Zookeeper Works?

Zookeeper follows a client-server model for synchronization across the nodes in a cluster. The service provider nodes are referred to as servers while the service consumer nodes are considered as clients during the ZooKeeper cluster processes. The multiple of servers, called ZooKeeper ensemble, shares the status of the data with server nodes after starting the ensemble. Each server node keeps a copy of the status and transaction details in its local log file. Any change in a server is considered as successful unless it is written to at least half of the servers, which is referred to as quorum, in an ensemble. If any of the servers fail and join the ensemble back, it simply synchronizes its status with the rest of the servers. The ZooKeeper ensemble will offer its services unless more than half of the servers (quorum) fail.

Client nodes can connect to any of the ZooKeeper servers in the ensemble. Once a client is connected, the connection is verified with the client by sending and receiving the acknowledgment signals and a unique ID is assigned. After the election of the ensemble leader, followers synchronize their data with each other, and then clients can connect to servers.

Reading a znode

Reading is comparatively simpler in ZooKeeper because clients are connected to different servers. To read a znode, the client sends out a request and the requested znode is returned by the connected server. Multiple clients can read simultaneously in an ensemble. For this reason, reading is much faster than writing in ZooKeeper.

Writing a znode

All write requests are handled by the leader in an ensemble. The client sends out the request with the data to be written and the server forward that request to the leader in the ensemble. The request is honored only if it is responded to by the majority of servers.

The ZooKeeper Architecture

The main components of the ZooKeeper architecture are as under:

  • Client: A client is a node in a cluster that requests a service by sending a message to the server. If the client does not receive any response from the server, the client will automatically connect to the next available server.
  • Server: A server is a node in the ZooKeeper ensemble that provides a particular service to clients in the cluster. Another task of the server is to respond to clients in response to their requests.
  • Leader: A Leader is a server node that is elected at startup and performs automatic recovery if a node fails.
  • Follower: All the server nodes except the leader, are referred to as followers.
  • Ensemble: The group of servers (at least three) is called an ensemble. All servers in the ensemble keep a copy of the data.
Image
Logical view of the ZooKeeper Architecure.

The ZooKeeper follows a simple client-server model where servers offer different services to their clients in a cluster. Zookeeper itself is a distributed application that offers coordination services to distributed systems. ZooKeeper servers can be implemented in two modes: standalone and quorum. In standalone mode, the ZooKeeper state is not replicated since there is only one server while in quorum mode, servers in the ensemble keep a copy of data and the ZooKeeper state is replicated. The data contains transaction logs and snapshots which are used for synchronization purposes and data watches. When a leader is elected at the beginning to form an ensemble, the followers share their status with each other for ZooKeeper replication.

Each client is connected to a server and sends/receives a periodic message to ensure that the session is alive. The client will connect to another server in the ensemble if it does not get the acknowledgment within a specified period. Multiple clients can connect to a single server at a time. Whenever a client requires service, it sends a request to ZooKeeper through its client library which handles the interaction with its server.

When a client requires to read data, it sends a request to its connected server and the ZooKeeper returns the required information. The write operation is different and whenever a client needs to write data it sends the write request to its server. The request is then forwarded to the ensemble leader where the leader issues command to all followers. The write authorization is granted only when a majority of the followers respond positively.

ZooKeeper Data Model

Image

ZooKeeper follows the standard file-system-like hierarchichal namespace where a node can have data as well as the child node’s address. As shown in the diagram, names and path elements in the namespace are separated by “/”, starting from the root znode. Paths can be expressed in absolute, slash-separated, or canonical arrangement. Any Unicode character can be used except the following:

  1. null character (\u0000)
  2. “.” or “..”
  3. \ud800 - uF8FF
  4. \uFFF0 – uFFFF
  5. \u0001 - \u001F
  6. \u007F
  7. \u009F
  8. and a token, "zookeeper"

Relative paths of any type are not allowed in the ZooKeeper data model. The znodes maintain a stat structure that comprises a version number, Action Control List (ACL), Timestamp, and Data length. The version number keeps track of the changes made to znodes in terms of the transaction IDs (czxid and mzxid). The ACL is an authentication set up to manage all the read and write operations. It also records the number of changes and the version is updated whenever a change is detected. The timestamp is represented in milliseconds which is the time duration between the creation and modification of a znode. The values of Zxid are used to compute the duration and used in the Timestamp stat. The data length represents the amount of data that is stored in a znode. The complete list of fields can be found here in the ZooKeeper document.

Sessions in ZooKeeper

When a client connects to a server in an ensemble, the session establishes between a client and a ZooKeeper service which links the client library to the server. To establish the session, a handle is created which starts in the CONNECTING state. When client libraries are connected to a server, the state is changed to CONNECTED. If any kind of disconnection is detected, the handler will change the state to CLOSED. The application on the client side provides the connection string for establishing a session. If a connection fails for any reason, the client will try to establish a new connection to the next server. This process will continue until it is reconnected to a server in the ensemble. During this re-connecting process, if a client connects to a different server, the already assigned session-ID will be communicated and the server will create a password for this ID. Afterward, any of the available servers can validate the session ID and its password in order to re-connect the client to a new server. The client then sends out a timeout request and the server returns the timeout.

In case the connection is re-established between the client and ZooKeeper server in an ensemble within the timeout value, the state will be turned to CONNECTED. When the connection is re-established after the timeout value the state will be turned to EXPIRED. The ZooKeeper cluster manages the session expiration itself rather than managed by the client. Sessions are usually considered as expired when the cluster does not receive any acknowledgment within the session timeout period. Any change which is occurred due to expiry status is communicated to all connected nodes and their stats are updated. The ephemeral nodes created in a session also get deleted when that particular session is expired.

Watches in ZooKeeper

The use of Watches in ZooKeeper is a simple mechanism for the clients to get notifications about the changes in a ZooKeeper ensemble. Any client can set a watch on data and will be notified once it detects the changes. Watches are useful for client application in many scenarios particularly, in sequential processes.

How ZooKeeper Watches work

Clients can set watches on data read operations including getChildren(), getData(),and exists(). Watches monitor data changes and send a notification to the clients if it gets changed. These watches are trigged only once and then removed. When a client needs another notification for the next change in that data, the watch has to be set again.

For any reason, the client may receive notifications about the changes in the incorrect order. For example, a change (first) occurs and the server sends the notification to the client which has set a watch, the client did not get the notification because of a network delay. Meanwhile, another (second) change occurred in data and the same client gets the notification for the recent (second) change. For this purpose, the ZooKeeper server sends the notifications in an asynchronous fashion to avoid any mix-ups. This ensures that the client will get these notifications in the same order as the client set the watches.

The following list of events can be called and trigger a watch.

  • Created event: Enabled with a call to exists.
  • Deleted event: Enabled with a call to exists, getData, and getChildren.
  • Changed event: Enabled with a call to exists and getData.
  • Child event: Enabled with a call to getChildren.

What type of data can be watched in ZooKeeper

Two types of data can be watched as ZooKeeper maintains two lists, one for data and the second for child nodes. As mentioned earlier, there are three functions getChildren(), getData(),and exists() which can be used for setting watches in ZooKeeper. The getChildren() function is used to set child watches and it returns the list of children while the remaining two functions are used to set the data watches and return the metadata of the node. Watches do not create bulky overheads in ZooKeeper and are easy to set and implement. They are maintained on local servers where those clients are connected which sets data or child watches. A session event triggers when a client switches to a new server but clients will not receive any watch notification during the switching period when the client is not connected to any server. If a client reconnects to the same server where it has previously set some watches, will get notifications for those watches. There can be very diverse scenarios of clients’ connections and disconnections and maintaining the status of watches during such a period, which can lead to complicated situations but ZooKeeper handles the maintenance and retaining of watches very well. However, there is a possibility that a watch may not be retained in some cases. For example, if a watch is not yet created for a znode and during the disconnection period, this znode is created and deleted, the watch will not be retained.

Access Control Lists (ACLs) in ZooKeeper

The ZooKeeper data model control different types of accesses using ACLs. It is very similar to UNIX’s file-permission techniques where authentication of znodes is controlled by changing the permission bits to set or unset. ACLs define the authentication mechanism between the client and services. Furthermore, it defines various controls for operations when znodes are created in the cluster. The main advantage of using ACLs is that it can define the permission levels and specify which user or group can perform the read, create, update or delete operation on a particular znode in ZooKeeper. ACLs affect only a specific znode not even the child node of that particular znode and they cannot be used in recursive mode. When a client is authenticated successfully by ZooKeeper, the corresponding ids to that particular client are associated with the connection. These ids are verified whenever the node is accessed. The ACLs consist of pair of (scheme: expression, perms). ACL mainly contains three parts i.e. authentication mechanism, its identity and permission set. An ACL is the combination of an authentication mechanism, an identity for that mechanism, and a set of permissions.

List of permissions supported by ZooKeeper

The ZooKeeper support the following set of permissions:

  • CREATE: is used for creating a child node.
  • READ: is use to get data from a node and list its children.
  • WRITE: is used to set data for a node.
  • DELETE: is used to delete a child node.
  • ADMIN: is used to set permissions.

Built-in Scheme:

ZooKeeeper has the following built-in schemes:

  • world: This scheme has a single id named “anyone” representing anyone.
  • auth: it overwrites the provided expression and uses the current user’s credentials and scheme. It allows a user to create a znode and restrict that user’s access to that particular node only however if the authentication is failed auth can not be used.
  • digest: this scheme uses a username: password to generate MD5 (message-digest algorithm) which is used as an ACL id.
  • ip: this scheme uses the client’s IP as an ACL id.
  • X509: uses the client X500 principal as an id.

Predefined ACLs in ZooKeeper.

  • ANYONE_ID_UNSAFE: represents anyone
  • AUTH_IDS: it is used for setting ACLs
  • OPEN_ACL_UNSAFE: it grants all permissions except ADMIN
  • CREATOR_ALL_ACL: it grants all the permissions to that user who creates the znode
  • READ_ACL_UNSAFE: it gives the ability to read

Strengths and Weaknesses of ZooKeeper

What are the advantages of ZooKeeper?

  1. Simple Coordination Processes: ZooKeeper adopts a very simple approach to coordinate nodes and servers in ZooKeeper which is easier to follow by the users.
  2. Reliability: ZooKeeper applies the update when the client overwrites it. The availability is safeguarded by using more than one server in the ZooKeeper ensemble. Similarly, the implementation of watches in ZooKeeper also improves the reliability metrics.
  3. Scalability: Performance can be improved by adding more servers to the ZooKeeper ensemble.
  4. Synchronization: Mutual exclusion and coordinated processes make the ZooKeeper more synchronized. It helps HBase in configuration management.
  5. Ordered Messages: The ZooKeeper keeps track and assigns an id to updates for guaranteeing the order of messages. Furthermore, while using data-change watches in ZooKeeper, the server sends the change notifications to clients in asynchronous mode to maintain the sequence. The ordered messages help in implementing a higher level of abstraction where the order is required.
  6. Fast: Apache ZooKeeper runs faster during the read operations.
  7. Atomicity: There are only two states of transactions possible either data will be transferred successfully or failed completely. There is no possibility of partial transaction in ZooKeeper.
  8. The data can be encoded in line with the specific set of rules.
  9. A node can join or leave a cluster in real time
  10. Services from other servers can be viewed by all clients

What are the limitations of ZooKeeper?

  1. ZooKeeper cannot be migrated between different versions without user intervention.
  2. There is no support for Rack placement.
  3. It is a challenging task to change the volume requirements after initial deployment.
  4. The data may get lost by adding a new server.
  5. If a service is deployed on a virtual network, the service cannot be switched to the host network without reinstalling.
  6. The number of pods is not allowed to be reduced.
  7. The ephemeral node lifecycle relies on the TCP connection. This may create a problem in case there is an issue with the process connection and the TCP connection is active.
  8. Cross-cluster scenarios are partially supported.
  9. The client services sometimes generate hard dependencies on infrastructure which may create scalability issues.

Leader Election in ZooKeeper

How Leader is elected in ZooKeeper

A Leader is a server node that is elected by an ensemble of servers. The ZooKeeper adopts a very simple approach to elect a leader in the ensemble. During the leader election process, all nodes use sequential/ephemeral flags with the path /election and child znode with the path /election/guid-n_ while creating znodes.

The ZooKeeper ensemble then automatically affixes a 10 digit sequence number to the existing path. For example, the first znode path will be: /leader/guid_0000000001, the second will be: /leader/guid_0000000002 and the third: /leader/guid_0000000003, and so on.

Now, the znode which has the smallest number appended to its path is elected as the leader while the rest of znodes become followers. In the above example, the leader/guid_0000000001 becomes the leader while the rest are followers. Each follower watches it preceding znode i.e. the leader/guid_0000000003 will watch leader/guid_0000000002 and leader/guid_0000000002 will watch leader/guid_0000000001. This method also avoids the herd effect, where all nodes watch the same node. If the leader fails for any reason, the znode with the next smallest number will become the leader. The election process ends when the majority of the servers synchronize their state with the leader.

The inner working of ZooKeeper The ZooKeeper service is used through client APIs which have bindings for C and Java languages. When a connection is established, the overall interaction between the client and server is handled by the client library. The interaction between application processes and the ZooKeeper service is shown in the following diagram.

Image
Interaction of Application Processes and ZooKeeper Service

As discussed in 'ZooKeeper architecture' section, the ZooKeeper servers can be implemented in two modes: standalone and quorum. In standalone mode, the ZooKeeper state is not replicated since there is only one server while in quorum mode, servers in the ensemble keep a copy of data and the ZooKeeper state is replicated.

Client connection with the ZooKeeper service

To establish a connection between a client and a server, the client sends a request to one of the servers in the ensemble. If the authentication is failed, it sends the request to the next server in the servers’ list. The process continues until a request is authenticated and a connection is established. The session is then created between a client and server by assigning a unique id to the client. The lifetime of an ephemeral znode is contingent on the active session. The ephemeral znode will only exist if the session is active otherwise it is automatically deleted. There is a timeout period for a session which is specified by the application. The timeout depends on the nature of the application and cluster environment. The session gets expired automatically when the connection remains idle for more than the specified timeout period.

The session remains active by sending a heartbeat signal to the ZooKeeper service. The heartbeat signals are sent by client libraries which maintain the session between client and server using a TCP connection. The client APIs return the connection object to the application. The connection object transitions through various states till the end of a connection. When the connection object is created with CONNECTING state, it tries to establish a connection to a server in the ensemble. When the connection request is successful, the object is transitioned to CONNECTED state. In case a session expires, the authentication fails, or a connection gracefully closes, the object transitions to CLOSED state. The overall transitions in various states are shown in the following diagram.

Image
Connection State Transitions

Implementation of ZooKeeper transactions

When a leader is elected from an ensemble, the rest of the servers become the followers. The leader is responsible to deal with the write requests from the clients. All followers in an ensemble, receive updates from the leader. The leader sends the updates to all followers and a consistent state is maintained across the ensemble if a majority of the followers accept the update. If a leader fails in the ensemble the next follower is elected as the new leader as discussed in the section 'Leader Election in ZooKeeper'.

The replication is ensured through persistent updates in all servers in the ensemble. All servers maintain an in-core database to store the entire state of the ZooKeeper namespace. The durability of updates is ensured through a simple mechanism where updates are logged to a local disk. ZooKeeper uses the atomic messaging protocol ZooKeeper Atomic Broadcast (ZAB), which ensures the local replications do not diverge. The ZAB protocol also makes sure that writing the results of the update either succeeds or fails. The read requests are processed locally when the client is connected. These requests include exists, getData, and getChildren. The write or update requests are forwarded to the leader where these requests are carried out as transactions. The update of transactions are also applied atomically and transactions do not interfere with each other. A transaction is identified by a 64-bit integer which is referred to as a transaction identifier (zxid). The process of transaction is performed in two phases: leader election and atomic broadcast.

Image
ZooKeeper Service Components

Leader election phase: All servers in an ensemble participate in the leader election algorithm with the LOOKING state. If a leader exists already in the ensemble, the server simply informs the new participating servers about the leader and they all synchronize their state with the leader. If there is no leader in the ensemble, the election process starts with the leader election algorithm which asks the servers to exchange messages to elect a leader. The algorithm stops when a leader is elected and the leader state is changed to LEADING while the rest of the servers’ state changes to FOLLOWING. Afterward, the leader is activated by proposing NEW_LEADER proposal. The leader is activated only if the majority of the followers accept the proposal. The message, which consists of server identification number (sid) and transaction id (zxid), is swapped with all the servers in the ensemble for synchronization.

Atomic broadcast phase: The write requests from clients are forwarded to the leader where it broadcasts the updates to all followers. If a majority of the followers acknowledge these updates then they are retained by all servers. The majority consensus is achieved through ZAB protocol. The ZAB protocol also ensures the correct sequence of transaction delivery.

Local storage and snapshots

Local storage is used by ZooKeeper for persistent transactions which are logged into transaction logs. Transactions are stored in pre-allocated files onto disk media. When a server acknowledges the transaction write proposal, it forces the writing of the transaction to transaction logs. Similarly, snapshots of Zookeeper tree are stored in a local file system. The snapshot files and transaction logs can be used for data recovery in case of a user or other errors.

Summary

  • Apache Zookeeper is an open source distributed service that coordinates and manage various type of nodes in a cluster.
  • Key features of ZooKeeper include: simple, scalable, reliable and fast.
  • Zookeeper’s architecture includes Leader, follower, client, ensemble, and server.
  • Zookeeper follows a simple data model
  • A variety of APIs are available
  • A very simple and reliable mechanism to achieve reliability.
  • Renowned companies like Twitter, Facebook, Yahoo, etc. are using ZooKeeper
  • A node in a ZooKeeper cluster can join or leave in real-time
  • The data may get lost by adding a new server
  • ZooKeeper cannot be migrated between different versions without user intervention.
  • The client services may generate hard dependencies on infrastructure

➡ Learn more on architecture and system design in Grokking the System Design Interview and Grokking the Advanced System Design Interview.

Read more on system design interview.
[1] 18 System Design Concepts Every Engineer Must Know Before the Interview.
[2] Top LeetCode Patterns for FAANG Coding Interviews
[3] The Complete Guide to Ace the System Design Interview

System Design Fundamentals
System Design Interview
FAANG
Get instant access to all current and upcoming courses through subscription.
$17
.66
/mo
billed yearly ($211)
Recommended Course
Join our Newsletter
Read More