ZooKeeper: Distributed System Coordination

Uploaded by

cs23m106

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views4 pages

ZooKeeper: Distributed System Coordination

Uploaded by

cs23m106

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

ZooKeeper

Wait-free coordination for Internet-scale systems

I. I NTRODUCTION ease of use for clients.All znodes store data, and all znodes,
except for ephemeral znodes, can have children.
Coordination in large-scale distributed applications en-
compasses various crucial functionalities like configuration
management, group membership, leader election, and locks.
Configuration defines operational parameters, while group
membership and leader election ensure fault tolerance and
load balancing. Locks guarantee exclusive resource access,
preventing data corruption. These mechanisms facilitate ef-
fective communication and collaboration, ensuring reliability,
scalability, and efficient resource utilization in distributed
systems. When designing the coordination system, we chose to
expose an API allowing application developers to implement
their own primitives. This decision led to the creation of a
coordination kernel capable of accommodating new primitives
without altering the service core, thus enabling diverse forms
of coordination tailored to specific application requirements. Fig. 1. Illustration of ZooKeeper namespace
We moved away from blocking primitives like locks and
implemented an API manipulating wait-free data objects orga-
There are two types of znodes that a client can create:
nized hierarchically, akin to file systems. While ZooKeeper’s
Regular: Clients manipulate regular znodes by creating and
API resembles that of a file system, its implementation of
deleting them explicitly; Ephemeral: Clients create such zn-
wait-free data objects distinguishes it from systems reliant
odes, and they ei ther delete them explicitly, or let the system
on blocking primitives. ZooKeeper employs a replicated en-
remove them automatically when the session that creates them
semble of servers to ensure high availability and performance
terminates (deliberately or due to a failure). Additionally, when
through replication. Its efficient design effectively manages
creating a new znode, a client can set a sequential flag.
coordination for applications with numerous processes. With
Nodes created with the sequential flag set have the value
a pipelined architecture, ZooKeeper can concurrently handle
of a monotonically increasing counter appended to its name.
hundreds or thousands of requests while maintaining low
If n is the new znode and p is the parent znode, then the
latency. This architecture guarantees FIFO order execution
sequence value of n is never smaller than the value in the
of operations from a single client, facilitating asynchronous
name of any other sequential znode ever created under p.
submission of operations. This capability proves especially
ZooKeeper implements watches to allow clients to receive
valuable in dynamic environments, ensuring smooth operation
timely notifications of changes without requir ing polling.
with sub-second initialization times, even when new clients
When a client issues a read operation with a watch flag set, the
become leaders requiring prompt metadata manipulation and
operation completes as nor mal except that the server promises
updates.
to notify the client when the information returned has changed.
Watches are one-time triggers associated with a session; they
II. T HE Z OO K EEPER S ERVICE
are unregistered once triggered or the session closes. Watches
ZooKeeper provides clients with a hierarchical namespace indicate that a change has happened, but do not provide the
abstraction, consisting of a set of data nodes known as znodes. change
This organization mirrors traditional file systems, offering Sessions. A client connects to ZooKeeper and initiates a
users familiarity and facilitating better organization of appli- session. Sessions have an associated timeout. Zoo Keeper
cation metadata. Clients interact with these znodes via the considers a client faulty if it does not receive any thing from
ZooKeeper API. Znodes are referenced using standard UNIX its session for more than that timeout. A ses sion ends when
file system path notation. For example, ”/A/B/C” refers to clients explicitly close a session handle or ZooKeeperdetects
znode C, with B as its parent and A as its grandparent. This that a clients is faulty. Sessions enable a client to move
hierarchical structure enhances the management and access transparently from one server to another within a ZooKeeper
of distributed data within ZooKeeper, ensuring efficiency and ensemble, and hence persist across ZooKeeper servers.
A. Client API ”ready” znode, then updating various configuration
create(path, data, flags): Creates a znode with path name znodes, and finally recreating the ”ready” znode. This
path, stores data[] in it, and returns the name of the new sequence ensures that configuration changes are applied
znode. flags en ables a client to select the type of znode: atomically. - By using the presence or absence of
regular, ephemeral, and set the sequential flag; the ”ready” znode as a signal, ZooKeeper enables
delete(path, version): Deletes the znode path if that znode is coordinated and consistent configuration updates across
at the expected version; distributed processes.
exists(path, watch): Returns true if the znode with path name
path exists, and returns false oth erwise. The watch flag 2. Ordering Guarantees:
enables a client to set a watch on the znode; - ZooKeeper provides strong ordering guarantees for
getData(path, watch): Returns the data and meta-data, such operations and notifications. This means that operations
as version information, associated with the znode. The watch and notifications are processed in the same order across
flag works in the same way as it does for exists(), except all nodes in the ZooKeeper ensemble. - When a process
that Zoo Keeper does not set the watch if the znode does not observes the creation or deletion of the ”ready” znode, it
exist; is guaranteed to also observe all configuration changes
setData(path, data, version): Writes data[] to znode path if made by the new leader in the correct order. - This
the version number is the current version of the znode; ensures that processes always see a consistent view of
getChildren(path, watch): Returns the set of names of the the system state and configuration changes, avoiding
children of a znode; race conditions or inconsistencies.
sync(path): Waitsforallupdatespendingatthestart of the
operation to propagate to the server that the client is 3. Sync Request:
connected to. The path is currently ignored. - To handle potential inconsistencies between ZooKeeper
All methods have both a synchronous and an asyn chronous and external communication channels, ZooKeeper offers
version available through the API. An applica tion uses the the ”sync” request. - When a client issues a ”sync”
synchronous API when it needs to execute a single ZooKeeper request followed by a read operation, it triggers a slow
operation and it has no concurrent tasks to execute, so it makes read operation on the server. - The server applies all pend-
the necessary ZooKeeper call and blocks. The asynchronous ing write requests before processing the read operation,
API, however, en ables an application to have both multiple ensuring that the read reflects the most up-to-date state of
outstanding ZooKeeper operations and other tasks executed the data. - This allows clients to synchronize their view
in par allel. The ZooKeeper client guarantees that the corre of the configuration data with the latest changes made
sponding callbacks for each operation are invoked in or der. by other processes, even if their ZooKeeper replicas are
slightly behind.
By leveraging the ”ready” znode mechanism, ordering
B. Zookeeper Guarantee guarantees, and the ”sync” request, ZooKeeper provides
a robust solution for managing configuration updates in
It has 2 ordering guarantees distributed systems. These features ensure consistency,
1) Linearizable writes coordination, and synchronization, enabling efficient and
2) FIFO client order reliable operation even in dynamic and asynchronous
When a new leader takes charge of the system, it must change environments.
a large number of configuration parameters and notify the
other processes once it finishes. We then have two important C. Examples of primitives
requirements: Configuration Management
• As the new leader starts making changes, we do not want In ZooKeeper, dynamic configuration in a distributed
other processes to start using the configuration that is application can be implemented using znodes, particularly
being changed; in a scheme where configuration is stored in a znode,
• If the new leader dies before the configuration has been typically denoted as ”zc.” Processes within the system
fully updated, we do not want the processes to use this start with the full pathname of this znode. Upon startup,
partial configuration. processes obtain their configuration by reading from zc
The Chubby distributed locks would address the first with the watch flag set to true. If the configuration in
requirement but fall short for the second. zc is updated at any point, processes are notified of
1. Ready Znode Mechanism: the change and subsequently read the new configuration,
- When a new leader needs to update the configuration, it again setting the watch flag to true for future updates.
designates a specific znode as the ”ready” indicator. Other It’s important to note that watches are employed to
processes are instructed to only use the configuration ensure that processes always have the most recent
when this ”ready” znode exists. - The new leader information. For instance, if a process is watching zc and
performs configuration updates by first deleting the is notified of a change to zc, but before it can issue a
read for zc, there are multiple subsequent changes to zc, dure.Unlock procedure is the same as the global lock case.
the process does not receive multiple notification events.
This behavior ensures that the process operates based
on the most recent configuration information available.
Even if multiple changes occur before the process can
update its configuration, it simply acknowledges that
the information it has is outdated, thereby maintaining
system integrity and consistency.
Group Membership we use the fact that ephemeral
nodes allow us to see the state of the session that
created the node. Distributed coordination pattern where
multiple clients form a group and maintain membership
information. Zg= serve as a parent z node for storing
member information. Each client that join the group
Fig. 3. Simple lock without herd effect
create a child z node under the group znode to represent
membership.
III. Z OO K EEPER I MPLEMENTATION
D. Simple Locks ZooKeeper ensures high availability through data replication
on each server within the service ensemble, assuming servers
The simplest lock implementation uses “lock files”.
may fail but can recover. Requests are processed by servers:
The lock is represented by a znode. To acquire a lock,
read requests are served from local replicas, while write
a client tries to create the designated znode with the
requests involve coordination using an agreement protocol
EPHEMERAL flag. If the create succeeds, the client
before committing changes to the fully replicated database.
holds the lock. Otherwise, the client can read the zn
The database is an in-memory structure containing the entire
ode with the watch flag set to be notified if the current
data tree, with each znode capable of storing up to 1MB
leader dies. A client releases the lock when it dies or ex
of data. Efficient logging of updates to disk and periodic
plicitly deletes the znode. Other clients that are waiting
snapshots ensure recoverability. Clients connect to a single
for a lock try again to acquire a lock once they observe
server to submit requests, with write requests processed via
the znode being deleted. This simple lock mechanism it
the agreement protocol for consistency maintenance. As part of
suffers from Herd effect
the agreement protocol write requests are forwarded to a single
server, called the leader1. The rest of the ZooKeeper servers,
called followers, receive 1Details of leaders and followers, as
part of the agreement protocol, are out of the scope of this
paper. message proposals consisting of state changes from the
leader and agree upon state changes.
A. Request Processor
ZooKeeper’s messaging layer ensures atomicity, preventing
local replicas from diverging despite potential differences in
applied transactions across servers. Transactions from clients
Fig. 2. Simple lock without herd effect
are idempotent. When the leader receives a write request,
it calculates the future system state, considering pending
In summary, this locking scheme has the following ad transactions. This prediction allows ZooKeeper to generate
vantages: appropriate transactions, such as setDataTXN for successful
1) The removal of a znode only causes one client to wake updates or errorTXN for encountered errors like version mis-
up,since each znode is watched by exactly one other matches or non-existing znodes.
client,so we do not have the herd effect;
2) There is no polling or timeouts; B. Atomic Broadcast
3) Because of the way we have implemented locking, we All requests that modify ZooKeeper’s state are directed to
can see by browsing the ZooKeeper data the amount the leader, which executes them and broadcasts the changes us-
of lock contention, break locks, and debug locking ing the Zab atomic broadcast protocol. The server responding
problems. to the client request confirms delivery of the corresponding
state change. ZooKeeper relies on simple majority quorums
E. Read/Write Locks provided by Zab, requiring at least 2f + 1 servers to tolerate f
For this locking system we change the lock procedure failures. To optimize throughput, ZooKeeper maintains a full
slightly and have separate read lock and write lock proce- request processing pipeline, potentially handling thousands of
requests simultaneously. Zab ensures strong ordering guar- client within the session timeout period. Clients send heartbeat
antees, delivering changes in the order they were sent and messages during low activity periods to prevent timeouts,
ensuring all changes from previous leaders are received before switching to a new server if communication is lost for an
a new leader broadcasts its changes. Implementation choices extended duration.
like using TCP for transport, utilizing the Zab-chosen leader
IV. E VALUATION
as the ZooKeeper leader, and leveraging the log as a write-
ahead log for the in-memory database enhance performance We performed all of our evaluation on a cluster of 50
and simplify implementation. While Zab delivers messages servers. Each server has one Xeon dual-core 2.1GHz proces-
in order and exactly once during normal operation, it may sor, 4GB of RAM, gigabit ethernet, and two SATA hard drives.
redeliver messages during recovery due to lack of persistent We split the following discussion into two parts: throughput
message IDs. ZooKeeper accommodates this through idempo- and latency of requests.
tent transactions, accepting multiple deliveries as long as they V. C ONCLUSION
are in order. In fact, ZooKeeper mandates Zab to redeliver all
messages after the last snapshot during recovery. ZooKeeper adopts a wait-free approach to coordinate pro-
cesses in distributed systems by providing wait-free objects to
C. Replicated Database clients. This approach has proven valuable for various appli-
Each ZooKeeper replica maintains an in-memory copy of cations within and beyond Yahoo!. Despite seemingly weak
the ZooKeeper state, and when a server recovers from a crash, consistency guarantees for reads and watches, ZooKeeper
it must restore this internal state. To avoid the prohibitively achieves high throughput, processing hundreds of thousands
long process of replaying all delivered messages, ZooKeeper of operations per second for read-dominant workloads. Fast
employs periodic snapshots and only requires redelivery of reads and watches served by local replicas contribute to this
messages since the start of the snapshot. These snapshots, efficiency. While reads may lack precedence ordering and data
termed ”fuzzy snapshots,” are taken without locking the object implementations are wait-free, ZooKeeper’s wait-free
ZooKeeper state and involve atomically scanning the tree property has been crucial for achieving high performance.
to read each node’s data and metadata before writing them Despite describing only a few applications, many others utilize
to disk. While this approach may result in a snapshot that ZooKeeper due to its simple interface and powerful ab-
reflects only a subset of the state changes delivered during its stractions. Additionally, its high throughput enables extensive
generation, the idempotent nature of state changes allows for usage, not limited to coarse-grained locking.
their application twice, as long as they are applied in order. For
example, if a snapshot records certain node values and versions
that do not correspond to a valid state of the ZooKeeper data
tree, upon recovery, Zab redelivers the state changes, restoring
the service to its state before the crash.

D. Client-Server Interactions
When processing a write request, a server also sends and
clears notifications corresponding to any associated watches.
Writes are handled sequentially, ensuring a strict succession
of notifications. Notifications are managed locally by servers,
with each connected client’s server responsible for tracking
and triggering notifications. Read requests are processed lo-
cally at each server, tagged with a zxid indicating the last
transaction seen by the server, establishing a partial order of
read requests relative to write requests. This local process-
ing optimizes read performance, avoiding disk activity and
agreement protocols. However, this design may result in stale
read values compared to more recent updates. To address
this, ZooKeeper implements the sync primitive, ordered by
the leader after pending writes to its local replica. Clients use
sync followed by read operations to ensure the latest value is
returned. Sync operations are ordered by the leader without
requiring atomic broadcast, minimizing additional broadcast
traffic. Client-server communication follows FIFO order, with
responses including zxids for reference. To detect client ses-
sion failures, ZooKeeper employs timeouts. The leader detects
failures if no other server receives communication from a

ZooKeeper: Distributed Coordination Service
100% (1)
ZooKeeper: Distributed Coordination Service
42 pages
Zookeeper
No ratings yet
Zookeeper
28 pages
Zookeeper
No ratings yet
Zookeeper
14 pages
Apache Zookeeper
No ratings yet
Apache Zookeeper
31 pages
Zookeeper HBase SPARK
No ratings yet
Zookeeper HBase SPARK
25 pages
017.1 - ZooKeeper
No ratings yet
017.1 - ZooKeeper
6 pages
Unit5 BDA
No ratings yet
Unit5 BDA
75 pages
Zookeeper and Hbase
No ratings yet
Zookeeper and Hbase
43 pages
Zookeeper Tutorial: What Is, Architecture of Apache Zookeeper
No ratings yet
Zookeeper Tutorial: What Is, Architecture of Apache Zookeeper
10 pages
Apache Zookeeper
No ratings yet
Apache Zookeeper
28 pages
Zookeeper Tomwheeler Ll-20120607
No ratings yet
Zookeeper Tomwheeler Ll-20120607
23 pages
Distributed Systems: Tutorial 6 - Apache Zookeeper™
No ratings yet
Distributed Systems: Tutorial 6 - Apache Zookeeper™
18 pages
HBase and ZooKeeper Overview
No ratings yet
HBase and ZooKeeper Overview
96 pages
Apache ZooKeeper for Developers
No ratings yet
Apache ZooKeeper for Developers
4 pages
Apache ZooKeeper - Mesosphere
No ratings yet
Apache ZooKeeper - Mesosphere
27 pages
Unit 5 Lecture No-4 (Zookeeper)
No ratings yet
Unit 5 Lecture No-4 (Zookeeper)
20 pages
2.2 BA ZC426 RTA Apache ZooKeeper
No ratings yet
2.2 BA ZC426 RTA Apache ZooKeeper
24 pages
Lecture 5 Archof Confand Cood Systems
No ratings yet
Lecture 5 Archof Confand Cood Systems
42 pages
Zookeeper Programmers
No ratings yet
Zookeeper Programmers
20 pages
Apache ZooKeeper
No ratings yet
Apache ZooKeeper
3 pages
Hadoop Questions
No ratings yet
Hadoop Questions
61 pages
Introduction To Zookeeper
No ratings yet
Introduction To Zookeeper
8 pages
The Origin of The Name "Zookeeper"
No ratings yet
The Origin of The Name "Zookeeper"
4 pages
ZooKeeper: Cluster Coordination Guide
No ratings yet
ZooKeeper: Cluster Coordination Guide
13 pages
Distributed Systems with ZooKeeper
No ratings yet
Distributed Systems with ZooKeeper
1 page
Zookeeper
No ratings yet
Zookeeper
59 pages
017 - Apache ZooKeeper
No ratings yet
017 - Apache ZooKeeper
4 pages
Zookeeper HOD
No ratings yet
Zookeeper HOD
85 pages
Apache ZooKeeper Is An Open
No ratings yet
Apache ZooKeeper Is An Open
2 pages
Zookeeper Getting Started Guide
No ratings yet
Zookeeper Getting Started Guide
5 pages
Zookeeper Started
No ratings yet
Zookeeper Started
7 pages
Zookeeper
No ratings yet
Zookeeper
23 pages
CIM M1 - Ch-4
No ratings yet
CIM M1 - Ch-4
16 pages
Unit V-HBase
No ratings yet
Unit V-HBase
10 pages
Unit 5 Lecture No-4 (Zookeeper)
No ratings yet
Unit 5 Lecture No-4 (Zookeeper)
20 pages
Zookeeper Tutorial
100% (1)
Zookeeper Tutorial
43 pages
The Chubby Lock Service For Loosely-Coupled Distributed Systems
No ratings yet
The Chubby Lock Service For Loosely-Coupled Distributed Systems
64 pages
Zookeeper
No ratings yet
Zookeeper
4 pages
Module 12 Zookeeper - Cluster Distributed Coordination Service
No ratings yet
Module 12 Zookeeper - Cluster Distributed Coordination Service
26 pages
SystemDesign - Zookeeper - Kafka
No ratings yet
SystemDesign - Zookeeper - Kafka
14 pages
System Design - Zookeeper + Kafka
No ratings yet
System Design - Zookeeper + Kafka
14 pages
017.2 - ZooKeeper Internals
No ratings yet
017.2 - ZooKeeper Internals
6 pages
ZooKeeper for Distributed Systems
No ratings yet
ZooKeeper for Distributed Systems
14 pages
Project Big Data: Zookeeper and Yarn
No ratings yet
Project Big Data: Zookeeper and Yarn
12 pages
Unit - 5 Updated MHM
No ratings yet
Unit - 5 Updated MHM
25 pages
Module 3-2
No ratings yet
Module 3-2
26 pages
Zookeeper Tutorial
No ratings yet
Zookeeper Tutorial
24 pages
Chp8 ZooKeeper Slider and Knox
No ratings yet
Chp8 ZooKeeper Slider and Knox
28 pages
The Chubby Locks Service: For Loosely-Coupled Distributed Systems
No ratings yet
The Chubby Locks Service: For Loosely-Coupled Distributed Systems
57 pages
What Is Zoo Keeper - List The Benefits of It.
No ratings yet
What Is Zoo Keeper - List The Benefits of It.
1 page
Paxos Demystifying
No ratings yet
Paxos Demystifying
10 pages
Improving Performance of Distributed File System - ZFS
No ratings yet
Improving Performance of Distributed File System - ZFS
39 pages
Pages From Unified Log ZK Nov15 2
No ratings yet
Pages From Unified Log ZK Nov15 2
10 pages
Zookeeper
0% (1)
Zookeeper
87 pages
Commodifying Replicated State Machines With Openreplica
No ratings yet
Commodifying Replicated State Machines With Openreplica
14 pages
Kafka Zookeeper Setup
No ratings yet
Kafka Zookeeper Setup
9 pages
BD Imp Ques 2
No ratings yet
BD Imp Ques 2
26 pages
08 Chubby
No ratings yet
08 Chubby
14 pages
Chotu Meat
No ratings yet
Chotu Meat
11 pages
8.2 Homework
No ratings yet
8.2 Homework
2 pages
Iii Sem - CS - Minor (Java)
No ratings yet
Iii Sem - CS - Minor (Java)
4 pages
Modulhandbuch MA MERO WiSe 2021 22
No ratings yet
Modulhandbuch MA MERO WiSe 2021 22
28 pages
Strength of Materials Basics
No ratings yet
Strength of Materials Basics
21 pages
Brochur Michelin Racing Tires 2010
100% (1)
Brochur Michelin Racing Tires 2010
20 pages
InfinityPools WP
No ratings yet
InfinityPools WP
8 pages
Raman Spectroscopy of Tetrahedral Oxyanions
No ratings yet
Raman Spectroscopy of Tetrahedral Oxyanions
3 pages
Paper DFS MIT Solving The "False Positives" Problem in Fraud Prediction
No ratings yet
Paper DFS MIT Solving The "False Positives" Problem in Fraud Prediction
14 pages
TA Mass Spectrometer Coupling en Web
No ratings yet
TA Mass Spectrometer Coupling en Web
28 pages
Console Commands - Official Galactic Civilizations III Wiki
No ratings yet
Console Commands - Official Galactic Civilizations III Wiki
9 pages
Safety Guide for Smoke Sabre Users
No ratings yet
Safety Guide for Smoke Sabre Users
7 pages
Synapse: Is The Point of Contact Between A Neuron and
No ratings yet
Synapse: Is The Point of Contact Between A Neuron and
31 pages
Develop Lofting and Lathe Creation in 3d Animation
No ratings yet
Develop Lofting and Lathe Creation in 3d Animation
5 pages
(A) A B (B) A B (C) A B (D) A B: Class: 6
No ratings yet
(A) A B (B) A B (C) A B (D) A B: Class: 6
27 pages
TF Idf
No ratings yet
TF Idf
15 pages
Eurocode 3 Part 6 PDF
100% (1)
Eurocode 3 Part 6 PDF
2 pages
Federated Level 4 Test
No ratings yet
Federated Level 4 Test
2 pages
ChE 471 EXAM 2 2004
No ratings yet
ChE 471 EXAM 2 2004
3 pages
Partial Differential Equations Muzammil Tanveer PDF
No ratings yet
Partial Differential Equations Muzammil Tanveer PDF
173 pages
Thrimawithana 2006
No ratings yet
Thrimawithana 2006
6 pages
Linear vs. Multiple Regression
100% (1)
Linear vs. Multiple Regression
4 pages
Terex RS6000
No ratings yet
Terex RS6000
4 pages
New Energy Technologies Issue 12
0% (1)
New Energy Technologies Issue 12
81 pages
EH - 102 - Linux Basics and Commands
No ratings yet
EH - 102 - Linux Basics and Commands
69 pages
Lab 6
100% (1)
Lab 6
16 pages
Exception Handling
No ratings yet
Exception Handling
34 pages
Attitude Control of A Quadrotor
No ratings yet
Attitude Control of A Quadrotor
6 pages
Tikz Tutorial
100% (1)
Tikz Tutorial
34 pages
Harvard Stat 111: Statistical Inference
0% (1)
Harvard Stat 111: Statistical Inference
2 pages

ZooKeeper: Distributed System Coordination

Uploaded by

ZooKeeper: Distributed System Coordination

Uploaded by

ZooKeeper

Wait-free coordination for Internet-scale systems

You might also like