0% found this document useful (0 votes)
27 views4 pages

ZooKeeper: Distributed System Coordination

Uploaded by

cs23m106
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views4 pages

ZooKeeper: Distributed System Coordination

Uploaded by

cs23m106
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

ZooKeeper

Wait-free coordination for Internet-scale systems

I. I NTRODUCTION ease of use for clients.All znodes store data, and all znodes,
except for ephemeral znodes, can have children.
Coordination in large-scale distributed applications en-
compasses various crucial functionalities like configuration
management, group membership, leader election, and locks.
Configuration defines operational parameters, while group
membership and leader election ensure fault tolerance and
load balancing. Locks guarantee exclusive resource access,
preventing data corruption. These mechanisms facilitate ef-
fective communication and collaboration, ensuring reliability,
scalability, and efficient resource utilization in distributed
systems. When designing the coordination system, we chose to
expose an API allowing application developers to implement
their own primitives. This decision led to the creation of a
coordination kernel capable of accommodating new primitives
without altering the service core, thus enabling diverse forms
of coordination tailored to specific application requirements. Fig. 1. Illustration of ZooKeeper namespace
We moved away from blocking primitives like locks and
implemented an API manipulating wait-free data objects orga-
There are two types of znodes that a client can create:
nized hierarchically, akin to file systems. While ZooKeeper’s
Regular: Clients manipulate regular znodes by creating and
API resembles that of a file system, its implementation of
deleting them explicitly; Ephemeral: Clients create such zn-
wait-free data objects distinguishes it from systems reliant
odes, and they ei ther delete them explicitly, or let the system
on blocking primitives. ZooKeeper employs a replicated en-
remove them automatically when the session that creates them
semble of servers to ensure high availability and performance
terminates (deliberately or due to a failure). Additionally, when
through replication. Its efficient design effectively manages
creating a new znode, a client can set a sequential flag.
coordination for applications with numerous processes. With
Nodes created with the sequential flag set have the value
a pipelined architecture, ZooKeeper can concurrently handle
of a monotonically increasing counter appended to its name.
hundreds or thousands of requests while maintaining low
If n is the new znode and p is the parent znode, then the
latency. This architecture guarantees FIFO order execution
sequence value of n is never smaller than the value in the
of operations from a single client, facilitating asynchronous
name of any other sequential znode ever created under p.
submission of operations. This capability proves especially
ZooKeeper implements watches to allow clients to receive
valuable in dynamic environments, ensuring smooth operation
timely notifications of changes without requir ing polling.
with sub-second initialization times, even when new clients
When a client issues a read operation with a watch flag set, the
become leaders requiring prompt metadata manipulation and
operation completes as nor mal except that the server promises
updates.
to notify the client when the information returned has changed.
Watches are one-time triggers associated with a session; they
II. T HE Z OO K EEPER S ERVICE
are unregistered once triggered or the session closes. Watches
ZooKeeper provides clients with a hierarchical namespace indicate that a change has happened, but do not provide the
abstraction, consisting of a set of data nodes known as znodes. change
This organization mirrors traditional file systems, offering Sessions. A client connects to ZooKeeper and initiates a
users familiarity and facilitating better organization of appli- session. Sessions have an associated timeout. Zoo Keeper
cation metadata. Clients interact with these znodes via the considers a client faulty if it does not receive any thing from
ZooKeeper API. Znodes are referenced using standard UNIX its session for more than that timeout. A ses sion ends when
file system path notation. For example, ”/A/B/C” refers to clients explicitly close a session handle or ZooKeeperdetects
znode C, with B as its parent and A as its grandparent. This that a clients is faulty. Sessions enable a client to move
hierarchical structure enhances the management and access transparently from one server to another within a ZooKeeper
of distributed data within ZooKeeper, ensuring efficiency and ensemble, and hence persist across ZooKeeper servers.
A. Client API ”ready” znode, then updating various configuration
create(path, data, flags): Creates a znode with path name znodes, and finally recreating the ”ready” znode. This
path, stores data[] in it, and returns the name of the new sequence ensures that configuration changes are applied
znode. flags en ables a client to select the type of znode: atomically. - By using the presence or absence of
regular, ephemeral, and set the sequential flag; the ”ready” znode as a signal, ZooKeeper enables
delete(path, version): Deletes the znode path if that znode is coordinated and consistent configuration updates across
at the expected version; distributed processes.
exists(path, watch): Returns true if the znode with path name
path exists, and returns false oth erwise. The watch flag 2. Ordering Guarantees:
enables a client to set a watch on the znode; - ZooKeeper provides strong ordering guarantees for
getData(path, watch): Returns the data and meta-data, such operations and notifications. This means that operations
as version information, associated with the znode. The watch and notifications are processed in the same order across
flag works in the same way as it does for exists(), except all nodes in the ZooKeeper ensemble. - When a process
that Zoo Keeper does not set the watch if the znode does not observes the creation or deletion of the ”ready” znode, it
exist; is guaranteed to also observe all configuration changes
setData(path, data, version): Writes data[] to znode path if made by the new leader in the correct order. - This
the version number is the current version of the znode; ensures that processes always see a consistent view of
getChildren(path, watch): Returns the set of names of the the system state and configuration changes, avoiding
children of a znode; race conditions or inconsistencies.
sync(path): Waitsforallupdatespendingatthestart of the
operation to propagate to the server that the client is 3. Sync Request:
connected to. The path is currently ignored. - To handle potential inconsistencies between ZooKeeper
All methods have both a synchronous and an asyn chronous and external communication channels, ZooKeeper offers
version available through the API. An applica tion uses the the ”sync” request. - When a client issues a ”sync”
synchronous API when it needs to execute a single ZooKeeper request followed by a read operation, it triggers a slow
operation and it has no concurrent tasks to execute, so it makes read operation on the server. - The server applies all pend-
the necessary ZooKeeper call and blocks. The asynchronous ing write requests before processing the read operation,
API, however, en ables an application to have both multiple ensuring that the read reflects the most up-to-date state of
outstanding ZooKeeper operations and other tasks executed the data. - This allows clients to synchronize their view
in par allel. The ZooKeeper client guarantees that the corre of the configuration data with the latest changes made
sponding callbacks for each operation are invoked in or der. by other processes, even if their ZooKeeper replicas are
slightly behind.
By leveraging the ”ready” znode mechanism, ordering
B. Zookeeper Guarantee guarantees, and the ”sync” request, ZooKeeper provides
a robust solution for managing configuration updates in
It has 2 ordering guarantees distributed systems. These features ensure consistency,
1) Linearizable writes coordination, and synchronization, enabling efficient and
2) FIFO client order reliable operation even in dynamic and asynchronous
When a new leader takes charge of the system, it must change environments.
a large number of configuration parameters and notify the
other processes once it finishes. We then have two important C. Examples of primitives
requirements: Configuration Management
• As the new leader starts making changes, we do not want In ZooKeeper, dynamic configuration in a distributed
other processes to start using the configuration that is application can be implemented using znodes, particularly
being changed; in a scheme where configuration is stored in a znode,
• If the new leader dies before the configuration has been typically denoted as ”zc.” Processes within the system
fully updated, we do not want the processes to use this start with the full pathname of this znode. Upon startup,
partial configuration. processes obtain their configuration by reading from zc
The Chubby distributed locks would address the first with the watch flag set to true. If the configuration in
requirement but fall short for the second. zc is updated at any point, processes are notified of
1. Ready Znode Mechanism: the change and subsequently read the new configuration,
- When a new leader needs to update the configuration, it again setting the watch flag to true for future updates.
designates a specific znode as the ”ready” indicator. Other It’s important to note that watches are employed to
processes are instructed to only use the configuration ensure that processes always have the most recent
when this ”ready” znode exists. - The new leader information. For instance, if a process is watching zc and
performs configuration updates by first deleting the is notified of a change to zc, but before it can issue a
read for zc, there are multiple subsequent changes to zc, dure.Unlock procedure is the same as the global lock case.
the process does not receive multiple notification events.
This behavior ensures that the process operates based
on the most recent configuration information available.
Even if multiple changes occur before the process can
update its configuration, it simply acknowledges that
the information it has is outdated, thereby maintaining
system integrity and consistency.
Group Membership we use the fact that ephemeral
nodes allow us to see the state of the session that
created the node. Distributed coordination pattern where
multiple clients form a group and maintain membership
information. Zg= serve as a parent z node for storing
member information. Each client that join the group
Fig. 3. Simple lock without herd effect
create a child z node under the group znode to represent
membership.
III. Z OO K EEPER I MPLEMENTATION
D. Simple Locks ZooKeeper ensures high availability through data replication
on each server within the service ensemble, assuming servers
The simplest lock implementation uses “lock files”.
may fail but can recover. Requests are processed by servers:
The lock is represented by a znode. To acquire a lock,
read requests are served from local replicas, while write
a client tries to create the designated znode with the
requests involve coordination using an agreement protocol
EPHEMERAL flag. If the create succeeds, the client
before committing changes to the fully replicated database.
holds the lock. Otherwise, the client can read the zn
The database is an in-memory structure containing the entire
ode with the watch flag set to be notified if the current
data tree, with each znode capable of storing up to 1MB
leader dies. A client releases the lock when it dies or ex
of data. Efficient logging of updates to disk and periodic
plicitly deletes the znode. Other clients that are waiting
snapshots ensure recoverability. Clients connect to a single
for a lock try again to acquire a lock once they observe
server to submit requests, with write requests processed via
the znode being deleted. This simple lock mechanism it
the agreement protocol for consistency maintenance. As part of
suffers from Herd effect
the agreement protocol write requests are forwarded to a single
server, called the leader1. The rest of the ZooKeeper servers,
called followers, receive 1Details of leaders and followers, as
part of the agreement protocol, are out of the scope of this
paper. message proposals consisting of state changes from the
leader and agree upon state changes.
A. Request Processor
ZooKeeper’s messaging layer ensures atomicity, preventing
local replicas from diverging despite potential differences in
applied transactions across servers. Transactions from clients
Fig. 2. Simple lock without herd effect
are idempotent. When the leader receives a write request,
it calculates the future system state, considering pending
In summary, this locking scheme has the following ad transactions. This prediction allows ZooKeeper to generate
vantages: appropriate transactions, such as setDataTXN for successful
1) The removal of a znode only causes one client to wake updates or errorTXN for encountered errors like version mis-
up,since each znode is watched by exactly one other matches or non-existing znodes.
client,so we do not have the herd effect;
2) There is no polling or timeouts; B. Atomic Broadcast
3) Because of the way we have implemented locking, we All requests that modify ZooKeeper’s state are directed to
can see by browsing the ZooKeeper data the amount the leader, which executes them and broadcasts the changes us-
of lock contention, break locks, and debug locking ing the Zab atomic broadcast protocol. The server responding
problems. to the client request confirms delivery of the corresponding
state change. ZooKeeper relies on simple majority quorums
E. Read/Write Locks provided by Zab, requiring at least 2f + 1 servers to tolerate f
For this locking system we change the lock procedure failures. To optimize throughput, ZooKeeper maintains a full
slightly and have separate read lock and write lock proce- request processing pipeline, potentially handling thousands of
requests simultaneously. Zab ensures strong ordering guar- client within the session timeout period. Clients send heartbeat
antees, delivering changes in the order they were sent and messages during low activity periods to prevent timeouts,
ensuring all changes from previous leaders are received before switching to a new server if communication is lost for an
a new leader broadcasts its changes. Implementation choices extended duration.
like using TCP for transport, utilizing the Zab-chosen leader
IV. E VALUATION
as the ZooKeeper leader, and leveraging the log as a write-
ahead log for the in-memory database enhance performance We performed all of our evaluation on a cluster of 50
and simplify implementation. While Zab delivers messages servers. Each server has one Xeon dual-core 2.1GHz proces-
in order and exactly once during normal operation, it may sor, 4GB of RAM, gigabit ethernet, and two SATA hard drives.
redeliver messages during recovery due to lack of persistent We split the following discussion into two parts: throughput
message IDs. ZooKeeper accommodates this through idempo- and latency of requests.
tent transactions, accepting multiple deliveries as long as they V. C ONCLUSION
are in order. In fact, ZooKeeper mandates Zab to redeliver all
messages after the last snapshot during recovery. ZooKeeper adopts a wait-free approach to coordinate pro-
cesses in distributed systems by providing wait-free objects to
C. Replicated Database clients. This approach has proven valuable for various appli-
Each ZooKeeper replica maintains an in-memory copy of cations within and beyond Yahoo!. Despite seemingly weak
the ZooKeeper state, and when a server recovers from a crash, consistency guarantees for reads and watches, ZooKeeper
it must restore this internal state. To avoid the prohibitively achieves high throughput, processing hundreds of thousands
long process of replaying all delivered messages, ZooKeeper of operations per second for read-dominant workloads. Fast
employs periodic snapshots and only requires redelivery of reads and watches served by local replicas contribute to this
messages since the start of the snapshot. These snapshots, efficiency. While reads may lack precedence ordering and data
termed ”fuzzy snapshots,” are taken without locking the object implementations are wait-free, ZooKeeper’s wait-free
ZooKeeper state and involve atomically scanning the tree property has been crucial for achieving high performance.
to read each node’s data and metadata before writing them Despite describing only a few applications, many others utilize
to disk. While this approach may result in a snapshot that ZooKeeper due to its simple interface and powerful ab-
reflects only a subset of the state changes delivered during its stractions. Additionally, its high throughput enables extensive
generation, the idempotent nature of state changes allows for usage, not limited to coarse-grained locking.
their application twice, as long as they are applied in order. For
example, if a snapshot records certain node values and versions
that do not correspond to a valid state of the ZooKeeper data
tree, upon recovery, Zab redelivers the state changes, restoring
the service to its state before the crash.

D. Client-Server Interactions
When processing a write request, a server also sends and
clears notifications corresponding to any associated watches.
Writes are handled sequentially, ensuring a strict succession
of notifications. Notifications are managed locally by servers,
with each connected client’s server responsible for tracking
and triggering notifications. Read requests are processed lo-
cally at each server, tagged with a zxid indicating the last
transaction seen by the server, establishing a partial order of
read requests relative to write requests. This local process-
ing optimizes read performance, avoiding disk activity and
agreement protocols. However, this design may result in stale
read values compared to more recent updates. To address
this, ZooKeeper implements the sync primitive, ordered by
the leader after pending writes to its local replica. Clients use
sync followed by read operations to ensure the latest value is
returned. Sync operations are ordered by the leader without
requiring atomic broadcast, minimizing additional broadcast
traffic. Client-server communication follows FIFO order, with
responses including zxids for reference. To detect client ses-
sion failures, ZooKeeper employs timeouts. The leader detects
failures if no other server receives communication from a

You might also like