Causal Consistency in Geo-Replication
Causal Consistency in Geo-Replication
Geo-replicated storage provides copies of the same data at multiple, geographically distinct locations.
Facebook, for example, geo-replicates its data (profiles, friends lists, likes, etc.) to data centers on the
east and west coasts of the United States, and in Europe. In each data center, a tier of separate Web
servers accepts browser requests and then handles those requests by reading and writing data from
the storage system, as shown in figure 1.
Geo-replication brings two key benefits to Web services: fault tolerance and low latency. It
provides fault tolerance through redundancy: if one data center fails, others can continue to provide
Geo-Replicated Storage
web
tier
browser data
store
B B
B B B
B B
B B
B B
B
geo replication
1
DATA
the service. It provides low latency through proximity: clients can be directed to and served by a
nearby data center to avoid speed-of-light delays associated with cross-country or round-the-globe
communication.
Geo-replication brings its challenges, however. The famous CAP theorem, conjectured by
Brewer1 and proved by Gilbert and Lynch,7 shows it is impossible to create a system that has strong
consistency, is always available for reads and writes, and is able to continue operating during network
partitions. Each of these properties is highly desirable. Strong consistency—more formally known
as linearizability—makes programming easier. Availability ensures that front-end Web servers can
always respond to client requests. Partition tolerance ensures that the system can continue operating
even when data centers cannot communicate with one another. Faced with the choice of at most
two of these properties, many systems5,8,16 have chosen to sacrifice strong consistency to ensure
availability and partition tolerance. Other systems—for example, those that deal with money—
sacrifice availability and/or partition tolerance to achieve the strong consistency that is necessary for
the applications built on top of them.4,15
The former choice of availability and partition tolerance is not surprising, however, given that it
also enables the storage system to provide low latency—defined as latency for reads and writes that
is less than half the speed-of-light delay between the two most distant data centers. A proof that
predates the CAP theorem by 14 years10 shows that it is impossible to guarantee low latency and
provide strong consistency at the same time. Front-end Web servers read or write data from the storage
system potentially many times to answer a single request; therefore, low latency in the storage system
is critical for enabling fast page-load times, which are linked to user engagement with a service—and,
thus, revenue. An always-available and partition-tolerant system can provide low latency on the order
of milliseconds by serving all operations in the local data center. A strongly consistent system must
contact remote data centers for reads and/or writes, which takes hundreds of milliseconds.
Thus, systems that sacrifice strong consistency gain much in return. They can be always available,
guarantee responses with low latency, and provide partition tolerance. In COPS (Clusters of Order-
preserving Servers),11 developed for our original work on this subject, we coined the term ALPS for
systems that provide these three properties—always available, low latency, and partition tolerance—and
one more: scalability. Scalability implies that adding storage servers to each data center produces a
proportional increase in storage capacity and throughput. Scalability is critical for modern systems
because data has grown far too large to be stored or served by a single machine.
The question remains as to what consistency properties ALPS systems can provide. Before
answering this, let’s consider the consistency offered by existing ALPS systems. For systems such as
Amazon’s Dynamo, LinkedIn’s Project Voldemort, and Facebook/Apache’s Cassandra, the answer is
eventual consistency.
EVENTUAL CONSISTENCY
Eventual consistency is a widely used term that can have many meanings. Here it is defined as the
strongest property provided by all systems that claim to provide it: namely, writes to one data center
will eventually appear at other data centers, and if all data centers have received the same set of
writes, they will have the same values for all data.
Contrast this with the following part of the definition of strong consistency (linearizability): a
total order exists over all operations in the system. This makes programming a strongly consistent
2
DATA
storage system simple, or at least simpler: it behaves as a single entity. Eventual consistency does not
say anything about the ordering of operations. This means that different data centers can reflect
arbitrarily different sets of operations. For example, if someone connected to the West Coast data
center sets A=1, B=2, and C=3, then someone else connected to the East Coast data center may
see only B=2 (not A=1 or C=3), and someone else connected to the European data center may see
only C=3 (not A=1 or B=2). This makes programming eventually consistent storage complicated:
operations can appear out of order.
The out-of-order arrival leads to many potential anomalies in eventually consistent systems. Here
are a few examples for a social network:
Figure 2 shows that in the West Coast data center, Alice posts, she comments, and then Bob
comments. In the East Coast data center, however, Alice’s comment has not appeared, making Bob
look less than kind. Figure 3 shows that in the West Coast data center, a grad student carefully
deletes incriminating photos before accepting an advisor as a friend. Unfortunately, in the East Coast
data center, the friend-acceptance appears before the photo deletions, allowing the advisor to see the
photos.3
Figure 4 shows that in the West Coast data center, Alice uploads photos, creates an album, and
then adds the photos to the album, but in the East Coast data center, the operations appear out of
order and her photos do not end up in the album. Finally, in figure 5, Cindy and Dave have $1,000
in their joint bank account. Concurrently, Dave withdraws $1,000 from the East Coast data center
and Cindy withdraws $1,000 from the West Coast data center. Once both withdrawals propagate
to each data center, their account is in a consistent state (-$1,000), but it is too late to prevent the
mischievous couple from making off with their ill-gotten gains.
CAUSAL CONSISTENCY
Causal consistency ensures that operations appear in the order the user intuitively expects. More
precisely, it enforces a partial order over operations that agrees with the notion of potential causality.
If operation A happens before operation B, then any data center that sees operation B must see
operation A first.
Three rules define potential causality:9
1. Thread of execution. If a and b are two operations in a single thread of execution, then a -> b if
operation a happens before operation b.
2. Reads-from. If a is a write operation and b is a read operation that returns the value written by a,
then a -> b.
3
DATA
Alost
Alost
Afound
Bglad
time
Bglad
Afound
Sdelete
Saccept
Saccept
time
Sdelete
3. Transitivity. For operations a, b, and c, if a -> b and b -> c, then a -> c. Thus, the causal relationship
between operations is the transitive closure of the first two rules.
Causal consistency ensures that operations appear in an order that agrees with these rules. This
makes users happy because their operations are applied everywhere in the order they intended. It
makes programmers happy because they no longer have to reason about out-of-order operations.
4
DATA
Auploads
Aalbum
Aadds Aadds
time
C–$1000 D–$1000
time
D–$1000 C–$1000
balance: –$1000
Causal consistency prevents each of our first three anomalies, turning them into regularities.
Write operations are only propagated and applied to other data centers, so the full causal ordering
that is enforced is Op1 -> Op2 -> Op4.
Now, in the East Coast data center, operations can appear only in an order that agrees with
causality. Thus:
Op1 Alice: I’ve lost my wedding ring.
Then
Op1 Alice: I’ve lost my wedding ring.
Op2 Alice: Whew, found it upstairs.
Then
Op1 Alice: I’ve lost my wedding ring.
Op2 Alice: Whew, found it upstairs.
Op4 Bob: I’m glad to hear that.
but never the anomaly that makes Bob look unkind.
but never in a different order that results in an empty album or complicates what a programmer
must think about.
7
DATA
operations establish causal links between write operations by different clients, but they are not
replicated to other data centers and thus do not need to have an ordering enforced on them. For
example, in anomaly/regularity 1, Bob’s read (Op3) of Alice’s post (Op1) and comment (Op2) creates
the causal link that orders Bob’s later comment (Op4) after Alice’s post and comment. A causal link
between two write operations is called a dependency—the later operation depends on the earlier
operation.
Figure 6 shows the relationship between the graph of causality and the graph of dependencies. A
dependency is a small piece of metadata that uniquely identifies a write operation. It has two fields:
a key, which is the data location that is updated by the write; and a timestamp, which is a globally
unique logical timestamp assigned by the logical clock of the server in the data center where it
was originally written. Figure 6 illustrates (a) a set of example operations; (b) the graph of causality
between them; (c) the corresponding dependency graph; and (d) a table listing dependences with
one-hop dependencies shown in bold.
In the initial design the client library tracks the full set of dependencies for each client. Tracking
all dependencies for a client requires tracking three sets of write operations:
FIGURE
(a) (b)
user op ID operation Alice Bob Carol
Alice w1 write(Alice:town, NYC) w1
Bob r2 read(Alice:town) r2
Bob w3 write(Bob:town, LA)
Alice r4 read(Bob:town)
w3
Carol w5 write(Carol:likes, ACM, 8/31/12)
r4
Alice w6 write(Alice:likes, ACM, 9/1/12)
w5
logical time
r7
(c)
w1 w8
w3
(d)
logical time
w5 op ID Dependencies
w6 w1 –
w3 w1
w5 –
w6 w 3 w1
w8
w8 w 6 w5 w3 w1
8
DATA
1. All of the client’s previous write operations, because of the thread-of-execution rule.
2. All of the operations that wrote values it read, because of the reads-from rule.
3. All of the operations that the operations in 1 and 2 depend on, because of the transitivity rule.
Tracking the first set is straightforward: servers return the unique timestamp assigned to each
write to the client library, which then adds a dependency on that write. Tracking the second set
is also straightforward: servers return the timestamp of the write that wrote the value when they
respond to reads, and then the client library adds a dependency on that write. The third set of
operations is a bit trickier: it requires that every write carry with it all of its dependencies, and that
these dependencies are stored with the value, returned with reads of that value, and then added to
the reader’s set of dependencies by the client library.
With the full set of dependencies for each client stored in its client library, all of these
dependencies can be attached to each write operation the client issues. Now when a server in a
remote data center receives a write with its full set of dependencies, it blocks the write and verifies
that each dependency is satisfied. Blocking these replicated write operations is acceptable because
they are not client-facing and do not block reads to whatever data they update. Here we have
explicitly chosen to delay these write operations until they can appear in the correct order, as shown
in figure 7. The dependency check for Bglad does not return until after Afound is applied on the East
Coast, which ensures Bob is never misunderstood.
The system described thus far provides causal consistency and all of the ALPS properties. Causal
consistency is provided by tracking causality with a client library and enforcing the causal order
with dependency checks on replicated writes. Availability and partition tolerance are ensured
by keeping all operations inside the local data center. Low latency is guaranteed by keeping all
FIGURE
Regularity 1-Redux
Alost
Alost
Afound
Afound
Bglad
9
DATA
operations local, nonblocking, and lock-free. Finally, a fully decentralized design ensures that the
system has scalability.
The current system, however, is inefficient. It has a huge amount of dependency metadata that
travels around with write operations and a huge number of dependency checks to execute before
applying them. Both of these factors steal throughput from user-facing operations and reduce
the utility of the system. Luckily, our systems can exploit the transitivity inherent in the graph
of causality to drastically reduce the dependencies that must be tracked and enforced. The subset
of dependencies being tracked are the one-hop dependencies, which have an arc to the current
operation in the graph of causality. (Note that in graph-theoretic terms, the one-hop dependencies
subset is the direct predecessor set of an operation.) In figure 6 the one-hop dependencies are shown
in bold. They transitively capture all of the ordering constraints on an operation. In particular,
because all other dependencies are depended upon by at least one of the one-hop dependencies by
definition, if this current operation occurs after the one-hop dependencies, then by transitivity it
will occur after all others as well.
LIMITED TRANSACTIONS
In addition to causal consistency, our systems provide limited forms of transactions. These include
read-only transactions, which transactionally read data spread across many servers in a data center,
and write-only transactions, which transactionally update data spread across many servers in a data
center.
These limited transactions are necessitated—and complicated—by the current scale of data. Data
for many services is now far too large to fit on a single machine and instead must be spread across
many machines. With data resident on many machines, extracting a consistent view of that data
becomes tricky. Even though a data store itself may always be consistent, a client can extract an
inconsistent view because the client’s reads will be served at different times by different servers. This,
unfortunately, can reintroduce many of the anomalies inherent in eventual consistency. In figure 8,
FIGURE
friends photo
student server server Advisor
friends check
& photo fetch
remove advisor Sremove
THEN yes friends
add bad photos
time
Sadd
bad photos
west coast
10
DATA
for example, in the West Coast data center, a grad student removes photo-viewing permissions from
an advisor and uploads incriminating photos. The advisor concurrently tries to view the student’s
photos and, incorrectly, is shown the incriminating photos. To avoid these anomalies, causal
consistency must be extended from the storage system to the Web servers and then on to users of the
service. This can be done using read-only transactions.
Read-only transactions allow programmers to transactionally read data spread across many
servers, yielding a consistent view of the data. The interface for a read-only transaction is simple:
a list of data locations. Instead of issuing many individual reads for different data locations, a
programmer issues a single read for all those locations. This is similar to batching these operations,
which is often done to make dispatching reads more efficient—except that it also ensures that the
results are isolated.
With read-only transactions, anomaly 5 can now be converted into a regularity as well. Figure 9
shows that with read-only transactions, the permissions and photos are read together transactionally,
yielding any of the three valid states shown, but never the anomaly that leaks the incriminating
photos to the student’s advisor.
friends photo
student server server Advisor
yes friends, old photos
Sadd OR
west coast
11
DATA
The basic idea behind our read-only transaction algorithm is that we want to read the entire
distributed data store at a single logical time. (For logical time, each node in the system keeps a
logical clock that is updated every time an event occurs (e.g., writing a value or receiving a message).
When sending a message, a node includes a timestamp t set to its logical clock c; when receiving
a message, a node sets c ←max(c, t + 1).9 Logical time LT provides a progressing logical view of the
system even though it is distributed and there is no centralized coordination of it. If event a happens
before event b, then LT(a) < LT(b). Thus, if distributed data is read at a single logical time LT for all
events seen at time t, we know all events that happen before them have lower logical times and thus
are reflected in the results. Figure 10 shows an example of this graphically, with validity periods for
values, represented by letters, written to different locations.
You can determine if values within a set are consistent with one another by annotating them with
the logical time they became visible and then were overwritten. For example, in figure 10 consistent
sets include {A,J,X}, {B,K,X}, {B,K,Y}, {B,L,Y} and inconsistent sets include {A,K,X}, {B,J,Y}, and {C,L,X},
among others. Our servers annotate values in this way and include them when returning results to
the client library so it can determine if values are mutually consistent.
Our read-only transaction algorithm is run by the client library and takes at most two rounds of
parallel reads. In the first round, the client library sends out parallel requests for all requested data.
Servers respond with their current visible values and validity intervals, which is the logical time the
value was written and the current logical time at the server. The value may be valid at future logical
times as well, but conservatively we know it is valid for at least this interval. Once the client receives
all responses, it determines if all the values are mutually consistent by checking for intersection in
their validity intervals. If there is intersection—which is almost always the case unless some of the
values are overwritten concurrently with the read-only transaction—then the client library knows
the values are consistent and returns them to the client.
If the validity intervals do not all intersect, then the process moves to the second round of the
algorithm. The second round begins by calculating a single logical time at which to read values,
called the effective time. It is calculated by choosing a time that ensures an up-to-date view of the
data instead of being stuck on an old consistent cut of it, and it allows the use of many of the values
FIGURE
Logical Time
1 11 21
location 1 A B C
2 12 19
location 2 J K L
3 15
location 3 X Y
logical time
12
DATA
retrieved in the first round. The client library then issues a second round of parallel reads for all
data for which it does not have a valid value at the effective time. These second-round reads ask for
the value of the data at the effective time, and servers answer these reads by traversing the older
version of a value until they find the one that was valid at the effective time. Figure 11 shows the
second round in action. Figure 11a is a read-only transaction that completes in a single round, while
figure 11b is a read-only transaction that requires a second round and requests data location 1 at the
effective time 15 and receives value B in response.
This read-only transaction algorithm is specifically designed to maintain all the ALPS properties
and provide high performance. It is available because all reads ask for a current value or an old value.
It is low-latency because it requires at most two nonblocking rounds of parallel reads. It is partition-
tolerant because all reads are in the local data center (partitions are assumed to occur only in the
wide area, not in the local data center). It is scalable because it is fully decentralized. Finally, it is
performant because it normally takes only a single round of parallel reads and only two rounds of
reads in the worst case.
Our previous work on Eiger12 has more details on how to choose the effective time, how to limit
server-side storage of old versions, and an algorithm for write-only transactions that also maintains
all the ALPS properties.
Read-Only Transactions
(a) (b)
1 7 1 10
location 1 A location 1 A
2 9 12 16
location 2 J location 2 K
3 8 3 15
location 3 X location 3 X
13
DATA
per second on average. This experiment shows that for this real-world workload Eiger’s causal
consistency and stronger semantics do not impose significant overhead.
To demonstrate the scalability of Eiger, we ran the Facebook TAO workload on N client machines
that fully loaded an N-server cluster that is replicating writes to another N-server cluster (i.e., the
N=128 experiment involves 384 machines). This experiment was run on PRObE’s Kodiak testbed,6
which provides an Emulab with exclusive access to hundreds of machines. Figure 12 shows the
throughput for Eiger as N scales from eight to 128 servers/cluster. The bars show throughput
normalized against the throughput of the eight-server cluster. Eiger scales out as the number of
servers increases; each doubling of the number of servers increases cluster throughput by 96 percent
on average.
MORE INFORMATION
More information is available in our papers on COPS11 and Eiger,12 and Wyatt Lloyd’s dissertation.13
The code for Eiger is available from https://github.com/wlloyd/eiger.
ACKNOWLEDGEMENTS
This work was supported by funding from National Science Foundation Awards CSR-0953197
(CAREER), CCF-0964474, CNS-1042537 (PRObE), and CNS-1042543 (PRObE); and by Intel via the
Intel Science and Technology Center for Cloud Computing (ISTC-CC).
FIGURE
64
32
normalized throughput (log)
16
0
8 16 32 64 128
servers/cluster (log)
14
DATA
REFERENCES
1. Brewer, E. 2000. Towards robust distributed systems. In Proceedings of the 19th Annual ACM
Symposium on Principles of Distributed Computing.
2. Bronson, N., et al. 2013. TAO: Facebook’s distributed data store for the social graph. In Proceedings
of the Usenix Annual Technical Conference.
3. Cham, J. 2013. PhD Comics (June); http://www.phdcomics.com/comics.php?f=1592.
4. Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J., Ghemawat, S., Gubarev,
A., Heiser, C., Hochschild, P., Hsieh, W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S.,
Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., Taylor, C., Wang,
R., Woodford, D. 2013. Spanner: Google’s globally distributed database. ACM Transactions on
Computer Systems 31(3).
5. DeCandia, G., et al. 2007. Dynamo: Amazon’s highly available key-value store. In Proceedings of the
21st ACM Symposium on Operating Systems Principles: 205-220.
6. G ibson, G., Grider, G., Jacobson, A., Lloyd, W. 2013. PRObE: A thousand-node experimental
cluster for computer systems research. Usenix ;login: 38(3).
7. Gilbert, S., Lynch, N. 2002. Brewer’s conjecture and the feasibility of consistent, available,
partition-tolerant Web services. ACM SIGACT News 33(2): 51-59.
8. Lakshman, A., Malik, P. 2009. Cassandra—a decentralized structured storage system. In the 3rd
ACM SIGOPS International Workshop on Large-scale Distributed Systems and Middleware.
9. Lamport, L. 1978. Time, clocks, and the ordering of events in a distributed system.
Communications of the ACM 21(7): 558-565.
10. Lipton, R. J., Sandberg, J. S. 1988. PRAM: a scalable shared memory. Technical Report TR-180-88.
Princeton University, Department of Computer Science.
11. Lloyd, W., Freedman, M. J., Kaminsky, M., Andersen, D. G. 2011. Don’t settle for eventual:
scalable causal consistency for wide-area storage with COPS. In Proceedings of the 23rd Symposium
on Operating Systems Principles: 401-416.
12. L loyd, W., Freedman, M. J., Kaminsky, M., Andersen, D.G. 2013. Stronger semantics for low-
latency geo-replicated storage. In Proceedings of the 10th Usenix Conference on Networked Systems
Design and Implementation: 313-328.
13. Lloyd, W. 2013. Stronger consistency and semantics for low-latency geo-replicated storage. Ph.D.
Dissertation, Princeton University.
14. S antora, M. 2013. In hours, thieves took $45 million in ATM scheme. New York Times (May 9).
15. Sovran, Y., Power, R., Aguilera, M. K., Li, J. 2011. Transactional storage for geo-replicated systems.
In Proceedings of the 23rd Symposium on Operating Systems Principles: 385-400.
16. Voldemort. 2013; http://project-voldemort.com.
16