vCorfu: Cloud-Scale Object Store
vCorfu: Cloud-Scale Object Store
Michael Wei, University of California, San Diego, and VMware Research Group; Amy Tai,
Princeton University and VMware Research Group; Christopher J. Rossbach, The University
of Texas at Austin and VMware Research Group; Ittai Abraham, VMware Research Group;
Maithem Munshed, Medhavi Dhawan, and Jim Stabile, VMware; Udi Wieder and Scott
Fritchie, VMware Research Group; Steven Swanson, University of California, San Diego;
Michael J. Freedman, Princeton University; Dahlia Malkhi, VMware Research Group
https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/wei-michael
Michael WeiF† , Amy Tai}† , Christopher J. Rossbach⌅† , Ittai Abraham† , Maithem Munshed‡ ,
Medhavi Dhawan‡ , Jim Stabile‡ , Udi Wieder† , Scott Fritchie† ,
Steven SwansonF , Michael J. Freedman} , Dahlia Malkhi†
† VMware
Research Group, ‡ VMware,
F University of California, San Diego, } Princeton University, ⌅ UT Austin
USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 35
materialized streams. Unlike streams proposed in previ- be scaled or upgraded independently. Traditional ar-
ous systems [10], which support only sequential reads, chitectures consist of three layers: a front-end which
materialized streams support fast, fully random reads, communicates to users, an application tier with stateless
because all updates for a stream can be accessed from logic, and a data tier, where state is held. This orga-
a single partition. This design enables log replicas to use nization enabled early web applications to scale easily
SMR to service requests directly, relieving the burden of because stateless front-end and application tiers enable
playback from clients. Like other shared log systems, scaling horizontally in the application tier with the addi-
vCorfu supports strong consistency, linearizable reads tion of more application servers or vertically in the data
and transactions, but the locality advantages of materi- tier by upgrading to more powerful database servers.
alization enable vCorfu to scale to thousands of clients. As more and more applications move to cloud exe-
vCorfu also leverages a sequencer to implement a fast, cution environments, system and application designers
lightweight transaction manager and can execute read- face increasingly daunting scalability requirements in
only transactions without introducing conflicts. A novel the common case. At the same time, the end of Den-
technique called composable state machine replication nard scaling [21] leaves system builders unable to rely
(CSMR) enables vCorfu to store huge objects while still on performance improvements from the hardware: verti-
allowing client queries to be expressed using a familiar cal scaling at the data tier is no longer feasible in most
object-based model. settings. As a consequence, modern cloud-scale sys-
We make the following contributions: tems generally trade off reduced functionality and pro-
grammability for scalability and performance at the data
• We present the design and architecture of vCorfu,
tier. A new class of N O SQL data stores [1, 4, 12, 14, 18,
a cloud-scale distributed object store built on a
26] has emerged, which achieve cloud-scale by relaxing
shared log. vCorfu’s novel materialization tech-
consistency, eliding transaction support, and restricting
nique enables reads without playback while main-
query and programming models.
taining strong consistency and high availability.
A severe consequence of this trend is an increased
• We show that vCorfu’s innovative design provides
burden on programmers. In practice, programmers of
the same strong consistency guarantees as shared
modern cloud systems are forced to cobble together
log designs while enabling scalability and perfor-
tools and components to restore missing functionality
mance that is competitive with, and often better than
when it is needed. For example, a lock server such
current N O SQL systems.
as ZooKeeper [23] is often used in conjunction with a
• We demonstrate that by conditionally issuing to- N O SQL store to implement atomic operations. Pro-
kens, our sequencer performs lightweight transac- grammers commonly implement auxiliary indexes to
tion resolution, relieving clients of the burden of re- support queries, typically with relaxed consistency since
solving transactions. the auxilliary index is not maintained by the data store.
• We evaluate vCorfu against a popular
N O SQL store, Cassandra, and show that vCorfu is
just as fast for writes and much faster at reads, even 2.2 Scalable Shared Logs
while providing stronger consistency guarantees
and advanced features such as transactions. Shared logs have been used to provide highly fault-
tolerant distributed data stores since the 1980s [36, 38].
• We describe CSMR, a technique which enables ef-
Logs are an extremely powerful tool for building strongly
ficient storage of huge objects by composition of a
consistent systems, since data is never overwritten, only
large state machine from smaller component state
appended, which yields a total order over concurrent
machines. vCorfu can store and support operations
modifications to the log. Early shared logs had limited
against 10GB YCSB! [16] database without sacri-
scalability, as all appends must be serialized through a
ficing the strong consistency afforded by SMR.
single server, quickly becoming an I/O bottleneck.
More recent shared log designs [9, 10, 11, 40, 41] ad-
2 Background
dress this scalability limitation to varying degrees. For
example, the Corfu protocol [9] leverages a centralized
2.1 Data Stores
sequencer which is not part of the I/O path, yielding a
Modern web applications rely heavily on multi-tiered ar- design in which append throughput is only limited by the
chitecture to enable systems in which components may speed in which a sequencer can issue log addresses.
36 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
vCorfu Clients vCorfu Stream Store Operation Description
Local Views (§4.2) Sequencer (§3) Layout (§3) read(laddresses) Get the data stored at log address(es).
read(stream, saddresses) Read from a stream at
� � �
��� ��
stream address(es).
vCorfu append(stream, data) Append data to stream.
Stream Log Replicas (§3.1) check(stream) Get the last address issued to a stream.
Protocol (§3) Even Odd trim(stream, saddresses, Release all entries with
� stream address < pre f ix.
� �
prefix)
� �
�
� fillhole(laddress) Invoke hole-filling for log address.
Table 1: Core operations supported by the vCorfu shared log.
Stream Replicas (§3.1)
CSMR (§5) A-M N-Z shared log, materialized streams are a first class abstrac-
tion which supports random and bulk reads just like scat-
� � tered logs like Kafka [26] and Kinesis [30], but with all
�
Runtime (§4.1)
�� ��
Remote Views (§4.2)
the consistency benefits of a shared log like Corfu [9] and
Tango [10].
The vCorfu stream store architecture is shown in Fig-
Figure 2: The architecture of vCorfu. Solid lines highlight the write ure 2. In vCorfu, data are written to materialized streams,
path, while dotted lines highlight the read path. Thin lines indicate
control operations outside of the I/O path.
and data entries receive monotonically increasing tokens
on both a global log and on individual streams from a
2.3 State Machine Replication
sequencer server. The sequencer can issue tokens condi-
Most shared log systems use state machine replication tionally to enable fast optimistic transaction resolution,
(SMR) [37] which relies on the log’s total ordering of as described in Section 4. vCorfu writes data in the form
appends to implement an abstraction layer over the log. of updates to both log replicas and stream replicas, each
Data stored on the log is modeled as a state machine. of which are indexed differently. This design replicates
Clients modify data by appending updates to the log and data for durability, but enables access to that data with
read data by traversing the log and applying those up- different keys, similar to Replex [39]. The advantage is
dates in order to an in-memory view. This approach en- that clients can directly read the latest version of a stream
ables strong consistency, and significantly simplifies sup- simply by contacting the stream replica.
port for transactions over multiple data items [10, 11]. A layout service maintains the mapping from log and
The achilles’ heel of shared log systems, however, is stream addresses to replicas. Log replicas and stream
playback. To service any request, a client must read ev- replicas in vCorfu contain different sets of updates, as
ery single update and apply it to in-memory state, re- shown in Figure 1. The log replicas store updates by their
gardless of whether the request has any dependency on (global) log address, and stream replicas by their stream
those updates. In practice, this has limited the applica- addresses. The replication protocol in vCorfu dynami-
bility of shared log systems to settings characterized by cally builds replication chains based on the global log
few clients or small global state [9], such as metadata ser- offset, the streams which are written to, and the streams
vices [10, 40]. In contrast, data tiers in typical web appli- offsets. Subsequent sections consider the design and im-
cations manage state at a scale that may make traditional plementation of materialized streams in more detail.
playback prohibitively expensive. Worse, in systems re- vCorfu is elastic and scalable: replicas may be added
lying on stateless application tiers, naı̈ve use of shared or removed from the system at any time. The sequencer,
logs induces a playback requirement to reconstruct state because it merely issues tokens, does not become an
for every request. The goal of vCorfu is to eliminate I/O bottleneck. Reconfiguration is triggered simply by
these limitations, enabling SMR with shared logs over changing the active layout. Finally, vCorfu is fault toler-
large state and without client playback overheads. ant - data which is stored in vCorfu can tolerate a limited
number of failures based on the arrangement and number
3 vCorfu Stream Store of replicas in the system, and recovery is handled similar
to the mechanism in Replex [39]. Generally, vCorfu can
vCorfu implements a shared log abstraction that removes tolerate the failures as long as a log replica and stream
the overhead and limitations of shared logs, enabling replica do not fail simultaneously. Stream replicas can be
playback that does not force a client to playback poten- reconstructed from the aggregate of the log replicas, and
tially irrelevant updates. vCorfu virtualizes the log us- log replicas can be reconstructed by scanning through all
ing a novel technique called stream materialization. Un- stream replicas.
like streams in Tango, which are merely tags within a Operationally, stream materialization divides a single
USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 37
Sequencer
"sequencers": 10.0.0.1,
"segments": { (“a”)
"start" : 0,
1a �
nextT
oken
”=1�
1b �
"log" : [[ 10.0.1.1 ], [ 10.0.1.2 ]],
bal = 4, “a
"stream" : [[ 10.0.2.1 ], [ 10.0.2.2 ]] ] } glo Log Replicas
Even (mod 0) Odd (mod 1)
�
2a � write(4 , data)
� �
3b mit)
global log into materialized streams, which support log-
ging operations: append, random and bulk reads, trim,
check and fillhole; the full API is shown in Table 1. Each Figure 4: Normal write path of a vCorfu log write, which takes four
materialized stream maps to an object in vCorfu, and roundtrips: one for token acquisition, two for writing to each replica
each stream stores an ordered history of modifications (and committing at the stream replica), and one to send a commit mes-
sage to the log replica.
to that object, following the SMR [37] paradigm.
writes to both replicas, it commits the write by broadcast-
3.1 Fully Elastic Layout ing a commit message to each replica it accessed (except
the final replica, since the final write is already commit-
In vCorfu, a mapping called a layout describes how ted). Replicas will only serve reads for committed data.
offsets in the global log or in a given materialized This enables stream replicas to provide a dense materi-
stream map to replicas. A vCorfu client runtime must alized stream, without holes. The write path of a client,
obtain a copy of the most current layout to determine which takes four roundtrips in normal operation is shown
which replica(s) to interact with. Each layout is stamped in Figure 4. A server-driven variant where the log replica
with an epoch number. Replicas will reject requests from writes to the stream replica takes 6 messages; we leave
clients with a stale epoch. A Paxos-based protocol [27] implementation of this variant for future work.
ensures that all replicas agree on the current layout. An
example layout is shown in Figure 3. Layouts work like 3.3 Atomically appending to multiple streams
leases on the log: a client request with the wrong lay-
out (and wrong epoch number) will be rejected by repli- The primary benefit of materialized streams is that they
cas. The layout enables clients to safely contact a stream provide an abstraction of independent logs while main-
replica directly for the latest update to a stream. taining a total global order over all appends. This enables
vCorfu to support atomic writes across streams, which
3.2 Appending to vCorfu materialized streams form the basic building block for supporting transactions.
To append to multiple streams atomically, the client
A client appending to a materialized stream (or streams) obtains a log token and stream tokens for each stream
first obtains the current layout and makes a request to the it wishes to append to. The client first writes to the log
sequencer with a stream id. The sequencer returns both replica using the log token. Then, the client writes to the
a log token, which is a pointer to the next address in the stream replica of each stream (multiple streams mapped
global log, and a stream token, which is a pointer to the to the same replica are written together so each replica
next address in the stream. Using these tokens and the is visited only once). The client then sends a commit
layout, the client determines the set of replicas to write message to each participating replica (the commit and
to. write are combined for the last replica in the chain). The
In contrast to traditional designs, replica sets in resulting write is ordered in the log by a single log token,
vCorfu are dynamically arranged during appends. For but multiple stream tokens.
fault tolerance, each entry is replicated on two replica
types: the first indexed by the address in the log (the log 3.4 Properties of the vCorfu Stream Store
replica), and the second by the combination of the stream
id and the stream address (the stream replica). To per- Materialized streams are a first class abstraction in
form a write, the client writes to the log replica first, then vCorfu, unlike streams in Tango [10] which are merely
to the stream replica. If a replica previously accepted tags within a shared log. Materialized streams strike
a write to a given address, the write is rejected and the a balance that combines the global consistency advan-
client must retry with a new log token. Once the client tages of shared logs with the locality advantages of dis-
38 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
tributed data platforms. Specifically, these properties en- serve requests. In most shared log designs, clients must
able vCorfu to effectively support SMR at scale: consume updates, which are distributed and sharded for
The global log is a single source of scalability, consis- performance. The log itself cannot directly serve re-
tency, durability and history. One may wonder, why have quests because no single storage unit for the log contains
log replicas at all, if all we care to read from are material- all the updates necessary to service a request. Stream
ized streams? First, the global log provides a convenient, replicas in vCorfu, however, contain all the updates for
scalable mechanism to obtain a consistent snapshot of a particular stream, so a stream replica can playback up-
the entire system. This can be used to execute long run- dates locally and directly service requests to clients, a
ning read-only transactions, a key part of many analytics departure from the traditional client-driven shared log
workloads, or a backup utility could constantly scan the paradigm. This removes the burden of playback from
log and move it to cold storage. Second, the log provides clients and avoids the playback bottleneck of previous
us with a unique level of fault tolerance - even if all the shared log designs [10, 11].
stream replicas were to fail, vCorfu can fall back to using Garbage collection is greatly simplified. In Tango,
the log replicas only, continuing to service requests. clients cannot trim (release entries for garbage collec-
Materialized streams are true virtual logs, unlike tion) streams directly. Instead, they must read the stream
streams. Tango streams enable clients to selectively con- to determine which log addresses should be released, and
sume a set of updates in a shared log. Clients read issue trim calls for each log address, which can be a
sequentially from streams using a readNext() call, costly operation if many entries are to be released. In
which returns the next entry in the stream. Tango clients vCorfu, clients issue trim commands to stream replicas,
cannot randomly read from anywhere in stream because which release storage locally and issue trim commands
streams are implemented using a technique called back- to the global log. Clients may also delegate the task of
pointers: each entry in a stream points to the previous garbage collection directly to a stream replica.
entry, inducing a requirement for sequential traversal.
Materializing the stream removes this restriction: since 4 The vCorfu Architecture
clients have access to a replica which contains all the up-
dates for a given stream, clients can perform all the func- vCorfu presents itself as an object store to applications.
tions they would call on a log, including a random read Developers interact with objects stored in vCorfu and
given a stream address, or a bulk read of an entire stream. a client library, which we refer to as the vCorfu run-
This support is essential if clients randomly read from time, provides consistency and durability by manipulat-
different streams, as backpointers would require reading ing and appending to the vCorfu stream store. Today, the
each stream from the tail in order. vCorfu runtime supports Java, but we envision support-
vCorfu avoids backpointers, which pose performance, ing many other languages in the future.
concurrency and recovery issues. Backpointers can re- The vCorfu runtime is inspired by the Tango [10] run-
sult in performance degradation when concurrent clients time, which provides a similar distributed object abstrac-
are writing to the log and a timeout occurs, causing a hole tion in C++. On top of the features provided by Tango,
filling protocol to be invoked [9]. Since holes have no such as linearizable reads and transactions, vCorfu lever-
backpointers, timeouts force a linear scan of the log, with ages Java language features which greatly simplify writ-
a cost proportional to the number of streams in the log. ing vCorfu objects. Developers may store arbitrary Java
Tango mitigates this problem by keeping the number of objects in vCorfu, we only require that the developer pro-
streams low and storing multiple backpointers, which has vide a serialization method and to annotate the object
significant overhead because the sequencer must main- to indicate which methods read or mutate the object, as
tain a queue for each stream. Furthermore, backpointers shown in Figure 5.
significantly complicate recovery: if the sequencer fails, Like Tango, vCorfu fully supports transactions over
the entire log must be read to determine the most recent objects with stronger semantics than most distributed
writes to each stream. vCorfu instead relies on stream data stores, thanks to inexpensive global snapshots pro-
replicas, which contain a complete copy of updates for vided by the log. In addition, vCorfu also supports
each stream, free of holes thanks to vCorfu’s commit pro- transactions involving objects not in the runtime’s local
tocol, resorting to a single backpointer only when stream memory (case D, §4.1 in [10]), opacity [22], which en-
replicas fail. Sequencer recovery is fast, since stream sures that transactions never observe inconsistent state,
replicas can be queried for the most recent update. and read-own-writes which greatly simplifies concurrent
Stream replicas may handle playback and directly programming. Unlike Tango, the vCorfu runtime never
USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 39
class User { shown in the login method in Figure 5.
String name; String password;
DateTime lastLogin; DateTime lastLogout;
The SMR technique extracts several important proper-
ties from the vCorfu stream store. First, the log acts as a
@Accessor
public String getName() {
source of consistency: every change to an object is totally
return name;} ordered by the sequencer, and every access to an object
@MutatorAccessor
reflects all updates which happen before it. Second, the
public boolean login(String pass, DateTime time){ log is a source of durability, since every object can be
if (password.equals(pass)) {
lastLogin = time;
reconstructed simply by playing back all the updates in
return true;} the log. Finally, the log is a source of history, as pre-
return false;}
vious versions of the object can be obtained by limiting
@Mutator playback to the desired position.
public void logout(DateTime time) {
lastLogout = time;}}
Each object can be referred to by the id of the stream
it is stored in. Stream ids are 128 bits, and we provide a
Figure 5: A Java object stored in vCorfu. @Mutator indicates that the standardized hash function so that objects can be stored
method modifies the object, @Accessor indicates the method reads the using human-readable strings (i.e., “person-1”).
object, and @MutatorAccessor indicates the object reads and modifies vCorfu clients call open() with the stream id and an
the object.
object type to obtain a view of that object. The client also
needs to resolve whether transactional entries in the log specifies whether the view should be local, which means
have succeeded thanks to a lightweight transaction mech- that the object state is stored in-memory locally, or re-
anism provided by the sequencer. mote, which means that the stream replica will store the
state and apply updates remotely (this is enabled by the
4.1 vCorfu Runtime
remote class loading feature of Java). Local views are
To interact with vCorfu as an object store, clients load similar to objects in Tango [10] and especially powerful
the vCorfu runtime, a library which manages interactions when the client will read an object frequently through-
with the vCorfu stream store. Developers never interact out the lifespan of a view: if the object has not changed,
with the store directly, instead, the runtime manipulates the runtime only performs a quick check() call to ver-
the store whenever an object is accessed or modified. The ify no other client has modified the object, and if it has,
runtime provides each client with a view of objects stored the runtime applies the relevant updates. Remote views,
in vCorfu, and these views are synchronized through the on the other hand, are useful when accesses are infre-
vCorfu stream store. quent, the state of the object is large, or when there are
The runtime provides three functions to clients: many remote updates to the object - instead of having
open(), which retrieves a in-memory view of an object to playback and store the state of the object in-memory,
stored in the log, TXbegin(), which starts a transac- the runtime simply delegates to the stream replica, which
tion, and TXend(), which commits a transaction. services the request with the same consistency as a local
view. To ensure that it can rapidly service requests, the
4.2 vCorfu Objects stream replicas generate periodic checkpoints. Finally,
the client can optionally specify a maximum position to
As we described earlier, vCorfu objects can be arbitrary open the view to, which enables the client to access the
Java objects such as the one shown in Figure 5. Objects history, version or snapshot of an object. Clients may
map to a stream, which stores updates to that object. have multiple views of the same object: for example, a
Like many shared log systems, we use state machine client may have a local view of the present state of the
replication (SMR) [27] to provide strongly consistent ac- object with a remote view of a past version of the object,
cesses to objects. When a method annotated with @Mu- enabling the client to operate against a snapshot.
tator or @MutatorAccessor is called, the runtime seri-
alizes the method call and appends it to the objects’ 4.3 Transactions in vCorfu
stream first. When an @Accessor or @MutatorAccessor
is called, the runtime reads all the updates to that stream, Transactions enable developers to issue multiple opera-
and applies those updates to the object’s state before re- tions which either succeed or fail atomically. Transac-
turning. In order for SMR to work, each mutator must be tions are a pain point for partitioned data stores since a
deterministic (a call to random() or new Date() is transaction may span across multiple partitions, requir-
not supported). Many method calls can be easily refac- ing locking or schemes such as 2PL [32] or MVCC [35]
tored to take non-deterministic calls as a parameter, as to achieve consistency.
40 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
vCorfu leverages atomic multi-stream appends and writes in the write buffer. Many other systems [1, 4, 10]
global snapshots provided by the log, and exploits the do not provide this property since it requires writes to be
sequencer as a lightweight transaction manager. Trans- applied to data items. The SMR paradigm, however, en-
action execution is optimistic, similar to transactions in ables vCorfu to generate the result of a write in-memory,
shared log systems [10, 11]. However, since our se- simplifying transactional programming.
quencer supports conditional token issuance, we avoid vCorfu fully supports nested transactions, where a
polluting the log with transactional aborts. transaction may begin and end within a transaction.
Whenever transaction nesting occurs, vCorfu buffers
To execute a transaction, a client informs the runtime
each transaction’s write set and the transaction takes the
that it wishes to enter a transactional context by calling
timestamp of the outermost transaction.
TXBegin(). The client obtains the most recently is-
sued log token once from the sequencer and begins op- 4.4 Querying Objects
timistic execution by modifying reads to read from a
vCorfu supports several mechanisms for finding and re-
snapshot at that point. Writes are buffered into a write
trieving objects. First, a developer can use vCorfu like
buffer. When the client ends the transaction by calling
a traditional key-value store just by using the stream id
TXEnd(), the client checks if there are any writes in the
for object as a key. We also support a much richer query
write buffer. If there are not, then the client has success-
model: a set of collections, which resemble the Java col-
fully executed a read-only transaction and ends transac-
lections are provided for programmers to store and ac-
tional execution. If there are writes in the write buffer,
cess objects in. These collections are objects just like
the client informs the sequencer of the log token it used
any other vCorfu object, so developers are free to im-
and the streams which will be affected by the transaction.
plement their own collection. Developers can take ad-
If the streams have not changed, the sequencer issues
vantage of multiple views on the same collection: for
log and stream tokens to the client, which commits the
instance a List can be viewed as a Queue or a Stack
transaction by writing the write buffer. Otherwise, the se-
simultaneously. Some of the collections we provide in-
quencer issues no token and the transaction is aborted by
clude a List, Queue, Stack, Map, and RangeMap.
the client without writing an entry into the log. This im-
Collections, however, tend to be very large objects
portant optimization ensures only committed entries are
which are highly contended. In the next section, we dis-
written, so that when a client encounters a transactional
cuss composable state machine replication, a technique
commit entry, it may treat it as any other update. In other
which allows vCorfu to build a collection out of multiple
shared log systems [10, 11, 40], each client must deter-
objects.
mine whether a commit record succeeds or aborts, either
by running the transaction locally or looking for a deci-
sion record. In vCorfu, we have designed transactional
5 Composable State Machine Replication
support to be as general as possible and to minimize the
In vCorfu, objects may be composed of other objects, a
amount of work that clients must perform to determine
technique which we refer to as composable state machine
the result of a transaction. We treat each object as an
replication (CSMR). The simplest example of CSMR is
opaque object, since fine-grained conflict resolution (for
a hash map composed of multiple hash maps, but much
example, determining if two updates to different keys in
more sophisticated objects can be created.
a map conflict) would either require the client resolve
Composing SMR objects has several important advan-
conflicts or a much more heavyweight sequencer.
tages. First, CSMR divides the state of a single object
Opacity is ensured by always operating against the into several smaller objects, which reduces the amount
same global snapshot, leveraging the history provided by of state stored at each stream. Second, smaller objects
the log. Opacity [22] is a stronger guarantee than strict reduce contention and false sharing, providing for higher
serializability as opacity prevents programmers from ob- concurrency. Finally, CSMR resembles how data struc-
serving inconsistent state (e.g. a divide-by-zero error tures are constructed in memory - this allows us to apply
when system invariants prevent such a state from occur- standard data structure principles to vCorfu. For exam-
ing). Since global snapshots are expensive in partitioned ple, a B-tree constructed using CSMR would result in a
systems, these systems [1, 2, 3, 4] typically provide only structure with O(log n) time complexity for search, insert
a weaker guarantee, allowing programs to observe incon- and delete operations. This opens a plethora of familiar
sistent state but guaranteeing that such transactions will data structures to developers.
be aborted. Read-own-writes is another property which Programmers manipulate CSMR objects just as they
vCorfu provides: transactional reads will also apply any would any other vCorfu object. A CSMR object starts
USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 41
class CSMRMap<K,V> implements Map<K,V> { child objects are nodes which contain either keys or ref-
final int numBuckets; erences to other child objects. Unlike a traditional B-
int getChildNumber(Object k) { tree, every node in the CSMR B-tree is versioned like
int hashCode = lubyRackoff(k.hashCode()); any other object in vCorfu. CSMR takes advantage of
return Math.abs(hashCode % numBuckets);}
this versioning when storing a reference to a child object:
SMRMap<K,V> getChild(int partition) { instead of storing a static pointer to particular versions of
return open(getStreamID() + partition);}
node, as in a traditional B-tree, references in vCorfu are
V get(K key) { dynamic. Normally, references point to the latest version
return getChild(getChildNumber(key)).get(key);}
of an object, but they may point to any version during
@TransactionalMethod(readOnly = true) a snapshotted read, allowing the client to read a consis-
int size() {
int total = 0; tent version of even the most sophisticated CSMR ob-
for (int i = 0; i < numBuckets; i++) { jects. With dynamic pointers, all pointers are implicitly
total += getChild(i).size();}
return total;} updated when an object is updated, avoiding a problem in
traditional trees, where an update to a single child node
@TransactionalMethod
void clear() { can cause an update cascade requiring all pointers up to
for (int i = 0; i < numBuckets; i++) { the root to be explicitly updated, known as the recursive
total += getChild(i).clear();}}}
update problem [42].
Figure 6: A CSMR Java Map in vCorfu. @TransactionalMethod indi-
cates that the method must be executed transactionally. 6 Evaluation
with a base object, which defines the interface that a de-
veloper will use to access the object. An example of a Our test system consists of sixteen 12 core machines run-
CSMR hash map is shown in Figure 6. The base object ning Linux (v4.4.0-38) with 96GB RAM and 10G NICs
manipulates child objects, which store the actual data. on each node with a single switch. The average latency
Child objects may reuse standard vCorfu objects, like a measured by ping (56 data bytes) between two hosts is
hash map, or they may be custom-tailored for the CSMR 0.18±0.01 ms when the system is idle. All benchmarks
object, like a B-tree node. are done in-memory, with persistence disabled. Due to
In the example CSMR map shown in Figure 6, the ob- the performance limitations and overheads from Java and
ject shown is the base object and the child objects are serialization, our system was CPU-bound and none of
standard SMR maps (backed by a hash map). The num- our tests were able to saturate the NIC (the maximum
ber of buckets is set at creation in the numBuckets bandwidth we achieved from a single node was 1Gb/s,
variable. Two functions, getChildNumber() and with 4KB writes).
getChild() help the base object locate child objects Our evaluation is driven by the following questions:
deterministically. In our CSMR map, we use the Luby-
Rakoff [28] algorithm to obtain an improved key distri- • What advantages to we obtain by materializing
bution over the standard Java hashCode() function. streams? (§ 6.1)
Most operations such as get and put operate as be-
• Do remote views offer N O SQL-like performance
fore, and the base object needs to only select the correct
with the global consistency of a shared log? (§ 6.2)
child to operate on. However, some operations such as
size() and clear() touch all child objects. These • How does the sequencer act as a lightweight, lock-
methods are annotated with @TransactionalObject so free transaction manager and offer inexpensive
that under the hood, the vCorfu runtime uses transactions read-only transactions? (§ 6.3)
to make sure objects are modified atomically and read
• How does CSMR keep state machines small, while
from a consistent snapshot. The vCorfu log provides fast
reducing contention and false conflicts? (§ 6.4)
access to snapshots of arbitrary objects, and the ability
to open remote views, which avoids the cost of playback,
6.1 vCorfu Stream Store
enables clients to quickly traverse CSMR objects without
reading many updates or storing large local state. The design of vCorfu relies on performant materializa-
In a more complex CSMR object, such as our CSMR tion. To show that materializing streams is efficient, we
B-tree, the base object and the child object may have implement streams using backpointers in vCorfu with
completely different interfaces. In the case of the B-tree, chain replication, similar to the implementation de-
the base object presents a map-like interface, while the scribed in Tango [10].
42 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
vCorfu Backpointers Distribution of backpointer seek sizes
120K
Appends per second
1.0
● ● ● ● ●
● ● ● ●●
● ● ●
●
60K
● ● ● ● ●
● ●
0.8
● ●
● ● ●
● ● ●
● ● ● ●
● ●
● ● ●
● ● ●
0.6
● ● ● ● 32 streams
0
● ●
CDF
● ● ●
10 1K 100K 10 1K 100K ● ● ● ●
1K streams
● ● ●
● ● ● ● 10K streams
0.4
Streams Streams ● ● ●
● ● 100K streams
● ● ●
● ● ●
● ● ● 500K streams
● ● ●
Figure 7: vCorfu’s replication protocol imposes a small penalty on
●
0.2
● ● ● ●
● ● ● ●
● ● ● ●
writes to support materialization. Each run is denoted with the number ● ●● ●
●
●
0.0
●
of streams used.
1e+00 1e+02 1e+04 1e+06
32 Streams 1K Streams 100K Streams
80
20
.15
60
15
10
.1
40K
● Write
number of entries per stream. op per sec ● Read
USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 43
back to using backpointers. Since the local view con-
2
Latency (s)
tains all the previous updates in the stream, reading the ●
●
LocalView
RemoteView
1
however, vCorfu would have to read the entire stream to ● ●
●
0
● ●
● ● ● ● ●
Number of Clients
6.2 Remote vs. Local Views
Figure 11: Latency of requests under load with local views and re-
Next, we examine the power of remote views. We first mote views. As the number of clients opening local views on an object
show that remote views address the playback bottleneck: increases, so does the latency for a linearized read.
In Figure 11, we append to a single local view and in-
crease the number of clients reading from their own
140K
local views. As the number of views increases, read vCorfu−Remote
120K
Cassandra
throughput decreases because each view must playback
100K
the stream and read every update. Once read through-
Throughput (op/sec)
put is saturated readers are unable to keep up with the
80K
updates in the system and read latency skyrockets: with
60K
just 32 clients, the read latency jumps to above one sec-
40K
ond. With a remote view, the stream replica takes care 20K
Load A B C F D E
We then substantiate our claim that remote views offer Workload
performance comparable to many N O SQL data stores.
In Figure 12, we run the popular Yahoo! cloud serv- Figure 12: YCSB suite throughput over 3 runs. Error bars indicate
standard deviation. Results in order executed by benchmark.
ing benchmark with Cassandra [1] (v 2.1.9), a popular
distributed key-value store, as well as the backpointer- Name Workload Description
Load 100% Insert Propagating Database
based implementation of vCorfu described in the pre- A 50% Read/50% Update Session Store
vious section. In vCorfu, we implement a key-value B 95% Read/5% Update Photo Tagging
C 100% Read User Profile Cache
store using the CSMR map described in Section 5 with D 95% Read/5% Insert User Status Updates
a bucket size of 1024, and configure the system in a E 95% Scan/5% Insert Threaded Conversations
F 50% Read/50% Read-Modify User Database
symmetrical configuration with 12 replicas and a chain
length of 2. Since the Java map interface returns the pre- Table 2: YCSB workloads.
vious key (a read-modify-write), we implement a spe-
10K
TX per sec
●
0
● ● ● ●
44 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
8K
20K
2PL Abort %
op per second
10
Latency (us)
90
2PL
TX per sec
● ●
● WithoutAnalytics−TPut
● 2PL AbortRate ● WithAnalytics−TPut
● vCorfu ● ● WithoutAnalytics−Latency
60
● ●
● ●
10K
4K
● WithAnalytics−Latency
6
30
●
2
●
0
0
● ●
0
1 10 100 200 0 5 10 15 20 25 30
USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 45
Uniform zipf include variants of two-phase locking [17, 32], a seri-
alization oracle [8, 15], or a two-round distributed or-
● ●
Primitive
8000
8000
●
Latency (ms)
Latency (ms)
●
C10
4000
●
C1000
●
●
●
●
●
●
●
●
0
● ● ● ● ● ● ● ● ● ● ● ●
80% 80%
●
60% 60%
●
●
● ●
● ●
20% ●
● 20% ●
●
●
●
● ● ●
1 2 4 8 16 1 2 4 8 16
Number of Clients
vCorfu is built around SMR [37], which has been
Figure 16: Top: The latency of initializing a local view versus the used both with [10, 11] and without [7, 23] shared logs
number of updates to the object, for different bucket sizes and on a
primitive SMR map. Bottom: The abort rate of optimistic transactions
to implement strongly consistent services. The SMR
with varying concurrency and bucket sizes on a primitive SMR map. paradigm requires that each replica store a complete copy
of the state, which is impractical for replicating large sys-
tems at scale, such as databases. vCorfu takes advantage
Latency (s)
20
●
● FirstRead
● InMemory
of CSMR to logically partition large state machines, and
10
● ●
● ● ●
46 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
References [13] William B. Cavnar, John M Trenkle, et al. N-gram-
based text categorization. In 3rd Annual Sympo-
[1] Cassandra. http://cassandra.apache. sium on Document Analysis and Information Re-
org/. trieval (SDAIR ’94), volume 48113, pages 161–
175, Las Vegas, NV, USA, 1994.
[2] Cockroach Labs. http://www.
cockroachlabs.com/. [14] Kristina Chodorow. MongoDB: the definitive
guide. O’Reilly Media, 2013.
[3] Couchbase. http://www.couchbase.com/.
[15] Brian F. Cooper, Raghu Ramakrishnan, Utkarsh
[4] Gemfire. http://www.gemfire.com/.
Srivastava, Adam Silberstein, Philip Bohannon,
[5] Marcos K. Aguilera, Joshua B. Leners, and Michael Hans-Arno Jacobsen, Nick Puz, Daniel Weaver,
Walfish. Yesquel: Scalable SQL storage for web and Ramana Yerneni. PNUTS: Yahoo!’s hosted
applications. In Proceedings of the 25th Symposium data serving platform. Proceedings of the Very
on Operating Systems Principles, SOSP ’15, pages Large Data Base Endowment, 1(2):1277–1288,
245–262, Monterey, California, USA, 2015. ACM. August 2008.
[6] Marcos K. Aguilera, Arif Merchant, Mehul Shah, [16] Brian F. Cooper, Adam Silberstein, Erwin Tam,
Alistair Veitch, and Christos Karamanolis. Sin- Raghu Ramakrishnan, and Russell Sears. Bench-
fonia: A new paradigm for building scalable dis- marking cloud serving systems with YCSB. In
tributed systems. ACM Transactions on Computer Proceedings of the 1st ACM Symposium on Cloud
Systems (TOCS), 27(3):5:1–5:48, November 2009. Computing (SOCC), pages 143–154, Indianapolis,
IN, USA, 2010. ACM.
[7] Deniz Altinbuken and Emin Gun Sirer. Commodi-
fying replicated state machines with OpenReplica. [17] James C. Corbett, Jeffrey Dean, Michael Epstein,
2012. Andrew Fikes, Christopher Frost, J. J. Furman,
Sanjay Ghemawat, Andrey Gubarev, Christopher
[8] J. Baker, C. Bond, J.C. Corbett, J. Furman, Heiser, Peter Hochschild, Wilson Hsieh, Sebastian
A. Khorlin, J. Larson, J.M. Léon, Y. Li, A. Lloyd, Kanthak, Eugene Kogan, Hongyi Li, Alexander
and V. Yushprakh. Megastore: providing scalable, Lloyd, Sergey Melnik, David Mwaura, David Na-
highly available storage for interactive services. In gle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Ya-
Proceedings of Conference on Innovative Data Sys- sushi Saito, Michal Szymaniak, Christopher Tay-
tems Research, CIDR, pages 223–234, Asilomar, lor, Ruth Wang, and Dale Woodford. Span-
CA, USA, 2011. ner: Google’s globally-distributed database. In
[9] Mahesh Balakrishnan, Dahlia Malkhi, John D. Proceedings of the 10th USENIX conference on
Davis, Vijayan Prabhakaran, Michael Wei, and Operating Systems Design and Implementation,
Ted Wobber. Corfu: A distributed shared log. OSDI’12, pages 251–264, Hollywood, CA, USA,
ACM Transactions on Computer Systems (TOCS), 2012. USENIX Association.
31(4):10, 2013. [18] Giuseppe DeCandia, Deniz Hastorun, Madan Jam-
[10] Mahesh Balakrishnan, Dahlia Malkhi, Ted Wob- pani, Gunavardhan Kakulapati, Avinash Lakshman,
ber, Ming Wu, Vijayan Prabhakaran, Michael Wei, Alex Pilchin, Swaminathan Sivasubramanian, Peter
John D. Davis, Sriram Rao, Tao Zou, and Aviad Vosshall, and Werner Vogels. Dynamo: Amazon’s
Zuck. Tango: Distributed data structures over a highly available key-value store. ACM SIGOPS
shared log. In Proceedings of the Twenty-Fourth Operating Systems Review (OSR), 41(6):205–220,
ACM Symposium on Operating Systems Principles, 2007.
Nemacolin, PA, USA. [19] John K. Edwards, Daniel Ellard, Craig Everhart,
[11] Philip A Bernstein, Colin W Reid, and Sudipto Das. Robert Fair, Eric Hamilton, Andy Kahn, Arkady
Hyder-A transactional record manager for shared Kanevsky, James Lentini, Ashish Prakash, Keith A.
flash. CIDR, Asilomar, CA. Smith, et al. FlexVol: flexible, efficient file volume
virtualization in WAFL. In USENIX 2008 Annual
[12] Josiah L. Carlson. Redis in Action. Manning Pub- Technical Conference (ATC ’08), pages 129–142,
lications Co., 2013. Boston, MA, USA, 2008. USENIX Association.
USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 47
[20] Robert Escriva, Bernard Wong, and Emin Gün [30] Sajee Mathew. Overview of Amazon Web Services.
Sirer. Hyperdex: A distributed, searchable key- Amazon Whitepapers, 2014.
value store. ACM SIGCOMM Computer Commu-
nication Review, 42(4):25–36, 2012. [31] Shuai Mu, Yang Cui, Yang Zhang, Wyatt Lloyd,
and Jinyang Li. Extracting more concurrency from
[21] Hadi Esmaeilzadeh, Emily Blem, Renee St Amant, distributed transactions. In 11th USENIX Sym-
Karthikeyan Sankaralingam, and Doug Burger. posium on Operating Systems Design and Imple-
Dark silicon and the end of multicore scaling. In mentation (OSDI’ 14), pages 479–494, Broomfield,
38th Annual International Symposium on Com- CO, USA, October 2014. USENIX Association.
puter Architecture (ISCA ’11), pages 365–376, San
Jose, CA, USA, 2011. IEEE. [32] D. Peng and F. Dabek. Large-scale incremental pro-
cessing using distributed transactions and notifica-
[22] Rachid Guerraoui and Michal Kapalka. On the cor- tions. In Proceedings of the 9th USENIX confer-
rectness of transactional memory. In Proceedings ence on Operating Systems Design and Implemen-
of the 13th ACM SIGPLAN Symposium on Princi- tation (OSDI ’10), Vancouver, BC, Canada.
ples and Practice of Parallel Programming (PPoPP
’08), pages 175–184, Salt Lake City, UT, 2008. [33] Ohad Rodeh, Josef Bacik, and Chris Mason. btrfs:
ACM. The Linux B-tree filesystem. ACM Transactions on
Storage (TOS), 9(3):9, 2013.
[23] Patrick Hunt, Mahadev Konar, Flavio Paiva Jun-
queira, and Benjamin Reed. Zookeeper: Wait-free [34] Mendel Rosenblum and John K. Ousterhout. The
coordination for internet-scale systems. In USENIX design and implementation of a log-structured file
Annual Technical Conference (ATC ’10), Boston, system. ACM Transactions on Computer Systems
MA, USA. (TOCS), 10(1):26–52, 1992.
[24] Markus Klems, David Bermbach, and Rene Wein- [35] Pierangelo Di Sanzo, Bruno Ciciani, Francesco
ert. A runtime quality measurement framework Quaglia, and Paolo Romano. A performance
for cloud database service systems. In Quality model of multi-version concurrency control. In
of Information and Communications Technology Modeling, Analysis and Simulation of Computers
(QUATIC), 2012 Eighth International Conference and Telecommunication Systems, 2008. MASCOTS
on the, pages 38–46. IEEE, 2012. 2008. IEEE International Symposium on, pages 1–
10. IEEE, 2008.
[25] Tim Kraska, Gene Pang, Michael J. Franklin,
Samuel Madden, and Alan Fekete. MDCC: Multi-
[36] Frank Schmuck and Jim Wylie. Experience with
data center consistency. In Proceedings of the 8th
transactions in QuickSilver. In ACM SIGOPS Op-
ACM European Conference on Computer Systems,
erating Systems Review (OSR), volume 25, pages
EuroSys ’13, pages 113–126, Prague, Czech Re-
239–253. ACM, 1991.
public, 2013. ACM.
[37] Fred B. Schneider. Implementing fault-tolerant ser-
[26] Jay Kreps, Neha Narkhede, Jun Rao, et al. Kafka:
vices using the state machine approach: A tutorial.
A distributed messaging system for log processing.
ACM Computing Surveys (CSUR), 22(4):299–319,
In Proceedings of the NetDB, pages 1–7, 2011.
1990.
[27] Leslie Lamport. The part-time parliament. FAST,
3:15–30, 2004. [38] Alfred Z. Spector, Joshua J. Bloch, Dean S.
Daniels, and Richard P. Draves. The Camelot
[28] Michael Luby and Charles Rackoff. How to con- project. 1986.
struct pseudorandom permutations from pseudo-
random functions. SIAM Journal on Computing, [39] Amy Tai, Michael Wei, Michael J. Freedman, Ittai
17(2):373–386, 1988. Abraham, and Dahlia Malkhi. Replex: A scalable,
highly available multi-index data store. In 2016
[29] Dahlia Malkhi and Jean-Philippe Martin. Span- USENIX Annual Technical Conference (USENIX
ner’s concurrency control. ACM SIGACT News, ATC ’16), Denver, CO, June 2016. USENIX As-
44(3):73–77, 2013. sociation.
48 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
[40] Alexander Thomson and Daniel J. Abadi. Calv-
inFS: consistent WAN replication and scalable
metadata management for distributed file systems.
In 13th USENIX Conference on File and Storage
Technologies (FAST ’15), pages 1–14, Santa Clara,
CA, 2015.
[41] Alexander Thomson, Thaddeus Diamond, Shu-
Chun Weng, Kun Ren, Philip Shao, and Daniel J.
Abadi. Calvin: Fast distributed transactions for par-
titioned database systems. In Proceedings of the
2012 ACM SIGMOD International Conference on
Management of Data, SIGMOD ’12, pages 1–12,
Scottsdale, Arizona, USA, 2012. ACM.
[42] Yiying Zhang, Leo Prasath Arulraj, Andrea C.
Arpaci-Dusseau, and Remzi H Arpaci-Dusseau.
De-indirection for flash-based SSDs with nameless
writes. In Proceedings of the 10th USENIX confer-
ence on File and Storage Technologies (FAST ’12),
page 1, Santa Clara, CA, USA, 2012.
USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 49