0% found this document useful (0 votes)
106 views19 pages

Beldi

This document summarizes a research paper that introduces Beldi, a library and runtime system that provides fault tolerance and transactional semantics for stateful serverless workflows. Beldi extends the log-based approach of Olive (OSDI 2016) with new data structures, transaction protocols, and function invocations. It also adapts this framework to work across a federated environment where each serverless function has sovereignty over its own data. The paper implements Beldi to evaluate its effectiveness on an AWS Lambda deployment of 1,000 functions for applications including a movie review service, travel reservation system, and social media site.

Uploaded by

Rafael Alexandre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views19 pages

Beldi

This document summarizes a research paper that introduces Beldi, a library and runtime system that provides fault tolerance and transactional semantics for stateful serverless workflows. Beldi extends the log-based approach of Olive (OSDI 2016) with new data structures, transaction protocols, and function invocations. It also adapts this framework to work across a federated environment where each serverless function has sovereignty over its own data. The paper implements Beldi to evaluate its effectiveness on an AWS Lambda deployment of 1,000 functions for applications including a movie review service, travel reservation system, and social media site.

Uploaded by

Rafael Alexandre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Fault-tolerant and transactional stateful

serverless workflows
Haoran Zhang, University of Pennsylvania; Adney Cardoza,
Rutgers University–Camden; Peter Baile Chen, Sebastian Angel,
and Vincent Liu, University of Pennsylvania
https://www.usenix.org/conference/osdi20/presentation/zhang-haoran

This paper is included in the Proceedings of the


14th USENIX Symposium on Operating Systems
Design and Implementation
November 4–6, 2020
978-1-939133-19-9

Open access to the Proceedings of the


14th USENIX Symposium on Operating
Systems Design and Implementation
is sponsored by USENIX
Fault-tolerant and Transactional Stateful Serverless Workflows
Haoran Zhang, Adney Cardoza† , Peter Baile Chen, Sebastian Angel, and Vincent Liu

University of Pennsylvania Rutgers University-Camden

Abstract less, but becomes involved when the functions maintain their
own state (e.g., modify a data structure that persists across
This paper introduces Beldi, a library and runtime system
invocations). Composing such stateful serverless functions
for writing and composing fault-tolerant and transactional
(SSFs) requires reasoning about consistency and isolation
stateful serverless functions. Beldi runs on existing providers
semantics in the presence of concurrent requests and deal-
and lets developers write complex stateful applications that
ing with component failures. While these requirements are
require fault tolerance and transactional semantics without the
common in distributed systems and are addressed by existing
need to deal with tasks such as load balancing or maintaining
proposals [8, 28, 33, 35, 46], SSFs have unique idiosyncrasies
virtual machines. Beldi’s contributions include extending the
that make existing approaches a poor fit.
log-based fault-tolerant approach in Olive (OSDI 2016) with
The first peculiarity is that request routing is stateless. Ap-
new data structures, transaction protocols, function invoca-
proaches based on state machine replication are hard to im-
tions, and garbage collection. They also include adapting the
plement because a follow-up message might be routed by the
resulting framework to work over a federated environment
infrastructure to a different SSF instance from the one that
where each serverless function has sovereignty over its own
processed a prior message (e.g., an “accept” routed differently
data. We implement three applications on Beldi, including a
than its “prepare”). A second characteristic is that SSFs can be
movie review service, a travel reservation system, and a social
independent and have sovereignty over their own data. For ex-
media site. Our evaluation on 1,000 AWS Lambdas shows
ample, different organizations may develop and deploy SSFs,
that Beldi’s approach is effective and affordable.
and an application may stitch them together to achieve some
end-to-end functionality. As a result, there is no component
1 Introduction in the system that has full visibility (or access) to all the state.
Serverless computing is changing the way in which we Lastly, SSF workflows (directed graphs of SSFs) can be com-
structure and deploy computations in Internet-scale systems. plex and include cycles to express recursion and loops over
Enabled by platforms like AWS Lambda [2], Azure Func- SSFs. If a developer wishes to define transactions over such
tions [3], and Google Cloud Functions [18], programmers workflows (or its subgraphs), all transactions (including those
can break their application into small functions that providers that will abort) must observe consistent state to avoid infinite
then automatically distribute over their data centers. When a loops and undefined behavior. This is a common requirement
user issues a request to a serverless-based system, this request in transactional memory systems [20, 23, 32, 37, 38], but is
flows through the corresponding functions to achieve the de- seldom needed in distributed transaction protocols
sired end-to-end semantics. For example, in an e-commerce To bring fault-tolerance and transactions to this challeng-
site, a user’s purchase might trigger a product order, a ship- ing environment, this paper introduces Beldi, a library and
ping event, a credit card charge, and an inventory update, all runtime system for building workflows of SSFs. Beldi runs
of which could be handled by separate serverless functions. on existing cloud providers without any modification to their
During development, structuring an application as a set of infrastructure and without the need for servers. The SSFs
serverless functions brings forth the benefits of microservice used in Beldi can come from either the app developer, other
architectures: it promotes modular design, quick iteration, developers in the same organization, third-party open-source
and code reuse. During deployment, it frees programmers developers, or the cloud providers. Regardless, Beldi helps
from the prosaic but difficult tasks associated with provision- to stitch together these components in a way that insulates
ing, scaling, and maintaining the underlying computational, the developer from the details of concurrency control, fault
storage, and network resources of the system. In particular, tolerance, and SSF composition.
app developers need not worry about setting up virtual ma- A well-known aspect of SSFs is that even though they
chines or containers, starting or winding down instances to can persist state, this state is usually kept in low-latency
accommodate demand, or routing user requests to the right NoSQL databases (possibly different for each SSF) such as
set of functional units—all of this is automated once an app DynamoDB, Bigtable, and Cosmos DB that are already fault
developer describes the connectivity of the units. tolerant. Viewed in this light, SSFs are clients of scalable
A key challenge in increasing the general applicability of fault-tolerant storage services rather than stateful services
serverless computing lies in correctly and efficiently compos- themselves. Beldi’s goal is therefore to guarantee exactly-
ing different functions to obtain nontrivial end-to-end applica- once semantics to workflows in the presence of clients (SSFs)
tions. This is fairly straightforward when functions are state- that fail at any point in their execution and to offer synchro-

USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 1187
nization primitives (in the form of locks and transactions) to 2.1 Serverless functions
prevent concurrent clients from unsafely handling state.
To realize this vision, Beldi extends Olive [36] and adapts Serverless computing aims to eliminate the need to manage
its mechanisms to the SSF setting. Olive is a recent frame- machines, runtimes, and resources (i.e., everything except
work that exposes an elegant abstraction based on logging and the app logic). It provides an abstraction where developers
request re-execution to clients of cloud storage systems; oper- upload a simple function (or ‘lambda’) to the cloud provider
ations that use Olive’s abstraction enjoy exactly-once seman- that is invoked on demand; an identifier is provided with
tics. Beldi’s extensions include support for operations beyond which clients and other services can invoke the function.
storage accesses such as synchronous and asynchronous invo- The cloud provider is then responsible for provisioning the
cations (so that SSFs can invoke each other), a new data struc- VMs or containers, deploying the user code, and scaling the
ture for unifying storage of application state and logs, and pro- allocated resources up and down based on current demand—
tocols that operate efficiently on this data structure (§4). The all of this is transparent to users. In practice, this means that
purpose of Beldi’s extensions is to smooth out the differences on every function invocation the provider will spawn a new
between Olive’s original use case and ours. As one example, worker (VM or container) with the necessary runtime and
Olive’s most critical optimization assumes that clients can dispatch the request to this worker (‘cold start’). The provider
store a large number of log entries in a database’s atomicity may also use an existing worker, if one is free (‘warm start’).
scope (the scope at which the database can atomically update Note that while workers can stay warm for a while, running
objects). However, this assumption does not hold for many functions are limited by a timeout, after which they are killed.
databases commonly used by SSFs. In DynamoDB, for ex- This time limit is configurable (up to 15 min in 1 s increments
ample, the atomicity scope is a single row that can store at on AWS, up to 9 min in 1 ms increments on Google Cloud,
most 400 KB of data [14]—the row would be full in less than and unbounded time in 1 s increments on Azure) and helps in
a minute in our applications. budgeting and limiting the effect of bugs.
Beldi also adapts existing concurrency control and dis- Serverless functions are often used individually, but they
tributed commit protocols to support transactions over SSF can also be composed into workflows: directed graphs of func-
workflows. A salient aspect of our setting is that there is no en- tions that may contain cycles to express recursion or loops
tity that can serve as a coordinator: a user issues its request to over one or more functions. Some ways to create workflows
the first SSF in the workflow, and SSFs interact only with the include AWS’s step functions [41] and driver functions. A
SSFs in their outgoing edges in the workflow. Consequently, step function specifies how to stitch together different func-
we design a protocol where SSFs work together (while re- tions (represented by their identifiers) and their inputs and
specting the workflow’s underlying communication pattern) outputs; the step function takes care of all scheduling and
to fulfill the duties of the coordinator and collectively decide data movement, and users get an identifier to invoke it. In
whether to commit or abort a transaction (§6). contrast, a driver function is a single function specified by the
To showcase the costs and the benefits of Beldi, we imple- developer that invokes other functions (similar to the main
ment three applications as representative case studies: (1) a function of a traditional program). Control flow can form a
travel reservation system, (2) a social media site, and (3) a graph because functions (including the driver function) can
movie review service. These applications are based on Death- be multi-threaded or perform asynchronous invocations.
StarBench [12, 16], which is an open-source benchmark for Stateful serverless functions (SSFs). Serverless functions
microservices; we have ported and extended these applica- were originally designed to be stateless. As such, state is not
tions to work without servers using SSFs. Our evaluation on guaranteed to persist between function invocations—even
AWS reveals that, at saturation, Beldi’s guarantees come at an when writing to a worker’s local disk, the function’s context
increase in the median request completion time of 2.4–3.3×, can be terminated as part of dynamic resource management,
and 99th percentile completion time of 1.2–1.8× (§7.4). At or load balancing might direct follow-up requests to different
low load, the median completion time increase is under 2×. or new instances. Accordingly, a common workaround to per-
In summary, Beldi helps developers build fault-tolerant and sist data is to store it in fault-tolerant low-latency NoSQL
transactional applications on top of SSFs at a modest cost. In databases. For example, AWS Lambdas can persist their
doing so, Beldi simplifies reasoning about compositions of state in DynamoDB, Google cloud functions can use Cloud
SSFs, runs on existing serverless platforms without modifica- Bigtable, and Azure functions can use Cosmos DB. Through
tions, and extends an elegant fault-tolerant abstraction. these intermediaries, stateful serverless functions (SSFs) can
save state and expose it to other instances.
2 Background and Goals Unfortunately, the above approach to state interacts poorly
In this section, we describe the basics of serverless computing with the way that serverless platforms handle failures. If
(sometimes known as Function-as-a-Service), the challenge a function in a workflow crashes or its worker hangs, the
of deploying complex serverless applications that incorporate provider will either (1) do nothing, leaving the workflow in-
state, and a list of requirements that Beldi aims to satisfy. complete, or (2) restart the function on a different worker,

1188 14th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
potentially incrementing a counter twice, popping a queue Workflow
Intent Collector
multiple times, or corrupting database state and violating ap- SSF Function
Instance
plication semantics. Indeed, serverless providers currently SSF SSF
Container Invocation API
recommend that developers write SSFs that are idempotent SSF
to ensure that re-execution is safe [17]. While helpful, these Database API
Data + Intent
recommendations place the burden entirely on developers. Write Log Table
In contrast, Beldi simplifies this process so developers need Transaction API
only worry about their application logic and not the low-level Read Log Call Log Library
Request
details of how serverless providers respond to failures. Garbage Collector
Database

Client Single SSF Beldi Runtime


2.2 Requirements and assumptions
F IGURE 1—Beldi’s architecture. Developers write SSFs as they do
We strive to design a framework that helps developers build today, but use the Beldi API for transactions and externally visible
serverless applications that tolerate failures and handle con- operations. At runtime, operations for each SSF are logged to a
current operations correctly. Our concrete goals are: database, which, when combined with a per-SSF intent and garbage
collector, guarantees exactly-once semantics.
Exactly-once semantics: Beldi should guarantee exactly-once
execution semantics in the presence of SSF or worker crash Deployable today: Beldi should work on existing serverless
failures. That is, even if an SSF crashes in the midst of its exe- platforms without any modifications on their end. This allows
cution and is restarted by the provider an arbitrary number of developers to use Beldi on any provider of their choosing (or
times, the resulting state must be equivalent to that produced even a multi-provider setup), and lowers the barrier to switch
by an execution in which the SSF ran exactly once, from start providers. Additionally, developers should not need to run
to finish, without crashing. any servers in order to use Beldi. After all, a big appeal of
serverless is that it frees developers from such burdens.
SSF data sovereignty: Beldi should support SSFs that are de-
veloped and managed independently. For example, multiple Assumptions. To achieve these goals, Beldi makes some as-
instances of an SSF may all access the same database, but sumptions about the storage provided to SSFs: that it supports
they might not have access to the databases of other SSFs, strong consistency, tolerates faults, supports atomic updates
even those in the same workflow. Instead, state should only on some atomicity scope (e.g., row, partition), and has a
be exposed by choice through an SSF’s outputs. This type of scan operation with the ability to filter results and create pro-
encapsulation is important to support a paradigm in which dif- jections. These assumptions hold for the NoSQL databases
ferent developers, organizations, and teams within the same commonly used by SSFs: Amazon’s DynamoDB, Azure’s
organization are responsible for designing and maintaining Cosmos DB, and Google’s Bigtable.
their own SSFs. An application developer can then contract
with SSF developers (or teams) to integrate their SSFs into 3 Design Overview
the application’s workflow via the SSF’s identifier (§2). Fur-
thermore, data sovereignty is key to enabling developers to Beldi consists of a library that developers use to write their
offer proprietary functions-as-a-service to others, and is a best SSFs and a serverless-based runtime system to support them.
practice in microservice architectures [11, §4]. For example, Beldi’s approach to handling SSF failures is based on an idea
Microsoft’s eShopOnContainers [29] serves as a blueprint for most recently explored by Olive [36] and inspired by decades
applying these ideas to real-world applications. of work on log-based fault tolerance [19, 30]. Specifically,
Beldi executes SSF operations while atomically logging these
SSF reusability: Beldi should allow multiple applications to actions and periodically re-executes SSFs that have not yet
use the same SSFs in their workflows at the same time. This finished. The logs prevent duplicating work that has already
may require each SSF to have different tables or databases been done, guaranteeing at-most-once execution semantics,
to maintain the state of each application separately, though while the re-execution ensures at-least-once semantics.
cross-application state should also be supported. Figure 1 depicts Beldi’s high-level architecture. Beldi con-
Workflow transactions: Beldi should support an optional trans- sists of four components: (1) the Beldi library, which exposes
actional API that allows an application to specify any sub- APIs for invocations, database reads/writes, and transactions;
graph of a workflow that should be processed transactionally (2) a set of database tables that store the SSF’s state, as well as
with ACID semantics. We target opacity [20] as the isola- logs of reads, writes, and invocations; (3) an intent collector,
tion level. Opacity ensures that (1) the effects of concurrent which is a serverless function that restarts any instances of
transactions are equivalent to some serial execution, and (2) the corresponding SSF that have stalled or crashed; and (4) a
every transaction, including those that are aborted, always garbage collector, which is a serverless function that keeps
observes a consistent view of the database. We discuss why the logs from growing unboundedly.
these requirements are important in SSFs in Section 6.2. To ensure data sovereignty (§2.2), the runtimes and logs

USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 1189
of different SSFs are independently managed and stored; Beldi Library Function Description
however, all instances of related SSFs may share the same read(k) → v Read operation
Beldi infrastructure. An app developer composes multiple write(k, v) Write operation
SSFs into a workflow by chaining them together using a driver condWrite(k, v, c) → T/F Write if c is true
function, a step function, or a combination of the two. In the syncInvoke(s, params) → v Calls s and waits for answer
following sections we expand on each of these components. asyncInvoke(s, params) Calls s without waiting
lock() Acquire a lock
3.1 Initial inspiration: Olive unlock() Release a lock
begin_tx() Begin a transaction
Olive [36] guarantees exactly-once execution semantics for
end_tx() End a transaction
clients that may fail while interacting with a fault-tolerant
storage server. This is a similar objective as ours, though our F IGURE 2—Beldi’s API for SSFs, which includes all of Beldi’s
setting makes applying Olive’s ideas nontrivial. An intent in primitives and its transactional support (§6).
Olive is an arbitrary code snippet that the client intends to
execute with exactly-once semantics. Each intent is assigned Intent collector. To guarantee at-least-once semantics, Olive
a unique identifier (intent id), which Olive uses as the primary must ensure that some entity finishes the intent if the client
key to save its progress. A client in Olive enjoys at-most-once crashes. This is the job of the intent collector (IC), which
semantics by checking the intent’s progress and skipping com- is a background process that periodically scans the intent
pleted operations during re-execution. Intents consist of local table and completes unfinished intents by running their code.
and external operations. For example, incrementing a local Before the IC executes an external operation, it consults the
variable is a local operation, whereas reading a value from operation log table with the operation’s step number to see
storage is an external operation. Each external operation in if the operation has already been done and to retrieve any
the intent is assigned a monotonically increasing step number, return value; regular clients also perform this check between
starting at 0, that uniquely identifies it. actions (1) and (2). If the operation has not been done, the
There are two key requirements for intents. First, intents IC atomically executes the operation and logs the result to
must be deterministic; developers can make non-deterministic the operation log table. Even if multiple IC instances execute
operations (e.g., a call to a random number generator) deter- concurrently, or if the IC starts executing the intent of a client
ministic by logging their results and replaying the same values that has not crashed, this is safe because of Olive’s assurance
in the event of a re-execution. Second, intents must be guaran- of at-most-once semantics.
teed to always complete in the absence of failures (e.g., they Beldi vs. Olive. Beldi is inspired by Olive’s high-level ap-
must be free from bugs such as deadlock or infinite loops). proach but makes key changes and introduces new data struc-
After an intent has been successfully logged, the client in tures and tables, support for invocations so that SSFs can
Olive executes the intent’s code normally until it reaches an call each other (Olive only supports storage operations), and
external operation (e.g., reading or writing to the database). garbage collection mechanisms to keep overheads low.
Then, the client: (1) determines the operation’s step number; An important difference between the two is the definition
(2) performs the operation (e.g., writes to the database); (3) of an ‘intent.’ In Olive, intents are code snippets—logged by
logs the intent id, step number, and the operation’s return the client—and all intents are logged in the same intent table.
value (if any) into a separate database table called the opera- In Beldi, the client (which is the SSF) is the code snippet. As a
tion log. When the client completes all operations, it marks result, an intent in Beldi is not code but rather the parameters
the intent as ‘done’ in the intent table. that identify a particular running instance of the SSF: its
To ensure at-most-once execution semantics, the client in inputs, start time, whether it was launched asynchronously,
Olive must perform actions (2) and (3) above atomically. This etc. Accordingly, Beldi uses the term ‘instance id’ instead of
ensures that if Olive re-executes an intent, there will be a ‘intent id’ to capture this distinction.
record in the operation log showing that a particular step Another critical difference is that, as shown in Figure 1,
has already been completed and should not be re-executed. each SSF in Beldi is backed by a different database and Beldi
Instead, the entity re-executing the intent should resume ex- runtime to ensure data sovereignty, though different SSFs
ecution from the last completed step, using logged return developed by the same engineering team may reuse these
values from previous steps as needed. To make these two ac- components if desired. We will expand on these details in the
tions atomic, Olive introduces a technique called Distributed following sections, but we begin by introducing Beldi’s API.
Atomic Affinity Logging (DAAL), which collocates log entries
for an item in the same atomicity scope (the scope at which 3.2 Beldi’s API
the database supports atomic operations) with the item’s data. Beldi exposes the API in Figure 2, which includes key-value
For example, in a storage system where operations are atomic operations such as read, write, and condWrite (a write
at the row level, Olive would store the item’s value and its log that succeeds only if the provided condition evaluates to
entries in different columns of the same row. true), and functions to invoke other SSFs (syncInvoke and

1190 14th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
Log Key Value log. Their schema is also in Figure 3. For each operation, the
intent instance id done, async, args, ret, ts key into the log is the combination of the executing SSF’s
read instance id, step number value instance id and the step number, which (like in Olive) is a
write instance id, step number true / false counter that identifies each unique external operation. Each
invoke instance id, step number instance id of callee, result read operation adds the value read from the database into
the read log. Writes, meanwhile, write to the write log with a
F IGURE 3—Beldi maintains four logs for each SSF. The intent table
boolean flag that states whether the write operation took effect.
keeps track of an instance’s completion status, arguments, return
value, type of invocation, and timestamp assigned by its garbage Regular writes always set this flag to true, while conditional
collector (ts). The read log stores the value read. The write log stores writes set it to the outcome of the condition at the time of the
true for writes, or the condition evaluation for a conditional write. write. The actual data being written is stored in a data table,
The invoke log stores the instance id of the callee and its result. although in Section 4 we discuss a data structure that general-
izes Olive’s DAAL and collocates the write log in the same
asyncInvoke). These operations are meant as drop-in re- table as the data to avoid cross-table transactions. The invoke
placements for the existing interface used by SSFs. Further- log is new to Beldi and ensures at-most-once semantics for
more, Beldi supports the ability to begin and end transac- calls to other SSFs; we describe it in Section 4.5.
tions; operations between these calls enjoy ACID semantics.
Beldi’s API hides from developers all of the complexity Intent and Garbage Collectors. For each SSF, Beldi intro-
of logging, replaying, and concurrency control protocols that duces a pair of serverless functions that are triggered period-
take place under the hood to guarantee exactly-once semantics ically by a timer. The first function acts as the SSF’s intent
and support transactions. For example, an SSF using Beldi’s collector (IC). The IC scans the SSF’s intent table to discover
API automatically determines (from the input, environment, instances of the SSF that have not yet finished (lack the ‘done’
and global variables) the SSF’s instance id, step number, and flag). The IC restarts each unfinished SSF by re-executing it
whether it is part of a transaction. Beldi takes actions before with the original instance id and arguments. Note that it is
and after the main body of the SSF as well as around any safe for the IC to restart an SSF instance even if the original
Beldi API operations. instance is still running and has not crashed, owing to Beldi’s
use of logs to guarantee at-most-once semantics for each step
3.3 Beldi’s runtime infrastructure of the SSF. We implement two natural optimizations for the
Developers write SSF code as they do today, but link Beldi’s IC. First, the IC restarts instances only after some amount
library and use its API. The rest of Beldi’s mechanisms hap- of time has passed since the last time they were launched to
pen behind the scenes. avoid spawning too many duplicate instances in cases where
the IC runs very frequently. Second, the IC speeds up the
Intent table. Beldi associates with every SSF invocation an process of finding unfinished instances among all instance ids
instance id, which uniquely identifies an intent to execute in the intent table by maintaining a secondary index.
a given SSF. For the first SSF in a workflow, the instance The second function acts as a Garbage Collector (GC)
id is the UUID assigned by the serverless platform to the for completed intents, taking care to ensure safety in the
initial request. For example, in AWS this UUID is called the presence of concurrent SSF instances, IC instances, and even
‘request id,’ in GCP it is called the ‘event id,’ and in Azure it GC instances. This component is described in Section 5.
is the ‘invocation id.’ For subsequent SSFs in the workflow,
each caller in the graph will generate a new UUID to be used 4 Executing and Logging Operations in Beldi
by the callee as its instance id. A new id is generated even
As we mention in Section 3.1, guaranteeing exactly-once se-
if the SSF has been invoked earlier in the workflow or if the
mantics requires atomically logging and executing operations.
callee is another instance of the caller SSF (in the case of
This section discusses how Beldi achieves this.
recursive functions). Thus, every SSF instance will have a
distinct instance id, even if the instances are of the same SSF 4.1 Linked DAAL
and in the same workflow. The logging approach taken by Olive (§3.1) requires an atom-
Beldi keeps an intent table that contains the instance id, icity scope with high storage capacity, as otherwise few log
arguments, completion status, and other information listed entries can be added. In the context of Cosmos DB (the suc-
in Figure 3 for every SSF instance that users and other SSFs cessor of the database used by Olive), the atomicity scope is a
intend to execute. It does this by modifying SSFs to ensure database partition, and the atomic operation is a transactional
that the first operation is to check the intent table to see if batch update. Olive’s DAAL is a good fit for Cosmos DB
their instance id is already present and, if not, to log a new because partitions can hold up to 20 GB of data [10], which
entry. Beldi performs a similar modification to set the intent is enough to collocate a data item and a large number of log
as ‘done’ at the end of the SSF execution. entries. However, other databases adopt designs with more
Operation logs. In addition to the intent table, Beldi main- limited atomicity scopes. For example, the atomicity scope
tains three logs for each SSF: a read log, write log, and invoke of DynamoDB and Bigtable is one row, which can hold up to

USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 1191
HEAD Key Value Lock Owner Recent Writes Log Size Next Row
def read ( table , key ) :
linkedDAAL = rawScan ( table ,
cond : "Key is {key}" ,
Row Id Key Value Lock Owner Recent Writes Log Size Next Row project : [ "RowId" , "NextRow" ] )
tail = getTail ( linkedDAAL )
F IGURE 4—Linked DAAL for a single item. Each row contains val = rawRead ( table , tail )
the item’s key, previous values (except the last row which contains logKey = [ ID , STEP ]
STEP = STEP + 1
the current value), lock information (used for transactions), a log of
ok = rawCondWrite ( ReadLog , logKey ,
recent writes, and information for traversal and garbage collection. cond : "{logKey} does not exist"
update : "Value = {val}" )
400 KB [14] and 256 MB [7], respectively; the recommended if ok :
limits are much lower. If we were to use Olive’s DAAL with return val
DynamoDB, an SSF could only perform hundreds of writes else :
to a given key before filling up the row. At such point, Olive return rawRead ( ReadLog , logKey )
would be unable to make further progress until the logs are F IGURE 5—Pseudocode for Beldi’s read wrapper function. Func-
pruned. This is hard to do in our setting: reaching a state of tions beginning with “raw” refer to native (unwrapped) access to the
quiescence where it is safe to garbage collect logs is challeng- database tables storing the data or the logs. Identifiers starting with
ing since existing platforms expose no mechanism to kill or capital letters indicate a member of the log structures.
pause SSFs (§5).
To support all common databases, Beldi introduces a new
data structure called the linked DAAL that allows logs to exist the database that returns every row containing a target Key.
on multiple rows (or atomicity scopes), with new rows being On its own, the scan operation returns all contents of each row
added as needed. There are three reasons why this simple data (including the values, write logs, etc.). To reduce this over-
structure is interesting for our purposes: (1) linked DAALs head, Beldi applies a projection that filters out all columns
continue to avoid the overheads of cross-table transactions except for RowId and NextRow. This combination of scan
and work on databases that do not support such transactions; and projection allows Beldi to download only 256 bits per
(2) linked DAALs are a type of non-blocking linked list [21, row of the linked DAAL. From these rows, Beldi constructs
42, 47], allowing multiple SSFs to access them concurrently a skeleton version of the linked DAAL locally, which it can
with the operations supported by the atomicity scope (e.g., quickly traverse to find the RowId of the tail.
atomic partial updates); (3) even with frequent accesses, our We note that the individual reads in a scan are not exe-
garbage collection protocol can ensure that the length of the cuted atomically. For example, Beldi might see a row with no
list for each item is kept consistently small (§5). NextRow, and also receive a row that was subsequently ap-
pended to it. This operation might even retrieve rows that are
Structure. Figure 4 gives an example of a linked DAAL for orphaned from a failed append operation. Regardless, when
an item with two rows of logs. Every row stores the item’s these databases are configured to be linearizable [6, 9, 13],
key, value, owner of the lock (used for transactions in Sec- the set of rows traversed from the head to the first instance of
tion 6), the log of writes, and metadata needed to traverse the an empty NextRow form a consistent snapshot of the linked
linked DAAL and perform garbage collection. The first row DAAL—any write that completes strictly before the scan
is the ‘head,’ which has a special RowId and is never garbage begins will be reflected in the constructed local linked DAAL.
collected. The primary key for rows is RowId + Key, the hash While the linked DAAL is structurally simple, operating
key is Key, and the sort key is RowId. When a row is full and on it requires care. The following sections detail how Beldi’s
the SSF issues a write operation, a new row is appended with API functions read and modify the linked DAAL.
the updated value and a log entry describing the write; the
previous row’s value and logs are not modified once filled. 4.2 Read
Thus, the tail always has the most recent value. We begin by discussing Beldi’s read operation. While
Traversal. Most operations in Beldi require traversal to the read has no externally visible effects on its own, the po-
tail of the list. The simplest way to accomplish this is to tential use of its non-deterministic results in a subsequent
start at the designated head row and iteratively issue read external operation means that Beldi must record the result of
requests for each NextRow until the field is empty. While every read in a dedicated ReadLog. Unlike write operations,
this procedure will eventually reach the tail, the number of however, the read from the database and the log to the Read-
database operations grows with the length of the list. Garbage Log need not happen atomically—if the SSF crashes before
collection can control this length, but Beldi applies an ad- logging the outcome, it is fine to fetch a fresh value as the
ditional optimization that leverages the scan and projection previous result did not have any externally visible effect.
operations available in the three NoSQL databases that we Figure 5 shows the pseudocode of the read API function,
surveyed. Specifically, Beldi issues a single scan operation to which involves two steps: (1) read the most recent value of the
key from the tail of the linked DAAL, and (2) log the result

1192 14th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
logKey ∈ logs logSize < N ∃ nextRow
def write ( table , key , val ) :
logKey = [ ID , STEP ] A True * *
linkedDAAL = rawScan ( table , B False True False
cond : "Key is {key}" C False False True
project : [ "RowId" , "NextRow" , D False False False
"RecentWrites [{ logKey }]" ] )
if logKey not in linkedDAAL : (a) Cases (b) Transitions
tail = getTail ( linkedDAAL ) F IGURE 7—Possible cases for the state of a candidate tail in the
tryWrite ( table , key , val , tail ) linked DAAL during a write and its potential transitions.
STEP = STEP + 1
def tryWrite ( table , key , val , row ) : never been executed previously, and there is room in the
logKey = [ ID , STEP ] current row to execute/log the write.
ok = rawCondWrite ( table , row [ RowId ] , C. The operation is not in the log, but the log is full and
cond : "({ logKey} not in RecentWrites)
&& (LogSize < N)" ,
there is a pointer to the next row. Beldi should follow
update : "Value = {val}; the provided pointer toward the tail.
LogSize = LogSize + 1; D. The operation is not in the log and the log is full, but
RecentWrites [{ logKey }] = NULL" ) there is no next row. Beldi should append a new row and
if ok : # Case B
return advance to that new row.
row = rawRead ( table , row [ RowId ] ) We formulate a lock-free algorithm to handle all the cases
if logKey in row [ RecentWrites ] : # Case A above by examining the transitions induced by concurrent
return
elif row [ NextRow ] does not exist : # Case D SSF accesses. For example, if Beldi is in case B, where the
row = appendRow ( table , key , row ) operation is not in any log and there is still space to execute
else : # Case C it in the current row, a concurrent SSF can, without warning,
row = rawRead ( table , row [ NextRow ] ) execute the current operation (→A) or fill the remaining space
tryWrite ( table , key , val , row )
in the log (→C/D). The reverse is not true: once there is
F IGURE 6—Pseudocode for Beldi’s write wrapper function. a NextRow pointer, the linked DAAL will never revert to
having extra space for logs. The cases and their transitions are
summarized in Figure 7, where N is the maximum number of
to the ReadLog if it has not yet been completed. For the first log entries that can fit in a row when accounting for the size of
step, Beldi retrieves the tail as described in Section 4.1. For the key, value, and other metadata. The exception is garbage
the second step, Beldi uses an atomic conditional update to collection (not covered in Figure 7), whose operation and
efficiently log the operation without overwriting a previously correctness we describe in Section 5. An arrow in Figure 7b
executed read. If it encounters a conflict during the update, it indicates a possible effect of concurrent SSF instances.
returns the previous result from the ReadLog. To safely identify the state of a row, Beldi checks for each
case starting at the node(s) in the transition graph without
4.3 Write incoming edges. In this case there is only one such node (B),
A write is more complex as the update and logging must so Beldi performs a conditional write with the condition given
be done atomically—within the same atomicity scope—and in case B of Figure 7a (i.e., that the logKey is not in the logs,
Beldi needs to handle cases where other SSFs are access- that the logSize is less than N, and that there is no nextRow).
ing and appending to the linked DAAL concurrently. At If the conditional check fails, the state of the row will not
a high level, the write operation must find the tail of the revert back to case B later because B has no incoming edges.
linked DAAL, check if the write has been previously exe- Therefore, it is safe to remove B from the transition graph and
cuted, log/update the tail if it has not, and extend the linked check the remaining cases. Beldi repeats the above process
DAAL if the current tail is full. Like read, Beldi can use with cases A and D (in any order) because they they have no
scan and projection to assemble a minimal local version of incoming edges in the remaining graph. Finally, if all prior
the linked DAAL. Unlike read, Beldi cannot skip directly to conditions fail, the row is in case C.
the tail; instead, Beldi must check that none of the scanned
rows contains a record of the current operation. Furthermore, 4.4 Conditional write
once Beldi has a candidate for the tail, Beldi needs to update Beldi also provides support for conditional writes, which
its value and add an entry to its log atomically. For a given only execute if a user-defined condition is true at the time
tail candidate there are exactly four possible scenarios: of the write. The initial scan and subsequent scenarios are
A. The operation has already been executed and the [in- similar to the scenarios for unconditional writes. The only
stance ID, step number] tuple can be found in the current exception is the case where the operation has not previously
row. Beldi can return immediately in this case. executed and the current row still has remaining space in the
B. The operation is not in the log and there is still space. log (i.e., case B from Section 4.3). We split this case into two:
This indicates that Beldi is at the tail, the operation has in B1 , the condition is true, and in B2 , the condition is false.

USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 1193
SSF1 (Original) SSF2 SSF1 (Callback)

def syncInvoke ( callee , input ) :


calleeId = UUID ( ) invoke
logKey = [ ID , STEP ] run logic
STEP = STEP + 1 callback
ok = rawCondWrite ( InvokeLog , logKey , (with result)
cond : "{logKey} not in InvokeLog" log result
waiting for response (in invoke log)
update : "Id = {calleeId };
Result = NULL" ) OK
if not ok : log as done
record = rawRead ( InvokeLog , logKey ) (in intent table)
calleeId = record [ Id ] fail to return
result = record [ Result ]
if result does not exist : F IGURE 9—SSF1 synchronously invokes SSF2, which then fails to
return rawSyncInvoke ( callee , return after logging the operation as done. The callback ensures that
[ calleeId , input ] )
SSF1 has the result of SSF2 before SSF2 marks itself as done.
# When the Callee is done it issues a callback
# to the caller. Below is the callback handler.
but before it returns the result to the caller (SSF1). Suppose
def syncInvokeCallbackHandler ( calleeId , result ) :
rawWrite ( InvokeLog , cond : "Id = {calleeId}" , that there is no callback, i.e., that SSF2 logs itself as complete
update : "Result = {result}" ) immediately after completing execution. Beldi’s federated
setup means that each SSF has a garbage collector running at
F IGURE 8—Pseudocode for synchronous invocation of other SSFs.
its own pace. If SSF2 were to fail after logging itself as done,
Asynchronous invocations are similar, but since they do not have
it is, therefore, possible that SSF2’s GC will garbage collect
return values, the callback is invoked as soon as the callee logs the
intent in its intent table. We give the code for the callee’s actions in the intent before SSF1 gets any value. Later, when SSF1’s
Appendix A of our tech report [45]. IC re-executes the unfinished SSF1 instance, the caller will
see the lack of result in the invoke log, re-invoke SSF2 (with
the existing callee id), and SSF2 will mistakenly perform
Beldi handles these cases by first checking B1 and B2 with the operation again. In some ways, this is similar to why
conditional writes before covering the other states exactly as write operations in Beldi must be atomically logged and
in the unconditional-write case. We give a detailed description executed (§3.1). Unfortunately, there are no mechanisms for
in Appendix A of our extended technical report [45]. atomically logging into a database and executing other SSFs.
We address this issue by decomposing an invocation into
4.5 Invocation of SSFs and local functions two steps: (1) the invocation itself, performed by the caller;
and (2) the recording of results, done via a second, auto-
Finally, Beldi supports three types of function invocations: matic invocation by the callee to some instance of the caller.
synchronous calls (syncInvoke), which block and return We emphasize ‘some’ and ‘original’ because request routing
a value; asynchronous calls (asyncInvoke), which return in serverless is stateless: if SSF1 invokes SSF2, and SSF2
immediately; and calls to functions that do not use Beldi’s then invokes SSF1, the two SSF1 instances could be differ-
API (e.g., legacy libraries or legacy SSFs). In the first two ent (§2.1). We call this automatic invocation a callback. When
cases, Beldi guarantees exactly-once semantics. In the last, it the second instance of the caller receives the callback, it logs
only guarantees that the operation is performed at least once. the provided result in its invoke log and returns. At this point,
Figure 8 shows pseudocode for synchronous SSF invoca- it is safe for the callee to mark its intent as done since it knows
tions. As mentioned in Section 3.3, to help SSFs that are the caller’s invoke log already contains the result. Note that
being invoked (“callees”) differentiate between re-executions callbacks only require at-least-once semantics, so there is no
and new executions, Beldi passes an instance id to the callee need for additional logging of the callback invocation.
(the “callee id”) along with the parameters of the call. In Figure 9 illustrates the idea of Beldi’s callback mechanism.
the first invocation, the callee id is generated using UUID(); The callback ensures that the result of SSF2 is properly re-
for re-executions, it retrieves the id from the invoke log. If ceived by SSF1. As such, we note that SSF2’s response to
there is already an entry in the invoke log for this caller id SSF1 is merely an optimization and not necessary for cor-
and step number, there are two cases: (1) a result is already rectness. We also note that if SSF2 fails after a successful
present, in which case the caller reuses that result; or (2) the callback but before logging the completion of the intent, it
entry is present but there is no result, in which case the caller may result in a case where SSF1 completes, gets garbage
re-invokes the callee with the existing callee id. collected, and then a re-execution of SSF2 invokes a spuri-
Callbacks. Note that syncInvoke (Figure 8) does not log ous callback. SSF1 can detect and ignore this case when a
the result of the actual call or otherwise mark the call as callback occurs for an invoke that does not exist.
complete. To see why this is important, consider the example Asynchronous invocations. This procedure is similar to that
trace in Figure 9, which shows the result of a failure of the of synchronous invocations, but with the two steps flipped
callee (SSF2) after it marks itself as done in the intent table

1194 14th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
on the callee. The caller first makes a rawSyncInvoke call def garbageCollection ( ) :
to the callee, but rather than execute the function, the callee time = now ( )
(observing an ‘async’ flag) simply registers the intent in its recyclable = [ ]
for id , intent in IntentTable :
intent table, issues a callback, and then immediately returns
if intent [ Done ] :
to the caller. In the second step, Beldi performs the actual if FinishTime not in intent :
asynchronous invocation of SSF2’s logic. We describe this intent [ FinishTime ] = time
operation in detail in Appendix A of our tech report [45]. elif time - intent [ FinishTime ] > T :
recyclable . append ( id )
for id in recyclable :
5 Garbage Collection remove from ReadLog
where "LogKey[Id] == {id}"
If left alone, the linked DAAL will grow indefinitely. While remove from InvokeLog
where "LogKey[Id] == {id}"
Beldi’s use of scans means that the linked DAAL’s length is for table , key in getAllDataKeys ( ) :
generally not the performance bottleneck, unbounded growth rows = rawScan ( table ,
of the linked DAAL and logs (intent table, read log, invoke cond : "Key == {key}" )
log) can lead to significant overheads and storage costs. Beldi for row in rows :
for log in row [ RecentWrites ] :
ensures that logs are pruned and the linked DAAL remains mark if log [ Id ] in recyclable
shallow with a garbage collector (GC) that deletes old rows if fullyMarked ( row [ RecentWrites ] )
and log entries without blocking SSFs that are concurrently and row [ NextRow ] exists :
accessing the list. The GC is an SSF triggered by a timer. prev ( row ) [ NextRow ] = row [ NextRow ]
if DangleTime not in row :
At a high level, the protocol has six parts. First, the GC row [ DangleTime ] = time
finds intents that have finished since the last time a GC in- rows = rawScan ( table , cond : "Key == {key}
stance ran and assigns them the current time as a finish time- && {time} - DangleTime > T" )
for row in rows :
stamp. Second, the GC looks up all intents whose finish time-
if row not reachable from head ( key )
stamp is ‘old enough’ (we expand on this next), and marks delete row
them as ‘recyclable.’ Third, the GC removes log entries (in for id in recyclable :
the read and invoke logs) that belong to recyclable intents. remove from IntentTable
where "LogKey[Id] == {id}"
Fourth, the GC disconnects, for every item, the non-tail rows
of their linked DAAL that have empty logs, marks these rows F IGURE 10—Pseudocode for Beldi’s lock-free, thread-safe garbage
as ‘dangling’, and assigns them the current time as a dangling collection algorithm. T is the maximum lifetime of an SSF instance.
timestamp. Fifth, the GC removes all rows whose dangling
timestamp is ‘old enough.’ Finally, the GC removes the log en-
tries from the intent table. The algorithm is given in Figure 10, instances that start after the change are fine.
with more details in Appendix A of our tech report [45]. Note Safety of concurrent access. With the above assumption,
that GCs only need at-least-once semantics to avoid mem- Beldi’s GC preserves exactly-once semantics without need-
ory leaks in the presence of crashes; they do not use Beldi’s ing to interrupt SSF instances. First, observe that an intent
exactly-once API. Instead, GCs defer the removal of entries is marked as recyclable only after Beldi is sure that no live
in the intent table until the end. SSF instance requires the intent. Accordingly, the read log,
Assumption. The safety of garbage collection relies on a syn- invoke log, and intent table entries for the intent will never
chrony assumption. In particular, it assumes that an individual be accessed again. For the linked DAAL, the GC only dis-
SSF instance terminates, one way or another, in at most T connects a row when all of the contained logs are marked as
time. This allows the GC to delete the logs of completed recyclable and it is not the tail. New traversals of the linked
intents after waiting T time for all running instances of the DAAL for read or write operations will not observe the dis-
completed intents to finish. Note that no new instances will connected row (technically the rawScan operation will return
be started by an SSF’s IC after the intent is marked as ‘done.’ these disconnected rows, but they will be ignored during the
Our assumption is based on the observation that serverless traversal of the local linked DAAL). Running SSF and GC
providers enforce user-defined execution timeouts on SSF instances, however, may be in the process of traversing the
instances (§2.1), but otherwise provide no interface for de- disconnected row—if Beldi deleted it immediately, the SSF
velopers to kill or stop running functions. We can derive a or GC might become stranded. To prevent this, Beldi keeps
conservative bound for T from these user-defined timeouts. the disconnected row for an additional T time to ensure that
Note that even if providers refuse to kill SSFs after the time- instances with such references terminate successfully.
out, we can work around this issue (at high cost) by having Safety of concurrent modifications. The linked DAAL also
the GC change the database’s permissions or rename tables supports garbage collection in the presence of concurrent ap-
so that ongoing SSF instances (including stragglers that stick pends from SSFs and deletions from other GC instances ow-
around after the intent is done) fail to corrupt the database;

USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 1195
ing to it being a type of non-blocking linked list. In fact, it is def lock ( table , key ) :
simpler than traditional non-blocking linked lists [21, 42, 47] ok = condWrite ( table , key ,
because new rows are always appended to the tail, and GCs cond : "LockOwner = NULL
|| LockOwner.id = TXNID" ,
never touch the tail. The only interesting case is the concur-
update : "LockOwner = [TXNID , START_TIME]" )
rent disconnection of neighboring rows such as X and Y in if not ok :
A → X → Y → B. In this case, the disconnection of X suc- row = read ( table , key )
ceeds, but the disconnection of Y will not be visible because ownerId , ownerTime = row [ LockOwner ]
if ownerTime <= TXNID :
the updated NextRow pointer in X is no longer part of the abort
linked DAAL. The next GC run disconnects Y permanently. else :
lock ( table , key )
6 Supporting Locks and Transactions
F IGURE 11—Pseudocode for the lock operation with wait-die dead-
In addition to exactly-once semantics, Beldi also provides lock prevention used during the ‘Execute’ mode of a transaction.
support for locks and transactions with user-generated aborts.
6.1 Locks agating abort/commit signals throughout a workflow. Note
Beldi’s approach to mutual exclusion borrows an abstraction that Beldi does not currently support asyncInvoke in trans-
in Olive called “locks with intent”, where locks over data actions; however, it does support spawning threads that issue
items are owned by an intent rather than a specific client. This syncInvoke operations and are then joined.
means that, if an SSF instance calls lock(item) and then Transaction contexts. In Beldi, transactions are defined with
crashes, the lock is not lost and held indefinitely; rather, the the begin_tx and end_tx API calls. Beldi assumes that
IC will soon restart the instance. The re-executed instance, both the begin and the end statements are placed in the same
upon arriving at the lock(item) call, will see that it already SSF, but SSFs can invoke other SSFs inside a transaction, so
acquired the lock and be able to continue with the remaining transactions can span across multiple SSFs. When an SSF
operations as if the original SSF instance had never crashed. calls begin_tx it creates a new top-level transaction context
In Beldi, the ownership of a lock on a given item is kept which consists of a unique transaction id and a mode (‘Exe-
alongside the data and logs in the “lock owner” column of the cute’, ‘Commit’, or ‘Abort’). Contexts start in ‘Execute’ mode.
item’s linked DAAL. Lock acquisition and release are logged The SSF instance will also, upon creating a new context, ex-
to the DAAL as writes to the item using Beldi’s condWrite ecute the transaction’s operations in a new thread/goroutine
semantics, where the condition is that the lock is either owned to catch any runtime exceptions. The matching end_tx waits
by the current SSF or has an empty lock-owner column in for the result and runs either a commit or abort protocol de-
the DAAL. The exactly-once semantics are needed for cases pending on the outcome of the contained operations.
where an SSF is re-run after successfully releasing a lock. Transaction contexts are passed along with any SSF invo-
Note that Beldi only guarantees exactly-once semantics—it cations that occur inside the transaction. Thus, whenever a
does not absolve the developer from writing bug-free code. Beldi-enabled SSF starts, it first determines whether it is a
Thus, problems like infinite loops within critical sections and part of an ongoing top-level transaction by checking whether
deadlock need to be handled with higher-level mechanisms a context was provided as part of the input. This is necessary
(like the one below) if the user wishes to guarantee liveness. even if the SSF never creates a transaction itself. If the SSF
does create a transaction, the begin_tx/end_tx statements
6.2 Transactions
will be ignored and all operations will be inherited by the
Beldi uses an extension of the locking mechanism of the top-level transaction context. Beldi does not currently support
preceding section to implement transactions within and across nested transaction semantics [31] (e.g., a sub-transaction can
SSF boundaries. Beldi transactions are based on a variant of abort without causing the top-level transaction to abort).
2PL with wait-die deadlock prevention and two-phase commit.
Opacity. Beldi chooses opacity as the isolation level for trans-
Note that the choice of wait-die (rather than something like
actions. Opacity [20] captures strict serializability [5, 34] with
wound-wait) is deliberate as SSF instances generally cannot
the additional requirement that even transactions that abort do
kill other instances. To implement this, we need to track
not observe inconsistent state. The rationale is that observing
the intent-creation time of each SSF. We do so by adding
inconsistent state can lead to undefined behavior and infinite
to the lock-owner column an intent-creation timestamp and
loops. For example, if an SSF instance reads inconsistent
checking upon lock-acquisition failure whether the existing
state that results in division by zero, it may crash. Beldi’s
lock owner is older or younger than the current SSF instance;
IC will restart the SSF instance and deterministically replay
if older, abort, otherwise, try again (see Figure 11).
the (inconsistent) values to ensure exactly-once semantics,
There are three main parts to Beldi’s transaction-handling
re-triggering the crash. Figure 12 gives another example of
protocol: (1) creating and forwarding a transaction context,
how OCC [26], which provides serializability but not opacity,
(2) executing Beldi calls inside a transaction, and (3) prop-

1196 14th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
begin_tx ( ) When an SSF is invoked with a transaction context that
x = read ( "x" ) ; y = read ( "y" ) includes a Commit mode, Beldi skips the SSF’s logic, and
while ( x ! = y ) : instead performs only the aforementioned commit protocol:
/ / some logic
flushes the final value of the items, releases any held locks,
x++
write ( "x" , x + 2 ) ; write ( "y" , y + 4 ) and notifies its own callees by invoking them with the pro-
end_tx ( ) vided transaction context. An Abort mode similarly skips the
SSF’s logic, releases all locks, and notifies its callees. This
F IGURE 12—OCC leads to an infinite loop when two instances of
recursive invocation of callees with a Commit or Abort mode
the above transaction, T1 and T2 , execute concurrently. Suppose x =
0, y = 1 initially. T1 reads x = 0, y = 1, executes the logic, acquires
mimics the role of a coordinator in two-phase commit.
locks on x and y, validates the read set, and writes x = 3, y = 4. T2 Supporting step functions. The previous discussion as-
reads x = 3, y = 1 (corresponding to a state after which T1 updated x sumes a begin_tx and end_tx in the same SSF. To sup-
but before it updated y), and is stuck in an infinite loop. Even though port transactions across SSFs defined in step functions, the
T2 is destined to abort, it will never reach the read set validation step. developers must introduce ‘begin’ and ‘end’ SSFs in their
workflow (we give an example in Appendix A of our tech
leads to infinite loops. These issues are not present with iso- report [45]). These SSFs create the transaction context and
lation levels that guarantee that all transactions read from a kickstart the commit or abort protocol. SSFs that fall between
consistent snapshot. the ‘begin’ and ‘end’ SSFs in the workflow execute trans-
actionally. If an SSF aborts it sends ‘abort’ on its outgoing
Operation semantics inside a transaction. If an SSF is in edges in the workflow; an SSF that receives an abort as in-
a transactional context, Beldi modifies the semantics of its put skips its operations and propagates the abort message on
API based on the mode to ensure ACID semantics. We have its outgoing edges. This continues until the abort message
already discussed two operation modifications that occur in reaches the ‘end’ SSF, which then sets the transaction context
‘Execute’ mode—one to locks in Figure 11 and another to mode to Abort and invokes the ‘begin’ SSF. If ‘end’ executes
begin_tx/end_tx, which are ignored. ‘Execute’ mode also without receiving any abort message, it sets the context mode
causes Beldi to call lock before every read, write, and to Commit instead. This invocation initiates the second phase
condWrite operation, using the transaction id as the lock of 2PC over the transactional subgraph of the workflow.
holder. In addition to acquiring locks, Beldi also changes
where reads and writes look up and record values. While lock Non-transactional SSFs inside transactions. While an SSF
acquisition still goes to the original tables, Beldi redirects that does not use transactions can be invoked inside a transac-
written values to a shadow table that acts as a local copy of tion by another SSF (which automatically forces the non-
state for the transaction. Like the original table, this shadow transactional SSF to acquire locks before any accesses),
table is also stored as a linked DAAL and is garbage collected app developers must ensure that the non-transactional SSF
along with the normal DAAL (except the GC also deletes the is only used inside transactional contexts. Otherwise, non-
head and tail). Unlike the original, the shadow table is parti- transactional instances may access the database without ac-
tioned by transaction id, with Key relegated to a secondary quiring locks or obeying the wait-die protocol.
index. All read operations check the shadow table first before
consulting the real table to ensure that transactions read their
7 Evaluation
own writes. If, before an operation, an SSF fails to acquire Beldi brings forth an array of programmability and fault-
a lock and must kill itself (due to wait-die), it returns to its tolerance benefits, but with these benefits come costs. In this
caller with an ‘abort’ outcome. section we are interested in answering three questions:
Propagation of commit or aborts. Eventually, a begin_- 1. What is the cost of maintaining and accessing the linked
tx/end_tx code block will reach the end_tx with an abort/- DAAL, and how does it compare to applicable baselines?
commit decision. For commit, Beldi changes the mode of the 2. What are the latency and throughput of representative
context to ‘Commit’, flushes the final values of the items in applications running on Beldi, and how does Beldi com-
the shadow table to the real linked DAAL, and releases any pare to existing serverless platforms that provide neither
held locks. Beldi then calls the SSF’s callees and passes them exactly-once semantics nor transactional support?
the transaction context in Commit mode. Note that if an SSF 3. What effect does Beldi’s GC have on linked DAAL traver-
instance fails between flushing the shadow table and notify- sal, and how does it change as we adjust the timeout (T)?
ing the callees of the commit decision, Beldi’s exactly-once We answer the above questions in the context of the fol-
semantics ensure that once the SSF instance is re-executed, lowing implementation, applications, and experimental setup.
it will pick up from where it left off. For abort, none of the
7.1 Implementation
values have been written to the actual table, so Beldi just
releases all locks and invokes all callees in ‘Abort’ mode. We have implemented a prototype of Beldi for Go applications
that runs transparently on AWS Lambda and DynamoDB. In

USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 1197
total, Beldi’s implementation consists of 1,823 lines of Go 60
for the API library and the intent and garbage collectors. Baseline
50 Beldi
Case studies. To evaluate Beldi’s ability to support interest- Beldi (cross-table txn)

Latency (ms)
40
ing applications at low cost, we implement three case stud-
30
ies: a social media site, a travel reservation system, and a
media streaming and review service. We adapt and extend 20
these applications from DeathStarBench [12, 16], which is a 10
recent open-source benchmark suite for microservices, and
0
port them to a serverless environment (using Go and AWS Read Write CondWrite Invoke
Lambda). This port took around 200 person-hours. Combined,
our implementations total 4,730 lines of Go. We provide de- F IGURE 13—Median latency of Beldi’s operations. Error bar repre-
tails of the corresponding workflows in Appendix B of our sents the 99th percentile, and “cross-table tx” is an implementation of
tech report [45], and give a brief description below. Beldi that uses cross-table transactions instead of the linked DAAL.
Movie review service (Cf. IMDB or Rotten Tomatoes): Users 7.3 What are the costs of Beldi’s primitives?
can create accounts, read reviews, view the plot and cast of
movies, and write their own movie reviews and articles. Our We start our evaluation with a microbenchmark that mea-
implementation of this app consists of a workflow of 13 SSFs. sures the cost of each of Beldi’s primitive operations: read,
Travel reservation (Cf. Expedia): Users can create an account, write, condWrite, and invoke. The keys are one byte and
search for hotels and flights, sort them by price/distance/rate, the values are 16 bytes. We measure the median and 99th per-
find recommendations, and reserve hotel rooms and flights. centile completion time of the four operations over a period
The workflow consists of 10 SSFs, and includes a cross-SSF of 10 minutes at very low load (1 req/s). As baselines, we
transaction to ensure that when a user reserves a hotel and a also measure the completion time (1) without Beldi’s exactly-
flight, the reservation goes through only if both SSFs succeed. once guarantees and (2) using a design that logs writes to a
Note that we extend this app to support flight reservations, as separate table using cross-table transactions. Since Beldi’s
the original implementation [12] only supports hotels. database operations depend on the length of the linked DAAL,
we populate the chosen key’s linked DAAL with a conserva-
Social media site (Cf. Twitter): Users can log in/out, see
tive value of 20 rows, which corresponds to the length of the
their timeline, search for other users, and follow/unfollow
linked DAAL after 30 minutes without garbage collection as
others. Users can also create posts that tag other users, attach
described in the experiment of Section 7.5.
media, and link URLs. The workflow consists of 13 SSFs that
Figure 13 shows the overhead of Beldi’s reads/writes com-
perform tasks like constructing the user’s timeline, shortening
pared to those of the baseline stem from two sources: scan-
URLs, handling user mentions, and composing posts.
ning the linked DAAL (instead of reading a single row) and
logging. For invoke, the overheads come from our callback
7.2 Experimental setup mechanism and logging to the invoke log. Consequently, all
We run all of our experiments on AWS Lambda. We configure of Beldi’s operations are around 2–4× more expensive than
lambdas to use 1 GB of memory and set DynamoDB to use the baseline. In contrast, the approach using cross-table trans-
autoscaling in on-demand mode. All of the read and scan actions does not use a DAAL so reads avoid the scan (but
operations for Beldi and the baseline use DynamoDB’s strong not the logging), and writes perform an atomic transaction
read consistency. We turn off automatic Lambda restarts and where the value is written to one table and the log entry is
let Beldi’s intent collectors take care of restarting failed Lamb- added to another. The cost of this operation is 2–2.5× higher
das. Our garbage and intent collectors are triggered by a timer than Beldi’s linked DAAL. Appendix C in our tech report [45]
every 1 minute, which is the finest resolution supported by describes the same experiment with a more optimistic setting
AWS. Note that AWS currently has a limit of 1,000 concur- (5 rows in the linked DAAL); the results are similar.
rent Lambdas per account. As we will see in some of our Note that not all existing databases (e.g., Bigtable) support
experiments, this limit is often the bottleneck in both the cross-table transactions. Even for those that do, the perfor-
baseline and Beldi. Finally, consistent with our deployability mance gain that cross-table transactions have on read opera-
requirement (§2.2), Beldi uses no servers. tions over using a linked DAAL goes away whenever SSFs
The baseline for our experiments is running our ported use transactions because read locks use condWrite which is
applications on AWS Lambda without Beldi’s library and run- a cheaper operation on the linked DAAL.
time. Consequently, these applications will not enjoy exactly- Other costs. Another consideration beyond performance is
once semantics or support transactions: when running on the the additional storage and network I/O required by Beldi to
baseline, the travel reservation app outputs inconsistent re- maintain and access all logs and linked DAAL metadata. For
sults, and all apps can corrupt state in the presence of crashes. our setup above, the 20-row DAAL for the item takes up

1198 14th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
3500 2500
3000 Baseline 50p Baseline 50p
Baseline 99p 2000 Baseline 99p
Latency (ms)

Latency (ms)
2500 Beldi 50p Beldi 50p
2000 Beldi 99p 1500 Beldi 99p
1500 1000
1000
500 500
0 0
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700
Throughput (request/second) Throughput (request/second)

F IGURE 14—Median response time and throughput for our movie F IGURE 15—Median response time and throughput for travel reser-
review service. Dashed lines represent 99th-percentile response time. vation service. Dashed lines represent 99th-percentile response time.
Beldi performs transactions over multiple SSFs to reserve a hotel
8 MB of storage. Counting all logs and metadata, each op- room and a flight, while the baseline returns inconsistent results.
eration requires storing between 20 to 36 bytes in addition
to the value. In terms of the network overhead introduced 120
without GC
by the scan and projection approach that we use to traverse 100 with GC (1 min)
Beldi’s linked DAAL, for a 20-row DAAL, each scan fetches with GC (10 min)

Latency (ms)
80 with GC (30 min)
2 KB more data than a baseline read to a single cell when
cross-table txn
measured at the network layer. Compared to the baseline, 60
Beldi induces one extra scan and write for each read oper- 40
ation, at least one scan for an unconditional write (and po-
tentially more scans and writes depending on the scenario), 20
and one read and two writes for a function invocation. In 0
DynamoDB’s on-demand mode, each read costs an additional 0 10 20 30 40 50 60
$2.5 × 10−7 , whereas writes cost an additional $1.25 × 10−6 . Time (minute)
In provisioned-capacity mode, costs are lower but depend on
F IGURE 16—Median response time for an SSF that uses one write
the specified capacity. operation under different GC configurations. Without GC, the linked
7.4 How does Beldi perform on our applications? DAAL grows over time. As a baseline, we configure Beldi with
cross-table transactions that do not use a linked DAAL.
In this section, we discuss the results of our large-scale exper-
iments for the movie review and travel reservation services; reservation). At this high load, Beldi’s 99th-percentile latency
the social networking site has similar results, so we defer its is only 20% higher for the movie review service, and 80%
results to our tech report [45]. The workloads that we use are higher for the transaction-enabled travel site. We also test
adapted from DeathStarBench [12, 16] with a minor modifi- a configuration of the travel site that uses Beldi for fault-
cation to support our extended travel reservation service: the tolerance but without transactions. The median latency at
transactions to reserve a hotel and flight randomly pick a hotel saturation for that configuration is 16% lower and the 99th-
and a flight out of 100 choices each following a normal distri- percentile latency is 20% lower than Beldi with transactions.
bution. Requests contain random content within the expected
7.5 What is the effect of garbage collection?
schema and are generated and measured using wrk2 [44].
We issue load at a constant rate for 5 minutes, starting at Finally, we evaluate the importance of the choice of garbage
100 req/s and increasing in increments of 100 req/s until the collector timeout (T) on performance. Note that this is differ-
system is saturated. For our applications, we achieve satura- ent from the 1-minute timer that triggers the GC SSF (§7.2).
tion at around 800 req/s. The primary bottleneck in all cases is T is instead proportional to the maximum lifetime of an SSF
compute: AWS enforces a limit of 1,000 concurrent Lambdas and determines when a GC can remove a row from the Linked
per account (even if the Lambdas are for different functions), DAAL. Thus, this value is important for safety, whereas the
and the HTTP Gateway (or some internal scheduler) rejects trigger only determines when the GC runs.
requests in excess of this limit. Since T is important to ensure exactly-once semantics, we
Figures 14 and 15 depict the results. In all cases (including could imagine performing a similar actuarial analysis to those
the social media app), we observe that, until around 400 req/s involved in setting the end-to-end timeouts of reliable failure
(34M per day), Beldi’s median and 99th-percentile response detectors [1, 27]. However, as Figure 16 shows, the median
time are each around 2× higher than that of the baseline. response times for SSFs that access the linked DAAL are only
At the highest loads that we could test on AWS, Beldi still lightly impacted by the choice of T, even as we run the system
achieves the same throughput as the baseline at a slightly for 30 minutes at constant load under pessimistic conditions
higher median response time (around 3.3× for the travel (all SSF instances write to the same key). As a result, we

USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 1199
can be relatively conservative about T. To be clear, this is innovations is a simple API that SSF developers can use to
a testament to the heroic efforts of DynamoDB engineers build exciting applications without worrying about fault toler-
that have optimized its scan, filter, and projection operations. ance, concurrency control, or managing any infrastructure!
Nevertheless, we take some slight credit for ensuring that In the context of serverless, the observation that existing
Beldi’s linked DAAL is compatible with such operators. designs are currently a poor fit for applications that require
It is worth noting, however, that while T has a minor impact state has been the subject of much prior work [15, 22, 24, 25,
on performance, it does impact storage overhead and I/O, 43]. For example, Cloudburst [40] proposes a new architecture
since read and write operations still fetch a projection of the for incorporating state into serverless functions, and gg [15]
linked DAAL which scales with the number of rows (§7.3). proposes workarounds to state-management issues that arise
in desktop workloads that are outsourced to thousands of
8 Discussion serverless functions. However, the general approach to fault-
We now discuss a few aspects of Beldi, such as the implica- tolerance in these works is to re-execute the entire workflow
tion of relying on strongly consistent databases, the potential when there is a crash or timeout—violating exactly-once
benefit of using SQL databases like Amazon Aurora, and the semantics if any SSF in the workflow is not idempotent.
security implications of SSF federation and reusability. AFT [39] is the closest proposal to Beldi and introduces a
fault-tolerant shim layer for SSFs. However, AFT’s deploy-
Strongly consistent databases. Beldi enables developers to
ment setting, guarantees, and mechanisms are very different.
write stateful serverless applications without having to worry
First, Beldi runs entirely on serverless functions, whereas
about concurrency control, fault tolerance, or manually mak-
AFT requires servers to interpose and coordinate all database
ing all of their functions idempotent. In doing so, Beldi lever-
accesses. As a result, Beldi can run on any existing serverless
ages one or more fault-tolerant databases configured to be
platform (or even in a multi-provider setup) without requiring
strongly consistent. If these databases were to become unavail-
any modification on their part and without the user needing
able, for example due to network partitions, SSFs that write
to administer their own VMs. Second, Beldi seamlessly en-
to these unavailable databases would also become unavailable
ables transactions within SSFs and across workflows with
until the partition was resolved.
opacity, whereas AFT targets the much weaker (but more
ACID databases. A natural question is whether SSFs that efficient) read atomic isolation level [4]. Due to the weaker
use ACID databases need all of Beldi. For such SSFs, the ben- isolation, it would be more difficult to implement our travel
efit is not having to maintain a read or write log (or a linked reservation system on AFT. Finally, Beldi allows SSFs to be
DAAL) since the database does its own logging. However, managed independently and to keep their data private from
ACID databases are not enough to guarantee exactly-once se- each other, while AFT’s servers manage all SSF data, handle
mantics for function invocations since they provide atomicity failures and garbage collection, and serve as a central point
for read and write operations, but have no support for invoca- of coordination for transactions.
tions. As a result, Beldi would still need to implement mech-
anisms such as callbacks (§4.5) to ensure that a failed SSF 10 Conclusion
is not mistakenly re-executed despite independent garbage Beldi makes it possible for developers to build transactional
collectors. Furthermore, workflows that contain transactions and fault-tolerant workflows of SSFs on existing serverless
across SSFs would still need a collaborative coordination platforms. To do so, Beldi introduces novel refinements to
protocol such as the one proposed in Section 6.2. an existing log-based approach to fault tolerance, including
Independence of separate applications. We view SSFs as a new data structure and algorithms that operate on this data
owning all the data on which they operate, similar to mi- structure (§4.1), support for invocations of other SSFs with
croservice architectures [11]. SSFs can isolate the state of a novel callback mechanism (§4.5), and a collaborative dis-
different applications by storing each application’s state on tributed transaction protocol (§6). With these refinements,
a different database. To ensure that a malicious request from Beldi extracts the fault tolerance already available in today’s
one application cannot observe the state of another, standard NoSQL databases, and extends it to workflows of SSFs at low
authentication mechanisms such as capabilities and public cost with minimal effort from application developers.
key encryption could be used.
Acknowledgments
9 Related Work We thank the OSDI reviewers for their feedback and our shep-
We already discuss Beldi’s differences with Olive [36] herd, Jay Lorch, for going above and beyond and providing
throughout. To summarize, Beldi builds upon Olive’s elegant suggestions that dramatically improved the content and pre-
approach to fault tolerance and mutual exclusion, and adapts sentation of our work. We also thank Srinath Setty for many
it to an entirely new domain. This adaptation is nontrivial and invaluable discussions and his help with Olive. This work was
requires us to introduce new data structures, algorithms, and funded in part by VMWare, NSF grants CNS-1845749 and
abstractions (e.g., transactions across SSFs). The result of our CCF-1910565, and DARPA contract HR0011-17-C0047.

1200 14th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
References Conference on Architectural Support for Programming
[1] M. K. Aguilera and M. Walfish. No time for asynchrony. In Languages and Operating Systems (ASPLOS), Apr. 2019.
Proceedings of the Workshop on Hot Topics in Operating [17] Google Cloud Functions. Retrying background functions.
Systems (HotOS), 2009. https://cloud.google.com/functions/docs/
[2] AWS Lambda. https://aws.amazon.com/lambda/. bestpractices/retries.
[3] Azure Functions. https://azure.microsoft.com/en- [18] Google Cloud Functions.
us/services/functions/. https://cloud.google.com/functions.
[4] P. Bailis, A. Fekete, A. Ghodsi, J. M. Hellerstein, and I. Stoica. [19] J. Gray. Notes on data base operating systems. In Operating
Scalable atomic visibility with RAMP transactions. In Systems, An Advanced Course, 1978.
Proceedings of the ACM SIGMOD Conference, June 2014. [20] R. Guerraoui and M. Kapałka. On the correctness of
[5] P. A. Bernstein, D. W. Shipman, and W. S. Wong. Formal transactional memory. In Proceedings of the ACM SIGPLAN
aspects of serializability in database concurrency control. Symposium on Principles and Practice of Parallel
IEEE Transactions on Software Engineering, SE-5(3), May Programming (PPoPP), Feb. 2008.
1979. [21] T. Harris. A pragmatic implementation of non-blocking linked
[6] Cloud Bigtable overview of replication. https://cloud. lists. In Proceedings of the International Symposium on
google.com/bigtable/docs/replication-overview. Distributed Computing (DISC), Oct. 2001.
[7] Quotas and limits for Cloud BigTable. [22] J. M. Hellerstein, J. Faleiro, J. E. Gonzalez, J. Schleier-Smith,
https://cloud.google.com/bigtable/quotas. V. Sreekanti, A. Tumanov, and C. Wu. Serverless computing:
[8] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, One step forward, two steps back. In Conference on
J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, Innovative Data Systems Research (CIDR), Jan. 2019.
P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, [23] N. Herman, J. P. Inala, Y. Huang, L. Tsai, E. Kohler,
A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, B. Liskov, and L. Shrira. Type-aware transactions for faster
R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, concurrent code. In Proceedings of the ACM European
and D. Woodford. Spanner: Google’s globally-distributed Conference on Computer Systems (EuroSys), Apr. 2016.
database. In Proceedings of the USENIX Symposium on [24] A. Jangda, D. Pinckney, Y. Brun, and A. Guha. Formal
Operating Systems Design and Implementation (OSDI), Oct. foundations of serverless computing. In Proceedings of the
2012. ACM SIGPLAN Conference on Object-Oriented Programming
[9] Consistency levels in Azure Cosmos DB. Systems, Languages and Applications (OOPSLA), Oct. 2019.
https://docs.microsoft.com/en-us/azure/cosmos- [25] A. Klimovic, Y. Wang, P. Stuedi, A. Trivedi, J. Pfefferle, and
db/consistency-levels. C. Kozyrakis. Pocket: Elastic ephemeral storage for serverless
[10] Azure Cosmos DB service quotas. analytics. In Proceedings of the USENIX Symposium on
https://docs.microsoft.com/en-us/azure/cosmos- Operating Systems Design and Implementation (OSDI), 2018.
db/concepts-limits. [26] H. T. Kung and J. T. Robinson. On optimistic methods for
[11] C. de la Torre, B. Wagner, and M. Rousos. .NET concurrency control. ACM Transactions on Database Systems
Microservices: Architecture for Containerized .NET (TODS), 6(2), June 1981.
Applications. Microsoft Developer Division, .NET and Visual [27] J. B. Leners, H. Wu, W.-L. Hung, M. K. Aguilera, and
Studio product teams, v3.1 edition, Jan. 2020. M. Walfish. Detecting failures in distributed systems with the
https://docs.microsoft.com/en- FALCON spy network. In Proceedings of the ACM
us/dotnet/architecture/microservices/. Symposium on Operating Systems Principles (SOSP), 2011.
[12] DeathStarBench. [28] H. Mahmoud, F. Nawab, A. Pucher, D. Agrawal, and A. El
https://github.com/delimitrou/DeathStarBench/. Abbadi. Low-latency multi-datacenter databases using
[13] Amazon DynamoDB read consistency. https: replicated commit. In Proceedings of the International
//docs.aws.amazon.com/amazondynamodb/latest/ Conference on Very Large Data Bases (VLDB), Aug. 2013.
developerguide/HowItWorks.ReadConsistency.html. [29] .Net Microservices Sample Reference Application.
[14] Limits in DynamoDB. https://github.com/dotnet-
https://docs.aws.amazon.com/amazondynamodb/ architecture/eShopOnContainers.
latest/developerguide/Limits.html. [30] C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and
[15] S. Fouladi, F. Romero, D. Iter, Q. Li, S. Chatterjee, P. Schwarz. ARIES: A transaction recovery method
C. Kozyrakis, M. Zaharia, and K. Winstein. From laptop to supporting fine-granularity locking and partial rollbacks using
lambda: Outsourcing everyday jobs to thousands of transient write-ahead logging. ACM Transactions on Database Systems
functional containers. In Proceedings of the USENIX Annual (TODS), 17(1), 1992.
Technical Conference (ATC), 2019. [31] E. B. Moss. Nested transactions: An approach to reliable
[16] Y. Gan, Y. Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, distributed computing. Technical report, Massachusetts
A. Bruno, J. Hu, B. Ritchken, B. Jackson, K. Hu, M. Pancholi, Institute of Technology, 1981.
Y. He, B. Clancy, C. Colen, F. Wen, C. Leung, S. Wang, [32] S. Mu, S. Angel, and D. Shasha. Deferred runtime pipelining
L. Zaruvinsky, M. Espinosa, R. Lin, Z. Liu, J. Padilla, and for contentious multicore software transactions. In
C. Delimitrou. An open-source benchmark suite for Proceedings of the ACM European Conference on Computer
microservices and their hardware-software implications for Systems (EuroSys), 2019.
cloud & edge systems. In Proceedings of the International [33] S. Mu, L. Nelson, W. Lloyd, and J. Li. Consolidating

USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 1201
concurrency control and consensus for commits under Cloudburst: Stateful functions-as-a-service.
conflicts. In Proceedings of the USENIX Symposium on arXiv:2001/04592, Jan. 2020.
Operating Systems Design and Implementation (OSDI), Nov. https://arxiv.org/abs/2001.04592.
2016. [41] AWS Step Functions.
[34] C. H. Papadimitriou. The serializability of concurrent https://aws.amazon.com/step-functions/.
database updates. Journal of the ACM, 26(4), Oct. 1979. [42] J. D. Valois. Lock-free linked lists using compare-and-swap.
[35] D. Peng and F. Dabek. Large-scale incremental processing In Proceedings of the Symposium on Principles of Distributed
using distributed transactions and notifications. In Computing (PODC), Aug. 1995.
Proceedings of the USENIX Symposium on Operating Systems [43] L. Wang, M. Li, Y. Zhang, T. Ristenpart, and M. Swift.
Design and Implementation (OSDI), Oct. 2010. Peeking behind the curtains of serverless platforms. In
[36] S. Setty, C. Su, J. R. Lorch, L. Zhou, H. Chen, P. Patel, and Proceedings of the USENIX Annual Technical Conference
J. Ren. Realizing the fault-tolerance promise of cloud storage (ATC), 2018.
using locks with intent. In Proceedings of the USENIX [44] wrk2: A constant throughput, correct latency recording variant
Symposium on Operating Systems Design and Implementation of wrk. https://github.com/giltene/wrk2.
(OSDI), 2016. [45] H. Zhang, A. Cardoza, P. B. Chen, S. Angel, and V. Liu.
[37] A. Shamis, M. Renzelmann, S. Novakovic, G. Chatzopoulos, Fault-tolerant and transactional stateful serverless workflows
A. Dragojević, D. Narayanan, and M. Castro. Fast general (extended version). arXiv:2010/06706, 2020.
distributed transactions with opacity. In Proceedings of the https://arxiv.org/abs/2010.06706.
ACM SIGMOD Conference, 2019. [46] I. Zhang, N. K. Sharma, A. Szekeres, A. Krishnamurthy, and
[38] M. F. Spear, V. J. Marathe, W. N. Scherer III, and M. L. Scott. D. R. K. Ports. Building consistent transactions with
Conflict detection and validation strategies for software inconsistent replication. In Proceedings of the ACM
transactional memory. In Proceedings of the International Symposium on Operating Systems Principles (SOSP), Oct.
Symposium on Distributed Computing (DISC), Sept. 2006. 2015.
[39] V. Sreekanti, C. Wu, S. Chhatrapati, J. E. Gonzalez, J. M. [47] K. Zhang, Y. Zhao, Y. Yang, Y. Liu, and M. Spear. Practical
Hellerstein, and J. M. Faleiro. A fault-tolerance shim for non-blocking unordered lists. In Proceedings of the
serverless computing. In Proceedings of the ACM European International Symposium on Distributed Computing (DISC),
Conference on Computer Systems (EuroSys), Apr. 2020. Oct. 2013.
[40] V. Sreekanti, C. Wu, X. C. Lin, J. Schleier-Smith, J. M.
Faleiro, J. E. Gonzalez, J. M. Hellerstein, and A. Tumanov.

1202 14th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
A Artifact Appendix A.5 Evaluation and expected result
A.1 Abstract A.5.1 Primitives (Figure 13)
Our artifact runs on Amazon AWS Lambda without additional re- To run the experiment
quirements or dependencies. Deploying the code, performing the $ ./scripts/singleop/run.sh
measurements, generating the plots, and running the benchmarks de- The script has two modes
pend on some third-party frameworks including serverless, gnuplot
and wrk2. 1. fast mode: less time, approximate result (around 5 min)

A.2 Artifact check-list 2. full mode: full experiment (around 30 min)


• Program: Golang The script will ask you which mode to run when it starts.
• Run-time environment: AWS Lambda Figure 13 includes three experiments, baseline (without beldi),
beldi and beldi-txn. Their function names are bsingleop, singleop
• Metrics: Throughput and latency and tsingleop respectively. After deployment, the script will ask for
• Experiments: Our serverless port of DeathStarBench the HTTP endpoint for these three lambdas, which needs manual
• Expected experiment run time: Around 20 hours setup at AWS.
Take bsingleop as an example:
• Public link: https://github.com/eniac/Beldi
• Code licenses: MIT License 1. Go to the lambda console, click the function

A.3 Description
A.3.1 How to access
https://github.com/eniac/Beldi
A.4 Installation
A.4.1 Set up docker container
1. login to a registry
$ docker login
2. pull the docker image
for github packages users:
2. Click add trigger
$ docker run -it \
> docker.pkg.github.com/eniac/beldi/beldi:latest
/bin/bash
for docker hub users:
$ docker run -it tauta/beldi:latest /bin/bash

The purpose of this container is to setup the environment needed


to run our configuration, deployment, and graph plotting scripts. The
actual code of Beldi runs on AWS lambda.
A.4.2 Set AWS Credentials
Inside the container run
$ aws configure
It will ask you for an access key ID, a secret access key, region
and output format. The first two can be found/created at: 3. Choose API Gateway

Set the region to us-east-1 and the output format to json.

USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 1203
4. Configure as below To generate the figure,
$ gnuplot < scripts/gctest/gc.pg
The figure will show up as beldi/result/gctest/res.png.
A.5.3 Movie review service (Figure 14)

Baseline.
$ ./scripts/media/run-baseline.sh
Each data point takes around 20 min.
The script will first ask you for a request rate (the default is
100). After deployment, it will ask for the HTTP endpoint for
beldi-dev-bFrontend. When it finishes, it will print to the termi-
nal the median and p99 latency. This result will also be saved to
result/media/baseline.json. Alternatively, you can view the
metrics on AWS CloudWatch.
5. Click the trigger created
Beldi.
$ ./scripts/media/run.sh

A.5.4 Travel Reservation (Figure 15)

Baseline.
$ ./scripts/hotel/run-baseline.sh
Each data point takes around 20 min.
It will ask for the HTTP endpoint for beldi-dev-bgateway. When
it finishes, it will print to terminal the median and p99 latency. It
will also save the result to result/hotel/baseline.json.
Beldi.
$ ./scripts/hotel/run.sh

6. Copy the link and paste in terminal

After all three endpoints get set, the experiment will start running.
The result will be saved at beldi/result/singleop/singleop,
which can be loaded by gnuplot
$ gnuplot < scripts/singleop/singleop.pg
The figure will show up as beldi/result/singleop/res.png.
You can use docker cp to copy it to your host.
A.5.2 Garbage Collection (Figure 10)
To run the experiment,
$ ./scripts/gctest/run.sh
The script has two modes
1. fast mode: less time, prefix of Figure 10 (around 30 min)
2. full mode: full experiment (around 150 min)
The script will ask you which mode to run when it starts.
The script compiles the code and deploys the binary to AWS.
After that, it will ask for the HTTP endpoint for beldi-dev-gctest.
The result will be saved as beldi/result/gctest/gc.

1204 14th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

You might also like