0% found this document useful (0 votes)

1K views84 pages

11 Low Latency Programming

The document discusses low-latency programming and kernel bypass techniques. It describes how the Linux kernel can only process around 1 million packets per second while modern network interface cards can handle over 10 million packets per second. It then shows some experiments demonstrating how packet processing speeds decrease as packets are distributed to multiple CPU cores instead of a single queue. Finally, it discusses how Solarflare network cards support a kernel bypass using OpenOnload to implement the network stack in userspace.

Uploaded by

yilvas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views84 pages

11 Low Latency Programming

Uploaded by

yilvas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 84

Low-latency programming

Paul Alexander Bilokon, PhD

Thalesians Ltd
Level39, One Canada Square, Canary Wharf, London E14 5AB

2022.06.16
Low-latency programming
The microsecond

The microsecond

I Carl Cook called his CppCon 2017 talk “When a

Microsecond Is an Eternity”.
I Indeed, this is the case in high-frequency trading (HFT).
I How can one visualize a microsecond?
I In one microsecond, light travels approximately 300 metres
(299.792... metres).
I In this photo, Grace Hopper is shown holding a 300 metre
cable—a “microsecond”.
Grace Hopper (1906–1992)
Low-latency programming
The microsecond

The Chrysler Building

I An Art Deco-style skyscraper in the Turtle Bay

neighbourhood on the East Side of Manhattan, New York
City, at the intersection of 42nd Street and Lexington
Avenue near Midtown Manhattan.
I It is the tallest brick building in the world with a steel
framework, and was the world’s tallest building for 11
months after its completion in 1930.
I As of 2019, the Chrysler is the 11th-tallest building in the
city, tied with The New York Times Building.
I While the antenna spire reaches 1,046 ft (319 m), the roof
is at 925 ft (282 m).
I It takes light just over 1 µs to travel from the top of the
Chrysler’s spire to the ground.

The Chrysler Building

Low-latency programming
The microsecond

Colocation (i)

From Latency equlisation: The need for fair and non-discriminatory colocation services1
(2018.08.17):
The significant upgrades to trading infrastructure, improved transparency and general best
practices necessitated by MiFID II have turned a spotlight on low-latency market access and
are driving a renewed demand for functional and scalable colocation provision.
It’s no longer enough to be optimally connected for maximum trading advantage—under Mi-
FID II, it also has to be equitable as per the regulation’s ‘latency equilisation’ provisions.
Under Articles 48(8) and (9) of Directive 2014/65/EU in MiFID II, trading venues are required
to provide “transparent, fair and non-discriminatory” colocation services that “do not create
incentives for disorderly trading conditions or market abuse.” The regulation extended the
provisions to cover multirateral and organised trading facilities, and the common requirements
of the regulation apply to all types of colocation services as well as to trading venues that
organise their own data centres or that use data centres owned or managed by third parties.
Trading venues do have a modicum of control—they are able to determine their own com-
mercial colocation policy, provided that it is based on “onbjective, transparent and non-
discriminatory criteria to the different types of users of the venue within the limits of the space,
power, cooling or similar facilities available.”

1 https://www.interxion.com/uk/blogs/2018/082/
latency-equalisation-the-need-for-fair-and-non-discriminatory-colocation-services
Low-latency programming
The microsecond

Colocation (ii)

In other words, trading venues cannot discriminate between users—they must provide all sub-
scribers with access to the same services on the same platform under the same conditions—
from space, power, cooling and cable length to data access, market connectivity, technology,
technical support and messaging types. In addition, trading venues have to monitor all con-
nections and latency measurements within the colocation services they offer, to ensure the
non-discriminatory treatment of all users with each type of service offered.
Fee structure plays an important part in this. Trading venues must be sure that users are
able to subscribe to only certain colocation services, and are not required to purchase bun-
dled services. They must also use objective criteria when determining rebates, incentives
and disincentives. Fee structures that contributes to conditions leading to disorderly trading
conditions through encouraging intensive trading and that may lead to a stress of market in-
frastructures are therefore prohibited—although volume discounts are allowed under certain
circumstances.
Transparency is a significant component of this requirement too. Trading venues must pub-
lish the details of their arrangements—including information on the colocation services they
offer, the price of the services, the conditions for accessing the services, the different types
of latency access that are available, procedures for the allocation of colocation space and the
requirements that third-party providers of colocation services must meet.
Low-latency programming
The microsecond

Colocation (ii)

In the light of these new requirements for electronic execution venues, and with the new Sys-
tematic Internaliser regime arriving on September 1, 2018, the importance of choosing the
right hosting venue has become increasingly important. Financial services firms must be sure
they are working with a data centre operator that understands the specific requirements of
MiFID II and is able to partner with the firm towards complete compliance.
Interxion is leading the way with the launch of LON3 in October - the latest addition to its Lon-
don data centre campus, situated right between the Square Mile, the centre of global finance,
and Tech City, the 3rd largest technology start-up cluster in the world. Its geographical prox-
imity to the London Stock Exchange, equidistant between the key hosting centres of Slough
(BATS, EBS) and Basildon (ICE Futures, Euronext), and access to microwave connectivity to
Frankfurt (Eurex) make it unequaled in terms of European trading venue access. This central
location yields major speed advantages for multi-venue trading strategies and enables optimal
order book aggregation / consolidation for best execution under MiFID II.
Low-latency programming
Kernel bypass

Limitations of the Linux kernel

I Vanilla Linux can only process about 1M packets per second (pps).
I Modern 10 Gbps NIC’s can usually process at least 10M pps.
I The only way to squeeze more packets from the hardware is by working around the
Linux kernel networking stack. This is called a kernel bypass.
Low-latency programming
Kernel bypass

First experiment

I Let’s conduct a small experiment.

$ sudo iptables -t raw -I PREROUTING -p udp --dport 4321 --dst 192.168.254.1 -j

,→ DROP
$ sudo ethtool -X eth2 weight 1
$ watch ’ethtool -S eth2|grep rx ’
rx_packets : 12.2m/s
rx -0. rx_packets : 1.4m/s
rx -1. rx_packets : 0/s
...

I The fastest way to drop packets in Linux, without hacking the kernel sources, is by
placing a DROP rule in the PREROUTING iptables chain.
I By manipulating an indirection table on a NIC with ethtool -X, we direct all the
packets to RX queue #0.
I ethtool statistics show that the network card receives a line rate of 12M pps.
I As we can see the kernel is able to process 1.4M pps on that queue with a single CPU.
Low-latency programming
Kernel bypass

Second experiment

I Let’s see the numbers when we direct packets to four RX queues:

$ sudo ethtool -X eth2 weight 1 1 1 1

$ watch ’ethtool -S eth2|grep rx ’
rx_packets : 12.1m/s
rx -0. rx_packets : 477.8k/s
rx -1. rx_packets : 447.5k/s
rx -2. rx_packets : 482.6k/s
rx -3. rx_packets : 455.9k/s

I Now we process only 480k pps per core.

I When the packets hit many cores the numbers drop sharply.
Low-latency programming
Kernel bypass

A kernel bypass technique

I Solarflare network cards support OpenOnload, a network accelerator. It achieves a

kernel bypass by implementing the network stack in userspace and using an
LD_PRELOAD to overwrite network syscalls of the target program.
I For low level access to the network card OpenOnload relies on an EF_VI library. This
library can be used directly and is well documented.
I Under the hood each EF_VI program is granted access to a dedicated RX queue,
hidden from the kernel.
I By default the queue receives no packets, until you create an EF_VI filter. This filter is
nothing more than a hidden flow steering rule. You won’t see it in ethtool -n, but the
rule does in fact exist on the network card.
I Having allocated an RX queue and managed flow steering rules, the only remaining
task for EF_VI is to provide a userspace API for accessing the queue.
I For other kernel bypass techniques, see
https://blog.cloudflare.com/kernel-bypass/
Low-latency programming
Branch prediction

Branch prediction (i)

I Branch prediction is a technique used in CPU design that attempts to guess the
outcome of a conditional operation and prepare for the most likely result.
I A digital circuit that performs this operation is known as a branch predictor. It is an
important component of modern CPU architectures, such as the x86.
I Two-way branching is usually implemented with a conditional jump instruction. A
conditional jump can either be “not taken” and continue execution with the first branch
of code which follows immediately after the conditional jump, or it can be “taken” and
jump to a different place in program memory where the second branch of code is
stored.
I It is not known for certain whether a conditional jump will be taken or not taken until the
condition has been calculated and the conditional jump has passed the execution
stage in the instruction pipeline.
Low-latency programming
Branch prediction

Instruction pipeline

Figure: Example of 4-stage pipeline. The coloured boxes represent instructions independent of each
other.
Low-latency programming
Branch prediction

Branch prediction (ii)

I Without branch prediction, the processor would have to wait until the conditional jump
instruction has passed the execute stage before the next instruction can enter the
fetch stage in the pipeline.
I The branch predictor attempts to avoid this waste of time by trying to guess whether
the conditional jump is most likely to be taken or not taken.
I The branch that is guessed to be the most likely is then fetched and speculatively
executed.
I If it is later detected that the guess was wrong, then the speculatively executed or
partially executed instructions are discarded and the pipeline starts over with the
correct branch incurring a delay.
Low-latency programming
Branch prediction

Branch misprediction

I The time that is wasted in case of a branch misprediction is equal to the number of
stages in the pipeline from the fetch stage to the execute stage.
I Modern microprocessors tend to have quite long pipelines so that the misprediction
delay is between 10 and 20 clock cycles.
I As a result, making a pipeline longer increases the need for a more advanced branch
predictor.
Low-latency programming
Branch prediction

The mechanics of branch prediction

I The first time a conditional jump instruction is encountered, there is not much
information to base a prediction on.
I But the branch predictor keeps records of whether branches are taken or not taken.
I When it encounters a conditional jump that has been seen several times before, then it
can base the prediction on the history.
I The branch predictor may, for example, recognize that the conditional jump is taken
more often than not, or that it is taken every second time.
Low-latency programming
Branch prediction

A two-level adaptive branch predictor

Figure: A two-level adaptive branch predictor.

Low-latency programming
Branch prediction

Branch prediction and low-latency programming (i)

I In low-latency programming we aim for a deterministic code flow and minimize

branching.
I For example, consider the following code flow. How could you rewrite it with low
latency in mind?

Figure: Code with a lot of branching

Low-latency programming
Branch prediction

Branch prediction and low-latency programming (ii)

Figure: Code with a lot less branching

Low-latency programming
The curiously recurring template pattern (CRTP)

Virtual function tables

I A virtual function table (VFT), virtual method table (VMT), or vtable is a

mechanism used in a programming language to support dynamic dispatch.
I In this context, dispatching just refers to the action of finding the right function to call.
I In the general case, when you define a method inside a class, the compiler will
remember its definition and execute it every time a call to that method is encountered.
Low-latency programming
The curiously recurring template pattern (CRTP)

Static dispatch

I Consider the following example:

# include <iostream >

class A {
public :
void foo ();
};

void A:: foo () {

std :: cout << "Hello this is foo" << std :: endl;
}

I Here the compiler will create a routine for foo() and remember its address.
I This routine will be executed every time the compiler finds a call to foo() on an
instance of A.
I Keep in mind that only one routine exists per class method, and is shared by all
instances of the class.
I This process is known as static dispatch or early binding: the compiler knows which
routine to execute during compilation.
Low-latency programming
The curiously recurring template pattern (CRTP)

Virtual functions
I Virtual functions...

# include <iostream >

class B {
public :
virtual void bar ();
virtual void qux ();
};

void B:: bar () {

std :: cout << "This is B’s implementation of bar" << std :: endl;
}

void B:: qux () {

std :: cout << "This is B’s implementation of qux" << std :: endl;
}

I can be overridden by subclasses:

class C : public B {
public :
void bar () override ;
};

void C:: bar () {

std :: cout << "This is C’s implementation of bar" << std :: endl;
}
Low-latency programming
The curiously recurring template pattern (CRTP)

Dynamic dispatch

I Now consider the following call to bar():

B* b = new C();
b->bar ();

I If we used static dispatch as above, the call b->bar() would execute B::bar(), since
(from the point of view of the compiler) b points to an object of type B.
I This would be wrong, because b actually points to an object of type C and C::bar()
should be called instead.
I Given that virtual functions can be redefined in subclasses, calls via pointers (or
references) to a base type cannot be dispatched at compile time.
I The compiler has to find the right function definition (i.e. the most specific one) at
runtime.
I This process is called dynamic dispatch or late method binding.
Low-latency programming
The curiously recurring template pattern (CRTP)

Implementation (i)

I For every class that contains virtual functions, the compiler constructs a virtual table
(vtable).
I The vtable contains an entry for each virtual function accessible by the class and
stores a pointer to its definition.
I Only the most specific function definition callable by the class is stored in the vtable.
I Entries in the vtable can point to either functions declared in the class itself (e.g.
C::bar()), or virtual functions inherited from a base class (e.g. C::qux()).
I In our example, the compiler will create the following virtual tables.
Low-latency programming
The curiously recurring template pattern (CRTP)

Implementation (ii)

Figure: Virtual function tables (vtables)

Low-latency programming
The curiously recurring template pattern (CRTP)

Implementation (iii)

I The vtable of class B has two entries, one for each of the two virtual functions declared
in B’s scope bar() and qux().
I Additionally, the vtable of B points to the local definition of functions , since they are the
most specific (and only) from B’s point of view.
I More interesting is C’s table. In this case, the entry for bar() points to own C’s
implementation, given that it is more specific than B::bar().
I Since C doesn’t override qux(), its entry in the vtable points to B’s definition (the most
specific definition).
I Note that vtables exist at the class level, meaning there exists a single vtable per class,
and is shared by all instances.
Low-latency programming
The curiously recurring template pattern (CRTP)

Vpointers (i)

I When the compiler sees b->bar() in the example above, it will look up B’s vtable for
bar’s entry and follow the corresponding function pointer, right?
I We would still be calling B::bar() and not C::bar()...
I We haven’t told you the second part of the story: vpointers.
I Every time the compiler creates a vtable for a class, it adds an extra argument to it: a
pointer to the corresponding virtual table, called the vpointer.
I vpointer is just another class member added by the compiler and increases the size of
every object that has a vtable by sizeof(vpointer).
I When a call to a virtual function on an object is performed, the vpointer of the object is
used to find the corresponding vtable of the class.
I Next, the function name is used as index to the vtable to find the correct (most
specific) routine to be executed.
Low-latency programming
The curiously recurring template pattern (CRTP)

Vpointers (ii)

Figure: Vpointers and vtables

Low-latency programming
The curiously recurring template pattern (CRTP)

Virtual destructors

I It is always a good idea to make destructors of base classes virtual.

I Since derived classes are often handled via base class references, declaring a
non-virtual destructor will be dispatched statically, obfuscating the destructor of the
derived class.
Low-latency programming
The curiously recurring template pattern (CRTP)

An example (i)

# include <iostream >

class Base
{
public :
~Base ()
{
std :: cout << " Destroying base" << std :: endl;
}
};
Low-latency programming
The curiously recurring template pattern (CRTP)

An example (ii)

class Derived : public Base

{
public :
Derived (int number )
{
some_resource_ = new int( number );
}

~ Derived ()
{
std :: cout << " Destroying derived " << std :: endl;
delete some_resource_ ;
}

private :
int* some_resource_ ;
};
Low-latency programming
The curiously recurring template pattern (CRTP)

An example (ii)

int main ()
{
Base* p = new Derived (5);
delete p;
}

This will output:

> Destroying base

Making Base’’s destructor virtual will result in the expected behaviour:

> Destroying derived

> Destroying base
Low-latency programming
The curiously recurring template pattern (CRTP)

Summary

I Function overriding makes it impossible to dispatch virtual functions statically (at

compile time).
I Dispatching of virtual functions needs to happen at runtime.
I The virtual table method is a popular implementation of dynamic dispatch.
I For every class that defines or inherits virtual functions the compiler creates a virtual
table.
I The virtual table stores a pointer to the most specific definition of each virtual function.
I For every class that has a vtable, the compiler adds an extra member to the class: the
vpointer.
I The vpointer points to the corresponding vtable of the class.
I Always declare desctructors of base classes as virtual.
Low-latency programming
The curiously recurring template pattern (CRTP)

Low-latency programming perspective

I The overhead of vpointers and vtables is runtime.

I In low-latency programming we prefer compile-time overhead.
I How can we reduce the runtime overhead at the expense of higher compile-time
overhead?
Low-latency programming
The curiously recurring template pattern (CRTP)

Classic polymorphism

class order {
virtual void place_order () { /* generic implementation */ }
};

class specific_order : public order {

virtual void place_order () override { /* specific implementation */ }
};

class generic_order : public order { /* No implementation */ }

Low-latency programming
The curiously recurring template pattern (CRTP)

The Curiously Recurring Template Pattern (CRTP)

template <typename actual_type >

class order {
void place_order () { static_cast < actual_type *>( this)->actual_place (); }
void actual_place () { /* Generic implementation */ }
};

class specific_order : public order < specific_order > {

void actual_place () { /* Specific implementation */ }
};

class generic_order : public order < generic_order > { /* ... */ };

Low-latency programming
Benchmarking

Microbenchmarking (i)

I Microbenchmarking, as the name suggests, is measuring the performance of

something “small”, like a system call to the kernel of an operating system.
I Compiler optimizations may skew the result of microbenchmarks.
I Consider, for example,

void test_for_loop () {
int iter_count = 1000000000;
time start = get_time ();
for (int i = 0; i < iter_count ; ++i) {}
time elapsed = get_time () - start;
time elapsed_per_iteration = elapsed / iter_count ;
printf ("Time elapsed for each iteration : %d\n", elapsed_per_iteration );
}

I Compilers can see that the loop does nothing and will not generate any code for it.
The values of elapsed and elapsed_per_iteration end up useless.
Low-latency programming
Benchmarking

Microbenchmarking (ii)

I Suppose that the loop does something:

void test_for_loop () {
int iter_count = 1000000000;
int sum = 0;
time start = get_time ();
for (int i = 0; i < iter_count ; ++i) {
++ sum;
}
time elapsed = get_time () - start;
time elapsed_per_iteration = elapsed / iter_count ;
printf ("Time elapsed for each iteration : %d\n", elapsed_per_iteration );
}

I The compiler may see that the variable sum isn’t used and optimize it away along with
the loop.
Low-latency programming
Benchmarking

Microbenchmarking (ii)

I Suppose that we print out the resulting sum:

I The compiler may see that the variable sum will always have a constant value and
optimize all that away as well.
Low-latency programming
Benchmarking

More caveats

I When we are testing I/O the OS may preload files into memory to improve
performance.
I Locality of reference (e.g. arrays versus linked lists).
I Effects of caches.
I Effects of memory bandwidth.
I Compiler inlining.
I Compiler implementation.
I Compiler switches.
I Number of processor cores.
I Optimizations at the processor level.
I Operating system schedulers.
I Operating system background processes.
Low-latency programming
Benchmarking

Profiling

I Microbenchmarking does not replace profiling—a form of dynamic program analysis

that measures, for example, the space (memory) or time complexity of a program, the
usage of particular instructions, or the frequency and duration of function calls.
I Profiling is achieved by instrumenting either the program source code or its binary
executable form using a tool called a profiler (or code profiler).
I Unlike microbenchmarking, profiling typically involves whole-program benchmarks with
well-defined test cases.
Low-latency programming
Benchmarking

Google Benchmark

I Google Benchmark is a microbenchmark support library.

I It supports the benchmarking of functions similar to unit-tests.
I The library can be used with C++03.
I However, it requires C++11 to build, including compiler and standard library support.
I The following minimum versions are required to build the library:
I GCC 4.7;
I Clang 3.4;
I Visual Studio 14 2015;
I Intel 2015 Update 1.
Low-latency programming
Benchmarking

Using Google Benchmark (i)

# ifdef _WIN32
# pragma comment ( lib , " Shlwapi .lib" )
#endif

# include <vector >

# include <limits >

# include " benchmark / benchmark .h"

std :: vector <int > rng(int begin , int end , int count) {
std :: vector <int > v;
for (int i = 0; i < count; ++i) {
v. push_back (( std :: rand () % end) + begin);
}
return v;
}
Low-latency programming
Benchmarking

Using Google Benchmark (ii)

static void cache_bench ( benchmark :: State& s) {
// Get the size from the input
int bytes = 1 << s.range (0);

// Share the size between data and indices

int count = (bytes / sizeof (int) / 2);

std :: vector <int > v;

for (auto i : rng(std :: numeric_limits <int >:: min (),
std :: numeric_limits <int >:: max (), count)) {
v. push_back (i);
}

// Initialize this vector with random indices

std :: vector <int > indices ;
for (auto i : rng (0, count , count)) {
indices . push_back (i);
}

// Now let ’s randomly access the values using the indices

while (s. KeepRunning ()) {
long sum = 0;
for (int i : indices ) {
sum += v[i];
}
// Make sure that sum isn ’t optimized out
benchmark :: DoNotOptimize (sum);
}
Low-latency programming
Benchmarking

Using Google Benchmark (iii)

// We can set the number of bytes we have processed

s. SetBytesProcessed (long(s. iterations ()) * long(bytes));
// Insert a custom label
s. SetLabel (std :: to_string (bytes / 1024) + "kb");
}
// Register the benchmark
// DenseRange allows us to generate a set of inputs
// ReportAggregatesOnly allows us to limit our output
BENCHMARK ( cache_bench )->DenseRange (13, 26) -> ReportAggregatesOnly (true);

// This is basically our main function

BENCHMARK_MAIN ();
Low-latency programming
Benchmarking

Organisation of CPU memory

Source: [PZ12]
Low-latency programming
Benchmarking

Cost of data access to area of varying size

Source: [PZ12]
Low-latency programming
Cache warming

No cache warming

Figure: No cache warming

Low-latency programming
Cache warming

Cache warming

Figure: Cache warming

Low-latency programming
Cache warming

Consequences

I Now the order placement flow is way likelier to be in the cache.

I Branch prediction is also more balanced.
I Measurements can be used to determine how many times per unit time to perform the
warmup.
Low-latency programming
Multithreading

Thread

I A thread of execution is the smallest sequence of programmed instructions that can

be managed independently by a scheduler, which is typically a part of the operating
system.
I Threads are scheduled onto hardware resources such as process cores.
I Threads may be interleaved on a single core:
I Cycle i: an instruction from thread A is issued.
I Cycle i + 1: an instruction from thread B is issued.
I Alternatively, in true parallelism, threads are truly evaluated in parallel across multiple
cores.
I There is middle ground. For example, in hyper-threading, for each processor core
that is physically present, the operating system addresses two virtual (logical) cores
and shares the workload between them when possible. The main function of
hyper-threading is to increase the number of independent instructions in the pipeline; it
takes advantage of superscalar architecture, in which multiple instructions operate on
separate data in parallel.
Low-latency programming
Multithreading

std::thread

I The class std::thread has been in C++ since C++11.

I It represents a single thread of execution.
I Threads begin execution immediately upon construction of the associated thread
object (pending any OS scheduling delays), starting at the top-level function provided
as a constructor argument.
I The return value of the top-level function is ignored and if it terminates by throwing an
exception, std::terminate is called.
I The top-level function may communicate its return value or an exception to the caller
via std::promise or by modifying shared variables (which may require
synchronization via std::mutex and std::atomic).
Low-latency programming
Multithreading

Example: std::thread (i)

# include <chrono >

# include <cstdlib >
# include <iostream >
# include <thread >

void foo () {
std :: cout << "foo starting " << std :: endl;
for (int i = 0; i < 10; ++i) {
std :: cout << i << std :: endl;
std :: this_thread :: sleep_for (std :: chrono :: seconds (1));
}
std :: cout << "foo finished " << std :: endl;
}

void bar(int x) {
std :: cout << "bar starting " << std :: endl;
for (int i = 0; i < x; i += 10) {
std :: cout << i << std :: endl;
std :: this_thread :: sleep_for (std :: chrono :: milliseconds (500));
}
std :: cout << "bar finished " << std :: endl;
}
Low-latency programming
Multithreading

Example: std::thread (ii)

int main () {
std :: thread first(foo); // Spawn new thread that calls foo ()
std :: thread second (bar , 100); // Spawn new thread that calls bar (0)

std :: cout << "main , foo , and bar are now executing concurrently ...\n";

// Synchronize threads :
first .join (); // Pauses until first finishes
second .join (); // Pauses until second finishes

std :: cout << "foo and bar completed \n";

return EXIT_SUCCESS ;
}
Low-latency programming
Multithreading

std::future

I The class template std::future provides a mechanism to access the result of

asynchronous operations.
I An asynchronous operation (created via std::async, std::packaged_task, or
std::promise) can provide a std::future object to the creator of that asynchronous
operation.
I The creator of the asynchronous operation can then use a variety of methods to query,
wait for, or extract a value from the std::future. These methods may block if the
asynchronous operation has not yet provided a value.
I When the asynchronous operation is ready to send a result to the creator, it cad do so
by modifying shared state (e.g. std::promise::set_value) that is linked to the
creator’s std::future.
Low-latency programming
Multithreading

Example: std::future
# include <cstdlib >
# include <future >
# include <iostream >
# include <thread >

int main () {
// std :: future from a packaged task.
std :: packaged_task <int ()> task ([] { return 7; }); // Wrap the function .
std :: future <int > f1 = task. get_future (); // Get a future .
std :: thread t(std :: move(task));
// std :: future from an async ()
std :: future <int > f2 = std :: async(std :: launch :: async , [] { return 8; });
// std :: future from a std :: promise
std :: promise <int > p;
std :: future <int > f3 = p. get_future ();
std :: thread ([&p] {p. set_value_at_thread_exit (9); }). detach ();

std :: cout << " Waiting ..." << std :: flush;

f1.wait ();
f2.wait ();
f3.wait ();
std :: cout << "Done !\ nResults are: "
<< f1.get () << ’ ’
<< f2.get () << ’ ’
<< f3.get () << std :: endl;
t.join ();
return EXIT_SUCCESS ;
}
Low-latency programming
Multithreading

std::promise

I The class template std::promise provides a facility to store a value or an exception

that is later acquired asynchronously via a std::future object created by the
std::promise object.
I Note that the std::promise object is meant to be used only once.
I Each promise is associated with a shared state, which contains some state
information and a result which may be not yet evaluated, evaluated to a value (possibly
void) or evaluated to an exception.
Low-latency programming
Multithreading

Example: std::promise
# include <chrono >
# include <future >
# include <iostream >
# include <numeric >
# include <thread >
# include <vector >

void accumulate (
std :: vector <int >:: iterator first ,
std :: vector <int >:: iterator last ,
std :: promise <int > accumulate_promise ) {
int sum = std :: accumulate (first , last , 0);
accumulate_promise . set_value (sum);
}

int main () {
// Demonstrating using promise <int > to transmit a result between threads .
std :: vector <int > numbers = { 1, 2, 3, 4, 5, 6 };
std :: promise <int > accumulate_promise ;
std :: future <int > accumulate_future = accumulate_promise . get_future ();
std :: thread work_thread (accumulate , numbers .begin (), numbers .end (),
std :: move( accumulate_promise ));

// future :: get () will wait until the future has a valid result and retrieves it.
// Calling accumulate_future .wait () before accumulate_future .get () is not needed .
std :: cout << " result = " << accumulate_future .get () << ’\n’;
work_thread .join (); // Wait for thread completion .
}
Low-latency programming
Multithreading

Mutex

I A mutex is a lockable object that is designed to signal when critical sections of the
code need exclusive access, preventing other threads from executing concurrently.
I std::mutex objects provide exclusive ownership and do not support recursivity (i.e.,
a thread cannot lock a mutex it already owns).
I For example, we can use a mutex to prevent multiple threads from performing I/O at
the same time or accessing the same memory locations.
Low-latency programming
Multithreading

Example: std::mutex

# include <cstdlib >

# include <iostream >
# include <mutex >

std :: mutex mtx;

void print_thread_id (int id) {

mtx.lock ();
std :: cout << " thread #" << id << ’\n’;
mtx. unlock ();
}

int main () {
std :: thread threads [10];
for (int i = 0; i < 10; ++i)
threads [i] = std :: thread ( print_thread_id , i + 1);

for (auto& th : threads ) th.join ();

return EXIT_SUCCESS ;
}
Low-latency programming
The Distruptor

The Disruptor

I LMAX Disruptor is a high performance inter-thread messaging library.

I Performance testing showed that using queues to pass data between stages of the
system was introducing latency, so the Disruptor team focussed on optimizing this
area.
I The Disruptor is the result of their research and testing.
I They found that cache misses at the CPU-level and locks requiring kernel arbitration
were both extremely costly, so they created a framework which has mechanical
sympathy for the hardware it was running on, and that was lock-free.
I This is not a specialist solution, it’s not designed to work only for a financial
application. The Disruptor is a general-purpose mechanism for solving a difficult
problem in concurrent programming.
I Our discussion of the Disruptor is based on presentations by Trisha Gee and Michael
Barker (LMAX).
I For more information, see http://lmax-exchange.github.io/disruptor/
Low-latency programming
The Distruptor

A Java example, no locking

static long foo = 0;

private static void increment () {

for (long l = 0; l < 500000000 L l++) {
foo ++;
}
}
Low-latency programming
The Distruptor

A Java example, locking

static long foo = 0;

public static Lock lock = new Lock ();

private static void increment () {

for (long l = 0; l < 500000000 L l++) {
lock.lock ();
try {
foo ++;
} finally {
lock. unlock ();
}
}
}
Low-latency programming
The Distruptor

A Java example, AtomicLong

static AtomicLong foo = new AtomicLong (0);

private static void increment () {

for (long l = 0; l < 500000000 L l++) {
foo. getAndIncrement ();
}
}
Low-latency programming
The Distruptor

Cost of contention

Increment a counter 500,000,000 times.

I One Thread: 300 ms
I One Thread (volatile): 4,700 ms (15x)
I One Thread (Atomic): 5,700 ms (19x)
I One Thread (Lock): 10,000 ms (33x)
I Two Threads (Atomic): 30,000 ms (100x)
I Two Threads (Lock): 224,000 ms ≈ 4 minutes (746x)
Low-latency programming
The Distruptor

Lock-free programming
I People often describe lock-free programming as programming without mutexes, which
are also referred to as locks.
I That’s true, but it’s only part of the story.
I The generally accepted definition, based on academic literature, is a bit more broad.
At its essence, lock-free is a property used to describe some code, without saying too
much about how that code was actually written.
I Jeff Preshing (https://preshing.com/) proposes the following definition: