11 Low Latency Programming
11 Low Latency Programming
Low-latency programming
                                               Thalesians Ltd
                          Level39, One Canada Square, Canary Wharf, London E14 5AB
                                             2022.06.16
Low-latency programming
   The microsecond
The microsecond
Colocation (i)
        From Latency equlisation: The need for fair and non-discriminatory colocation services1
        (2018.08.17):
               The significant upgrades to trading infrastructure, improved transparency and general best
               practices necessitated by MiFID II have turned a spotlight on low-latency market access and
               are driving a renewed demand for functional and scalable colocation provision.
               It’s no longer enough to be optimally connected for maximum trading advantage—under Mi-
               FID II, it also has to be equitable as per the regulation’s ‘latency equilisation’ provisions.
               Under Articles 48(8) and (9) of Directive 2014/65/EU in MiFID II, trading venues are required
               to provide “transparent, fair and non-discriminatory” colocation services that “do not create
               incentives for disorderly trading conditions or market abuse.” The regulation extended the
               provisions to cover multirateral and organised trading facilities, and the common requirements
               of the regulation apply to all types of colocation services as well as to trading venues that
               organise their own data centres or that use data centres owned or managed by third parties.
               Trading venues do have a modicum of control—they are able to determine their own com-
               mercial colocation policy, provided that it is based on “onbjective, transparent and non-
               discriminatory criteria to the different types of users of the venue within the limits of the space,
               power, cooling or similar facilities available.”
          1 https://www.interxion.com/uk/blogs/2018/082/
        latency-equalisation-the-need-for-fair-and-non-discriminatory-colocation-services
Low-latency programming
   The microsecond
Colocation (ii)
               In other words, trading venues cannot discriminate between users—they must provide all sub-
               scribers with access to the same services on the same platform under the same conditions—
               from space, power, cooling and cable length to data access, market connectivity, technology,
               technical support and messaging types. In addition, trading venues have to monitor all con-
               nections and latency measurements within the colocation services they offer, to ensure the
               non-discriminatory treatment of all users with each type of service offered.
               Fee structure plays an important part in this. Trading venues must be sure that users are
               able to subscribe to only certain colocation services, and are not required to purchase bun-
               dled services. They must also use objective criteria when determining rebates, incentives
               and disincentives. Fee structures that contributes to conditions leading to disorderly trading
               conditions through encouraging intensive trading and that may lead to a stress of market in-
               frastructures are therefore prohibited—although volume discounts are allowed under certain
               circumstances.
               Transparency is a significant component of this requirement too. Trading venues must pub-
               lish the details of their arrangements—including information on the colocation services they
               offer, the price of the services, the conditions for accessing the services, the different types
               of latency access that are available, procedures for the allocation of colocation space and the
               requirements that third-party providers of colocation services must meet.
Low-latency programming
   The microsecond
Colocation (ii)
               In the light of these new requirements for electronic execution venues, and with the new Sys-
               tematic Internaliser regime arriving on September 1, 2018, the importance of choosing the
               right hosting venue has become increasingly important. Financial services firms must be sure
               they are working with a data centre operator that understands the specific requirements of
               MiFID II and is able to partner with the firm towards complete compliance.
               Interxion is leading the way with the launch of LON3 in October - the latest addition to its Lon-
               don data centre campus, situated right between the Square Mile, the centre of global finance,
               and Tech City, the 3rd largest technology start-up cluster in the world. Its geographical prox-
               imity to the London Stock Exchange, equidistant between the key hosting centres of Slough
               (BATS, EBS) and Basildon (ICE Futures, Euronext), and access to microwave connectivity to
               Frankfurt (Eurex) make it unequaled in terms of European trading venue access. This central
               location yields major speed advantages for multi-venue trading strategies and enables optimal
               order book aggregation / consolidation for best execution under MiFID II.
Low-latency programming
   Kernel bypass
           I Vanilla Linux can only process about 1M packets per second (pps).
           I Modern 10 Gbps NIC’s can usually process at least 10M pps.
           I The only way to squeeze more packets from the hardware is by working around the
             Linux kernel networking stack. This is called a kernel bypass.
Low-latency programming
   Kernel bypass
First experiment
           I The fastest way to drop packets in Linux, without hacking the kernel sources, is by
             placing a DROP rule in the PREROUTING iptables chain.
           I By manipulating an indirection table on a NIC with ethtool -X, we direct all the
             packets to RX queue #0.
           I ethtool statistics show that the network card receives a line rate of 12M pps.
           I As we can see the kernel is able to process 1.4M pps on that queue with a single CPU.
Low-latency programming
   Kernel bypass
Second experiment
           I Branch prediction is a technique used in CPU design that attempts to guess the
             outcome of a conditional operation and prepare for the most likely result.
           I A digital circuit that performs this operation is known as a branch predictor. It is an
             important component of modern CPU architectures, such as the x86.
           I Two-way branching is usually implemented with a conditional jump instruction. A
             conditional jump can either be “not taken” and continue execution with the first branch
             of code which follows immediately after the conditional jump, or it can be “taken” and
             jump to a different place in program memory where the second branch of code is
             stored.
           I It is not known for certain whether a conditional jump will be taken or not taken until the
             condition has been calculated and the conditional jump has passed the execution
             stage in the instruction pipeline.
Low-latency programming
   Branch prediction
Instruction pipeline
        Figure: Example of 4-stage pipeline. The coloured boxes represent instructions independent of each
        other.
Low-latency programming
   Branch prediction
           I Without branch prediction, the processor would have to wait until the conditional jump
             instruction has passed the execute stage before the next instruction can enter the
             fetch stage in the pipeline.
           I The branch predictor attempts to avoid this waste of time by trying to guess whether
             the conditional jump is most likely to be taken or not taken.
           I The branch that is guessed to be the most likely is then fetched and speculatively
             executed.
           I If it is later detected that the guess was wrong, then the speculatively executed or
             partially executed instructions are discarded and the pipeline starts over with the
             correct branch incurring a delay.
Low-latency programming
   Branch prediction
Branch misprediction
           I The time that is wasted in case of a branch misprediction is equal to the number of
             stages in the pipeline from the fetch stage to the execute stage.
           I Modern microprocessors tend to have quite long pipelines so that the misprediction
             delay is between 10 and 20 clock cycles.
           I As a result, making a pipeline longer increases the need for a more advanced branch
             predictor.
Low-latency programming
   Branch prediction
           I The first time a conditional jump instruction is encountered, there is not much
             information to base a prediction on.
           I But the branch predictor keeps records of whether branches are taken or not taken.
           I When it encounters a conditional jump that has been seen several times before, then it
             can base the prediction on the history.
           I The branch predictor may, for example, recognize that the conditional jump is taken
             more often than not, or that it is taken every second time.
Low-latency programming
   Branch prediction
Static dispatch
                class A {
                public :
                    void foo ();
                };
           I Here the compiler will create a routine for foo() and remember its address.
           I This routine will be executed every time the compiler finds a call to foo() on an
             instance of A.
           I Keep in mind that only one routine exists per class method, and is shared by all
             instances of the class.
           I This process is known as static dispatch or early binding: the compiler knows which
             routine to execute during compilation.
Low-latency programming
   The curiously recurring template pattern (CRTP)
Virtual functions
           I Virtual functions...
                class B {
                public :
                    virtual void bar ();
                    virtual void qux ();
                };
                class C : public B {
                public :
                    void bar () override ;
                };
Dynamic dispatch
                B* b = new C();
                b->bar ();
           I If we used static dispatch as above, the call b->bar() would execute B::bar(), since
             (from the point of view of the compiler) b points to an object of type B.
           I This would be wrong, because b actually points to an object of type C and C::bar()
             should be called instead.
           I Given that virtual functions can be redefined in subclasses, calls via pointers (or
             references) to a base type cannot be dispatched at compile time.
           I The compiler has to find the right function definition (i.e. the most specific one) at
             runtime.
           I This process is called dynamic dispatch or late method binding.
Low-latency programming
   The curiously recurring template pattern (CRTP)
Implementation (i)
           I For every class that contains virtual functions, the compiler constructs a virtual table
             (vtable).
           I The vtable contains an entry for each virtual function accessible by the class and
             stores a pointer to its definition.
           I Only the most specific function definition callable by the class is stored in the vtable.
           I Entries in the vtable can point to either functions declared in the class itself (e.g.
             C::bar()), or virtual functions inherited from a base class (e.g. C::qux()).
           I In our example, the compiler will create the following virtual tables.
Low-latency programming
   The curiously recurring template pattern (CRTP)
Implementation (ii)
Implementation (iii)
           I The vtable of class B has two entries, one for each of the two virtual functions declared
             in B’s scope bar() and qux().
           I Additionally, the vtable of B points to the local definition of functions , since they are the
             most specific (and only) from B’s point of view.
           I More interesting is C’s table. In this case, the entry for bar() points to own C’s
             implementation, given that it is more specific than B::bar().
           I Since C doesn’t override qux(), its entry in the vtable points to B’s definition (the most
             specific definition).
           I Note that vtables exist at the class level, meaning there exists a single vtable per class,
             and is shared by all instances.
Low-latency programming
   The curiously recurring template pattern (CRTP)
Vpointers (i)
           I When the compiler sees b->bar() in the example above, it will look up B’s vtable for
             bar’s entry and follow the corresponding function pointer, right?
           I We would still be calling B::bar() and not C::bar()...
           I We haven’t told you the second part of the story: vpointers.
           I Every time the compiler creates a vtable for a class, it adds an extra argument to it: a
             pointer to the corresponding virtual table, called the vpointer.
           I vpointer is just another class member added by the compiler and increases the size of
             every object that has a vtable by sizeof(vpointer).
           I When a call to a virtual function on an object is performed, the vpointer of the object is
             used to find the corresponding vtable of the class.
           I Next, the function name is used as index to the vtable to find the correct (most
             specific) routine to be executed.
Low-latency programming
   The curiously recurring template pattern (CRTP)
Vpointers (ii)
Virtual destructors
An example (i)
        class Base
        {
        public :
           ~Base ()
           {
             std :: cout << " Destroying base" << std :: endl;
           }
        };
Low-latency programming
   The curiously recurring template pattern (CRTP)
An example (ii)
               ~ Derived ()
               {
                    std :: cout << " Destroying derived " << std :: endl;
                    delete some_resource_ ;
               }
        private :
            int* some_resource_ ;
        };
Low-latency programming
   The curiously recurring template pattern (CRTP)
An example (ii)
        int main ()
        {
          Base* p = new Derived (5);
          delete p;
        }
Summary
Classic polymorphism
        class order {
            virtual void place_order () { /* generic implementation */ }
        };
Microbenchmarking (i)
               void test_for_loop () {
                   int iter_count = 1000000000;
                   time start = get_time ();
                   for (int i = 0; i < iter_count ; ++i) {}
                   time elapsed = get_time () - start;
                   time elapsed_per_iteration = elapsed / iter_count ;
                   printf ("Time elapsed for each iteration : %d\n", elapsed_per_iteration );
               }
           I Compilers can see that the loop does nothing and will not generate any code for it.
             The values of elapsed and elapsed_per_iteration end up useless.
Low-latency programming
   Benchmarking
Microbenchmarking (ii)
               void test_for_loop () {
                   int iter_count = 1000000000;
                   int sum = 0;
                   time start = get_time ();
                   for (int i = 0; i < iter_count ; ++i) {
                       ++ sum;
                   }
                   time elapsed = get_time () - start;
                   time elapsed_per_iteration = elapsed / iter_count ;
                   printf ("Time elapsed for each iteration : %d\n", elapsed_per_iteration );
               }
           I The compiler may see that the variable sum isn’t used and optimize it away along with
             the loop.
Low-latency programming
   Benchmarking
Microbenchmarking (ii)
               void test_for_loop () {
                   int iter_count = 1000000000;
                   int sum = 0;
                   time start = get_time ();
                   for (int i = 0; i < iter_count ; ++i) {
                       ++ sum;
                   }
                   time elapsed = get_time () - start;
                   time elapsed_per_iteration = elapsed / iter_count ;
                   printf ("Time elapsed for each iteration : %d\n", elapsed_per_iteration );
                   printf ("Sum: %d\n", sum);    // Added
               }
           I The compiler may see that the variable sum will always have a constant value and
             optimize all that away as well.
Low-latency programming
   Benchmarking
More caveats
           I When we are testing I/O the OS may preload files into memory to improve
             performance.
           I Locality of reference (e.g. arrays versus linked lists).
           I Effects of caches.
           I Effects of memory bandwidth.
           I Compiler inlining.
           I Compiler implementation.
           I Compiler switches.
           I Number of processor cores.
           I Optimizations at the processor level.
           I Operating system schedulers.
           I Operating system background processes.
Low-latency programming
   Benchmarking
Profiling
Google Benchmark
        # ifdef _WIN32
        # pragma comment ( lib , " Shlwapi .lib" )
        #endif
        std :: vector <int > rng(int begin , int end , int count) {
             std :: vector <int > v;
             for (int i = 0; i < count; ++i) {
                  v. push_back (( std :: rand () % end) + begin);
             }
             return v;
        }
Low-latency programming
   Benchmarking
        Source: [PZ12]
Low-latency programming
   Benchmarking
        Source: [PZ12]
Low-latency programming
   Cache warming
No cache warming
Cache warming
Consequences
Thread
std::thread
        void foo () {
            std :: cout << "foo starting " << std :: endl;
            for (int i = 0; i < 10; ++i) {
                 std :: cout << i << std :: endl;
                 std :: this_thread :: sleep_for (std :: chrono :: seconds (1));
            }
            std :: cout << "foo finished " << std :: endl;
        }
        void bar(int x) {
            std :: cout << "bar starting " << std :: endl;
            for (int i = 0; i < x; i += 10) {
                 std :: cout << i << std :: endl;
                 std :: this_thread :: sleep_for (std :: chrono :: milliseconds (500));
            }
            std :: cout << "bar finished " << std :: endl;
        }
Low-latency programming
   Multithreading
        int main () {
            std :: thread first(foo);           // Spawn new thread that calls foo ()
            std :: thread second (bar , 100);     // Spawn new thread that calls bar (0)
std :: cout << "main , foo , and bar are now executing concurrently ...\n";
               // Synchronize threads :
               first .join ();   // Pauses until first finishes
               second .join ();  // Pauses until second finishes
               return EXIT_SUCCESS ;
        }
Low-latency programming
   Multithreading
std::future
Example: std::future
        # include     <cstdlib >
        # include     <future >
        # include     <iostream >
        # include     <thread >
        int main () {
            // std :: future from a packaged task.
            std :: packaged_task <int ()> task ([] { return 7; });          // Wrap the function .
            std :: future <int > f1 = task. get_future ();       // Get a future .
            std :: thread t(std :: move(task));
            // std :: future from an async ()
            std :: future <int > f2 = std :: async(std :: launch :: async , [] { return 8; });
            // std :: future from a std :: promise
            std :: promise <int > p;
            std :: future <int > f3 = p. get_future ();
            std :: thread ([&p] {p. set_value_at_thread_exit (9); }). detach ();
std::promise
Example: std::promise
        # include     <chrono >
        # include     <future >
        # include     <iostream >
        # include     <numeric >
        # include     <thread >
        # include     <vector >
        void accumulate (
            std :: vector <int >:: iterator first ,
            std :: vector <int >:: iterator last ,
            std :: promise <int > accumulate_promise ) {
            int sum = std :: accumulate (first , last , 0);
            accumulate_promise . set_value (sum);
        }
        int main () {
            // Demonstrating using promise <int > to transmit a result between threads .
            std :: vector <int > numbers = { 1, 2, 3, 4, 5, 6 };
            std :: promise <int > accumulate_promise ;
            std :: future <int > accumulate_future = accumulate_promise . get_future ();
            std :: thread work_thread (accumulate , numbers .begin (), numbers .end (),
                 std :: move( accumulate_promise ));
               // future :: get () will wait until the future has a valid result and retrieves it.
               // Calling accumulate_future .wait () before accumulate_future .get () is not needed .
               std :: cout << " result = " << accumulate_future .get () << ’\n’;
               work_thread .join ();     // Wait for thread completion .
        }
Low-latency programming
   Multithreading
Mutex
            I A mutex is a lockable object that is designed to signal when critical sections of the
              code need exclusive access, preventing other threads from executing concurrently.
            I std::mutex objects provide exclusive ownership and do not support recursivity (i.e.,
              a thread cannot lock a mutex it already owns).
            I For example, we can use a mutex to prevent multiple threads from performing I/O at
              the same time or accessing the same memory locations.
Low-latency programming
   Multithreading
Example: std::mutex
        int main () {
            std :: thread threads [10];
            for (int i = 0; i < 10; ++i)
                 threads [i] = std :: thread ( print_thread_id , i + 1);
               return EXIT_SUCCESS ;
        }
Low-latency programming
   The Distruptor
The Disruptor
Cost of contention
Lock-free programming
           I People often describe lock-free programming as programming without mutexes, which
             are also referred to as locks.
           I That’s true, but it’s only part of the story.
           I The generally accepted definition, based on academic literature, is a bit more broad.
             At its essence, lock-free is a property used to describe some code, without saying too
             much about how that code was actually written.
           I Jeff Preshing (https://preshing.com/) proposes the following definition:
The Disruptor
Going parallel
Videos on benchmarking
           I Bart Massey (Portland C/Assembly Users Group). Popcount and the Black Art of
             Microbenchmarking. pdxbyte, 2014.03.11,
             https://www.youtube.com/watch?v=opJvsJk1B68
           I Chandler Carruth (Google). Tuning C++: Benchmarks, and CPUs, and Compilers!
             Oh My! CppCon 2015, https://www.youtube.com/watch?v=nXaxk27zwlk
           I Alexander Radchenko (Optiver). Benchmarking C++: From video games to
             algorithmic trading. Pacific++ 2018.10m,
             https://www.youtube.com/watch?v=MHaZwyhpl-M
           I Alexander Radchenko (Optiver). Benchmarking C++: From video games to
             algorithmic trading. Meeting C++ 2018,
             https://www.youtube.com/watch?v=7YVMC5v4qCA
           I CoffeeBeforeArch. C++ Crash Course: Google Benchmark. 2019.07.21,
             https://www.youtube.com/watch?v=eKODykkIZTE
Low-latency programming
   The Distruptor
           I Trisha Gee and Michael Barker (LMAX). The Disruptor—A Beginners Guide to
             Hardcore Concurrency. JAX London 2011,
             https://www.youtube.com/watch?v=DCdGlxBbKU4
           I Trisha Gee (LMAX). Concurrent Programming Using The Disruptor. 2013.01.25,
             https://www.youtube.com/watch?v=eTeWxZvlCZ8
           I Peter Lawrey. Low Latency Code in Java 8. MelbJVM, April, 2015,
             https://www.youtube.com/watch?v=t49bfPLp0B0
           I Daniel Shaya. How Low Can You Go? Ultra Low Latency Java in the Real World.
             London Java Community, 2018, https://www.youtube.com/watch?v=BD9cRbxWQx8
Low-latency programming
   The Distruptor
Books on optimization
                  Dennis Andriesse.
                  Practical Binary Analysis: Build Your Own Linux Tools for Binary Instrumentation,
                  Analysis, and Disassembly.
                  No Starch Press, 2018.
                  Björn Andrist and Viktor Sehr.
                  C++ High Performance: Master the art of optimizing the functioning of your C++ code.
                  Packt, 2020.
                  Michael Born.
                  Binary Analysis Cookbook: Actionable recipes for disassembling and analyzing
                  binaries for security risks.
                  Packt, 2019.
                  Benjamin J. Evans and James Gough.
                  Optimizing Java: Practical Techniques for Improved Performance Tuning.
                  O’Reilly, 2016.
                  Randall Hyde.
                  Write Great Code, Volume 1: Understanding the Machine.
                  No Starch Press, 2004.
                  Randall Hyde.
                  Write Great Code, Volume 2: Thinking Low-Level, Writing High-Level.
                  No Starch Press, 2006.
                  Christopher Kormanyos.
Low-latency programming
   Bibliography