Osbook-V0 78
Osbook-V0 78
Kernel-Oriented Approach
(Partially written. Expect grammatical mistakes
and minor technical errors.
Updates are released every week on Fridays.)
Send bug reports/suggestions to
srsarangi@cse.iitd.ac.in
Version 0.78
Smruti R. Sarangi
List of Trademarks
• Linux is a registered trademark owned by Linus Torvalds.
• Intel, Intel SGX and Intel TDS are registered trademarks of Intel Corpo-
ration.
• AMD is a registered trademark of AMD corporation.
• Microsoft and Windows are registered trademarks of Microsoft Corpora-
tion.
1 Introduction 9
1.1 Types of Operating Systems . . . . . . . . . . . . . . . . . . . . . 11
1.2 The Linux OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Versions, Statistics and Conventions . . . . . . . . . . . . 14
1.3 Organization of the Book . . . . . . . . . . . . . . . . . . . . . . 17
3 Processes 63
3.1 The Process Descriptor . . . . . . . . . . . . . . . . . . . . . . . 66
3.1.1 The Notion of a Process . . . . . . . . . . . . . . . . . . . 66
3.1.2 struct task struct . . . . . . . . . . . . . . . . . . . . . . . 67
3.1.3 struct thread info . . . . . . . . . . . . . . . . . . . . . 67
3.1.4 Task States . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.1.5 Kernel Stack . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.1.6 Task Priorities . . . . . . . . . . . . . . . . . . . . . . . . 75
3.1.7 Computing Actual Task Priorities . . . . . . . . . . . . . 76
3
c Smruti R. Sarangi 4
9
c Smruti R. Sarangi 10
CPU
Memory Hard
Disk
I/0 devices
Figure 1.1: Diagram of the overall system
Programs
P1 P2 P3
Opera�ng System
CPUs
Hardware
Memory
I/O & Storage
Figure 1.2: Place of the OS in the overall system
Summary 1.0.1
• Programs share hardware such as the CPU, the memory and stor-
age devices. These devices have to be fairly allocated to different
programs based on user-specified priorities. The job of the OS is
to do a fair resource allocation.
many running programs that are trying to access the same set of
memory locations. This may can be a possible security violation
or this may be a genuine shared memory-based communication
pattern. There is a need to differentiate between them by providing
neat and well-defined mechanisms.
• Power, temperature and security concerns have become very im-
portant over the last decade. Any operating system that is being
designed today needs to run on very small devices such as mobile
phones, tablets and even smartwatches. In the foreseeable future,
they may run on even smaller devices such as smart glasses or de-
vices that are embedded within the body. Hence, it is important
for an OS to be extremely power-aware.
that they will work under severe power constraints and deliver a good user
experience. Moreover, in this case the code size and the memory footprint of
the OS needs to be very small.
which most likely will be improvements, and then build on them. There were no
proprietary walls; this allowed the community to make rapid progress because
all incremental improvements had to be shared. However, at that point of time,
this was not the case with other pieces of software. Users or developers were not
duty-bound to contribute back to the mother repository. This ensured that a lot
of the innovations that were made by large research groups and multinational
companies were not given back to the community.
Over the years, Linux has grown by leaps and bounds in terms of function-
ality and popularity. By 2000, it had established itself as a worthy desktop and
server operating system. People started taking it seriously and many academic
groups started moving away from UNIX to adopt Linux. Given that Linux was
reasonably similar to UNIX in terms of the interface and some other high-level
design decisions, it was easy to migrate from Unix to Linux. The year 2003
was a pivotal year for the Linux community. This year Linux kernel version 2.6
was released. It had a lot of advanced features and was very different from the
previous kernel versions. After this, Linux started being taken very seriously in
both academic and industry circles. In a certain sense, it had come of age and
had entered the big league. Many companies sprang up that started offering
Linux-based offerings, which included the kernel bundled with a set of packages
(software programs) and also custom support.
Over the years, Linux distributions such as Red Hat R , Suse R and Ubuntu R
(Canonical R ) have come to dominate the scene. As of writing this book, circa
2024, they continue to be major Linux vendors. Since 2003, a lot of other
changes have also happened. Linux has found many new applications – it has
made major inroads into the mobile and handheld market. The Android oper-
ating system, which as of 2023 dominates the entire mobile operating space is
based on the Linux kernel. Many of the operating systems for smart devices and
other wearable gadgets are based on Android. In addition, Google R ’s Chrome
OS is also a Linux-derived variant. So are other operating systems for Smart
TVs such as LG R ’s webOS and Samsung R ’s Tizen.
Point 1.2.1
It is important to understand the economic model. Especially in the
early stages, the GPL licensing model made a lot of difference and
was very successful in single-handedly propelling the Linux movement.
We need to understand that Linux carved a niche of its own in terms
of performance roughly a decade after the project began. The reason
it was grown and sustained by a large team of developers in the first
formative decade is that they saw a future in it. This also included large
for-profit companies. The financial logic behind such extraordinary
foresight is quite straightforward.
Linux is not the only free open-source operating system. There are many
others, which are derived from classical UNIX, notably BSD Unix (Berkeley
Standard Distribution) family of operating systems. Some other important
variants are FreeBSD R , OpenBSD and NetBSD R . Akin to Linux, their code
is also free to use and distribute. Of course, they follow a different licensing
mechanism, which is not as restrictive as GPL. However, they are also very good
operating systems in their own right. They have their niche markets, and they
have a large developer community that actively adds features and ports them
to new hardware. A paper by your author and his student S. S. Singh [Singh
and Sarangi, 2020] nicely compares three operating systems – Linux, FreeBSD
and OpenBSD – in terms of the performance across different workloads.
Example: 6.2.12
x y z -rc<num>
rc � release candidate
(test release)
Major version
number Patch number
Minor version
number
Linux Versions
Consider Linux version 6.2.16. Here 6 is the major version number, 2 is the mi-
nor version number and 16 is the patch number. Every hmajor, minori version
pair has multiple patch numbers associated with it. A major version represents
important architectural changes. The minor version adds important bug fixes
and feature additions. A patch mostly focuses on minor issues and security-
related bug fixes. Every time there is an important feature-related commit, a
patch is created. Prior to 2004, even minor versions were associated with sta-
ble versions and odd minor versions were associated with development versions.
Ever since Linux kernel version 3.0, this practice has not been adhered to. Ev-
ery version is stable now. Development versions are now release candidates that
predate stable versions.
Every new patch is associated with multiple release candidates. A release
candidate does not have major bugs; it incorporates multiple smaller fixes and
feature additions that are not fully verified. These release candidates are con-
sidered experimental and are not fully ready to be used in a production setting.
They are numbered as follows -rc1, -rc2, . . .. They are mainly aimed at other
Linux developers, who can download these release candidates, test their features,
suggest improvements and initiate a process of (mostly) online discussion. Once,
the discussions have converged, the release candidates are succeeded by a stable
version (read patch or major/minor version).
Let us now provide an overview of the Linux code base (see Figure 1.4). The
c Smruti R. Sarangi 16
architecture subsystem of the kernel contains all the code that is architecture
specific. The Linux kernel has a directory called arch that contains various
subdirectories. Each subdirectory corresponds to a distinct architecture such as
x86, ARM, Sparc, etc. An OS needs to rely on processor-specific code for various
critical actions like booting, device drivers and access to privileged hardware
operations. All of this code is nicely bundled up in the arch directory. The
rest of the code of the operating system is independent of the architecture. It
is not dependent on the ISA or the machine. It relies on primitives, macros
and functions defined in the corresponding arch subdirectory. All the operating
system code relies on these abstractions such that developers do not have to
concern themselves with details of the architecture such as whether it is 16-bit
or 32-bit, little endian or big endian, CISC or RISC. This subsystem contains
more than 1.7 million lines of code.
The other large subsystems that contain large volumes of code are the code
bases for the filesystem and network, respectively. Note that a popular OS such
as Linux needs to support many file systems and network protocols. As a result,
the code base for these directories is quite large. The other subsystems for the
memory and security modules are comparatively much smaller.
Figure 1.5 shows the list of prominent directories in the Linux kernel. The
kernel directory contains all the core features of the Linux kernel. Some of the
most important subsystems are the scheduler, time manager, synchronization
manager and debugging subsystem. It is by far the most important subsystem
or the core kernel. We will focus a lot on this subsystem.
We have already seen the arch directory. A related directory is the init
directory that contains all the booting code. Both these directories are hardware
dependent.
The mm, fs, block and io uring directories contain important code for the
memory subsystem, file system and I/O modules, respectively. The code for
virtualizing an operating system is resident in the virt directory. Virtualizing
the OS means that we can run an OS as a regular program on top of the Linux
OS. This subsystem is tightly coupled with the memory, file and I/O subsystems.
Finally, note that the largest directory is drivers that contains drivers (spe-
17 c Smruti R. Sarangi
cialized programs for talking to devices) for a large number of I/O devices. This
directory is so large because an operating system such as Linux needs to support
a large amount of hardware. For every hardware device, we should not expect
the user to browse the web, locate its driver and install it. Hence, there is a
need to include its code in the code base of the kernel itself. At the same time,
we do not want to include the code of every single device driver on the planet
in the code base of the kernel. Its code will become prohibitively large. Rarely
used and obsolescent devices can be left out. Hence, the developers of the ker-
nel need to judiciously choose the set of drivers that need to be included in
the kernel’s code base, which is released and distributed. These devices should
be reasonably popular, and the drivers should be deemed to be safe (devoid of
security issues).
Figure 1.6 shows the list of chapters and appendixes in the book. All the
chapters use concepts that may require the user to refer to the appendixes.
There are three appendixes in the book. Appendix A introduces the x86 as-
sembly language (the 64-bit variant). We shall refer to snippets of assembly
code throughout the text. To understand them thoroughly, it is necessary to be
familiar with x86 assembly. Most of the critical routines in operating systems
are still written in assembly language for speed and efficiency. Appendix B
describes the compiling, linking and loading process. This appendix should be
read thoroughly because it is important to understand how large C-based soft-
ware projects are structured. Readers show know the specific roles of C files,
header files, .o files, static and dynamically linked libraries. These concepts
are described in detail in this chapter. Finally, Appendix C introduces the most
commonly used data structures in the Linux kernel. A lot of the data structures
that we typically study in a basic undergraduate data structures course have
c Smruti R. Sarangi 18
the urgent work is completed immediately and the rest of the work is completed
later. There are different types of kernel tasks that can do such deferred work.
They run with different priorities and have different features. Specifically, we
shall introduce softirqs, threaded IRQs and work queues. Finally, we shall
introduce signals, which are the reverse of system calls. The OS uses signals to
send messages to running processes. For example, if we press a mouse button,
then a message goes to the running process regarding the mouse click and its
coordinates. This happens via signals. Here also there is a need to change the
context of the user application because it now needs to start processing the
signal.
Chapter 5 is a long chapter on synchronization and scheduling. In any
modern OS, we have hundreds of running tasks that often try to access shared
resources concurrently. Many such shared resources can only be accessed by
one thread at a time. Hence, there is a need for synchronizing the accesses.
This is known as locking in the context of operating systems. Locking is a large
and complex field that has a fairly strong overlap with advanced multiprocessor
computer architecture. We specifically need to understand it in the context of
memory models and data races. Memory models determine the valid outcomes
of concurrent programs on a given architecture. We shall observe that it is
often necessary to restrict the space of outcomes using special instructions to
correctly implement locks. If locks are correctly implemented and used, then
uncoordinated accesses known as data races will not happen. Data races are
the source of a lot of synchronization-related bugs. Once the basic primitive
has been designed, we shall move on to discussing different types of locks and
advanced synchronization mechanisms such as semaphores, condition variables,
reader-writer locks and barriers. The kernel needs many concurrent data struc-
tures such as producer-consumer queues, mutexes, spinlocks and semaphores to
do its job. We shall look at their design and implementation in detail.
Next, we shall move on to explaining a very interesting synchronization
primitive that is extremely lightweight and derives its correctness by stopping
task preemption at specific times. It is known as the read-copy-update (RCU)
mechanism, which is widely used in the kernel code. It is arguably one of the
most important innovations made by the designers of the kernel, which has had
far-reaching implications. It has obviated the need for a garbage collector. We
shall then move on to discussing scheduling algorithms. After a cursory intro-
duction to trivial algorithms like shortest-job first and list scheduling, we shall
move on to algorithms that are actually used in the kernel such as completely
fair scheduling (CFS). This discussion will segue into a deeper discussion on
real-time scheduling algorithms where concrete guarantees can be made about
schedulability and tasks getting a specific pre-specified amount of CPU time.
In the context of real-time systems, another important family of algorithms
deal with locking and acquiring resources exclusively. It is possible that a low-
priority process may hold a resource for a long time while a high-priority process
is waiting for it. This is known as priority inversion, which needs to be avoided.
We shall study a plethora of mechanisms to avoid this and other problems in
the domain of real-time scheduling and synchronization.
Chapter 6 discusses the design of the memory system in the kernel. We shall
start with extending the concepts that we studied in Chapter 2 (architecture
fundamentals). The role of the page table, TLB, address spaces, pages and folios
will be made clear. For a course on operating systems, understanding these
c Smruti R. Sarangi 20
Basically, the CPU and the devices are being virtualized here. As of today,
virtualization and its lightweight version namely containers are the most popular
technologies in the cloud computing ecosystem. Some popular virtualization
software are VMWare vSphere R , Oracle VirtualBox R and XenServer R . They
are also known as hypervisors. Linux has a built-in hypervisor known as Linux
KVM (kernel virtual machine). We will study more about them in this chapter.
We will also look at lightweight virtualization techniques using containers that
virtualize processes, users, the network, file systems, configurations and devices.
Docker R and Podman R are important technologies in this space. In the last
part of this chapter we shall look at specific mechanisms for virtualizing the I/O
system and file systems, and finally conclude.
c Smruti R. Sarangi 22
Exercises
Ex. 1 — What are the roles and functions of a modern operating system?
Ex. 2 — Is a system call like a regular function call? Why or why not?
Ex. 3 — Why is the drivers directory the largest directory in the kernel’s code
base?
Ex. 4 — What are the advantages of having a single arch directory that stores
all the architecture-specific code? Does it make writing the rest of the kernel
easier?
Ex. 5 — Write a report about all the open-source operating systems in use
today. Trace their evolution.
Figure 2.1 shows the organization of this chapter. The main aim of this
chapter is to cover all the computer architecture concepts needed to understand
modern operating systems. The objective is not to explain well-known computer
architecture concepts such as cores, caches and the memory system. The focus
is only on specific hardware features that are relevant for understanding a book
on operating systems.
We shall start with looking at the privileged mode of execution, which oper-
ating systems use. This is normally not taught in regular computer architecture
courses because regular user programs cannot access privileged registers and
privileged instructions. We need to look at such instructions because they are
very useful for writing software such as the Linux kernel. Privileged registers
can be used for controlling the underlying hardware such as turning off the dis-
play or the hard disk. Next we shall discuss methods to invoke the OS and
application-OS communication. No OS program normally runs. The kernel
(core part of the OS) begins to run only when there is an event of interest: the
system boots, a hardware device raises an interrupt, there is a software bug such
as an illegal access or the running program raises a dummy software interrupt
to get the attention of the OS kernel. If interrupts are not naturally being gen-
erated, then there is a need to create dummy interrupts using a timer chip – a
23
c Smruti R. Sarangi 24
Computer Architecture
Basics
Memory System
Memory Map
Virtual Memory
Segmentation
approaches require the active involvement of the CPU. The third approach relies
on outsourcing this work to DMA (Direct Memory Access) engines that often
reside outside the chip. They do the entire job of transferring data to or from
the I/O device. Once the transfer is done, they raise an interrupt.
After reading this entire chapter, the reader will have sufficient knowledge
in specific aspects of computer architecture that are relevant from an OS per-
spective. The reader is strong encouraged to also go through Appendixes A and
B. They cover the x86 assembly language and an introduction to the process of
compiling, linking and loading, respectively. We shall continuously be referring
to concepts discussed in these appendixes. Hence, it makes a lot of sense to go
through them after completing this chapter.
Core
Caches
Main memory
a very high storage density. The most important point that we need to keep in
mind here is that it is only the main memory – DRAM memory located outside
the chip – that is visible to software, notably the OS. The rest of the smaller
memory elements within the chip such as the L1, L2 and L3 caches are normally
not visible to the OS. Some ISAs have specialized instructions that can flush
certain levels of the cache hierarchy either fully or partially. Sometimes even
user applications can use these instructions. However, this is the only notable
exception. Otherwise, we can safely assume that almost all software including
privileged software like the operating system are unaware of the caches. Let us
live with the assumption that the highest level of memory that an OS can see
or access is the main memory.
Let us define the term memory space as the set of all addressable memory
locations. A software program including the OS perceive this memory space to
be one large array of bytes. Any location in this space can be accessed at will
and also can be modified at will. Later on when we discuss virtual memory, we
will refine this abstraction.
Next, let us differentiate between CISC and RISC processors. RISC stands
27 c Smruti R. Sarangi
for “Reduced Instruction Set Computer”. A lot of the modern ISAs such as
ARM and RISC-V are RISC instruction sets, which are regular and simple.
RISC ISAs and processors tend to use registers much more than their CISC
(complex instruction set) counterparts. CISC instructions can have long im-
mediates (constants) and may also use more than one memory operand. The
instruction set used by Intel and AMD processors, x86, is a CISC ISA. Regard-
less of the type of the ISA, registers are central to the operation of any program
(be it RISC or CISC). The compiler needs to manage them efficiently.
2.1.3 Registers
General Purpose Registers
Let us look at the space of registers in some more detail. All the registers that
regular programs use are known as general purpose registers. They are visible
to all software including the compiler. Note that almost all the programs that
are compiled today use registers and the author is not aware of any compilation
model or any architectural model that does not rely on registers.
Privileged Registers
A core also has a set of registers known as privileged registers, which only the OS
or software with similar privileges can access. In Chapter 8, we shall look at hy-
pervisors or virtual machine managers (VMMs) that run with OS privileges. All
such software are known as system software or privileged mode software. They
are given special treatment by the CPU – they can access privileged registers.
For instance, an ALU has a flags register that stores its state, especially the
state of instructions that have executed in the past such as comparison instruc-
tions. Often these flags registers are not fully visible to regular application-level
software. However, they are visible to the OS and anything else that runs with
OS privileges such as VMMs. It is necessary to have full access to these registers
to enable multitasking: run multiple programs on a core one after the other.
We also have control registers that can enable or disable specific hardware
features such as the fan, LED lights on the chassis and can even turn off the
system itself. We do not want all the instructions that change the values stored
in these registers to be visible to regular programs because then a user appli-
cation can create havoc. Hence, we entrust only a specific set of programs (OS
and VMM) with access rights to these registers.
Then, there are debug registers that are meant to debug hardware and sys-
tem software. Given the fact that they are privy to additional information and
can be used to extract information out of running programs, we do not allow
regular programs to access these registers. Otherwise, there will be serious se-
curity violations. However, from a system designer’s point of view or from the
OS’s point of view these registers are very important. This is because they
give us an insight into how the system is operating before and after an error
is detected – this information can potentially allow us to find the root cause of
bugs.
Finally, we have I/O registers that are used to communicate with externally
placed I/O devices such as the monitor, printer and network card. Here again,
we need privileged access. Otherwise, we can have serious security violations,
c Smruti R. Sarangi 28
and different applications may try to monopolize an I/O resource. They may
not allow other applications to access them. Hence, the OS needs to act as a
broker. Its job is to manage, restrict and regulate accesses.
Given the fact that we have discussed so much about privileged registers, let
us see how the notion of privileges is implemented. Note that we need to ensure
that only the OS and related system software such as the VMM can have access
to privileged resources such as the privileged registers.
Ring 3
Ring 0
Signal A system call is a message that is sent from the application to the
OS. A signal is the reverse. It is a message that is sent from the OS
to the application. An example of this would be a key press. In this
case, a hardware interrupt is generated, which is processed by the OS.
The OS reads the key that was pressed, and then figures out the process
that is running in the foreground. The ASCII value of this key needs to
be communicated to this process. The signal mechanism is the method
that is used. In this case, a function registered by the process with the
OS to handle a “key press” event is invoked. The running application
process then gets to know that a certain key was pressed and depending
upon its logic, appropriate action is taken. A signal is basically a callback
function that an application registers with the OS. When an event of
interest happens (pertaining to that signal), the OS calls the callback
function in the application context. This callback function is known as
the signal handler.
As we can see, communicating with the OS does require some novel and un-
conventional mechanisms. Traditional methods of communication that include
writing to shared memory or invoking functions are not used because the OS
runs in a separate address space and also switching to the OS is an onerous
activity. It also involves a change in the privilege level and a fair amount of
bookkeeping is required at both the hardware and software levels, as we shall
see in subsequent chapters.
As we can see, all that we need to do is that we need to load the number of
the system call in the rax register. The syscall instruction subsequently does
the rest. It generates a dummy interrupt, stores some data corresponding to
the state of the executing program (for more details, refer to [Sarangi, 2021])
and loads the appropriate system call handler. An older approach is to directly
generate an interrupt itself using the instruction int 0x80. Here, the code 0x80
stands for a system call. However, as of today, this method is not used for x86
processors.
Flags and
Program
Registers special Memory PC
state
registers
Register state
Figure 2.4 shows an overview of the process to store the context of a running
program. The state of the running program comprises the contents of the general
purpose registers, contents of the flags and special purpose registers, the memory
and the PC (program counter). Towards the end of this chapter, we shall
see that the virtual memory mechanism stores the memory space of a process
very effectively and stops other processes from unintentionally or maliciously
modifying it. Hence, we need not bother about storing and restoring the memory
contents of a process. It is not affected by the context switch and restore process.
Insofar as the rest of the three elements are concerned, we can think of all
of them as the volatile state of the program that is erased when there is a
context switch. As a result, a hardware mechanism is needed to read all of
them and store them in memory locations that are known a priori. We shall see
that there are many ways of doing this and there are specialized and privileged
instructions that are used.
For more details about what exactly the hardware needs to do, readers can
refer to the computer architecture text by your author [Sarangi, 2021]. In the
example pipeline in the reference, the reader will appreciate the need for having
specialized hardware instructions for automatically storing the PC, the flags and
c Smruti R. Sarangi 32
special registers, and possibly the stack pointer in either privileged registers or
a dedicated memory region. Regardless of the mechanism, we have a known
location where the volatile state of the program is stored, and it can later on
be retrieved by the interrupt handler. For clarity and readability, we will use
the term interrupt handler to refer to traditional interrupt handlers, as well as
exception handlers and system call handlers, whenever the context makes this
clear.
Subsequently, the first task of the interrupt handler is to retrieve the program
state or context of the executing program – either from specialized registers or
a dedicated memory area. Note that these temporary locations may not store
the entire state of the program, for instance they may not store the values of all
the general purpose registers. The interrupt handler will thus have to do more
work and retrieve the full program state. Regardless of the specific mechanism,
the role of the interrupt handler is to collect the full state of the executing
program and ultimately store it somewhere in memory, from where it can easily
be retrieved later.
Restoring the context of a program is quite straightforward. We need to
follow the reverse sequence of steps.
The life cycle of a process can thus be visualized as shown in Figure 2.5. The
application program executes, it is interrupted for a certain duration after the
OS takes over, then the application program is resumed at the point at which
it was interrupted. Here, the word “interrupted” needs to be understood in a
very general sense. It could be a hardware interrupt, a software interrupt like a
system call or a program-generated exception.
Context Context
switch switch
OS + other
OS + other processes
processes
execute
execute
Execu�on Execu�on Execu�on
Figure 2.5: The life cycle of a process (active and interrupted phases)
Timer Interrupts
There is an important question to think about here. Consider a system where
there are no interrupts and executing processes do not generate system calls
and exceptions. Assume that there are n cores, and each core runs such a
process that does not lead to system calls or exceptions. This means that the
33 c Smruti R. Sarangi
OS will never get executed because its routines will never get invoked. Note
that the operating system never executes in the background (as one would want
to naively believe) – it is a separate program that needs to be invoked by a very
special set of mechanisms namely system calls, exceptions and interrupts. Let
us refer to these as events of interest. The OS cannot come into the picture
(execute on a core) any other way.
Now, we are looking at a very peculiar situation where all the cores are
occupied with programs that do none of the above. There are no events of
interest. The key question that we need to answer is whether the system becomes
unresponsive if these programs decide to run for a long time. Is rebooting the
system the only option?
Question 2.1.1
CPU
Core Core
Timer chip
Core Core
Periodically send �mer
interrupts to the CPU
most integral part of a machine that supports an operating system. The key
insight is that it is needed for ensuring that the system is responsive, and it
periodically executes the OS code. The operating system kernel has full control
over the processes that run on cores, the memory, storage devices and I/O
systems. Hence, it needs to run periodically such that it can effectively manage
the system and provide a good quality of experience to users.
Inter-processor interrupts
As we have discussed, the OS gets invoked on one core and its subsequent job is
to take control of the system and basically manage everything such as running
processes, waiting processes, cores, devices and memory. Often there is a need
to ascertain if a process has been running for a long time or not and whether it
needs to be swapped out or not. If there is a need to swap it out, then the OS
finds the most eligible process (using its scheduler) and runs it.
If the new process needs to run on the core on which the OS is executing,
then it is simple. All that needs to be done is that the OS needs to load the
context of the process that it wants to run. If a process on a different core
needs to be swapped out to make room for the selected process, the mechanism
35 c Smruti R. Sarangi
Trivia 2.2.1
In most modern ISAs, load and store instructions read their base ad-
dresses from registers. They add a constant offset to it. The size of
registers thus determines the range of addresses that can be accessed. If
registers are 32 bits wide, then the size of the address space is naturally
constrained to 232 bytes.
Compatibility Problem
As a rule, in an n-bit architecture, where the register size is n bits, we assume
that the instructions can access any of the addressable 2n bytes unless there are
specific constraints. The same assumption needs to be made by the program-
mer and the compiler because they only see the registers. Other details of the
memory system are not directly visible to them.
Note that a program is compiled only once on the developers’ machines and
then distributed to the world. If a million copies are running, then we can be rest
assured that they are running on a very large number of heterogeneous devices.
These devices can be very different from each other. Of course, they will have
to share the same ISA, but they can have radically different main memory sizes
and even cache sizes. Unless we assume that all the 2n addresses are accessible
to a program or process, no other elegant assumption can be made. This may
sound impractical for 64-bit machines, but this is the most elegant assumption
that can be made.
This has a potential to cause problems. For example, if we assumed that a
process can access 4 GB at will, it will not run on a system with 1 GB of memory,
unless we find a mechanism to do so. We thus have a compatibility problem
here, where we want our process to assume that addresses are n bits wide (n is
typically 32 or 64), yet run on machines with all memory sizes (typically much
lower than the theoretical maximum).
Definition 2.2.1 Compatibility Problem
Processes assume that they can access any byte in a hypothetically large
memory region of size 232 or 264 bytes at will (for 32-bit and 64-bit
systems, respectively). Even if processes are actually accessing very little
data, there is a need to create a mechanism to run them on physical
machines with far lower memory (let’s say a few GBs). The memory
addressing scheme is not compatible with the physical memory system
of real machines. This is the compatibility problem.
that are either running one after the other (using multitasking mechanisms) or
are running in parallel on different cores. These processes can access the same
address because nothing prohibits them from doing so.
Overlap Problem
In this case, unbeknownst to multiple processes, they can corrupt each other’s
state by writing to the same address. One program can be malicious, and then
it can easily get access to the other’s secrets. For example, if one process stores
a credit card number, another process can read it straight out of memory. This
is clearly not allowed and presents a massive security risk. Hence, we have two
opposing requirements over here. First, we want an addressing mechanism that
is as simple and straightforward as possible such that programs and compilers
remain simple and assume that the entire memory space is theirs. This is a
very convenient abstraction. However, on a real system, we also want different
processes to access a different set of addresses such that there is no unintended
overlap between the set of memory addresses that they access. This is known
as the overlap problem.
Definition 2.2.2 Overlap Problem
Unless adequate steps are taken, it is possible for two processes to access
overlapping regions of memory, and also it is possible to get unauthorized
access to other processes’ data by simply reading values that they write
to memory. This is known as the overlap problem.
Size Problem
We are sadly not done with our set of problems; it turns out that we have
another serious problem on our hands. It may happen that we want to run a
program whose memory footprint is much more than the physical memory that
is present on the system. For instance, the memory footprint could be two GBs
whereas the total physical memory is only one GB. It may be convenient to
say that we can simply deny the user the permission to execute the program on
such a machine. However, the implications of this are severe. It basically means
that any program that is compiled for a machine with more physical memory
cannot run on a machine with less physical memory. This means that it will
cease to be backward compatible – not compatible with older hardware that has
less memory. In terms of a business risk, this is significant.
Hence, all attempts should be made to ensure that such a situation does not
arise. It turns out that this problem is very closely related with the overlap and
compatibility problems that we have seen earlier. It is possible to slightly repur-
pose the solution that we shall design for solving the overlap and compatibility
problems.
Summary 2.2.1
a real system.
The memory map is partitioned into distinct sections. It starts from address
zero. Then after a fixed offset, the text section starts, which contains all the
program’s instructions. The processor starts executing the first instruction at
the beginning of the text section and then starts fetching subsequent instructions
as per the logic of the program. Once the text section ends, the data section
begins. It stores initialized data that comprises global and static variables that
are typically defined outside the scope of functions. After this, we have the bss
(block starting symbol) section that stores the same kind of variables, however
they are uninitialized. It is possible that one process has a very small data
section and another process has a very large data section – it all depends upon
how the program is written.
Then we have the heap and the stack. The heap is a memory region that
stores dynamically allocated variables and data structures, which are typically
allocated using the malloc call in C and the new call in C++ and Java. Tra-
ditionally, the heap section has grown upwards (towards increasing addresses).
As and when we allocate new data, the heap size increases. It is also possible
for the heap size to decrease as we free or dynamically delete allocated data
structures. Then there is a massive hole, which basically means that there is
a very large memory region that doesn’t store anything. Particularly, in 64-bit
machines, this region is indeed extremely large.
Next, at a very high memory location (0xC0000000 in 32-bit Linux), the
stack starts. The stack typically grows downwards (grows towards decreasing
addresses). Given the fact that there is a huge gap between the end of the heap
and the top of stack, both of them can grow to be very large. If we consider
the value 0xC0000000, it is actually 3 GB. This basically means that in a 32-bit
39 c Smruti R. Sarangi
process. The starting address of this region is set as the contents of the base
register. The address sent to the memory system is computed by adding the
address computed by the CPU with the contents of the base register. All the
addresses computed by the CPU are as per the memory map of the process;
however, the addresses sent to the memory system are different. In this system,
if the process accesses an address that is beyond the limit register, then a fault
is generated. Refer to Figure 2.9 for a graphical illustration of the base-limit
scheme.
hole
We observe that there are many processes, and they have their memory re-
gions clearly demarcated. Therefore, there is no chance of an overlap. This idea
does seem encouraging, but this is not going to work in practice for a combi-
nation of several reasons. The biggest problem is that neither the programmer
nor the compiler know for sure how much memory a program requires at run
time. This is because for large programs, the user inputs are not known, and
thus the total memory footprint is not predictable. Even if it is predictable, we
will have to budget for a very large footprint (conservative maximum). In most
cases, this conservative estimate is going to be much larger than the memory
footprints we may see in practice. We may thus end up wasting a lot of memory.
Hence, in the memory region that is allocated to a process between the base and
limit registers, there is a possibility of a lot of memory getting wasted. This is
known as internal fragmentation.
Let us again take a deeper look at Figure 2.9. We see that there are holes or
unallocated memory regions between allocated memory regions. Whenever we
want to allocate memory for a new process, we need to find a hole that is larger
than or equal to what we need and then split it into an allocated region and
a smaller hole. Very soon we will have many such holes in the memory space,
which cannot be used for allocating memory to any other process. It may be the
case that we have enough memory available, but it is just that it is partitioned
among so many processes that we do not have a contiguous region that is large
enough. This situation where a lot of memory is wasted in such holes is known
as external fragmentation.
41 c Smruti R. Sarangi
There are many ways of solving this problem. Some may argue that period-
ically we can compact the memory space by reading data and transferring them
to a new region by updating the base and limit registers for each process. In
this case, we can essentially merge holes and create enough space by creating
one large hole. The problem is that a lot of reads and writes are involved in
this process and during that time the process needs to remain mostly stalled.
Another problem is that the prediction of the maximum memory usage may
be wrong. A process may try to access memory that is beyond the limit register.
As we have argued, in this case a fault is generated. However, this can be avoided
if we allocate another memory region and link the second memory region to the
first (using a linked list like structure). The algorithm now is that we first
access the memory region that is allocated to the process and if the offset is
beyond the limit register, then we access the second read memory region. The
second remain memory region will also have base and limit registers. We can
extend this idea and create a linked list of such memory regions. We can also
save time by having a lookup table. It will not be necessary to traverse linked
lists. Given an address, we can quickly figure out the memory region in which
it lies. Many of the early approaches focused on such kind of techniques, and
they grew to become very complex, but soon the community realized that this
is not a scalable solution, and it is definitely not elegant.
However, an important insight came out of this exercise. It was that the ad-
dress that is generated by the CPU, which is also the same address that the
programmer, process and compiler see, is not the address that is ultimately sent
to the memory system. Even in this simple case, where we use a base and limit
register, the address generated by the program is actually added to the contents
of the base register to generate the real memory address. The real or physical
address is sent to the memory system. The gateway to the memory system is
the instruction cache for instructions and the L1 data cache for data. They
only see the physical address. On the other hand, the address generated by the
CPU is known as the virtual address. There is a need to translate or convert
the virtual address to a physical address such that we can access memory and
solve the overlap problem, as well as the compatibility problem.
c Smruti R. Sarangi 42
A few ideas emerge from this discussion. Given a virtual address, there
should be a table that we can look up, and find the physical address that it
maps to. Clearly, one virtual address will always be mapped to one physical
address. This is a common sense requirement. However, if we can also ensure
that every physical address maps to only one virtual address across processes
(barring special cases), or in other words there is a strict one-to-one mapping,
then we observe that no overlaps between processes are possible. Regardless of
how hard a process tries, it will not be able to access or overwrite the data that
belongs to any other process. In this case we are using the term data in the
general sense – it encompasses both code and data. Recall that in the memory
system, code is actually stored as data.
The crux of the entire definition of virtual memory (see Definition 2.2.5) is
that we have a mapping table that maps each virtual address (that is used by
the program) to a physical address. If the mapping satisfies some conditions,
then we can solve all the three problems. So the main technical challenge in
front of us is to properly and efficiently create the mapping table to implement
an address translation system.
Process 1
Map process’s virtual
pages to physical frames
Process 2
Figure 2.10: Conceptual overview of the virtual memory based page mapping
system
storage overhead per process, because every process needs its own page table.
Now assume that we have 100 processes in the system, we therefore need 250
MB to just store page tables !!!
This is a prohibitive overhead. If we consider a 64-bit memory system,
then the page table storage overhead is even larger and clearly this idea will not
work. It represents a tremendous wastage of physical memory space. Let us thus
propose optimizations. To start with, note that most of the virtual address space
is actually not used. In fact, it is quite sparse particularly between the stack and
the heap. This region can actually be quite large (refer to Section 2.2.1). Recall
that the beginning of the virtual address space is populated with the text,
data, bss and heap sections. Then there is a massive hole between the heap
and the stack. The stack starts at the upper boundary of the virtual address
space. In some cases, memory regions corresponding to memory mapped files
and dynamic libraries can occupy a part of this region. We shall still have large
gaps and have a significant amount of sparsity. This insight can be used to
design a multilevel page table, which can leverage this pattern and prove to be
a far more space-efficient solution.
48-bit
Virtual 12 bits
Bits 48-40 Bits 39-31 Bits 30-22 Bits 21-13 (intra-page)
address
The top 16 bits of the VA 52-bit frame
are assumed to be zero address
Level 1 Level 2 Level 3 Level 4
CR3 register
(c) Smru� R. Sarangi, 2023 55
248 - 1
Stack
vary in their lower bits, however, in all likelihood their more significant bits will
be the same. To cross-check, count from 0 to 999 (in base 10). The unit’s digit
changes the most frequently. It changes with every number. The ten’s digit on
the other hand changes more infrequently. It changes after every 10 numbers,
and the hundred’s digit changes even more infrequently. It changes once for
every 100 numbers. By the same logic, when we consider binary addresses we
expect the more significant bits to change far less often than the less significant
bits. Given this insight let us proceed to design an optimized version of the
page table.
Let us consider the first set of 9 bits (bits 40-48). They can be used to
access 29 (=512) entries in a table. Let us create a Level 1 page table that is
indexed using these 9 bits. An entry in this table is either null (empty) or points
to a Level 2 page table. Given our earlier explanation about the structure of
the memory map, we expect most of the entries in the Level 1 page table to
be null. This will happen because the memory map is sparse and each set of
top-level 9 bits points to a large contiguous region. Most of these regions will
be unallocated. This means that we shall have to allocate space for very few
Level 2 page tables. There is no need to allocate a Level 2 page table if the
set of addresses that it corresponds to are all unallocated. For example, assume
that there are no allocated virtual addresses whose top 9 bits (bits 40-48) are
equal to the binary sequence 011010100. Then, the corresponding row in the
L1 page table will store a null value and no corresponding Level 2 page table
will be allocated. This is the key insight that allows us to save space.
Note that we need to store the address of the Level 1 page table somewhere.
This is typically stored in a machine specific register on Intel hardware called
the CR3 register. Whenever a process is loaded, the address of its Level 1
page table is loaded into the CR3 register. Whenever, there is a need to find a
mapping in the page table, the first step is to read the CR3 register and find
47 c Smruti R. Sarangi
the base address of the Level 1 page table. It is a part of the process’s context.
We follow a similar logic at the next level. The only difference is that in this
case there may be multiple Level 2 page tables. Unlike the earlier case, we don’t
need to store their starting addresses in dedicated registers. The Level 1 entries
point to their respective base addresses. We use the next 9 bits to index entries
in the Level 2 page tables. In this case, we expect more valid non-null entries.
We continue with the same method. Each Level 2 page table entry points to the
starting address of a Level 3 page table, and finally each Level 3 page table entry
points to a Level 4 page table. We expect more and more valid entries at each
level. Finally, an entry in a Level 4 page table stores the corresponding frame’s
address. The process of address translation is thus complete. Each entry of the
Level 4 page table is known as a page table entry. We shall later see that it
contains the physical address of the frame and contains a few more important
pieces of information.
Note that we had to go through 4 levels to translate a virtual address to a
physical address. Reading the page table is thus a slow operation. If parts of
this table are in the caches, then the operation may be faster. However, in the
worst case, we need to make 4 reads to main memory, which requires more than
1000 cycles. This is a very slow operation. The sad part is that we need to
do this for every memory access !!! This is clearly an infeasible idea given that
roughly a third of the instructions are memory accesses.
The page tables and TLBs store some additional information. They store some
permission information. For security reasons, a program is typically not allowed
to write to code pages. Otherwise, it is easy for viruses to modify the code pages
such that a program can execute code that an attacker wants it to execute.
Sometimes, we want to create an execute-only page if there are specific licensing
requirements where the developers don’t want user programs to read the code
that is being executed. This makes it easy to find loopholes. We can thus
associate three permission bits with each page: read, write and execute. If a
bit is 1, then the corresponding permission is granted to the page. For instance,
if the bits are 101, then it means that the user process can read and execute
code in the page, but it cannot write to the page. These bits are stored in each
page table entry and also in each TLB entry. The core needs to ensure that the
permission bits are respected.
We can additionally have a bit that indicates whether the page can be ac-
cessed by the kernel or not. Most OS kernel pages are not accessible in user
space. The page table can store this protection information. This stops user
pages from accessing and mapping kernel pages.
Sometimes, the page is present in memory, but the user process does not
have adequate permissions to access the page. This is known as a soft page
fault, which usually generates an exception. When we discuss the MGLRU page
replacement algorithm, we shall observe that sometimes this mechanism proves
to be quite handy. We can deliberately induce soft page faults to track page
accesses. This gives us an idea about the popularity of a process’s pages.
c Smruti R. Sarangi 50
An inverted page table maps a physical frame to all the virtual pages
(across processes) that are mapped to it.
51 c Smruti R. Sarangi
cs es
ss fs
ds gs
Figure 2.14 shows the way that segmented memory is addressed. The phi-
losophy is broadly similar to the paging system for virtual memory where the
53 c Smruti R. Sarangi
Segment Logical
GDT
base address address
Linear address
SDC Virtual
memory
Segment Register
As seen in Figure 2.14, the base address stored in the relevant segment
register is added to the logical address. The resultant linear address further
undergoes translation to generate the physical address. This is then sent to the
memory system.
the I/O devices are properly interfaced. These additional chips comprise the
chipset. The motherboard is the printed circuit board that houses the CPUs,
memory chips, the chipset and the I/O interfacing hardware ports.
2.3.1 Overview
CPU
GPU
Northbridge
PCI
slots
Keyboard, mouse, USB
Southbridge
ports, I/O chips, ...
Any processor chip has hundreds of pins. Complex designs have roughly a
1000+ pins. Most of them are there to supply current to the chip: power and
ground pins. We need so many pins because modern processors draw a lot of
current. Note that a pin has limited current delivery capacity. However, a few
hundred pins are typically left for communication with external entities such as
the memory chips, off-chip GPUs and I/O devices.
Memory chips have their dedicated memory controllers on-chip. These mem-
ory controllers are aware of the number of memory chips that are connected and
how to interact with them. This happens at the hardware level and the OS is
blissfully unaware of what goes on here. Depending on the motherboard, there
could be a dedicated connection to an off-chip GPU. An ultra-fast and high-
bandwidth connection is required to a GPU that is housed separately on the
motherboard. Such buses (sets of copper wires) have their own controllers that
are typically on-chip.
Figure 2.15 shows a traditional design where the dedicated circuitry for com-
municating with the main memory modules and the GPU are combined, and
added to the Northbridge chip. The Northbridge chip used to traditionally be
resident on the motherboard (outside the chip). However, in most modern pro-
cessors today, the logic used in the Northbridge chip has moved into the main
CPU chip. It is much faster for the cores and caches to communicate with an
on-chip component. Given that both the main memory and GPU have very high
bandwidth requirements, this design decision makes sense. Alternative designs
are also possible where the Northbridge logic is split into two and is placed at
different ends of the chip: one part communicates with the GPU and the other
part communicates with the memory modules.
To communicate with other slower I/O devices such as the keyboard, mouse,
USB devices and the hard disk, a dedicated controller chip called the South-
bridge chip is used. In most modern designs, this chip is resident outside the
55 c Smruti R. Sarangi
Once, the read/write operation is done the data read from the device and the
status of the operation is passed on to the program that requested for the I/O
operation.
If we dive in further, we observe that an in instruction is a message that is
sent to the chip on the motherboard that is directly connected to the I/O device.
Its job is to further interpret this instruction and send device-level commands
to the device. It is expected that the chip on the motherboard knows which
message needs to be sent. The OS need not concern itself with such low-level
details. For example, a small chip on the motherboard knows how to interact
with USB devices. It handles all the I/O. It just exposes a set of I/O ports
to the CPU that are accessible via the in/out ports. Similar is the case for
out instructions, where the device drivers simply write data to I/O ports. The
corresponding chip on the motherboard knows how to translate this to device-
level commands.
Using I/O ports is the oldest method to realize I/O operations and has
been around for the last fifty years. It is however a very slow method and the
amount of data that can be transferred is very little. Also, for transferring a
small amount of data (1-4 bytes), there is a need to issue a new I/O instruction.
This method is alright for control messages but not for data messages in high
bandwidth devices like the network cards. There is a need for a faster method.
This is known as port-mapped I/O (PMIO).
Virtual
address
space
Mapped to I/O
addresses I/O device ports
The faster method is to directly map regions of the virtual address space
to an I/O device. Insofar as the OS is concerned, it makes regular reads and
writes. The TLB however stores an additional bit indicating that the page is
an I/O page. The hardware automatically translates memory requests to I/O
requests. There are several advantages of this scheme (refer to Figure 2.16).
The first is that we can send a large amount of data in one go. The x86
architecture has instructions such as rep movs and rep stos that enable the
programmer to move hundreds of bytes between addresses in one go. These
instructions can be used to transfer kilobytes to/from I/O space. The hardware
57 c Smruti R. Sarangi
on the chipset can then use fast mechanisms to ensure that this process is
realized as soon as possible.
At the side of the processor, we can clearly see the advantage. All that we
need is a few instructions to transfer a large amount of data. This reduces the
instruction processing overhead at the end of the CPU and keeps the program
simple – we only need to use load and store instructions. I/O devices and
chips in the chipset have also evolved to support memory-mapped I/O. Along
with their traditional port-based interface, they are also incorporating small
memories that are accessible to chips in the chipset. The data that is in the
process of being transferred to/from I/O devices can be temporarily buffered in
these small memories.
A combination of these technologies makes memory-mapped I/O very effi-
cient. Hence, it is very popular as of 2025. In many reference manuals, it is
conveniently referred to by its acronym MMIO.
2.3.4 DMA
CPU DMA
2. Interrupt the CPU (transfer done)
Even though memory-mapped I/O is much more efficient than the older
method that relied on primitive instructions and basic I/O ports, it turns out
that we can do far better. Even in the case of memory-mapped I/O, the proces-
sor needs to wait for the load/store instruction that is doing the I/O to finish.
Given that I/O operations take a lot of time, the entire pipeline fills up and
the processor remains stalled until the outstanding I/O operations complete.
One simple solution is that we do the memory-mapped I/O operations in small
chunks and do other work in the middle; however, this slows down the entire
transfer process. We can also remove write operations from the critical path
and assume that they are done asynchronously. Still the problem of slow reads
will be there.
Our main objective here is that we would like to do other work while I/O
operations are in progress. We can extend the idea of asynchronous writes to
also have asynchronous reads. In this model, the processor does not wait for
the read or write operation to complete. The key idea is shown in Figure 2.17,
where there is a separate DMA (direct memory access) chip that effects the
transfers between the I/O device and memory. The CPU basically outsources
the I/O operation to the DMA chip. The chip is provided with the addresses in
memory as well as the addresses on the I/O device along with the direction of
data transfer. Subsequently, the DMA chip initiates the process of data transfer.
In the meanwhile, the CPU can continue executing programs without stalling.
Once the DMA operation completes, it is necessary to let the OS know about
it.
Hence, the DMA chip issues an interrupt, the OS comes into play, and then
it realizes that the DMA operation has completed. Since user programs cannot
c Smruti R. Sarangi 58
directly issue DMA requests, they instead just make system calls and let the
OS know about their intent to access an I/O device. This interface can be kept
simple primarily because it is only the OS’s device drivers that interact with
the DMA chip.
When the interrupt from the DMA controller arrives, the OS knows what
to do with it and how to signal the device drivers that the I/O operation is
done. The device driver can then either read the data that has been fetched
from an I/O device or assume that the write has completed. In many cases,
it is important to let the user program also know that the I/O operation has
completed. For example, when the printer successfully finishes printing a page,
the icon changes from “printing in progress” to “printing complete”. Signals
can be used for this purpose.
To summarize, in this section we have seen three different approaches for
interacting with I/O devices. The first approach is also the oldest approach
where we use old-fashioned I/O ports. This is a simple approach especially
when we are performing extremely low-level accesses, and we are not reading
or writing a lot of data. Currently, I/O ports are primarily used for interacting
with the BIOS (booting system), simple devices like LEDs and in embedded sys-
tems. This method has mostly been replaced by memory-mapped I/O (MMIO).
MMIO is easy for programmers, and it leverages the natural strengths of the
virtual memory system. It provides a convenient and elegant interface for device
drivers – they use regular load/store instructions to perform I/O. Also, another
advantage is that it is possible to implement a zero-copy mechanism where if
some data is read from an I/O device, it is very easy to transfer it to a user
program. The device driver can simply change the mapping of the pages and
map them to the user program after the I/O device has populated the pages.
Consequently, there is no necessity to read data from an I/O device into pages
that are accessible only to the OS, and then copy all the data once again to user
pages. This is inefficient.
Subsequently, we looked at a method, which provides much more bandwidth
and also does not stall the CPU. This is known as DMA (direct memory access).
Here, the entire role of interacting with I/O devices is outsourced to an off-chip
DMA device; it finally interrupts the CPU once the I/O operation completes.
After that the device driver can take appropriate action, which also includes
letting the user program know that its I/O operation is over.
Point 2.3.1
DMA The entire job of effecting the transfer is outsourced to the DMA
chip (or DMA controller). After performing the transfer, it raises
an interrupt to let the OS know that the transfer has completed.
processes lay out their code, data, heap and stack sections in a
similar manner in memory. This is known as the memory map of
the process. The stack typically starts at a very high address and
grows downwards.
10. This “virtual view” of memory is an elegant and convenient ab-
straction for the compiler, programmer and processor. However,
in a practical real-world system, there are three problems.
13. To augment the size of the physical address space, some space in
storage devices can be used. This is known as the swap space. If a
frame is not found in main memory, then this event is known as a
page fault. There is a need to bring in the frame from swap space
and possibly replace a frame already resident in main memory.
14. x86 uses six different segment registers. The CPU generates a
logical address that is added to the base address stored in the
associated segment descriptor. The resultant linear address acts
like a virtual address that is translated to a physical address.
15. x86-64 primarily uses the fs and gs segments.
16. All these segment registers contain an index that maps to a segment
descriptor in the GDT table. The lookup process is accelerated by
using a segment descriptor cache.
61 c Smruti R. Sarangi
17. There are three methods for performing I/O: port-mapped I/O (us-
ing regular I/O ports, 64-KB I/O address space), memory-mapped
I/O (map a region of the virtual address space to the I/O device’s
internal memory) and DMA (outsource the job of transferring data
to a dedicated off-chip circuit).
Exercises
Ex. 1 — Why are multiple rings there in an x86 processor? Isn’t having just
two rings enough?
Ex. 2 — How does a process know that it is time for another process to run
in a multitasking system? Explain the mechanism in detail.
Ex. 3 — Assume a 16-core system. There are 25 active threads that are purely
computational. They do not make system calls. The I/O activity in the system
is negligible. Answer the following questions:
a)How will the scheduler get invoked?
b)Assume that the scheduler has a special feature. Whenever it is invoked,
it will schedule a new thread on the core on which it was invoked and
replace the thread running on a different core with another active (ready
to run) thread. How do we achieve this? What kind of hardware support
is required?
c Smruti R. Sarangi 62
Ex. 4 — What is the need for having privileged registers in a system? How
does Intel avoid them to a large extent?
Ex. 5 — How can we design a virtual memory system for a machine that does
not have any kind of storage device such as a hard disk attached to it? How do
we boot such a system?
Ex. 7 — Do the processor and compiler work with physical addresses or vir-
tual addresses?
Ex. 8 — How does the memory map of a process influence the design of a
page table for 64-bit systems?
Ex. 12 — When is it preferred to use an inverted page table over the tradi-
tional (tree-based) page table?
Ex. 13 — Why are the memory contents not a part of a process’s context?
Ex. 14 — Assume two processes access a file in read-only mode. They use
memory-mapped I/O. Is there a possibility of saving physical memory space
here?
Chapter 3
Processes
as a logically contiguous sequence of bytes on a storage device. Files can be of various types
such as documents, video files, audio files, and so on.
63
c Smruti R. Sarangi 64
systems use. They first fully copy the parent process’s memory image and then
replace the memory image if there is a need. This approach is however not
followed in other operating systems like Windows. We shall learn more about
Linux’s approach and its pros and cons. Finally, we shall also spend some time in
understanding the different kinds of context switch mechanisms that are needed
in a modern operating system. Some of them can be made more efficient and
admit optimizations.
Organization of this Chapter
Processes
Threads
This chapter has three subparts (refer to Figure 3.1). We will start with
discussing the main concepts underlying a process in the latest Linux kernel.
A process is a very complex entity because the kernel needs to create several
data structures to represent all the runtime state of the running program. This
would, for example, include creating elaborate data structures to manage all the
memory regions that the process owns. This makes it easy to allocate resources
to processes and later on deallocate them. The kernel uses the task struct
structure to maintain this information.
Subsequently, we shall discuss the relevant code for managing process ids
(pids in Linux) and the overall state of the process. We shall specifically look
at a data structure called a maple tree, which the current version of the Linux
kernel uses extensively. We shall then also look at two more kinds of trees, which
are very useful for searching data namely the radix tree and the augmented tree.
Appendix C describes these data structures in great detail. It is thus necessary
to keep referring to it.
In the subsequent section, we shall look at the methods of process creation
and destruction. Specifically, we shall look at the fork and exec system calls.
Using the fork system call, we can clone an existing process. Then, we can use
the exec family of calls to superimpose the image of a different executable on
65 c Smruti R. Sarangi
top of the currently running process. This is the standard mechanism by which
new processes are created in Linux.
Finally, we shall discuss the context switch mechanism in a fair amount of
detail. We shall first introduce the different types of context switches and the
state that the kernel needs to maintain to suspend a running process and resume
it later. The process of suspension and resumption of a process is different for
different kinds of processes. For instance, if we are running an interrupt handler,
then certain rules apply that are quite restrictive whereas if we are running a
regular program, then some other rules apply.
Summary: Data Structures used in this Chapter
The reader is requested to kindly take a look at some important data structures
that are used in the Linux kernel (see Appendix C). Before proceeding forward,
we would like the reader to be fully familiar with the following data structures:
B-tree, B+ tree, maple tree and radix tree. They are extensively used through-
out the kernel. It is important to understand them before we proceed.
The kernel heavily relies on tree-based data structures. We frequently face
problems such as identifying the virtual memory region that contains a given
virtual memory address. This boils down to a search problem – given a key
find the value. Often using a hash table is not a very wise idea in such cases,
particularly when we are not sure of how many keys we need to store. They
also have poor cache locality and do not lend themselves to easily implementing
range queries. Trees, on the other hand, are very versatile data structures. With
logarithmic-time complexity they can implement a wide variety of functions.
They support concurrent accesses and highly cache-efficient organizations. This
is why they are often used in high-performance implementations. Trees are
naturally scalable as well in terms of the number of nodes that they store.
We can always use the classical red-black and AVL trees. However, it is far
more common to use m-ary B-trees where a node’s size is equal to that of one
or more cache blocks. This leads to minimizing cache block fetches and also
allows convenient node-level locking. A B+ tree is a variation of a B-tree where
the keys are only stored at the leaves. A maple tree is a specialized B+ tree
that is used in the Linux kernel. The arity of the nodes changes with the level.
Internal nodes close to the root typically have fewer children and nodes with
higher depths have more children. Such adaptive node sizing is done to improve
memory efficiency.
Another noteworthy structure for storing keys and values is the radix tree.
It works well if keys share common prefixes. We traverse such a tree based on
the digits in the key. The search time is linear in terms of the number of digits
in the key.
The kernel also uses augmented trees that help us solve problems of the
following type: given a bit vector, find the location of the first 0 or 1 in log-
arithmic time (starting from a given location and proceeding towards higher
or lower indexes). Such trees are used to accelerate operations on bit vectors
especially scan operations that attempt to find the next 0 or 1 in a bit vector.
The implementation can be optimized. Each leaf of the augmented tree need
not correspond to a single bit. It can instead correspond to a set of 32 or 64
bits (size of a memory word). Their parent node in the tree just needs to store
if any of those 32 or 64 bits are equal to a 0 or 1 or not. Note that in modern
machines data can only be stored at the granularity of 32 or 64 bits; hence, the
c Smruti R. Sarangi 66
We shall revisit this definition in Section 3.3.2, where we shall look at how
67 c Smruti R. Sarangi
Field Description
struct thread info thread info Low-level information
uint state Process state
void * stack Kernel stack
Priorities prio, static prio, normal prio
struct sched info sched info Scheduling information
struct mm struct *mm, *active mm Pointer to memory information
pid t pid Process id
struct task struct *parent Parent process
struct list head children, sibling Child and sibling processes
Other fields file system, I/O, synchroniza-
tion, and debugging fields
the ellipses . . . symbol to indicate that something is omitted, but most of the
time for the sake of readability, we will not have any ellipses.
The declaration of thread info is shown in Listing 3.1.
/* current CPU */
u32 cpu ;
}
This structure basically stores the current state of the thread, the state of
the executing system call and synchronization-related information. Along with
that, it stores another vital piece of information, which is the number of the
CPU on which the thread is running or is scheduled to run at a later point
in time. We shall see in later sections that finding the id of the current CPU
(and the state associated with it) is a very frequent operation and thus there is a
pressing need to realize it as efficiently as possible. In this context, thread info
provides a somewhat suboptimal implementation. There are faster mechanisms
of doing this, which we shall discuss in later sections. It is important to note
that the reader needs to figure out whether we are referring to a thread or a
process depending upon the context. In most cases, it does not matter because
a thread is treated as a process by the kernel. However, given that we allow
multiple threads or a thread group to also be referred to as a multi-threaded
process (albeit, in limited contexts), the term thread will more often be used
because it is more accurate. It basically refers to a single program executing as
opposed to multiple related programs (threads) executing.
Execu�on TASK_ZOMBIE
finished (task finishes or
is terminated)
New task created
Scheduler asks it to execute
TASK_RUNNING TASK_RUNNING
TASK_STOPPED
(ready but not (currently
(stopped)
running) running) SIGSTOP
message sent by the OS, which we refer to as a signal. For instance, it is possible
for other tasks to send the interrupted process a message (via the OS) and in
response it can invoke a signal handler. Recall that a signal handler is a specific
function defined in the program that is conceptually similar to an interrupt
handler, however, the only difference is that it is implemented in user space.
In comparison, in the UNINTERRUPTIBLE state, the task does not respond to
signals.
Zombie Tasks
The process of deleting the state of a task after it exits is quite elaborate in
Linux. To start with, note that the processor has no way of knowing when
a task has completed. It will continue to fetch bytes from memory and try
to execute them. It is thus necessary to explicitly inform the kernel that a
task has completed by making the exit system call. However, a task’s state is
not cleaned up at this stage. Instead, the task’s parent is informed using the
SIGCHLD signal. Every task has a parent. It is the task that has spawned the
current task. The parent then needs to call the system call wait to read the
exit status of the child. It is important to understand that every time the exit
system call is called, the exit status is passed as an argument. Typically, the
value zero indicates that the task completed successfully. On the other hand,
a non-zero status indicates that there was an error. The status in this case
represents the error code.
Here again, there is a convention. The exit status ‘1’ indicates that there
was an error, however it does not provide any additional details. We can refer to
this situation as a non-specific error. Given that we have a structured hierarchy
of tasks with parent-child relationships, Linux explicitly wants every parent to
read the exit status of all its children. Until a parent task has read the exit status
of the child, the child remains a zombie task – neither dead nor alive. After the
parent has read the exit status, all the state associated with the completed child
task can be deleted.
Hence, typically all versions of the Linux kernel have placed strict hard limits
on the size of the kernel stack.
Pointer to
task_struct
thread_info
thread_info
current
The size of the kernel stack is limited to two 4-KB pages, i.e., 8 KB. It
contains useful data about the running thread. These are basically per-thread
stacks. In addition, the kernel maintains a few other stacks, which are specific to
a CPU. The CPU-specific stacks are used to run interrupt handlers, for instance.
Sometimes, we have very high priority interrupts and some interrupts cannot
be ignored (not maskable). The latter kind of interrupts are known as NMIs
(non-maskable interrupts). This basically means that if we are executing an
interrupt handler, if a higher priority interrupt arrives, we need to do a context
switch and run the interrupt handler for the higher-priority interrupt. This is
conceptually similar to the regular context switch process for user-level tasks.
It is just that in this case interrupt handlers are being paused and subsequently
resumed. This is happening within the kernel. Note that each such interrupt
handler needs its own stack to execute. Every time an interrupt handler runs,
we need to find a free stack and assign it to the handler. Once, the handler
finishes running, the stack’s contents can be cleared and the stack is ready to
be used by another handler. Given that nested interrupts (running interrupt
handlers by pausing other handlers) are supported, we need to provision for
many stacks. Linux has a limit of 7. This means that the level of interrupt
handler nesting is limited to 7.
Figure 3.3 shows the structure of the kernel stack in older kernels. The
thread info structure was kept at the lowest address in the 8-KB memory re-
gion that stored the kernel stack. Even in current kernels, this memory region
is always aligned to an 8-KB boundary. The thread info structure had a vari-
able called task that pointed to the corresponding task struct structure. The
current macro subsumed the logic for getting the thread info of the current
task and then getting a pointer to the associated task struct from it. The main
aim here is to design a very quick method for retrieving the task struct asso-
ciated with the current task. This is a very time-critical operation in modern
kernels and is invoked very frequently. Hence, a need was felt to optimize this
73 c Smruti R. Sarangi
Example 3.1.2
Write a function to extract the ith bit in the number x. The LSB is the
first bit.
Answer:
int extract ( int x , int i ) {
return ( x & (1 << ( i - 1) ) ) >> (i -1) ;
}
Point 3.1.2
Often a need is felt to store a set of bits. Each bit could be a flag or
some other status code. The most space-efficient data structure to store
such bits is often a variant of the classical unsigned integer. For example,
if we want to store 12 bits, it is best to use an unsigned short integer
(u16). The 12 LSB bits of the primitive data type can be used to store
the 12 bits, respectively. Similarly, if we wish to store 40 bits, it is best
to use an unsigned long integer (u64). The bits can be extracted using
the logic followed in Example 3.1.2.
Refer to the code in Listing 3.2. It defines a macro current that returns a
pointer to the current task struct via a chain of macros and in-line functions. 2
The code ultimately resolves to a single instruction that reads the address of
the current task’s task struct in the gs segment [Lameter and Kumar, 2014].
The gs segment thus serves as a dedicated region that stores information that
is quickly accessible to a kernel thread. In fact, the kernel partitions a part of
this region to store information specific to each core (CPU in kernel’s parlance).
It can thus instantly access the task struct structures of processes running on
all the CPUs.
Note that here we are using the term “CPU” as a synonym for a “core”.
This is Linux’s terminology. We can store a lot of important information in a
dedicated per-CPU/per-core area, notably the current (task) variable, which
is needed very often. It is clearly a global variable insofar as the kernel code
running on the CPU is concerned. We thus want to access it with as few
memory accesses as possible. In our current solution with segmentation, we
are reading the variable with just a single instruction. This was made possible
because the gs register directly stores a pointer to the beginning of the dedicated
storage region, and the offset of the task struct from that region is known.
An astute reader can clearly make out that this mechanism is more efficient
than the earlier method that used a redirection via the thread info structure.
The slower redirection-based mechanism is still used in architectures that do
not have support for segmentation.
There are many things to be learned here. The first is that for something as
important as the current task, which is accessed very frequently, and is often on
the critical path, there is a need to devise a very efficient mechanism. Further-
more, we also need to note the diligence of the kernel developers in this regard
and appreciate how much they have worked to make each and every mechanism
as efficient as possible – save memory accesses wherever and whenever possible.
In this case, several conventional solutions are clearly not feasible such as storing
the current task pointer in CPU registers, a privileged/model-specific register
(not a portable choice), or even a known memory address. The issue with stor-
ing this pointer at a known memory address is that it significantly limits our
2 In an inline function, the code of the function is expanded at the point of invocation.
There is no function call and return. This method enhances the performance of very small
functions.
75 c Smruti R. Sarangi
flexibility in using the virtual address space. This may create portability issues
across architectures. As a result, the developers chose the segmentation-based
method for x86 hardware.
There is a small technicality here. We need to note that different CPUs
(cores on a machine) will have different per-CPU regions. This, in practice,
can be realized very easily with this scheme because different CPUs have dif-
ferent segment registers. We also need to ensure that these per-CPU regions
are aligned to cache line boundaries. This means that a cache line is uniquely
allocated to a per-CPU region – there are no overlaps. If this is the case, we
will have a lot of false sharing misses across the CPUs, which will prove to be
detrimental to the overall performance. Recall that false sharing misses are an
artifact of cache coherence. A cache line may end up continually bouncing be-
tween cores if they are interested in accessing different non-overlapping chunks
of that same cache line.
Linux uses 140 task priorities. The priority range as shown in Table 3.2 is
from 0 to 139. The priorities 0-99 are for real-time tasks. These tasks are for
mission-critical operations, where deadline misses are often not allowed. The
scheduler needs to execute them as soon as possible.
The reason we have 100 different priorities for such real-time processes is
because we can have real-time tasks that have different degrees of importance.
We can have some that have relatively “soft” requirements, in the sense that
it is fine if they are occasionally delayed. Whereas, we may have some tasks
where no delay is tolerable. The way we interpret the priority range 0-99 is as
follows. In this space, 0 corresponds to the least priority real-time task and the
task with priority 99 has the highest priority in the overall system.
Some kernel threads run with real-time priorities, especially if they are in-
volved in important bookkeeping activities or interact with sensitive hardware
devices. Their priorities are typically in the range of 40 to 60. In general, it is
not advisable to have a lot of real-time tasks with very high priorities (more than
60) because the system tends to become quite unstable. The reason is that the
CPU time is completely monopolized by these real-time tasks, resulting in the
rest of the tasks, including many OS tasks, not getting enough time to execute.
Hence, a lot of important kernel activities get delayed.
Now for regular user-level tasks, we interpret their priority slightly differ-
ently. In this case, higher the priority number, lower is the actual priority. This
basically means that in the entire system, the task with priority 139 has the
c Smruti R. Sarangi 76
least priority. On the other hand, the task with priority 100 has the highest
priority among all regular user-level tasks. It still does not have a real-time
priority but among non-real-time tasks it has the highest priority. The impor-
tant point to understand is that the way that we understand these numbers
is quite different for real-time and non-real-time tasks. We interpret them in
diametrically opposite manners in both the cases (refer to Figure 3.4).
User
0 99 139
100
Priority value
There are two concepts here. The first is the number that we assign in the
range 0-139, and the second is the way that we interpret the number as a task
priority. It is clear from the preceding discussion that the number is interpreted
differently for regular and real-time tasks. However, if we consider the kernel,
it needs to resolve the ambiguity and use a single number to represent the
priority of a task. We would ideally like to have some degree of monotonicity.
Ideally, we want that either a lower value should always correspond to a higher
priority or the reverse, but we never want a combination of the two in the actual
kernel code. This is exactly what is being rectified in the code snippet shown
in Listing 3.3. We need to note that there are historical reasons for interpreting
user and real-time priority numbers at the application level differently, but in
the kernel code this ambiguity needs to be resolved and monotonicity needs to
be ensured.
In line with this philosophy, let us consider the first else if condition that
corresponds to real-time tasks. In this case, the value of MAX RT PRIO is 100.
Hence, the range [0-99] gets translated to [99-0]. This basically means that lower
the value of prio, greater the priority. We would want user-level priorities
77 c Smruti R. Sarangi
/* Timestamps : */
The class sched info (shown in Listing 3.4) contains some meta-information
about the overall scheduling process. The variable pcount denotes the number
of times this task has run on the CPU. run delay is the time spent waiting in
the runqueue. The runqueue is a structure that stores all the tasks whose status
is TASK RUNNING.3 As we have discussed earlier, this includes tasks that are
currently running on CPUs as well as tasks that are ready to run. Then we have
a bunch of timestamps. The most important timestamps are last arrival and
last queued, which store when a task last ran on the CPU, and it was last
queued to run, respectively. In general, the unit of time within a CPU is either
in milliseconds or in jiffies (refer to Section 2.1.4).
Key components
are, and this information will remain the same throughout the execution of the
process. The kernel can access the page table using its physical address. There
is no need to issue a lookup operation to find where the page table of a given
process is currently located. This approach also reduces TLB misses because a
lot of the mappings do not change.
The kernel uses a very elaborate data structure known as struct mm struct
to maintain all memory-related information of this nature as shown in Figure 3.5.
The core data structure, mm struct, has many fields as shown in the figure.
As we just discussed, one of the key roles of this structure is to keep track
of the memory map (refer to Section 2.2.1). This means that we need to keep
track of all the virtual memory regions that are owned by a process. The kernel
uses a dedicated structure known as a maple tree that keeps track of all these
regions. It is a sophisticated variant of a traditional B+ tree (see Appendix C).
Each key in the maple tree is actually a 2-tuple: starting and ending address of
the region. The key thus represents a range. In this case, the keys (and their
corresponding ranges) are non-overlapping. Hence, it is easily possible to find
which region a virtual address is a part of by just traversing the maple tree.
This takes logarithmic time.
Along with the memory map, the other important piece of information that
the mm struct stores is a pointer to the page table (pgd). Linux uses a multi-
level page table, where each entry contains a lot of information – this is necessary
for address translation, security and high performance. Readers should note the
high level of abstraction here. The entire page table is referenced using just a
single pointer: pgd t* pgd. All the operations performed on the page table
require nothing more than this single pointer. This is a very elegant design
pattern and is repeated throughout the kernel.
Next, the structure contains a bunch of statistics about the total number of
pages, the number of locked pages, the number of pinned pages and the details of
different memory regions in the memory map. For example, this structure stores
the starting and ending virtual addresses of the code, data and stack sections.
Next, the id of the owner process (pointer to a task struct) is stored. There
is a one-to-one correspondence between a process and its mm struct.
The last field cpu bitmap is somewhat interesting. It is a bitmap of all the
CPUs on which the current task has executed in the past. For example, if there
are 8 CPUs in the system, then the bitmap will have 8 bits. If bit 3 is set to 1,
then it means that the task has executed on CPU #3 in the past. This is an
important piece of information because we need to understand that if a task has
executed on a CPU in the past, then most likely its caches will have warm data.
In this case “warm data” refers to data that the current task is most likely going
to use in the near future. Given that programs exhibit temporal locality, they
tend to access data that they have recently accessed in the past. This is why it
is a good idea to record the past history of the current task’s execution. Given
a choice, it should always be relocated to a CPU on which it has executed in
the recent past. In that case, we are maximizing the chances of finding data in
the caches, which may still prove to be useful in the near future.
map of a process, we had observed that there are a few contiguous regions
interspersed with massive holes. The memory map, especially in a 64-bit system,
is a very sparse structure. In the middle of the large sparse areas, small chunks
of virtual memory are used by the process. Hence, any data structure that is
chosen needs to take this sparsity into account.
Many of the regions in the memory map have already been introduced such
as the heap, stack, text, data and bss regions. In between the stack and heap
there is a huge empty space. In the middle of this space, some virtual memory
regions are used for mapping files and loading shared libraries. There are many
other miscellaneous entities that are stored in the memory map such as handles
to resources that a process owns. Hence, it is advisable to have an elaborate
data structure that keeps track of all the used virtual memory regions regardless
of their actual purpose. Each virtual memory region can have the same level
of memory protection, read/write policies and methods to handle page faults.
Instead of treating each virtual page distinctly, it is a good idea to group them
into regions and assign common attributes and policies to each region. Hence,
we need to design a data structure that answers the following question.
Question 3.1.1
Given a virtual memory address, find the virtual memory region that it
is a part of.
Listing 3.5 shows the code of vm area struct that represents a contiguous
virtual memory region. As we can see from the code, it maintains the details of
each virtual memory (VM) region including its starting and ending addresses. It
also contains a pointer to the parent mm struct. For understanding the rest of
the fields, let us introduce the two kinds of memory regions in Linux: anonymous
and file-backed.
Anonymous memory region These are many memory regions that are not
mirrored or copied from a file such as the stack and heap. These memory
regions are created during the execution of the process and store dynam-
ically allocated data. Hence, these are referred to as anonymous memory
regions. They have a dynamic existence, and are not linked to specific
sections in a binary or object file.
memory space is mapped to a file. This means that the contents of the file
are physically copied to memory and that region is mapped to a virtual
memory region. Typically, if we write to that region in memory, the
changes will ultimately reflect in the backing file. This backing file is
referred to as vmfile in struct vm area struct.
3.1.12 Namespaces
Containers
Traditional cloud computing is quickly being complemented with many new
technologies: microservices, containers and serverless computing. We shall focus
on them in Chapter 8. The basic idea is that a virtual machine (VM) is a
full virtualized environment where every processor resource including the CPUs
and memory are virtualized. These VMs can be suspended, moved to a new
machine and restarted. However, this is a heavy-duty solution. Containers on
the other hand are lightweight solution where the environment is not virtualized.
A container is used to create a small isolated environment within a machine that
is a “mini-virtual machine”. Processes within a container perceive an isolated
environment. They have their own set of processes, network stack and file
system. They also provide strong security guarantees.
Almost all modern versions of Linux support containers such as Docker4 ,
Podman5 and LXC6 . A container is primarily a set of processes that own file
and network resources. These are exclusive to the container that allow it to
host a custom environment. For example, if the user has spent a lot of effort in
creating a custom software environment, she would not like to again install the
same software programs on another machine. Along with re-installing the same
software, configuring the system is quite cumbersome. A lot of environment
variables need to be set and a lot of script files need to be written. Instead
of repeating this same sequence of burdensome steps repeatedly, it is a better
idea to create a custom file system, mount it on a Docker container and simply
distribute the Docker container. All that one needs to do on a remote machine
4 https://www.docker.com/
5 https://www.docker.com/
6 https://linuxcontainers.org/
83 c Smruti R. Sarangi
Details of Namespaces
Let us discuss the idea of namespaces, which underlie the key process man-
agement subsystem of containers. They need to provide a virtualized process
environment where processes retain their pid numbers, inter-process communi-
cation structures and state after migration.
Specifically, the kernel groups processes into namespaces. Recall that the
processes are arranged as a tree. Every process has a parent process, and there
is one global root process. Similarly, the namespaces are also hierarchically
organized as a tree. There is a root namespace. Every process is visible to its
own namespace and is additionally also visible to all ancestral namespaces. No
process is visible to any child namespace.
Point 3.1.4
Every process is visible to its own namespace and is additionally also
visible to all ancestral namespaces.
In this case, a pid (number) is defined only within the context of a names-
pace. When we migrate a container, we also migrate its namespace. Then the
container is restarted on a remote machine, which is tantamount to re-instating
its namespace. This means that all the paused processes in the namespace are
activated. Given that this needs to happen unbeknownst to the processes in
the container, the processes need to maintain the same pids even on the new
machine.
As discussed earlier, a namespace itself can be embedded in a hierarchy of
namespaces. This is done for the ease of managing processes and implementing
containers. Every container is assigned its separate namespace. It is possible for
the system administrator to provide only a certain set of resources to the parent
namespace. Then the parent namespace needs to appropriately partition these
7 https://criu.org/
c Smruti R. Sarangi 84
resources among its child namespaces. This allows for fine-grained resource
management and tracking.
struct pid_namespace {
/* A radix tree to store allocated pid structures */
struct idr idr ;
The code of struct pid namespace is shown in Listing 3.6. The most
important structure that we need to consider is idr (IDR tree). This is an
annotated Radix tree (of type struct idr) and is indexed by the pid. The
reason that there is such a sophisticated data structure here is because, in
principle, a namespace could contain a very large number of processes. Hence,
there is a need for a very fast data structure for storing and indexing them.
We need to understand that often there is a need to store additional data
associated with a process. It is stored in a dedicated structure called (struct
pid). The idr tree returns the pid structure for a given pid number. We need
to note that some confusion is possible here given that both are referred to using
the same term “pid”.
Next, we have a kernel object cache (kmem cache) or pool called pid cachep.
It is important to understand what a pool is. Typically, free and malloc calls for
allocating and deallocating memory in C take a lot of time. There is also need
for maintaining a complex heap memory manager, which needs to find a hole of
a suitable size for allocating a new data structure. It is a much better idea to
have a set of pre-allocated objects of the same type in an 1D array called a pool.
It is a generic concept and is used in a lot of software systems including the
kernel. Here, allocating a new object is as simple as fetching it from the pool
and deallocating it is also simple – we need to return it back to the pool. These
are very fast calls and do not involve the action of the heap memory manager,
which is far slower. Furthermore, it is very easy to track memory leaks. If we
forget to return objects back to the pool, then in due course of time the pool
will become empty. We can then throw an exception, and let the programmer
know that this is an unforeseen condition and is most likely caused by a memory
leak. The programmer must have forgotten to return objects back to the pool.
To initialize the pool, the programmer should have some idea about the
maximum number of instances of objects that may be active at any given point
of time. After adding a safety margin, the programmer needs to initialize the
pool and then use it accordingly. In general, it is not expected that the pool
85 c Smruti R. Sarangi
will become empty because as discussed earlier it will lead to memory leaks.
However, there could be legitimate reasons for this to happen such as a wrong
initial estimate. In such cases, one of the options is to automatically enlarge
the pool size up till a certain limit. Note that a pool can store only one kind
of objects. In almost all cases, it cannot contain two different types of objects.
Sometimes exceptions to this rule are made if the objects are of the same size.
Next, we store the level field that indicates the level of the namespace.
Recall that namespaces are stored in a hierarchical fashion. This is why, every
namespace has a parent field.
struct upid {
int nr ; /* pid number */
struct pid_namespace * ns ; /* namespace pointer */
};
struct pid
{
refcount_t count ;
unsigned int level ;
Let us now look at the code of struct pid in Listing 3.7. As discussed
earlier, often there is a need to store additional information regarding a process,
which may be used after the pid has been reused, and the process has terminated. Should be 'number of resources
The count field refers to the number of resources that are using the process. being used by the process.'
Ideally, it should be 0 when the process is freed. Also, every process has a
default level, which is captured by the level field. This is the level of its
original namespace.
The linked list tasks stores several lists of tasks. Note that hlist head
points to a linked list (singly-linked). It has several members. The most impor-
tant members are as follows:
• tasks[PIDTYPE TGID] (list of processes in the thread group)
• tasks[PIDTYPE PGID] (list of processes in the process group)
• tasks[PIDTYPE SID] (list of processes in the session)
We have already looked at a thread group. A process group is a set of
processes that are all started from the same shell command. For example, if we
start an instance of the Chrome browser, and it starts a set of processes, they
are all a part of the same process group. If the user presses Ctrl+C on the shell
then the Chrome browser process and all its child processes get terminated. A
c Smruti R. Sarangi 86
session consists of a set of process groups. For example, all the processes created
by the login shell are a part of the same session.
Point 3.1.5
A process may belong to a thread group. Each thread group has a thread
group id, which is the pid of the leader process. A collection of processes
and thread groups is referred to as a process group. All of them can be
sent SIGINT (Ctrl+C) and SIGTST (Ctrl+z) signals from the shell. It
is possiblel to terminate all of them in one go. A collection of process
groups form a session. For example, all the processes started by the same
login shell are a part of the same session.
IDR Tree
Each namespace has a data structure called an IDR tree (struct idr). IDR
stands for “ID Radix” We can think of the IDR tree as an augmented version
of the classical radix tree. Its nodes are annotated with additional information,
which allow it to function as an augmented tree as well. Its default operation
is to work like a hash table, where the key is the pid number and the value
is the pid structure. This function is very easily realized by a classical radix
tree. However, the IDR tree can do much more in terms of finding the lowest
unallocated pid. This functionality is normally provided by an augmented tree
(see Appendix C). The IDR tree is a beneficial combination of a radix tree and
an augmented tree. It can thus be used for mapping pids to pid structures and
for finding the lowest unallocated pid number in a namespace in logarithmic
time.
A node in the IDR tree is an xa node, which typically contains an array of
64 pointers. Each entry can either point to another internal node (xa node) or
an object such a struct pid. In the former case, we are considering internal
nodes in the augmented tree. The contiguous key space assigned to each subtree
is split into non-overlapping regions and assigned to each child node. The leaves
are the values stored in the tree. They are the objects stored in the tree (values
in the key-value pairs). We reach an object (leaf node) by traversing a path
based on the digits in the key.
Let us explain the method to perform a key lookup using the IDR tree.
We start from the most significant MSB bits of the pid, and gradually proceed
towards the LSB bit. This ensures that the leaves of the tree that correspond to
unique pids are in sorted order if we traverse the tree using a preorder traversal.
Each leaf (struct pid) corresponds to a valid pid.
87 c Smruti R. Sarangi
struct pids
Trivia 3.1.1
It is important to note that we do not store a bit vector explicitly at one
place. The bit vector is distributed across all the internal nodes at the
second-last level. Nodes at this level point to the leaves.
Scanning every bit sequentially in the bit vector marks stored in an xa node
can take a lot of time (refer to Figure 3.7). If it is a 64-bit wide field, we need
to run a for loop that has 64 iterations. Fortunately, on x86 machines, there
is an instruction called bsf (bit scan forward) that returns the position of the
c Smruti R. Sarangi 88
xa_node
marks
pointers to
slots
xa_nodes or
pid_structs
How is bsf instruction first (least significant) 1. This is a very fast hardware instruction that executes
implemented in hardware? in 2-3 cycles. The kernel uses this instruction to almost instantaneously find
the location of the first 1 bit (free bit).
Once a free bit is found, it is set to 0, and the corresponding pid number
is deemed to be allocated. This is equivalent to converting a 1 to a 0 in a
augmented tree (see Appendix C). There is a need to traverse the path from the
See appendix on leaf to the root and change the status of nodes accordingly. Similarly, when a
augmented tree for pid is deallocated, we convert the corresponding bit from 0 to 1, and appropriate
description of these changes changes are made to the nodes in the path from the leaf to the root.
and flash drives to be block-level storage devices – their atomic storage units
are blocks (512 B to 4 KB). It is necessary to also maintain some information
regarding the I/O requests that have been sent to different block devices. Linux
also defines character devices such as the keyboard and mouse that typically
send a single character (a few bytes) at a time. Whenever, some I/O operation
completes or a character device sends some data, it is necessary to call a signal
handler. Recall, that a signal is a message sent from the operating system to
a task. The signal handler is a specialized function that is registered with the
kernel.
The fields that store all this information in the task struct are as follows.
/* I / O device context */
struct io_context * io_context ;
An example of using the fork system call is shown in Listing 3.9. Here, the
fork library recall is used that encapsulates the fork system call. The fork
library call returns a process id (variable pid in the code) after creating the
child process.
It is clear that inside the code of the forking procedure, a new process is
created, which is a child of the parent process that made the fork call. It is a
perfect copy of the parent process. It inherits the parent’s code as well as its
memory state. In this case, inheriting means that all the memory regions and
the state are fully copied and the copy is assigned to the child. For example, if
a variable x is defined to be 7 in the code before executing the fork call, then
after the call is over and the child is created, both of the processes can read x.
They will see its value to be 7. However, there is a point to note here. The
variable x is different for both the processes even thought it has the same value,
i.e., 7. This means that if the parent changes x to 19, the child will still read
c Smruti R. Sarangi 92
Original
process
Create a
copy
fork()
Child process
child's 0
pid
Herein lies the brilliance of this mechanism – the parent and child are re-
turned different values.
Point 3.2.1
The child is returned 0 and the parent is returned the pid of the child.
This part is crucial because it helps the rest of the code differentiate between
the parent and the child. A process knows whether it is the parent process or
It took courage... it took courage to climb into that machine every night... not knowing... if I'd be the man in
the box... or the prestige.
93 c Smruti R. Sarangi
the child process from the return value: 0 for the child and the child’s pid for
the parent. Subsequently, the child and parent go their separate ways. Based
on the return value of the fork call, the if statement is used to differentiate
between the child and parent. Both can execute arbitrary code beyond this
point and their behavior can completely diverge. In fact, we shall see that the
child can completely replace its memory map and execute some other binary.
However, before we go that far, let us look at how the address space of one
process is completely copied. This is known as the copy-on-write mechanism.
Copy-on-Write
Page
Parent Child
(a) Parent and child sharing a page
Page Page
Parent Child
work. Since we are only performing read operations, we are only interested in
getting the correct values – the same will be obtained. However, the moment
there is a write operation, initiated by either the parent or the child after the
fork operation, some additional work needs to be done.
Let us understand our constraints. We do not share any variables between
the parent and the child. As we have discussed earlier, if a variable x is de-
fined before the fork call, after the call it actually becomes two variables: x in
the parent’s address space and x in the child’s address space. This cannot be
achieved by just copying the page table of the parent. We clearly need to do
more if there is a write.
This part is shown in Figure 3.9(b). Whenever there is a write operation
that is initiated by the parent or the child, we create a new copy of the data
for the writing process. This is done at the page level. This means that a
new physical copy of the frame is created and mapped to the respective virtual
address space. This requires changes in the TLB and page table of the writing
process. The child and parent now have different mappings in their TLBs and
page tables. The virtual addresses that were written to now point to different
physical addresses. Assume that the child initiated the write, then it gets a new
copy of the frame and appropriate changes are made to its TLB and page table
to reflect the new mapping. Subsequently, the write operation is realized. To
summarize, the child writes to its “private” copy of the page. This write is not
visible to the parent.
As the name suggests, this is a copy-on-write mechanism where the child
and parent continue to use the same physical page (frame) until there is a write
operation initiated by either one. This approach can easily be realized by just
copying the page table, which is a very fast operation. The moment there is a
write, there is a need to create a new copy of the corresponding frame, assign
it to the writing process, and then proceed with the write operation. This
increases the performance overheads when it comes to the first write operation
after a fork call; however, a lot of this overhead gets amortized and is seldom
visible.
There are several reasons for this. The first is that the parent and child
may not subsequently write to a large part of the memory space such as the
code and data sections. In this case, the copy-on-write mechanism will never
get activated. The child may end up overwriting its memory image with that of
another binary and this will end up erasing its entire memory map. There will
thus be no need to invoke the CoW mechanism. Furthermore, lazily creating
copies of frames as and when there is a demand, distributes the overheads over
a long period of time. Most applications can absorb this overhead very easily.
Hence, the fork mechanism has withstood the test of time.
Details
We would like to draw the reader’s attention to the file in the kernel that lists
all the supported system calls: include/linux/syscalls.h. It has a long list of
system calls. However, the system calls of our interest are clone and vfork.
The clone system call is the preferred mechanism to create a new process or
thread in a thread group. It is extremely flexible and takes a wide variety of
arguments. However, the vfork call is optimized for the case when the child
process immediately makes an exec call to replace its memory image. In this
case, there is no need to fully initialize the child and copy the page tables of the
parent. Finally, note that in a multithreaded process (thread group), only the
calling thread is forked.
Inside the kernel, all of these functions ultimately end up calling the copy process
function in kernel/fork.c. While forking a process the vfork call is preferred,
whereas while creating a new thread, the clone call is preferred. The latter
allows the caller to accurately indicate which memory regions need to be shared
with the child and which memory regions need to be kept private. The signature
of the copy process function is as follows:
Here, the ellipses . . . indicate that there are more arguments, which we are
not specifying for the sake of readability. The main tasks that are involved in
copying a process are as follows:
2. Copy all the information about open files, network connections, I/O, and
other resources from the parent task.
(a) Copy all the connections to open files. This means that from now
on the parent and child can access the same open file (unless it is
exclusively locked by the parent).
(b) Copy a reference to the current file system.
(c) Copy all information regarding signal handlers to the child.
(d) Copy the page table and other memory-related information (the com-
plete struct mm struct).
(e) Recreate all namespace memberships and copy all the I/O permis-
sions. By default, the child has the same level of permissions as the
parent.
(a) Add the new child task to the list of children of the parent task.
(b) Fix the parent and sibling list of the newly added child task.
(c) Add thread group, process group and session information to the child
tasks’s struct pid.
The exec family of system calls are used to achieve this. In Listing 3.10,
an example is shown where the child process runs the execv library call. Its
arguments are a null-terminated string representing the path of the executable
and an array of arguments. The first argument is by default the file name – pwd
in this case. The next few arguments should be the command-line arguments to
the executable and the last argument needs to be NULL. Since we do not have
any arguments, our second argument is NULL. There are many library calls in
the exec family. All of them wrap the exec system call.
There are many steps involved in this process. The first action is to clean
up the memory space (memory map) of a process and reinitialize all the data
structures. We need to then load the starting state of the new binary in the
process’s memory map. This includes the contents of the text and data sections.
Then there is a need to initialize the stack and heap sections, and set the starting
value of the stack pointer. In general, file and network connections are preserved
in an exec call. Hence, there is no need to modify, cleanup or reinitialize them.
After the exec call returns, we can start executing the process from the start of
its new text section. We are basically starting the execution of a new program.
The fact that we started from a forked process is conveniently forgotten. This
is the Linux way.
data in its user-mode virtual address space and can also access data in the
kernel’s virtual address space. The next step is that the kernel’s virtual address
is kept separate. For example, on a 32-bit system the lower 3 GB of the virtual
address space is reserved for user programs and the upper 1 GB represents the
kernel’s virtual address space. All kernel-level data structures including the
kernel stacks are stored in this upper 1 GB. Unless a kernel thread is explicitly
accessing user space, it accesses data structures only in kernel space. No user
thread can access kernel virtual memory. It will be immediately stopped by
the TLB. The TLB will quickly realize that a user process wishes to access a
kernel page. This information is stored in each TLB entry. Hence, processes can
keep transitioning from user mode to kernel mode, and vice versa, repeatedly,
without revealing kernel data to the user, unless the information is the return
value of a system call.
The other type of “user threads” are not real threads. They are purely
user-level entities that are created, managed and destroyed in user space. This
means that a single process (recognized by the kernel) can create and manage
multiple user threads. This could also be a multithreaded process. Regardless of
its implementation, we need to note that a single group of threads manage user
threads that could be far more numerous. Consider the case of a single-threaded
process P that creates multiple user threads. It partitions its virtual address
space and assigns dedicated memory regions to each created user thread. Each
such user thread is given its own stack. Process P also creates a heap that is
shared between all user threads. We need to understand that the kernel still
perceives a single process P . If P is suspended, then all the user threads are also
suspended. This mechanism is clearly not as flexible as native threads that are
recognized by the kernel. It is hard to pause processes, collect their context and
restore the same context later. However, user-threading libraries have become
mature. It is possible to simulate much of kernel’s activities such as timer
interrupts and context collection using signal handlers, kernel-level timers and
bespoke assembly routines. We shall use the term pure-user threads to refer to
this type of threads, which are not recognized by the kernel.
Let us now look at I/O threads and kernel threads. We need to understand
that Linux has a single task struct and all threads are just processes. We do
not have different task structs for different kinds of threads. A task struct
however has different fields that determine its behavior. Every task has a priority
and it can be a specialized task that only does kernel work. Let us look at such
variations.
I/O threads are reasonably low-priority threads that are dedicated to I/O
tasks. They can be in the kernel space or run exclusively in user space. Kernel
threads run with kernel permissions and are often very high-priority threads.
The PF KTHREAD bit is set in task struct.flags if a task is a kernel thread.
Kernel threads exclusively do kernel work and do not transition to user mode.
Linux defines analogous functions such as kthread create and kernel clone
to create and clone kernel threads, respectively. They are primarily used for
implementing all kinds of bookkeeping tasks, timers, interrupt handlers and
device drivers.
99 c Smruti R. Sarangi
3. Segment registers
There are many other minor components of the hardware context in a large
and complex processor like an x86-64 machine. We have listed the main com-
ponents for the sake of readability. The key point that we need to note is that
this context needs to be correctly stored and subsequently restored.
Let us focus on the TLB now and understand the role it plays in the con-
text switch process. It stores the most frequently (or recently) used virtual-to-
physical mappings. There is a need to flush the TLB when the process changes,
because the new process will have a new virtual memory map. We do not want
it to use the mappings of the previous process. They will be incorrect and this
will also be a serious security hazard because now the new process can access
the memory space of the older process. Hence, once a process is swapped out,
at least no other user-level process should have access to its TLB contents.
An easy solution is to flush the TLB upon a context switch. However, as we
shall see later, there is a more optimized solution, which allows us to append
the pid number to each TLB entry. This does not require the system to flush
the TLB upon a context switch, which is a very expensive solution in terms of
performance. Every process should use its own mappings. Because of the pid
information that is present, a process cannot access the mappings of any other
process. This mechanism (enforced by hardware) reduces the number of TLB
misses. As a result, there is a net performance improvement.
c Smruti R. Sarangi 100
Software Context
The page table, open file and network connections and the details of similar
resources that a process uses are a part of its software context. This information
is maintained in the process’s task struct. There is no need to store and restore
this information upon a context switch – it can always be retrieved from the
task struct.
The structure of the page table is quite interesting if we consider the space of
both user and kernel threads. The virtual address space of any process is typi-
cally split between user space addresses and kernel addresses. On x86 machines,
the kernel addresses are located at the higher part of the virtual address range.
The user space addresses are at the lower end of the virtual address range. Sec-
ond, note that all the kernel threads share their virtual address space. This
means that across processes, the mappings of kernel virtual addresses are iden-
tical. This situation is depicted in Figure 3.10. It helps to underscore the fact
that the virtual address spaces of all user processes are different. This means
that across user processes, the same virtual address maps to different physical
addresses unless they correspond to a shared memory channel. However, this is
not the case for the kernel region. Here, the same virtual address maps to the
same physical address regardless of the user process. For kernel threads that
exclusively run in kernel mode, they do not use any user space virtual address.
They only use kernel space virtual addresses at the upper end of the virtual
address space. They also follow the same rule – all kernel space mappings are
identical across all processes.
Identical mappings
Figure 3.10 shows that a large part of the virtual address spaces of processes
have identical mappings. Hence, a part of the page table will also be common
across all processes due to such identical mappings. This “kernel portion” of
the page table will not change even if there is a transition from one process to
another, even though the page tables themselves may change. The mappings
that stand to change on a context switch are in the portion corresponding to
user space addresses.
101 c Smruti R. Sarangi
Point 3.3.1
The page table needs to be changed only when the user space virtual
address mappings change. If there is a switch between kernel threads,
there is no need to change the page table because the kernel virtual
address space is the same for all kernel threads. There is no need to
change the page table even if there is a transition from user mode to
kernel mode. The kernel space mappings will remain the same.
• The virtual address space of any process is split between user and
kernel addresses.
calls the appropriate handler. If we consider the case of a timer interrupt, then
the reason for the kernel’s invocation is not very serious – it is a routine matter.
In this case, there is no need to create an additional kernel thread that is tasked
to continue the process of saving the context of the user-level thread that was
executing. As discussed earlier, we can reuse the same user-level thread that
was interrupted. Specifically, the same task struct can be used, and the user
thread can simply be run in “kernel mode”. Think of this as a new avatar of the
same thread, which has now ascended from the user plane to the kernel plane.
This saves a lot of resources as well as time; there is no need to initialize any
new data structure or create/resume any thread here.
The job of this newly converted kernel thread is to continue the process
of storing the hardware context. This means that there is a need to collect
the values of all the registers and store them somewhere. In general, in most
architectures, the kernel stack is used to store this information. We can pretty
much treat this as a soft switch. This is because the same thread is being reused.
Its status just gets changed – it temporarily becomes a kernel thread and starts
executing kernel code (not a part of the original binary though). Also, it now
uses its kernel stack. Recall that the user-level stack cannot be used in kernel
mode. This method is clearly performance-enhancing and is very lightweight in
character. Let us now answer two key questions.
Is there a need to flush the TLB?
There is only a need to flush the TLB when the mappings change. This will
only happen if the user-mode virtual address space changes (see Point 3.3.1).
There is no need to flush the TLB if there is a user→kernel or kernel→kernel
transition – the kernel part of the address space remains the same. Now, if we
append the pid to each TLB entry, there is no need to remove TLB entries if
the user space process changes.
Is there a need to change the page table?
The answer to this question is the same as the previous one. Whenever we
are transitioning from user mode to kernel mode, there is no need to change
the page table. The kernel space mappings are all that are needed in kernel
mode, and they are identical for all kernel threads. Similarly, while switching
between kernel threads, there is also no need. A need arises to switch the page
table when we are transitioning from kernel mode to user mode, and that too
not all the time. If we are switching back to the same user process that was
interrupted, then also there is no need because the same page table will be used
once again. We did not switch it while entering kernel mode. A need to switch
the page table arises if the scheduler decides to run some other user task. In
this case, it will have different user space mappings, and thus the page table
needs to be changed.
Point 3.3.2
Because the kernel’s virtual address space is the same for all kernel
threads, there is often no need to switch the page table upon a context
switch. For example, while switching from user mode to kernel mode,
there is no need to switch the page table. A need only arises when we
are running a new user task, where the user space mappings change.
103 c Smruti R. Sarangi
Let us now come to the problem of switching between threads that belong
to the same thread group. This should, in principle, be a more lightweight
mechanism than switching between unrelated processes. There should be a way
to optimize this process from the point of view of performance and total effort.
The Linux kernel supports this notion.
Up till now we have maintained that each thread has its dedicated stack
and TLS region. Here, TLS stands for Thread Local Storage. It is a private
storage area for each thread. Given that we do not want to flush the TLB or
switch the page table, we can do something very interesting. We can mandate
all threads to actually use the same virtual address space like kernel threads.
This is a reasonable decision for all memory regions other than the stack and
the TLS region. Here, we can adopt the same solution as kernel threads and
pure-user threads (see Section 3.2.3). We simply use the same virtual address
space and assign different stack pointers to different stacks. This means that all
the stacks are stored in the same virtual address space. They are just stored in
different regions. We just have to ensure that the spacing between them is large
enough in the virtual address space to ensure that one stack does not overflow
and overwrite the contents of another stack. If this is done, then we can nicely
fit all the stacks in the same virtual address space. The same can be done for
TLS regions. On an x86 machine that supports segmentation, doing this is even
easier. We just set the value of the stack segment register to the starting address
of the stack – it is a function of the id of the currently executing thread in the
thread group. This design decision solves a lot of problems for us. There is no
need to frequently replace the contents of the CR3 register, which stores the
starting address of the page table. On x86 machines, any update to the CR3
register typically flushes the TLB also. Both are very expensive operations,
which in this case are fortunately avoided.
Point 3.3.4
In Linux, different threads in a thread group share the complete virtual
address space. The stack and TLS regions of the constituent threads are
stored at different points in this shared space.
There is however a need to store and restore the register state. This includes
the contents of all the general-purpose registers, privileged registers, the pro-
gram counter and the ALU flags. Finally, we need to set the current pointer
to the task struct of the new thread.
To summarize, this is a reasonably lightweight mechanism. Hence, many
kernels typically give preference to another thread from the same thread group
as opposed to an unrelated thread.
c Smruti R. Sarangi 104
Let us now look in detail at the steps involved in saving the context after a
system call is made using the syscall instruction. The initial steps are performed
automatically by hardware, and the later steps are performed by the system call
handler. Note that during the process of saving the state, interrupts are often
disabled. This is because this is a very sensitive operation, and we do not
want to be interrupted in the middle. If we allow interruptions, then the state
will be partially saved and the rest of the state will get lost. Hence, to keep
things simple it is best to disable interrupts at the beginning of this process
and enable them when the context is fully saved. Of course, this does delay
interrupt processing a little bit; however, we can be sure that the context was
saved correctly. Let us now look at the steps.
1. The hardware stores the program counter (rip register) in the register
rcx and stores the flags register rflags in r11. Before making a system
call, it is assumed that the two general purpose registers rcx and r11 do
not contain any useful data.
4. Almost all x86 and x86-64 processors define a special segment in each CPU
known as the Task State Segment or TSS. The size of the TSS segment is
small, but it is used to store important information regarding the context
switch process. It was previously used to store the entire context of the
task. However, these days it is used to store a part of the overall hardware
context of a running task. On x86-64 machines, the stack pointer (rsp)
is stored on it. There is sadly no other choice. We cannot use the kernel
stack because for that we need to update the stack pointer – the old value
will get lost. We also cannot use a general-purpose register. Hence, a
separate memory region such as the TSS segment is necessary.
5. Finally, the stack of the current process can be set to the kernel stack.
6. We can now push the rest of the state to the kernel stack. This will include
the following:
To restore the state, we need to exactly follow the reverse sequence of steps.
Additional Context
Along with the conventional hardware context, there are additional parts of
the hardware context that need to be stored and restored. Because the size of
the kernel stack is limited, it is not possible to store a lot of information there.
Hence, a dedicated structure called a thread struct is defined to store all extra
and miscellaneous information. It is defined at the following link:
arch/x86/include/asm/processor.h.
Every thread has TLS regions (thread local storage). It stores variables
specific to a thread. The thread struct stores a list of such TLS regions
(starting address and size of each), the stack pointer (optionally), the segment
registers (ds,es,fs and gs), I/O permissions and the state of the floating-point
unit.
...
if (! prev - > mm ) {
prev - > active_mm = NULL ;
}
}
The switch to function accomplishes this task by executing the steps to save
the context in the reverse order (context restore process). The first step is to
extract all the information in the thread struct structures and restore them.
They are not very critical to the execution and thus can be restored first. Then
the thread local state and segment registers other than the code segment register
are restored. Finally, the current task pointer, a few of the registers and the
stack pointer are restored.
The function finish task switch completes the process. It updates the process
states of the prev and next tasks and also updates the timing information
associated with the respective tasks. This information is used by the scheduler.
Sometimes it can happen that the kernel uses more memory than the size of
its virtual address space. On 32-bit systems, the kernel can use only 1 GB.
However, there are times when it may need more memory. In this case, it is
necessary to temporarily map some pages to kernel memory (known as kmap in
Linux). These pages are typically unmapped in this function before returning
back to the user process.
Finally, we are ready to start the new task !!! We set the values of the rest of
the flags, registers, the code segment register and finally the instruction pointer.
109 c Smruti R. Sarangi
Trivia 3.3.1
One will often find statements of this form in the kernel code:
if ( likely ( < some condition >) ) {...}
if ( unlikely ( < some condition >) ) {...}
These are hints to the branch predictor of the CPU. The term likely
means that the branch is most likely to be taken, and the term unlikely
means that the branch is most likely to be not taken. These hints increase
the branch predictor accuracy, which is vital for good performance.
Trivia 3.3.2
One often finds statements of the form:
static __latent_entropy struct task_struct *
copy_process (...) {...}
Here, we are using the value of the task struct* pointer as a source
of randomness. Many such random sources are combined in the kernel
to create a good random number source that can be used to generate
cryptographic keys.
15. The hardware context comprises the values of all the registers
(general-purpose and privileged), the next program counter and
the ALU flags. This context needs to be saved and later restored.
16. All the kernel threads share the kernel virtual address space.
This insight can be used to eliminate TLB flushes and page ta-
ble switches whenever there is a context switch to kernel mode or
there is a context switch between kernel threads.
(a) Split the virtual space between the user space and kernel
space.
(b) The mappings for all the kernel pages across all the processes
(user and kernel) are identical.
(c) There is no need to flush the TLB when there is a kernel to
user-mode transition and the interrupted user process is being
resumed once again.
(d) A need to arises for a TLB flush and a page table switch only
when we are resuming a different user process.
Exercises
Ex. 2 — Why do we use the term “kernel thread” as opposed to “kernel pro-
cess”?
Ex. 3 — How does placing a limit on the kernel thread stack size make kernel
memory management easy?
Ex. 4 — If the kernel wants to access physical memory directly, how does it
do so using the conventional virtual memory mechanism?
Ex. 5 — Explain the design and operation of the kernel linked list structure
in detail.
Ex. 6 — Why cannot the kernel code use the same user-level stack and delete
its contents before a context switch?
Ex. 7 — What are the advantages of creating a child process with f ork and
exec, as compared to a hypothetical mechanism that can directly create a pro-
113 c Smruti R. Sarangi
Ex. 8 — Assume that there are some pages in a process such as code pages
that need to be read-only all the time. How do we ensure that this holds during
the forking process as well? How do we ensure that the copy-on-write mechanism
does not convert these pages to “non-read-only”?
Ex. 9 — What is the role of the TSS segment in the overall context switch
process?
* Ex. 14 — To save the context of a program, we need to read all of its reg-
isters, and store them in the kernel’s memory space. The role of the interrupt
handler is to do this by sequentially transferring the values of registers to ker-
nel memory. Sadly, the interrupt handler needs to access registers for its own
execution. We thus run the risk of inadvertently overwriting the context of the
original program, specifically the values that it saved in the registers. How do
we stop this from happening?
Ex. 16 — Consider a situation where a process exits, yet a few threads of that
process are still running. Will those threads continue to run? Explain briefly.
Ex. 17 — Why are idr trees used to store pid structures? Why can’t we use
BSTs, B-Trees, and hash tables? Why is it effective?
Ex. 18 — Which two trees does the idr tree combine? How and why?
d)How do DLLs support global and static variables? If the same DLL is
being used concurrently, wouldn’t this cause a problem?
Ex. 25 — How are the radix and augmented trees combined? What is the
need for combining them? Answer the latter question in the context of process
management.
Open-Ended Questions
In this chapter, we will delve into the details of system calls, interrupts, excep-
tions and signals. The first three are the only methods to invoke the OS. It is
important to bear in mind that the OS code normally lies dormant. It comes
into action only after three events of interest: system calls, interrupts and ex-
ceptions. In a generic context, all three of these events are often referred to as
interrupts. All three of them involving transferring control from one process to
a dedicated handler. Note that sometimes specific distinctions are made such as
using the terms “hardware interrupts” and “software interrupts”. Hardware in-
terrupts refer to classical interrupts generated by I/O devices whereas software
interrupts refer to system calls and exceptions.
The classical method of making system calls on x86 machines is to invoke
the instruction int 0x80 that simply generates an interrupt with interrupt code
0x80. The generic interrupt processing mechanism is used to process the system
call. Modern machines have the syscall instruction, which is more direct and
specialized (as we have seen in Section 3.3.3), even though the basic mechanism
is still the same. Similarly, the x86 processor treats exceptions as a special type
of interrupt.
All hardware interrupts have their own interrupt codes – they are also known
as interrupt vectors. Similarly, all exceptions have their unique codes and so do
system calls. Whenever any such event of interest happens, the hardware first
determines its type. Subsequently, it indexes the appropriate table with the
code of the event of interest. For example, an interrupt vector is used to index
interrupt handler tables. Each entry of this table points to a function that is
supposed to handle the interrupt. Let us elaborate.
115
c Smruti R. Sarangi 116
World” to the terminal. Recall that a library call encapsulates a system call. It
prepares the arguments for the system call, sets up the environment, makes the
system call and then appropriately processes the return values. The glibc library
on Linux contains all the relevant library code for the standard C library.
Listing 4.1: Example code with the printf library call
# include < stdio .h >
int main () {
printf ( " Hello World \ n " ) ;
}
Let us now understand this process in some detail. The signature of the
printf function is as follows: int printf(const char* format, ...). The
format string is of the form ‘‘The result is %d, %s’’. It is succeeded by a
sequence of arguments, which replace the format specifiers (‘‘%d’’ and ‘‘%s’’)
in the format string. The ellipses . . . indicate that the number of arguments is
variable.
A sequence of functions is called in the glibc code. The sequence is as follows:
printf → printf → vfprintf → printf positional → outstring → PUT.
Gradually the signature changes – it becomes more and more generic. This
ensures that other calls like fprintf that write to a file are all covered by the
same function as special cases. Note that Linux treats every device as a file
including the terminal. The terminal is a special kind of file, which is referred
to as stdout. The function vfprintf accepts a generic file as an argument,
which it can write to. This generic file can be a regular file in the file system
or the terminal (stdout). The signature of vprintf is as follows:
int vfprintf ( FILE *s , const CHAR_T * format , va_list ap ,
unsigned int mode_flags ) ;
Note the generic file argument FILE *s, the format string, the list of ar-
guments and the flags that specify the nature of the I/O operation. Every
subsequent call generalizes the function further. Ultimately, the control reaches
the new do write function in the glibc code (fileops.c). It makes the write
system call, which finally transfers control to the OS. At this point, it is impor-
tant to digress and make a quick point about the generic principles underlying
library design.
any file including stdout. The printf positional function creates the string
that needs to be printed. It sends the output to the outstring function that
ultimately dispatches the string to the file. Ultimately, the write system call is
made that sends the string that needs to be printed along with other details to
the OS.
Attribute Register
System call number rax
Arg. 1 rdi
Arg. 2 rsi
Arg. 3 rdx
Arg. 4 r10
Arg. 5 r8
Arg. 6 r9
as shown in Table 4.2. Given a system call number, the table lists the pointer to
the function that handles the specific type of system call. This function is then
subsequently called. For instance, the write system call ultimately gets handled
by the ksys write function, where all the arguments are processed, and the real
work is done.
some value that is useful to the current thread. In this case also, the thread
that wishes to yield the CPU gets this flag set.
If this flag is set, then the scheduler needs to run and find the most worthy
process to run next. The scheduler uses very complex algorithms to decide this.
The scheduler treats the TIF NEED RESCHED flag as a coarse-grained estimate.
Nevertheless, it makes its independent decision. It may decide to continue with
the same task, or it may decide to start a new task on the same core. This is
purely its prerogative.
The context restore mechanism follows the reverse sequence vis-á-vis the
context switch process. Note that there are some entities such as segment
registers that normally need not be stored but have to be restored. The reason
is that because they are not transient (ephemeral) in character. We don’t expect
them to get changed often, especially by the user task. Once, they are set, they
typically continue to retain their values till the end of the execution. Their
values can be stored in the task struct. At the time of restoring a context, if
the task changes, we can read the respective values from the task struct and
set the values of the segment registers.
Finally, the kernel calls the sysret instruction that sets the value of the PC
and completes the control transfer back to the user process. It also changes the
ring level or in other words effects a mode switch (from kernel mode to user
mode).
IDT
Excep�ons
System calls
(via int 0x80)
Interrupt Address of
vector the handler
Interrupts
idtr
Hardware device Register
iden�fied by its IRQ
the address of the interrupt handler, whose code is subsequently loaded. The
handler finishes the rest of the context switch process and begins to execute the
code to process the interrupt. Let us now understand the details of the different
types of handlers.
Let us now discuss interrupts and exceptions. Intel processors have APIC
(Advanced Programmable Interrupt Controller) chips (or circuits) that do the
job of liaising with hardware and generating interrupts. These dedicated cir-
cuits are sometimes known as just interrupt controllers. There are two kinds of
interrupt controllers on standard Intel machines: LAPIC (local APIC), a per-
CPU interrupt controller and the I/O APIC. There is only one I/O APIC for
the entire system. It manages all external I/O interrupts. Refer to Figure 4.3
for a pictorial explanation.
LAPIC
CPU CPU
I/O
APIC
CPU CPU cntrlr
4.2.1 APICs
Figure 4.4 represents the flow of actions. We need to distinguish between two
terms: interrupt request and interrupt number/vector. The interrupt number of
interrupt vector is a unique identifier of the interrupt and is used to identify the
interrupt service routine that needs to run whenever the interrupt is generated.
The IDT is indexed by this number.
The interrupt request(IRQ) on the other hand is a hardware signal that
is sent to the CPU indicating that a certain hardware needs to be serviced.
There are different IRQ lines (see Figure 4.4). For example, one line may be
for the keyboard, another one for the mouse, so on and so forth. In older
systems, the number/index of the IRQ line was the same as the interrupt vector.
However, with the advent of programmable interrupt controllers (read APICs),
this has been made more flexible. The mapping can be changed dynamically.
For example, we can program a new device such as a USB device to actually
act as a mouse. It will generate exactly the same interrupt vector. In this way,
it is possible to obfuscate a device and make it present itself as a different or
somewhat altered device to software.
Here again there is a small distinction between the LAPIC and I/O APIC.
The LAPIC directly generates interrupt vectors and sends them to the CPU.
121 c Smruti R. Sarangi
IRQ
lines APIC CPU
IRQ (interrupt request): An interrupt vector (INT) identifies any kind The LAPIC sends an
Kernel iden�fier for a HW of an interrupting event: interrupt, system interrupt vector to the
interrupt source call, exception, fault, etc. CPU (not an IRQ)
The flow of actions (for the local APIC) are shown in Figure 4.4. ¶ The first
step is to check if interrupts are enabled or disabled. Recall that we discussed
that often there are sensitive sections in the execution of the kernel where it
is a wise idea to disable interrupts such that no correctness problems are in-
troduced. Interrupts are typically not lost. They are queued in the hardware
queue in the respective APIC and processed in priority order when interrupts
are enabled back again. Of course, there is a possibility of overflows. This is a
rare situation but can happen. In this case interrupts will be lost. Note that
disabling and masking interrupts are two different concepts. Disabling is more
of a sledgehammer like operation where all interrupts are temporarily disabled.
However, masking is a more fine-grained action where only certain interrupts are
disabled in the APIC. Akin to disabling, the interrupts are queued in the APIC
and presented to the CPU at a later point of time when they are unmasked.
· Let us assume that interrupts are not disabled. Then the APIC chooses
the highest priority interrupt and finds the interrupt vector for it. It also needs
the corresponding data from the device. ¸ It buffers the interrupt vector and
data, and then checks if the interrupt is masked or not. ¹ If it is masked, then
it is added to a queue as discussed, otherwise it is delivered to the CPU. º The
CPU needs to acknowledge that it has successfully received the interrupt and
only then does the APIC remove the interrupt from its internal queues. Let
us now understand the roles of the different interrupt controllers in some more
detail.
I/O APIC
In the full system, there is only one I/O APIC chip. It is typically not a part
of the CPU, instead it is a chip on the motherboard. It mainly contains a
redirection table. Its role is to receive interrupt requests from different devices,
process them and dispatch the interrupts to different LAPICs. It is essentially
an interrupt router. Most I/O APICs typically have 24 interrupt request lines.
Typically, each device is assigned its IRQ number – the lower the number higher
is the priority. A noteworthy mention is the timer interrupt, whose IRQ number
c Smruti R. Sarangi 122
is typically 0.
Each LAPIC can receive an interrupt from the I/O APIC. It can also receive a
special kind of interrupt known as an inter-processor interrupts (IPI) from other
LAPICs. This type of interrupt is very important for kernel code. Assume that
a kernel thread is running on CPU 5 and the kernel decides to preempt the task
running on CPU 1. Currently, we are not aware of any method of doing so.
The kernel thread only has control over the current CPU, which is CPU 5. It
does not seem to have any control over what is happening on CPU 1. The IPI
mechanism, which is a hardware mechanism is precisely designed to facilitate
this. CPU 5 on the behest of the kernel thread running on it, can instruct
its LAPIC to send an IPI to the LAPIC of CPU 1. This will be delivered to
CPU 1, which will run a kernel thread on CPU 1. After doing the necessary
bookkeeping steps, this kernel thread will realize that it was brought in because
the kernel thread on CPU 5 wanted to replace the task running on CPU 1 with
some other task. In this manner, one kernel thread can exercise its control over
all CPUs. It does however need the IPI mechanism to achieve this. Often, the
timer chip is a part of the LAPIC. Depending upon the needs of the kernel, its
interrupt frequency can be configured or even changed dynamically. We have
already described the flow of actions in Figure 4.4.
Distribution of Interrupts
The next question that we need to address is how are the interrupts distributed
between the LAPICs. There are regular I/O interrupts, timer interrupts and
IPIs. We can either have a static distribution or a dynamic distribution. In
the static distribution, one specific core or a set of cores are assigned the role
of processing a given interrupt. Of course, there is no flexibility when it comes
to IPIs. Even in the case of timer interrupts, it is typically the case that each
LAPIC generates periodic timer interrupts to interrupt its local core. However,
this is not absolutely necessary, and some flexibility is provided. For instance,
instead of generating periodic interrupts, it can be programmed to generate an
interrupt at a specific point of time. In this case, this is a one-shot interrupt and
periodic interrupts are not generated. This behavior can change dynamically
because LAPICs are programmable.
In the dynamic scheme, it is possible to send the interrupt to the core that is
running the task with the least priority. This again requires hardware support.
Every core on an Intel machine has a task priority register, where the
kernel writes the priority of the current task that is executing on it. This
information is used by the I/O APIC to deliver the interrupt to the core that
is running the least priority process. This is a very efficient scheme, because it
allows higher priority processes to run unhindered. If there are idle cores, then
the situation is even better. They can be used to process all the I/O interrupts
and sometimes even timer interrupts (if they can be rerouted to a different core).
123 c Smruti R. Sarangi
4.2.2 IRQs
The file /proc/interrupts contains the details of all the IRQs and how they
are getting processed (refer to Figure 4.3). Note that this file is relevant to only
the author’s machine and that too as of 2023.
The first column is the IRQ number. As we see, the timer interrupt is IRQ#
0. The next four columns show the count of timer interrupts received at each
CPU. Note that it has a small value. This is because any modern machine has
a variety of timers. It has the low-resolution LAPIC timer. In this case, a more
high-resolution timer was used. Modern kernels prefer high-resolution timers
because they can dynamically configure the interrupt interval based on the
processes that are executing in the kernel. This interrupt is originally processed
by the I/O APIC. The term “2-edge” means that this is an edge-triggered
interrupt on IRQ line 2. Edge-triggered interrupts are activated when there is
a level transition (0 → 1 and 1 → 0 transitions). The handler is the generic
function associated with the timer interrupt.
The “fasteoi” interrupts are level-triggered. Instead of being based on an
edge (a signal transition), they depend upon the level of the signal in the in-
terrupt request line. “eoi” stands for “End of Interrupt”. The line remains
asserted until the interrupt is acknowledged.
For every request that comes from an IRQ, an interrupt vector is generated.
Table 4.4 shows the range of interrupt vectors. NMIs (non-maskable interrupts
and exceptions) fall in the range 0-19. The interrupt numbers 20-31 are reserved
by Intel for later use. The range 32-127 corresponds to interrupts generated by
external sources (typically I/O devices). We are all familiar with interrupt num-
ber 128 (0x80 in hex) that is a software-generated interrupt corresponding to a
system call. Most modern machines have stopped using this mechanism because
they now have a faster method based on the syscall instruction. 239 is the local
APIC (LAPIC) timer interrupt. As we have argued many IRQs can generate
this interrupt vector because there are many timers in modern systems with
different resolutions. Lastly, the range 251-253 corresponds to inter-processor
interrupts (IPIs). A disclaimer is due here. This is the interrupt vector range in
the author’s Intel i7-based system as of 2023. This in all likelihood may change
in the future. Hence, a request to the reader is to treat this data as an example.
Table 4.5 summarizes our discussion quite nicely. It shows the IRQ number,
interrupt vector and the hardware device. We see that IRQ 0 for the default
timer corresponds to interrupt vector 32. The keyboard, system clock, network
c Smruti R. Sarangi 124
interface and USB ports have their IRQ numbers and corresponding interrupt
vector numbers. One advantage of separating the two concepts – IRQ and in-
terrupt vector – are clear from the case of timers. We can have a wide variety
of timers with different resolutions. However, they can be mapped to the same
interrupt vector. This will ensure that whenever an interrupt arrives from any
one of them the timer interrupt handler can be invoked. The current can dy-
namically decide which timer to use depending on the requirements and load
on the system.
Given that HW IRQs are limited in number, it is possible that we may have
more devices than the number of IRQs. In this case, several devices have to
share the same IRQ number. We can do our best to dynamically manage the
IRQs such as deallocating the IRQ when a device is not in use or dynamically
allocating an IRQ when a device is accessed for the first time. In spite of that we
still may not have enough IRQs. Hence, there is a need to share an IRQ between
multiple devices. Whenever an interrupt is received from an IRQ, we need to
check which device generated it by running all the handlers corresponding to
each connected device (that share the same IRQ). These handlers will query the
individual devices or inspect the data and find out. Ultimately, we will find a
device that is responsible for the interrupt. This is a slow but compulsory task.
essential elements of the program state such as the f lags register and the PC.
There is some amount of complexity involved. Depending upon the nature of
the exception/interrupt, the program counter that should be stored can either
be the current program counter or the next one. If an I/O interrupt is received,
then without doubt we need to store the next PC. However, if there is a page
fault then we need to execute the same instruction once again. In this case,
the PC is set to the current PC. It is assumed that the interrupt processing
hardware is smart enough to figure this out. It needs to then store the next PC
(appropriately computed) and the f lags to either known registers such as rcx
and r11 or on to the user’s stack. This part has to be automatically done prior
to starting the interrupt handler, which will be executed in software.
Listing 4.2 shows the important fields in struct irqdesc. It is the nodal
data structure for all IRQ-related data. It stores all the information regarding
the hardware device, the interrupt vector, CPU affinities (which CPUs process
it), pointer to the handler, special flags and so on.
Akin to process namespaces, IRQs are subdivided into domains. This is
especially necessary given that modern processors have a lot of devices and
interrupt controllers. We can have a lot of IRQs, but at the end of the day, the
processor will use the interrupt vector (a simple number between 0-255). It still
needs to retain its meaning and be unique.
A solution similar to hierarchical namespaces is as follows: assign each in-
terrupt controller a domain. Within a domain, the IRQ numbers are unique.
Recall that we followed a similar logic in process namespaces – within a names-
pace pid numbers are unique. The IRQ number (like a pid) is in a certain sense
getting virtualized. Similar to a namespace’s IDR tree whose job was to map pid
numbers to struct pid data structures, we need a similar mapping structure
here per domain. It needs to map IRQ numbers to irq desc data structures.
This is known as reverse mapping (in this specific context). Such a mapping
mechanism allows us to quickly retrieve an irq desc data structure given an
IRQ number. Before that we need to add the interrupt controller to an IRQ
domain. Typically, the irq domain add function is used to realize this. This is
c Smruti R. Sarangi 126
The IDT maps the interrupt vector to the address of the handler.
The initial IDT is set up by the BIOS. During the process of the kernel
booting up, it is sometimes necessary to process user inputs or other important
system events like a voltage or thermal emergency. Also in many cases prior
to the OS booting up, the bootloader shows up on the screen; it asks the user
about the kernel that she would like to boot. For all of this, we need a bare
bones IDT that is already set up. However, once the kernel boots, it needs
to reinitialize or overwrite it. For every single device and exception-generating
situation, entries need to be made. These will be custom entries and only the
kernel can make them because the BIOS would simply not be aware of them –
they are very kernel specific. Furthermore, the interrupt handlers will be in the
kernel’s address space and thus only the kernel will be aware of their locations.
In general, interrupt handlers are not kept in a memory region that can be
relocated or swapped out. The pages are locked and pinned in physical memory
(see Section 3.1.9).
The kernel maps the IDT to the idt table data structure. Each entry of this
table is indexed by the interrupt vector. Each entry points to the corresponding
interrupt handler. It basically contains two pieces of information: the value
of the code segment register and an offset within the code segment. This is
sufficient to load the interrupt handler. Even though this data structure is
set up by the kernel, it is actually looked up in hardware. There is a simple
mechanism to enable this. There is a special register called the IDTR register.
Similar to the CR3 register for the page table, it stores the base address of the
IDT. Thus, the processor knows where to find the IDT in physical memory.
The rest of the lookup can be done in hardware and interrupt handlers can be
automatically loaded by a hardware circuit. The OS need not be involved in
127 c Smruti R. Sarangi
this process. Its job is to basically set up the table and let the hardware do the
rest.
The entry point to the IDT is shown in Listing 4.4. The vector irq array
is a table that uses the interrupt vector (vector) as an index to fetch the
corresponding irq desc data structure. This array is stored in the per-CPU
region, hence the this cpu read macro is used to access it. Once we obtain the
irq desc data structure, we can process the interrupt by calling the handle irq
function. Recall that the interrupt descriptor stores a pointer to the function
that is meant to handle the interrupt. The array regs contains the values of
all the CPU registers. This was populated in the process of saving the context
of the running process that was interrupted. Let us now look at an interrupt
handler, referred to as an IRQ handler in the parlance of the Linux kernel. The
specific interrupt handlers are called from the handle irq function.
4.2.6 Exceptions
The Intel processor on your author’s machine defines 24 types of exceptions.
These are treated exactly the same way as interrupts and similar to an interrupt
c Smruti R. Sarangi 130
Many of the exceptions are self-explanatory. However, some need some ad-
ditional explanation as well as justification. Let us consider the “Breakpoint”
exception. This is pretty much a user-added exception. While debugging a
program using a debugger like gdb, we normally want the execution to stop at
a given line of code. This point is known as a breakpoint. This is achieved as
follows. First, it is necessary to include detailed symbol and statement-level in-
formation while compiling the binary (known as debugging information). This
is achieved by adding the ‘-g’ flag to the gcc compilation process. This de-
bugging information that indicates which line of the code corresponds to which
program counter value, or which variable corresponds to which memory address
is stored in typically the DWARF format. A debugger extracts this information
and stores it in its internal hash tables.
When the programmer requests the debugger to set a breakpoint correspond-
ing to a given line of code, then the debugger finds the program counter that is
associated with that line of code and informs the hardware that it needs to stop
when it encounters that program counter. Every x86 processor has dedicated
debug registers (DR0 . . . DR3 and a few more), where this information can be
stored a priori. The processor uses this information to stop at a breakpoint.
At this point, it raises the Breakpoint exception, which the OS catches and
subsequently lets the debugger know about it. The debugger then lets the pro-
grammer know. Note that after the processor raises the Breakpoint exception,
the program that was being debugged remains effectively paused. It is possible
to analyze its state at this point of time. The state includes the values of all
the variables and the memory contents.
The other exceptions correspond to erroneous conditions that should nor-
mally not arise such as accessing an invalid opcode, device or address. An
important exception is the “Double fault”. It is an exception that arises while
131 c Smruti R. Sarangi
Exception Handling
Let us now look at exception handling (also known as trap handling). For every
exception, we define a macro of the form (refer to Listing 4.5) –
We are declaring a macro for division errors. It is named exc divide error.
It is defined in Listing 4.6. The generic function do error trap handles all the
traps (to begin with). Along with details of the trap, it takes all the CPU
registers as an input.
There are several things that an exception handler can do. The various
options are shown in Figure 4.5.
The first option is clearly the most innocuous, which is to simply send a
signal to the process and not take any other kernel-level action. This can for
instance happen in the case of debugging, where the processor will generate an
exception upon the detection of a debug event. The OS will then be informed,
and the OS needs to send a signal to the debugging process. This is exactly
how a breakpoint or watchpoint work.
The second option is not an exclusive option – it can be clubbed with the
other options. The exception handler can additionally print messages to the
kernel logs using the built-in printk function. This is a kernel-specific print
function that writes to the logs. These logs are visible using either the dmesg
command or are typically found in the /var/log/messages file. Many times
understanding the reasons behind an exception is very important, particularly
when kernel code is being debugged.
The third option is meant to genuinely be an exceptional case. It is a dou-
ble fault – an exception within an exception handler. This is never supposed
to happen unless there is a serious bug in the kernel code. In this case, the
recommended course of action is to halt the system and restart the kernel This
event is also known as a kernel panic (srckernel/panic.c).
The fourth option is very useful. For example, assume that a program has
been compiled for a later version of a processor that provides a certain instruc-
tion that an earlier version does not. For instance, processor version 10 in the
processor family provides the cosine instruction, which version 9 does not. In
c Smruti R. Sarangi 132
this case, it is possible to create a very easy patch in software such that code
that uses this instruction can still seamlessly run on a version 9 processor.
The idea is as follows. We allow the original code to run. When the CPU
will encounter an unknown instruction (in this case the cosine instruction), it
will generate an exception – illegal instruction. The kernel’s exception handler
can then analyze the nature of the exception and figure out that it was actually
the cosine operation that the instruction was trying to compute. However, that
instruction is not a part of the ISA of the current processor. In this case, it
is possible to use other existing instructions and perform the computation to
compute the cosine of the argument and populate the destination register with
the result. The running program can be restarted at the exact point at which it
trapped. The destination register will have the correct result. It will not even
perceive the fact that it was running on a CPU that did not support the cosine
instruction. Hence, from the point of view of correctness, there is no issue.
Of course, there is a performance penalty – this is a much slower solution
as compared to having a dedicated instruction. However, the code now be-
comes completely portable. Had we not implemented this patching mechanism
via exceptions, the entire program would have been rendered useless. A small
performance penalty is a very small price to pay in this case.
The last option is known as the notify die mechanism, which implements
the classic observer pattern in software engineering.
exception handler) in a linked list associated with the exception. The callback
function (handler) will then be invoked along with some arguments that the
exception will produce. This basically means that we would like to associate
multiple handlers with an exception. The aim is to invoke them in a certain
sequence and allow all of them to process the exception as per their own internal
logic.
Each of these processes that register their interest are known as observers or
listeners. For example, if there is an error within the core (known as a Machine
Check Exception), then different handlers can be invoked. One of them can
look at the nature of the exception and try to deal with it by running a piece
of code to fix any errors that may have occurred. Another interested listener
can log the event. These two processes are clearly doing different things, which
was the original intention. We can clearly add more processors to the chain of
listeners, and do many other things.
The return values of the different handlers are quite relevant and important
here. This process is similar in character to the irqaction mechanism, where
we invoke all the interrupt handlers that share an IRQ line in sequence. The
return value indicates whether the interrupt was successfully handled or not.
In that case, we would like to handle the interrupt only once. However, in the
case of an exception, multiple handlers can be invoked, and they can perform
different kinds of processing. They may not enjoy a sense of exclusivity (as
in the case of interrupts). Let us elaborate on this point by looking at the
return values of exception handlers that use the notify die mechanism (shown
in Table 4.7). We can either continue traversing the chain of listeners/observers
after processing an event or stop calling any more functions. All the options
have been provided.
Value Meaning
NOTIFY DONE Do not care about this event. However, other
functions in the chain can be invoked.
NOTIFY OK Event successfully handled. Other functions in
the chain can be invoked.
NOTIFY STOP Do not call any more functions.
NOTIFY BAD Something went wrong. Stop calling any more
functions.
Table 4.7: Status values returned by exception handlers that have subscribed
to the notify die mechanism. source : include/linux/notifier.h
a more generic mechanism known as work queues. They can be used to ex-
ecute any generic function as a deferred function call. They run as regular
kernel threads in the kernel space. Work queues were conceived to be generic
mechanisms, whereas soft and threaded IRQs were always designed for more
specialized interrupt processing tasks.
A brief explanation of the terminology is necessary here. We shall refer to an
IRQ using capital letters. A softirq is however a Linux mechanism and thus will
be referred to with small letters or with a specialized font softirq (represents
a variable in code).
4.3.1 Softirqs
A regular interrupt’s top-half handler is known as a hard IRQ. It is bound by a
large number of rules and constraints regarding what it can and cannot do. A
softirq on the other hand is a bottom half handler. There are two ways that it
can be invoked (refer to Figure 4.6).
System management
Hard IRQ
(kernel threads)
do_so�irq
The first method (on the left) starts with a regular I/O interrupt (hard IRQ).
After basic interrupt processing, a softirq request is raised. This means that a
work parcel is created that needs to be executed later using a softirq thread.
It is important to call the function local bh enable after this such that the
processing of bottom-half threads like softirq threads is enabled.
Then at a later point of time the function do softirq is invoked whose job
is to check all the deferred work items and execute them one after the other
using specialized high-priority threads.
There is another mechanism of doing this type of work (the right path in the
figure). It is not necessary that top-half interrupt handlers raise softirq requests.
They can be raised by regular kernel threads that want to defer some work for
later processing. It is important to note that there may be more urgent needs
in the system and thus some kernel work needs to be immediately. Hence, a
deferred work item can be created as stored as a softirq request.
A dedicated kernel thread called ksoftirqd runs periodically and checks for
pending softirq requests. These threads are called daemons. Daemons are ded-
icated kernel threads that typically run periodically and check/process pending
requests. ksoftirqd periodically follows the same execution path and calls the
function do softirq where it picks an item from a softirq queue and executes
135 c Smruti R. Sarangi
Raising a softirq
Many kinds of interrupt handlers can raise softirq requests. They all invoke the
raise softirq function whenever they need to add a softirq request. Instead
of using a software queue, there is a faster method to record this information.
A fast method is to store a word in memory in the per-CPU region. Each bit
of this memory word has a bit corresponding to a specific type of softirq. If a
bit is set, then it means that a softirq request of the specific type is pending at
the corresponding CPU.
Here are examples of some types of softirqs (defined in include/linux/interrupt.h):
HI SOFTIRQ, TIMER SOFTIRQ, NET TX SOFTIRQ, BLOCK SOFTIRQ, SCHED SOFTIRQ
and HRTIMER SOFTIRQ. As the names suggest, for different kinds of inter-
rupts, we have different kinds of softirqs defined. Of course, the list is limited
and so is flexibility. However, the softirq mechanism was never meant to be very
generic in the first place. It was always meant to offload deferred work for a few
well-defined classes of interrupts and kernel tasks. It is not meant to be used
by device drivers.
that run HW IRQ or softirq tasks. On most Linux distributions, their real-time
priority is set to 50, which is clearly way more than all user-level threads and a
lot of low-priority kernel threads as well.
We can appreciate this much better by looking at struct irqaction again.
Refer to Listing 4.7.
Broad Overview
Let us provide a brief overview of how a work queue works (refer to Figure 4.7).
A work queue is typically associated with a certain class of tasks such as high-
priority tasks, batch jobs, bottom halves, etc. This is not a strict requirement,
however, in terms of software engineering, this is a sensible decision.
Each work queue contains a bunch of worker pool wrappers that each wrap
a worker pool. Let us first understand what is a worker pool, and then we will
discuss the need to wrap it (create additional code to manage it). A worker
pool has three components: set of inactive work items, a group of threads that
process the work in the pool and a linked list of work items that need to be
processed (executed).
The main role of the worker pool is to basically store a list of work items
that need to be completed at some point of time in the future. Consequently,
it has a set of ready threads to perform this work and to also guarantee some
degree of timely completion. This is why, it maintains a set of threads that can
immediately be given a work item to process. A work item contains a function
pointer and the arguments of the function. A thread executes the function with
137 c Smruti R. Sarangi
Work queue
Wrapper
Wrapper
Linked list of work
Wrapper
Worker pool items (work_structs)
Worker poolpool
Worker
Inac�ve
items
Threads
the arguments that are stored in the work item (referred to as a work struct
in the kernel code).
It may appear that all that we need for creating such a worker pool is a
bunch of threads and a linked list of work items. However, there is a little bit
of additional complexity here. It is possible that a given worker pool may be
overwhelmed with work. For instance, we typically associate a worker pool with
a CPU or a group of CPUs. It is possible that a lot of work is being added to
it and thus the linked list of work items ends up becoming very long. Hence,
there is a need to limit the size of the work that is assigned to a worker pool.
We do not want to traverse long linked lists.
An ingenious solution to limit the size of the linked list is as follows. We
tag some work items as active and put them in the linked list of work items
and tag the rest of the work items as inactive. The latter are stored in another
data structure, which is specialized for storing inactive work items (meant to
be processed much later). The advantage that we derive here is that for the
regular operation of the worker pool, we deal with smaller data structures.
Given that now there is an explicit size limitation, whenever there is an
overflow in terms of adding additional work items, we can safely store them in
the set of inactive items. When we have processed a sizeable number of active
items, we can bring in work items from the inactive list into the active list. It is
the role of the wrapper of a worker pool to perform this activity. Hence, there
is a need to wrap it.
The worker pool along with its wrapper can be thought of as one cohesive
unit. Now, we may need many such wrapped worker pools because in a large
system we shall have a lot of CPUs, and we may want to associate a worker
pool with each CPU or a group of CPUs. This is an elegant way of partitioning
the work and also doing some load-balancing.
Let us now look at the kernel code that is involved in implementing a work
queue.
c Smruti R. Sarangi 138
/kernel/workqueue
/kernel/workqueue.c workqueue_struct
internal.h
kernel task
pool_workqueue worker_pool
There is not much to it. The member struct list head entry indicates
that this is a part of a linked list of work structs. This is per se not a field
that indicates the details of the operation that needs to be performed. The only
two operational fields of importance are data (data to be processed) and the
function pointer (func). The data field can be a pointer as well to an object that
contains all the information about the arguments. A work struct represents
the basic unit of work.
The advantage of work queues is that it is usable by third-party code and
device drivers as well. This is quite unlike threaded IRQs and softirqs that are
not usable by device drivers. Any entity can create a struct work struct and
insert it in a work queue. This is executed later on when the kernel has the
bandwidth to execute such work.
139 c Smruti R. Sarangi
The relationship between the pool of workers, the work queue and the worker
pool is shown in Figure 4.9. The apex data structure that represents the entire
work queue is the struct workqueue struct. It has a member, which is a
wrapper class called struct pool workqueue. It wraps the worker pool.
Let us explain the notion of a wrapper class. This wrapper class wraps the
worker pool (struct worker pool). This means that it intercepts every call to
the worker pool, checks it and appropriately modifies it. Its job is to restrict
the size of the pool and limit the amount of work it does at any point of time.
It is not desirable to overload a worker pool with a lot of work – it will perform
inefficiently. This also means that either the kernel itself is doing a lot of work
or it is not distributing the work efficiently amongst the CPUs (different worker
pools).
The standard approach used by this wrapper class is to maintain two lists:
active and inactive. If there is more work than what the queue can handle, we
put the additional work items in the inactive list. When the size of the pool
reduces, items can be moved from the inactive list to the list of active work
items.
In addition, each CPU has two work queues: one for low-priority tasks and
one for high-priority tasks.
Listing 4.10 shows the code of a signal handler. Here, the handler function is
handler that takes as input a single argument: the number of the signal. Then
the function executes like any other function. It can make library calls and also
141 c Smruti R. Sarangi
call other functions. In this specific version of the handler, we are making an
exit call. This kills the thread that is executing the signal handler. However,
this is not strictly necessary.
Let us assume that we did not make the call to the exit library function,
then one of the following could have happened: if the signal blocked other signals
or interrupts, then their respective handlers would be executed, if the signal was
associated with process or thread termination, then the respective thread (or
thread group) would be terminated (example: SIGSEGV and SIGABRT) upon
returning from the handler. If the thread was not meant to be terminated,
then it resumes executing from the point at which it was paused and the signal
handler’s execution began. From the thread’s point of view this is like a regular
context switch.
Now let us look at the rest of the code. Refer to the main function. We need
to register the signal handler. This is done in Line 10. After that, we fork the
process. It is important to bear in mind that signal handling information is also
copied. In this case, for the child process its signal handler will be the copy of
the handler function in its address space. The child process prints that it is
the child and then goes into an infinite while loop.
The parent process on the other hand has more work to do. First, it waits
for the child to get fully initialized. There is no point in sending a signal to a
process that has not been fully initialized. Otherwise, it will ignore the signal.
It thus sleeps for 2 seconds, which is deemed to be enough. It then sends a signal
to the child using the kill library call that in turns makes the kill system call,
which is used to send signals to processes. In this case, it sends the SIGUSR1
signal. SIGUSR1 has no particular significance otherwise – it is meant to be
defined by user programs for their internal use.
When the parent process sends the signal, the child at that point of time is
stuck in an infinite loop. It subsequently wakes up and runs the signal handler.
The logic of the signal handler is quite clear – it prints the fact that it is the
child along with its process id and then makes the exit call. The parent in turn
waits for the child to exit, and then it collects the pid of the child process along
with its exit status. The WEXITSTATUS macro can be used to parse the exit value
(extract its lower 8 bits).
The output of the program shall clearly indicate that the child was stuck in
an infinite loop. Then the parent called the signal handler and waited for the
child to exit. Finally, the child thread exited.
Note that all the signal handling structures are in general shared among all the
threads in a thread group. A thread group (also referred to as a multithreaded
process) is a single unit insofar as signal handling is concerned. The kill com-
mand or system call can be used to send a signal to any other process from
either the command line or programmatically. Note that kill does not mean
killing the process as the literal meaning would suggest. It should have been
named send signal instead. Let us decide to live with such anomalies /. Using
the kill command on the command line is quite easy. The format is: kill
-signal pid.
Please refer to Table 4.8 that shows some of the most common signals used
in the Linux operating system. Many of them can be handled and blocked.
However, there are many like SIGSTOP and SIGKILL that are not sent to the
processes. The kernel directly stops or kill the process, respectively.
In this sense, the SIGKILL signal is meant for all the threads of a multi-
threaded process. But, as we can see, in general, a signal is meant to be handled
only by a single thread in a thread group. There are different ways of sending a
signal to a thread group. One of the simplest approaches is the kill system call
that can send any given signal to a thread group. One of the threads handles
c Smruti R. Sarangi 144
the signal. There are many versions of this system call. For example, the tkill
call can send a signal to specific thread within a process, whereas the tgkill
call takes care of a corner case. It is possible that thread id specified in the
tkill call is recycled. This means that the thread completes and then a new
thread is spawned with the same id. This can lead to a signal being sent to
the wrong thread. To guard against this the tgkill call takes an additional
argument, the thread group id. It is unlikely that both will be recycled and still
remain the same.
Regardless of the method that is used, it is very clear that signals are sent
to a thread group; they are not meant to be sent to a particular thread unless
the tgkill call is used. Sometimes, there is an arithmetic exception in a thread
and thus there is a need to call the specific handler for that thread only. In this
case, it is not possible nor advisable to call the handler associated with another
thread in the same thread group.
Furthermore, signals can be blocked as well as ignored. When a signal is
blocked and such a signal arrives, it is queued. All such queued/pending signals
are handled once they are unblocked. Here also there is a caveat: no two
pending signals of the same type can be pending for a process at the same time.
Moreover, when a signal handler executes, it blocks the corresponding signal.
There are several ways in which a signal can be handled.
The first option is to ignore the signal – it means that the signal is not
important, and no handler is registered for it. In this case, the signal can be
happily ignored. On the other hand, if the signal is important and can lead to
process termination, then the action that needs to be taken is to kill the process.
Examples of such signals are SIGKILL and SIGINT (refer to Table 4.8). There
can also be a case where there will be process termination, but an additional file
will be created called the core dump. It can be used by a debugger to inspect
the state of the process at which it was paused or stopped because of the receipt
of the signal. For instance, we can find the values of all the local variables, the
stack’s contents and the memory contents.
We have already seen the process stop and resume signals earlier. Here, the
stop action is associated with suspending a process indefinitely until the resum-
ing action is initiated. The former corresponds to the SIGSTOP and SIGTSTP
signals, whereas the latter corresponds to the SIGCONT signal. It is impor-
tant to understand that like SIGKILL, these signals are not sent to the process.
They are instead intercepted by the kernel and the process is either terminated
or stopped/resumed. SIGKILL and SIGSTOP in particular cannot be ignored,
handled or blocked.
Finally, the last method is to handle the signal by registering a handler. Note
that in many cases this may not be possible, especially if the signal arose because
of an exception. The same exception-causing instruction will execute after the
handler returns and again cause an exception. In such cases, process termination
or stopping that thread and spawning a new thread are good options. In some
cases, if the circumstances behind an exception can be changed, then the signal
handler can do that like remapping a memory page or changing the value of
a variable. Making such changes in a signal handler is quite risky and is only
meant for black belt programmers ,.
Let us now look at the relevant kernel code. The apex data structure in signal
handling is signal struct (refer to Listing 4.11). The information about the
signal handler is kept in struct sighand struct. The two important fields
that store the set of blocked/masked signals are blocked and real blocked.
They are of the type sigset t, which is nothing but a bit vector: one bit for
each signal. It is possible that a lot of signals have been blocked by the process
because it is simply not interested in them. All of these signals are stored in
the variable real blocked. Now, during the execution of any signal handler,
typically more signals are blocked including the signal that is being handled.
There is a need to add all of these additional signals to the set real blocked.
With these additional signals, the expanded set of signals is called blocked.
Note the following:
In this case, we set the blocked signal set as a super set of the set real blocked.
These are all the signals that we do not want to handle when a signal handler
is executing. After finishing executing the handler, the kernel sets blocked =
real blocked.
struct sigpending stores the list of pending/queued signals that have not
been handled by the process yet. We will discuss its intricacies later.
Finally, consider the last field, which is quite interesting. For a signal han-
dler, we may want to use the same stack of the thread that was interrupted or
a different one. If we are using the same stack, then there is no problem; we
can otherwise use a different stack in the thread’s address space. In this case
its starting address and the size of the stack need to be specified in this case.
If we are using the alternative stack, which is different from the real stack that
the thread was using, no correctness problem is created. The original thread in
any case is stopped and thus the stack that is used does not matter.
Listing 4.12 shows the important fields in the main signal-related struc-
ture signal struct. It mainly contains process-related information such as the
number of active threads in the thread group, linked list of all the threads (in
the thread group), list of all the constituent threads that are waiting on the
wait system call, the last thread that processed a signal and the list of pending
signals (shared across all the threads in a thread group).
Let us now look at the next data structure – the signal handler.
Listing 4.13 shows the wrapper of signal handlers of the entire multithreaded
process. It actually contains a lot of information in these few fields. Note that
this structure is shared by all the threads in the thread group.
The first field count maintains the number of task struct s that use this
handler. The next field signalfd wqh is a queue of waiting processes. At this
stage, it is fundamental to understand that there are two ways of sending a
signal to a process. We have already seen the first approach, which involves
calling the signal handler directly. This is a straightforward approach and uses
the traditional paradigm of using callback functions, where a callback function
is a function pointer that is registered with the caller. In this case, the caller
(invoker) is the signal handling subsystem of the OS.
It turns out that there is a second mechanism, which is not used that widely.
As compared to the default mechanism, which is asynchronous (signal handlers
can be run any time), this is a synchronous mechanism. In this case, signal
handling is a planned process. It is not the case that signals can arrive at any
point of time, and then they need to be handled immediately. This notion is
captured in the field signalfd wqh. The idea is that the process registers a file
descriptor with the OS – we refer to this as the signalfd file. Whenever a signal
needs to be sent to the process, the OS writes the details of the signal to the
signalfd file. Processes in this case, typically wait for signals to come. Hence,
processes need to be woken up. At their leisure, they can check the contents of
the file and process the signals accordingly.
Now, it is possible that multiple processes are waiting for something to be
written to the signalfd file. Hence, there is a need to create a queue of waiting
processes. This wait queue is the signalfd wqh field.
147 c Smruti R. Sarangi
However, the more common method of handling signals is using the regular
asynchronous mechanism. All that we need to store here is an array of 64
( NSIG) signal handlers. 64 is the maximum number of signal handlers that
Linux on x86 supports. Each signal handler is wrapped using the k sigaction
structure. On most architectures, this simply wraps the sigaction structure,
which we shall describe next.
struct sigaction
The important fields of struct sigaction are shown in Listing 4.14. The
fields are reasonably self-explanatory. sa handler is the function pointer in the
thread’s user space memory. flags represents the parameters that the kernel
uses to handle the signal such as whether a separate stack needs to be used or
not. Finally, we have the set of masked signals.
struct sigpending
The final data structure that we need to define is the list of pending signals
(struct sigpending). This data structure is reasonably complicated. It uses
some of the tricky features of linked lists, which we have very nicely steered
away from up till now.
struct sigpending {
struct list_head list; sigqueue sigqueue sigqueue sigqueue
sigset_t signal;
}
struct sigqueue {
struct list_head list; /* Poin�ng to its current posi�on in the
queue of sigqueues */
kernel_siginfo_t info; /* signal number, signal source, etc. */
}
Refer to Figure 4.10. The structure sigpending wraps a linked list that
contains all the pending signals. The name of the list is as simple as it can be,
list. The other field of interest is signal that is simply a bit vector whose ith
bit is set if the ith signal is pending for the process. Note that this is why there
c Smruti R. Sarangi 148
is a requirement that two signals of the same type can never be pending for the
same process.
Each entry of the linked list is of type struct sigqueue. Note that we
discussed in Appendix C that in Linux different kinds of nodes can be part
of a linked list. Hence, in this case we have the head of the linked list as a
structure of type sigpending, whereas all the entries are of type sigqueue. As
non-intuitive as this may seem, this is indeed possible in Linux’s linked lists.
Each sigqueue structure is a part of a linked list, hence, it is mandated to
have an element of type struct list head. This points to the linked lists on
the left and right (previous and next), respectively. Each entry encapsulates the
signal in the kernel siginfo t structure.
This structure contains the following fields: signal number, number of the
error or exceptional condition that led to the signal being raised, source of the
signal and the sending process’s pid (if relevant). This is all the information
that is needed to store the details of a signal that has been raised.
struct ucontext {
unsigned long uc_flags ;
stack_t uc_stack ; /* user ’s stack pointer */
struct sigcontext uc_mcontext ; /* Snapshot of all the
registers and process ’s state */
};
struct rt sigframe keeps all the information required to store the context
of the thread that was signaled. The context per se is stored in the structure
struct ucontext. Along with some signal handling flags, it stores two vital
pieces of information: the pointer to the user thread’s stack and the snapshot
of all the user thread’s registers and its state. The stack pointer can be in
the same region of memory as the user thread’s stack or in a separate memory
149 c Smruti R. Sarangi
region. Recall that it is possible to specify a separate address for storing the
signal handler’s stack.
The next argument is the signal information that contains the details of the
signal: its number, the relevant error code and the details of the source of the
signal.
The last argument is the most interesting. The question is where should the
signal handler return to? It cannot return to the point at which the original
thread stopped executing. This is because its context has not been restored
yet. Hence, we need to return to a special function that needs to do a host of
things such as restoring the user thread’s context. Hence, here is the “million
dollar” idea. Before launching the signal handler, we deliberately tweak the
return address to return to a custom function that can restore the user thread’s
context. Note that on x86 machines, the return address is stored on the stack.
All that we need to do is to change it to point it to a specific function, which is
the restore rt function in the glibc standard library.
When the signal handler returns, it will now return and start executing
the restore rt function. This function does a lot of important things. It
does some bookkeeping and makes the all important sigreturn system call.
This transfers control back to the kernel. It is only the kernel that can restore
the context of a process. This cannot be done in user space without hardware
support. Hence, it is necessary to bring the kernel into the picture. The kernel’s
system call handler copies the context stored in the user process’s stack using
the copy from user function to the kernel’s address space. The same way that
we restore the context while loading a process on a core, we do exactly the
same here. The context collected from user space is transferred to the same
subsystem in the kernel; it restores the user thread’s context (exactly at where
it stopped). The kernel populates all the registers of the user thread including
the PC and the stack pointer. It starts from exactly the same point at which it
was paused to handle the signal.
To summarize, a signal handler is a small process within a process. It has a
short-lived life. It ceases to exist after the signal handling function finishes its
execution. The original thread then resumes.
Exercises
Ex. 4 — The way that we save the context in the case of interrupts and system
calls is slightly different. Explain the nature of the difference. Why is this the
case?
Ex. 5 — If we want to use the same assembly language routine to store the
context after both an interrupt and a system call, what kind of support is
required (in SW and/or HW)?
Ex. 10 — What is the philosophy behind having sets like blocked and real-blocked
in signal-handling structures? Explain with examples.
Ex. 11 — How does the interrupt controller ensure real-time task execution
on an x86 system? It somehow needs to respect real-time process priorities.
151 c Smruti R. Sarangi
Ex. 15 — What are the beneficial features of softirqs and threaded IRQs?
In this chapter we will discuss one of the most important concepts in operating
systems namely synchronization and scheduling. The first deals with managing
resources that are common to a bunch of processes or threads (shared between
them). It is possible that there will be competition amongst the threads or
processes to acquire the resource: this is also known as a race condition. Such
data races can lead to errors. As a result, only one of the processes needs to
access the shared resource at any point of time.
Once all such synchronizing conditions have been worked out, it is the role
of the operating system to ensure that all the computing resources namely the
cores and accelerators are optimally used. There should be no idleness or exces-
sive context switching. Therefore, it is important to design a proper scheduling
algorithm such that tasks can be efficiently mapped to the available compu-
tational resources. We shall see that there are a wide variety of scheduling
algorithms, constraints and possible scheduling goals. Given that there are such
a wide variety of practical use cases, situations and circumstances, there is no
one single universal scheduling algorithm that outperforms all the others. In
fact, we shall see that for different situations, different scheduling algorithms
perform very differently.
5.1 Synchronization
5.1.1 Introduction to Data Races
Consider the case of a multicore CPU. We want to do a very simple operation,
which is just to increment the value of the count variable that is stored in
memory. It is a regular variable and incrementing it should be easy. Listing 5.1
shows that it translates to three assembly-level instructions. We are showing C-
like code without the semicolon for the sake of enhancing readability. Note that
each line corresponds to one line of assembly code (or one machine instruction)
in this code snippet. count is a global variable that can be shared across
threads. t1 corresponds to a register (private to each thread and core). The
first instruction loads the variable count to a register, the second line increments
the value in the register and the third line stores the incremented value in the
153
c Smruti R. Sarangi 154
This code is very simple, but when we consider multiple threads, it turns
out to be quite erroneous because we can have several correctness problems.
Consider the scenario shown in Figure 5.1. Note again that we first load
the value into a register, then we increment the contents of the register and
finally save the contents of the register in the memory address corresponding to
the variable count. This makes a total of 3 instructions that are not executed
atomically – execute at three different instants of time. Here there is a possibility
of multiple threads trying to execute the same code snippet at the same point
of time and also update count concurrently. This situation is called a data race
(a more precise and detailed definition follows later).
count = 0
Thread 1 Thread 2
t1 = count t2 = count
t1 = t1 + 1 t2 = t2 + 1
count = t1 count = t2
Figure 5.1: Incrementing the count variable in parallel (two threads). The
run on two different cores. t1 and t2 are thread-specific variables mapped to
registers
Before we proceed towards that and elaborate on how and why a data race
can be a problem, we need to list a couple of assumptions.
¶ The first assumption is that each basic statement in Listing 5.1 corre-
sponds to one line of assembly code, which is assumed to execute atomically.
This means that it appears to execute at a single instant of time.
· The second assumption here is that the delay between two instructions
can be indefinitely long (arbitrarily large). This could be because of hardware-
level delays or could be because there is a context switch and then the context
is restored after a long time. We cannot thus assume anything about the timing
of the instructions, especially the timing between consecutive instructions given
that there could be indefinite delays for the aforementioned reasons.
Now given these assumptions, let us look at the example shown in Figure 5.1
and one possible execution in Figure 5.2. Note that a parallel program can have
many possible executions. We are showing one of them, which is particularly
problematic. We see that the two threads read the value of the variable count
155 c Smruti R. Sarangi
count = 0
t2 = t2 + 1
count = t1 count = 1
count = t2 count = 1
Should be 2 5
Figure 5.2: An execution that leads to the wrong value of the count variable
Unlock.
Figure 5.4 shows the execution of the code snippet count++ by two threads.
Note the critical sections, the use of the lock and unlock calls. Given that the
critical section is protected with locks, there are no data races here. The final
value is correct: count = 2.
Thread 1 Thread 2
t1 = count
t1 = t1 + 1
count = t1
t2 = count
t2 = t2 + 1
count = t2
(c) Smru� R. Sarangi, 2023
Figure 5.4: Two threads incrementing count by wrapping the critical section
within a lock-unlock call pair
thread finds that the value has changed back to 0 (free), it tries to set it to 1
(test-and-set phase). In this case, it is inevitable that there will be a competition
or a race among the threads to acquire the lock (set the value in A to 1). Regular
reads or writes cannot be used to implement such locks.
It is important to use an atomic synchronizing instruction that almost all
the processors provide as of today. For instance, we can use the test-and-set
instruction that is available on most hardware. This instruction checks the value
of the variable stored in memory and if it is 0, it atomically sets it to 1 (appears
to happen instantaneously). If it is able to do so successfully (0 → 1), it returns
a 1, else it returns 0. This basically means that if two threads are trying to set
the value of a lock variable from 0 to 1, only one of them will be successful. The
hardware guarantees this
The test-and-set instruction returns 1 if it is successful, and it returns 0
if it fails (cannot set 0 → 1). Clearly we can extend the argument and observe
that if there are n threads that all want to convert the value of the lock variable
from 0 to 1, then only one of them will succeed. The thread that was successful
is deemed to have acquired the lock. For the rest of the threads that were
unsuccessful, they need to keep trying (iterating). This process is also known
as busy waiting. Such a lock that involves busy waiting – it is also called a spin
lock.
It is important to note that we are relying on a hardware instruction that
atomically sets the value in a memory location to another value and indicates
whether it was successful in doing so or not. There is a lot of theory around this
and there are also a lot of hardware primitives that play the role of atomic oper-
ations. Many of them fall in the class of read-modify-write (RMW) operations.
They read the value stored at a memory location, sometimes test if it satisfies
a certain property or not, and then they modify the contents of the memory
location accordingly. These RMW operations are typically used in implement-
ing locks. The standard method is to keep checking whether the lock variable
is free or not. The moment the lock is found to be free, threads compete to
acquire the lock using atomic instructions. Atomic instructions guarantee that
only one instruction is successful at a time. Once a thread acquires the lock, it
can proceed to safely access the critical section.
After executing the critical section, unlocking is quite simple. All that needs
to be done is that the value of the location at the lock needs to be set back to 0
(free). However, bear in mind that if one takes a computer architecture course,
c Smruti R. Sarangi 158
one will realize that this is not that simple. This is because all the memory
operations that have been performed in the critical section should be visible to
all the threads running on other cores once the lock has been unlocked. This
normally does not happen as architectures and compilers tend to reorder in-
structions. Also, it is possible that the instructions in the critical section are
visible to other threads before a lock is fully acquired unless additional precau-
tions are taken. This is again an unintended consequence or reordering that
is done by compilers and machines for performance reasons. Such reordering
needs to be checked.
This is why most atomic instructions either additionally act as fence instruc-
tions or a separate fence instruction is added by the library code to lock/unlock
functions.
the second update of the count variable. Therefore, we can say that there is a
happens-before relationship between the first and second updates to the count
variable. Note that this relationship is a property of a given execution. In a
different execution, a different happens-before relationship may be visible. A
happens-before relationship by definition is a transitive relationship.
The moment we do not have such happens-before relationships between ac-
cesses, they are deemed to be concurrent. Note that in our example, such
happens-before relationships are being enforced by the lock/unlock operations
and their inherent fences. Happens-before order: updates in the critical section
→ unlock operation → lock operation → reads/writes in the second critical
section (so on and so forth). Encapsulating critical sections within lock-unlock
pairs creates such happens-before relationships. Otherwise, we have data races.
Such data races are clearly undesirable as we saw in the case of count++.
Hence, concurrent and conflicting accesses to the same shared variable should
not be there. With data races, it is possible that we may have hard-to-detect
bugs in the program. Also, data races have a much deeper significance in terms
of the correctness of the execution of parallel programs. At this point we are not
in the position to appreciate all of this. All that can be said is that data-race-
free programs have a lot of nice and useful properties, which are very important
in ensuring the correctness of parallel programs. Hence, data races should be
avoided for a wide variety of reasons. Refer to the book by your author on
Advanced Computer Architecture [Sarangi, 2023] for a detailed explanation of
data races, and their implications and advantages.
Point 5.1.2
An astute reader may argue that there have to be data races in the code
to acquire the lock itself. However, those happen in a very controlled
manner, and they don’t pose a correctness problem. This part of the
code is heavily verified and is provably correct. The same cannot be said
about data races in regular programs.
Properly-Labeled Programs
Now, to avoid data races, it is important to create properly labeled programs.
In a properly labeled program, the same shared variable should be locked by
the same lock or the same set of locks. This will avoid concurrent accesses to
the same shared variable. For example, the situation shown in Figure 5.6 has
a data race on the variable C because it is not protected by the same lock in
both the cases. Hence, we may observe a data race because this program is
not properly labeled. This is why it is important that we ensure that the same
variable is protected by the same lock (could also be the same set of multiple
locks).
5.1.4 Deadlocks
Using locks sadly does not come for free; they can lead to a situation known as
deadlocks. A deadlock is defined as a situation where one thread is waiting on
another thread, that thread is waiting on another thread, so on and so forth –
c Smruti R. Sarangi 160
Figure 5.6: A figure showing a situation with two critical sections. The first
is protected by lock X and the second is protected by lock Y . Address C is
common to both the critical sections. There may be a data race on address C.
we have a circular or cyclic wait. This basically means that in a deadlocked sit-
uation, no thread can make any progress. In Figure 5.7, we see such a situation
with locks.
Thread 1 Thread 2
Lock X Lock Y
Lock Y Lock X
It shows that one thread holds lock X, and it tries to acquire lock Y . On the
other hand, the second thread holds lock Y and tries to acquire lock X. There
is a clear deadlock situation here. It is not possible for any thread to make
progress because they are waiting on each other. This is happening because we
are using locks and a thread cannot make any progress unless it acquires the
lock that it is waiting for. A code with locks may thus lead to such kind of
deadlocks that are characterized by circular waits. Let us elaborate.
There are four conditions for a deadlock to happen. This is why if a deadlock
is supposed to be avoided or prevented, one of these conditions needs to be
prevented/avoided. The conditions are as follows:
4. Circular wait: As we can see in Figure 5.7, all the threads are waiting
on each other and there is a circular or cyclic wait. A cyclic wait ensures
that no thread can make any progress.
forks to be put on the table. These have sadly been picked up from the table
by their respective neighbors. Clearly a circular wait has been created. Let
us look at the rest of the deadlock conditions, which are non-preeemption and
hold-and-wait, respectively. Clearly mutual exclusion will always have to hold
because a fork cannot be shared between neighbors at the same moment of time.
Preemption – forcibly taking away a fork from the neighbor – seems to be
difficult because the neighbor can also do the same. Designing a protocol around
this idea seems to be difficult. Let us try to relax hold-and-wait. A philosopher
may give up after a certain point of time and put the fork that he has acquired
back on the table. Again creating a protocol around this appears to be difficult
because it is very easy to get into a livelock.
Hence, the simplest way of dealing with this situation is to try to avoid the
circular wait condition. In this case, we would like to introduce the notion of
asymmetry, where we can change the rules for just one of the philosophers. Let
us say that the default algorithm is that a philosopher picks the left fork first
and then the right one. We change the rule for one of the philosophers: he
acquires his right fork first and then the left one.
It is possible to show that a circular wait cannot form. Let us number the
philosophers from 1 to n. Assume that the nth philosopher is the one that has
the special privilege of picking up the forks in the reverse order (first right and
then left). In this case, we need to show that a cyclic wait can never form.
Assume that a cyclic wait has formed. It means that a philosopher (other
than the last one) has picked up the left fork and is waiting for the right fork to be
put on the table. This is the case for philosophers 1 to and n−1. Consider what
is happening between philosophers and n − 1 and n. The (n − 1)th philosopher
picks its left fork and waits for the right one. The fact that it is waiting basically
means that the nth philosopher has picked it up. This is his left fork. It means
that he has also picked up his right fork because he picks up the forks in the
reverse order. He first picks up his right fork and then his left one. This basically
means that the nth philosopher has acquired both the forks and is thus eating
his food. He is not waiting. We therefore do not have a deadlock situation over
here.
allow processes to deadlock and wait forever. However, the converse is not true.
Deadlock freedom does not imply starvation freedom because starvation is a
much stronger condition.
The other condition is a livelock, where processes continuously take steps
and execute statements but do not make any tangible progress. This means
that even if processes continually change their state, they do not reach the final
end state – they continually cycle between interim states. Note that they are not
in a deadlock in the sense that they can still take some steps and keep changing
their state. However, the states do not converge to the final state, which would
indicate a desirable outcome.
For example, consider two people trying to cross each other in a narrow
corridor. A person can either be on the left side or on the right side of the
corridor. So it is possible that both are on the left side, and they see each other
face to face. Hence, they cannot cross each other. Then they decide to either
stay there or move to the right. It is possible that both of them move to the
right side at the same point of time, and they are again face to face. Again
they cannot cross each other. This process can continue indefinitely. In this
case, the two people can keep moving from left to right and back. However,
they are not making any progress because they are not able to cross each other.
This situation is a livelock, where threads move in terms of changing states, but
nothing useful gets ultimately done.
Listing 5.2: Code to create two pthreads and collect their return values
# include < stdio .h >
# include < pthread .h >
# include < stdlib .h >
# include < unistd .h >
In the main function, two pthreads are created. The arguments to the
pthread create function are a pointer to the pthread structure, a pointer to
a pthread attribute structure that shall control its behavior (NULL in this ex-
ample), the function pointer that needs to be executed and a pointer to its sole
argument. If the function takes multiple arguments, then we need to put all of
them in a structure and pass a pointer to that structure.
The return value of the func function is quite interesting. It is a void *,
which is a generic pointer. In our example, it is a pointer to an integer that is
equal to 2 times the thread id. When a pthread function (like func) returns, akin
to a signal handler, it returns to the address of a special routine. Specifically,
it does the job of cleaning up the state and tearing down the thread. Once the
thread finishes, the parent thread that spawned it can wait for it to finish using
the pthread join call.
This is similar to the wait call invoked by a parent process, when it waits
for a child to terminate in the regular fork-exec model. In the case of a regular
process, we collect the exit code of the child process. However, in the case
of pthreads, the pthread join call takes two arguments: the pthread, and the
address of a pointer variable (&result). The value filled in the address is exactly
the pointer that the pthread function returns. We can proceed to dereference
the pointer and extract the value that the function wanted to return.
Given that we have now created a mechanism to create pthread functions
that can be made to run in parallel, let us implement a few concurrent algo-
rithms. Let us try to increment a count.
Let us now use the CAS method to increment count (code shown in List-
ing 5.6).
Hypothe�cal
sequen�al {1} {3,1} {3} {} {2} {} {4} {5,4} {5}
order
Figure 5.9: A parallel execution and its equivalent sequential execution. Every
event has a distinct start time and end time. In this figure, we assume that we
know the completion time. We arrange all the events in ascending order of their
completion times in a hypothetical sequential order at the bottom. Each point
in the sequential order shows the contents of the queue after the respective
operation has completed. Note that the terminology enq: 3 means that we
enqueue 3, and similarly deq: 4 means that we dequeue 4.
The key question that needs to be answered is where is this point of com-
pletion vis-á-vis the start and end points. If it always lies between them, then
we can always claim that before a method call ends, it is deemed to have fully
completed – its changes to the global state are visible to all the threads. This
is a very strong correctness criterion of a parallel execution. We are, of course,
assuming that the equivalent sequential execution is legal. This correctness
criteria is known as linearizability.
Linearizability
Linearizability is the de facto criterion used to prove the correctness of con-
current data structures that are of a non-blocking nature. If all the executions
corresponding to a concurrent algorithm are linearizable, then the algorithm
itself is said to satisfy linearizability. In fact, the execution shown in Figure 5.9
is linearizable.
This notion of linearizability is summarized in Definition 5.1.1. Note that
the term “physical time” in the definition refers to real time that we read off a
c Smruti R. Sarangi 170
wall clock. Later on, while discussing progress guarantees, we will see that the
notion of physical time has limited utility. We would alternatively prefer to use
the notion of a logical time instead. Nevertheless, let us stick to physical time
for the time being.
Definition 5.1.1 Linearizability
Now, let us address the last conundrum. Even if the completion times are not
known, which is often the case, as long as we can show that distinct completion
points appear to exist for each method (between its start and end), the execution
is deemed to be linearizable. Mere existence of completion points is what needs
to be shown. Whether the method actually completes at that point or not is
unimportant. This is why we keep using the word “appears” throughout the
definitions.
Write
CPU Delay
Write
buffer
L1 Cache
Now consider the other case when the point of completion may be after the
end of a method. For obvious reasons, it cannot be before the start point of
a method. An example of such an execution, which is clearly atomic but not
linearizable, is a simple write operation in multicore processors (see Figure 5.10).
The write method returns when the processor has completed the write operation
and has written it to its write buffer. This is also when the write operation is
removed the pipeline. However, that does not mean that the write operation has
completed. It completes when it is visible to all the threads, which can happen
much later – when the write operation leaves the write buffer and is written to
a shared cache. This is thus a case when the completion time is beyond the end
time of the method. The word “beyond” is being used in the sense that it is
“after” the end time in terms of the real physical time.
We now enter a world of possibilities. Let us once again consider the simple
read and write operations that are issued by cores in a multicore system. The
171 c Smruti R. Sarangi
Sequential Consistency
Along with atomicity, SC mandates that in the equivalent sequential order of
events, methods invoked by the same thread appear in program order. The
program order is the order of instructions in the program that will be perceived
by a single-cycle processor, which will pick an instruction, execute it completely,
proceed to the next instruction, so on and so forth. SC is basically atomicity +
intra-thread program order.
Consider the following execution. Assume that x and y are initialized to 0.
They are global variables. t1 and t2 are local variables. They are stored in
registers (not shared across threads).
Thread 1 Thread 2
x=1 y=1
t1 = y t2 = x
Note that if we run this code many times on a multicore machine, we shall
see different outcomes. It is possible that Thread 1 executes first and completes
both of its instructions and then Thread 2 is scheduled on another core, or vice
versa, or their execution is interleaved. Regardless of the scheduling policy, we
will never observe the outcome t1 = t2 = 0 if the memory model is SC or lin-
earizability. This reason is straightforward. All SC and linearizable executions
respect the per-thread order of instructions. In this case, the first instruction
c Smruti R. Sarangi 172
lock. Note that an acquire is weaker than a full fence, which also specifies the
ordering of operations before the fence (in program order). Similarly, the re-
lease operation corresponds to lock release. As per RC, the release operation
can complete only if all the operations before it have fully completed. Again,
this also makes sense, because when we release the lock, we want the rest of the
threads to see all the changes that have been made in the critical section.
Obstruction Freedom
It is called obstruction freedom, which basically says that in an n-thread system,
if we set any set of (n − 1) threads to sleep, then the only thread that is active
will be able to complete its execution in a bounded number of internal steps.
c Smruti R. Sarangi 174
This means that we cannot use locks because if the thread that has acquired
the lock gets swapped out or goes to sleep, no other thread can complete the
operation.
Wait Freedom
Now, let us look at another progress guarantee, which is at the other end of
the spectrum. It is known as wait freedom. In this case, we avoid all forms
of starvation. Every thread completes the operation within a bounded number
of internal steps. So in this case, starvation is not possible. The code shown
in Listing 5.4 is an example of a wait-free algorithm because regardless of the
number of threads and the amount of contention, it completes within a bounded
number of internal steps. However, the code shown in Listing 5.6 is not a wait-
free algorithm. This is because there is no guarantee that the compare and swap
will be successful in a bounded number of attempts. Thus, we cannot guarantee
wait freedom. However, this code is obstruction free because if any set of (n − 1)
threads go to sleep, then the only thread that is active will succeed in the CAS
operation and ultimately complete the overall operation in a bounded number
of steps.
Lock Freedom
Given that we have now defined what an obstruction-free and a wait-free al-
gorithm is, we can now tackle the definition of lock freedom, which is slightly
more complicated. In this case, let us count the cumulative number of steps
that all the n threads in the system execute. We have already mentioned that
there is no correlation between the time it takes to complete an internal step
across the n threads. That remaining true, we can still take a system and count
the cumulative number of internal steps taken by all the threads together. Lock
freedom basically says that if this cumulative number is above a certain thresh-
old or a bound, then we can say for sure that at least one of the operations has
completed successfully. Note that in this case, we are saying that at least one
thread will make progress and there can be no deadlocks.
All the threads also cannot get stuck in a livelock. However, there can be
starvation because we are taking a system-wide view and not a thread-specific
view here. As long as one thread makes progress by completing operations, we
do not care about the rest of the threads. This was not the case in wait-free
algorithms. The code shown in Listing 5.6 is lock free, but it is not wait free.
The reason is that the compare and exchange has to be successful for at least one
of the threads and that thread will successfully move on to complete the entire
count increment operation. The rest of the threads will fail in that iteration.
However, that is not of a great concern here because at least one thread achieves
success.
It is important to note that every program that is wait free is also lock free.
This follows from the definition of lock freedom and wait freedom, respectively.
If we are saying that in less than k internal steps, every thread is guaranteed
to complete its operation, then in nk system-wide steps, at least one thread is
guaranteed to complete its operation. By the pigeonhole principle, at least one
thread must have taken k steps and completed its operation. Thus wait freedom
implies lock freedom.
175 c Smruti R. Sarangi
Similarly, every program that is lock free is also obstruction free, which again
follows very easily from the definitions. This is the case because we are saying
that if the system as a whole takes a certain number of steps (let’s say k 0 ), then
at least one thread successfully completes its operation. Now, if n − 1 threads
in the system are quiescent, then only one thread is taking steps and within k 0
steps it has to complete its operation. Hence, the algorithm is obstruction free.
Obstruc�on free
Lock free
Wait free
Figure 5.11: Venn diagram showing the relationship between different progress
guarantees
However, the converse is not true in the sense that it is possible to find a
lock-free algorithm that is not wait free and an obstruction free algorithm that is
not lock free. This can be visualized in a Venn diagram as shown in Figure 5.11.
All of these algorithms cannot use locks. They are thus broadly known as non-
blocking algorithms even though they provide very different kinds of progress
guarantees.
An astute reader may ask why not use wait-free algorithms every time be-
cause after all there are theoretical results that say that any algorithm can be
converted to a parallel wait-free variant, which is also provably correct. This
part is correct, however, wait-free algorithms tend to be very slow and also are
very difficult to write and verify. Hence, in most practical cases, a lock-free
implementation is much faster and is far easier to code and verify. In general,
obstruction freedom is too week as a progress guarantee. Thus, it is hard to find
a practical system that uses an obstruction- free algorithm. In most practical
systems, lock-free algorithms are used, which optimally trade off performance,
correctness and complexity.
There is a fine point here. Many authors have replaced the bounded property
in the definitions with finite. The latter property is more theoretical and often
does not gel well with practical implementations. Hence, we have not decided
to use it in this book. We will continue with bounded steps, where the bound
can be known in advance.
5.1.8 Semaphores
Let us now consider another synchronization primitive called a semaphore. We
can think of it as a generalization of a lock. It is a more flexible variant of a
c Smruti R. Sarangi 176
lock, which admits more than two states. Recall that a lock has just two states:
locked and unlocked.
pthread_cond_t cond ;
pthr ead_cond_init (& count , NULL ) ;
Clearly, a reader and writer cannot operate concurrently at the same point
of time without synchronization because of the possibility of data races.
We thus envision two smaller locks as a part of the locking mechanism: a
read lock and a write lock. The read lock allows multiple readers to operate in
parallel on a concurrent object, which means that we can invoke a read method
concurrently. We need a write lock that does not allow any other readers or
writers to work on the queue concurrently. It just allows one writer to change
the state of the queue.
void get_write_lock () {
LOCK (\ _ \ _rwlock ) ;
}
void release_write_lock () {
UNLOCK (\ _ \ _rwlock ) ;
}
void get_read_lock () {
LOCK (\ _ \ _rdlock ) ;
if ( readers == 0) LOCK (\ _ \ _rwlock ) ;
readers ++;
UNLOCK (\ _ \ _rdlock ) ;
}
void release_read_lock () {
LOCK (\ _ \ _rdlock ) ;
readers - -;
if ( readers == 0)
UNLOCK (\ _ \ _rwlock ) ;
UNLOCK (\ _ \ _rdlock ) ;
}
The code for the locks is shown in Listing 5.10. We are assuming two macros
LOCK and UNLOCK. They take a lock (mutex) as their argument, and invoke the
methods lock and unlock, respectively. We use two locks: rwlock (for both
readers and writers) and rdlock (only for readers). The prefix signifies
that these are internal locks within the reader-writer lock. These locks are
meant for implementing the logic of the reader-writer lock, which provides two
key functionalities: get or release a read lock (allow a process to only read),
and get or release a write lock (allow a process to read/write). Even though the
names appear similar, the internal locks are very different from the functionality
that the composite reader-writer lock provides, which is providing a read lock
(multiple readers) and a write lock (single writer only).
Let’s first look at the code of a writer. There are two methods that it
can invoke: get write lock and release write lock. In this case, we need a
global lock that needs to stop both reads and writes from proceeding. This is
why in the function get write lock, we wait on the lock rwlock.
The read lock, on the other hand, is slightly more complicated. Refer to
the function get read lock in Listing 5.10. We use another mutex lock called
rdlock. A reader waits to acquire it. The idea is to maintain a count of the
179 c Smruti R. Sarangi
number of readers. Since there are concurrent updates to the readers variable
involved, it needs to be protected by the rdlock mutex. After acquiring
rdlock, it is possible that the lock acquiring process may find that a writer is
active. We need to explicitly check for this by checking if the number of readers,
readers, is equal to 0 or not. If it is equal to 0, then it means that other readers
are not active – a writer could be active. Otherwise, it means that other readers
are active, and a writer cannot be active.
If readers = 0 we need to acquire rwlock to stop writers. The rest of the
method is reasonably straightforward. We increment the number of readers and
finally release rdlock such that other readers can proceed.
Releasing the read lock is also simple. We subtract 1 from the number of
readers after acquiring rdlock. Now, if the number of readers becomes equal
to 0, then there is no reason to hold the global rwlock. It needs to be released
such that writers can potentially get a chance to complete their operation.
A discerning reader at this point of time will clearly see that if readers are
active, then new readers can keep coming in and the waiting write operation
will never get a chance. This means that there is a possibility of starvation.
Because readers may never reach 0, rwlock will never be released by the
reader holding it. The locks themselves could be fair, but overall we cannot
guarantee fairness for writes. Hence, this version of the reader-writer lock’s
design needs improvement. Starvation-freedom is needed, especially for write
operations. Various solutions to this problem are proposed in the reference
[Herlihy and Shavit, 2012].
the threads to finish so that it can collect all the partial sums and add them to
produce the final result (reduce phase). This is a rendezvous point insofar as all
the threads are concerned because all of them need to reach this point before
they can proceed to do other work. Such a point arises very commonly in a lot
of scientific kernels that involve linear algebra.
Hence, it is very important to optimize such operations, which are known
as barriers. Note that this barrier is different from a memory barrier (discussed
earlier), which is a fence operation. They just happen to share the same name
(unfortunately so). We can psychologically think of a barrier as a point that
stops threads from progressing, unless all the threads that are a part of the
thread group associated with the barrier reach it (see Figure 5.12). Almost
all programming languages, especially parallel programming languages provide
support for barriers. In fact, supercomputers have special dedicated hardware
for barrier operations. They can be realized very quickly, often in less than a
few milliseconds.
There is a more flexible version of a barrier known as a phaser (see Fig-
ure 5.13). It is somewhat uncommon, but many languages like Java define them
and in many cases they prove to be very useful. In this case, we define two
points in the code: Point 1 and Point 2. The rule is that no thread can cross
Point 2 unless all the threads have arrived at Point 1. Point 1 is a point in
the program, which in a certain sense precedes Point 2 or is before Point 2 in
program order. Often when we are pipelining computations, there is a need for
using phasers. We want some amount of work to be completed before some new
work can be assigned to all the threads. Essentially, we want all the threads to
complete the phase prior to Point 1, and enter the phase between Points 1 and
2, before a thread is allowed to enter the phase that succeeds Point 2.
5.2 Queues
Let us now see how to use all the synchronization primitives introduced in
Section 5.1.
One of the most important data structures in a complex software system like
an OS is a queue. All practical queues have a bounded size. Hence, we shall not
differentiate between a queue and a queue with a maximum or bounded size.
Typically, to communicate messages between different subsystems, queues are
181 c Smruti R. Sarangi
Producers Consumers
Figure 5.14: A bounded queue
Using this restriction, it turns out that we can easily create a wait-free queue.
There is no need to use any locks – operations complete within bounded time.
Listing 5.11: A simple wait-free queue with one enqueuer and one dequeuer
# define BUFSIZE 10
# define INC ( x ) (( x +1) % BUFSIZE )
# define NUM 25
void nap () {
struct timespec rem ;
int ms = rand () % 100;
struct timespec req = {0 , ms * 1000 * 1000};
nanosleep (& req , & rem ) ;
}
return 0; /* success */
}
int deq () {
int cur_head = atomic_load (& head ) ;
int cur_tail = atomic_load (& tail ) ;
int new_head = INC ( cur_head ) ;
The main function creates two threads. The odd-numbered thread enqueues
by calling enqfunc, and the even-numbered thread dequeues by calling deqfunc.
These functions invoke the enq and deq functions, respectively, NUM times. Be-
tween iterations, the threads take a nap for a random duration.
The exact proof of wait freedom can be found in textbooks on this topic
such as the book by Herlihy and Shavit [Herlihy and Shavit, 2012]. Given that
there are no loops, we don’t have a possibility of looping endlessly. Hence, the
enqueue and dequeue operations will complete in bounded time. The proof of
linearizability and correctness needs more understanding and thus is beyond the
scope of this book.
Note the use of atomics. They are a staple of modern programming languages
such as C++20 and other recent languages. Along with atomic load and store
operations, the library provides many more functions such as atomic fetch add,
atomic flag test and set and atomic compare exchange strong. Depend-
ing upon the architecture and the function arguments, their implementations
come with different memory ordering guarantees (embed different kinds of fences).
c Smruti R. Sarangi 184
int deq () {
int val ;
do {
LOCK ( qlock ) ;
if ( tail == head ) val = -1;
else {
val = queue [ head ];
head = INC ( head ) ;
}
UNLOCK ( qlock ) ;
} while ( val == -1) ;
return val ;
}
int main () {
...
pthread_mutex_init (& qlock , NULL ) ;
...
pth read _mute x_de stroy (& qlock ) ;
}
int main () {
sem_init (& qlock , 0 , 1) ;
...
sem_destroy (& qlock ) ;
}
Listing 5.14: A queue with semaphores but does not have busy waiting
# define WAIT ( x ) ( sem_wait (& x ) )
# define POST ( x ) ( sem_post (& x ) )
sem_t qlock , empty , full ;
POST ( qlock ) ;
POST ( full ) ;
return 0; /* success */
}
int deq () {
WAIT ( full ) ;
WAIT ( qlock ) ;
POST ( qlock ) ;
POST ( empty ) ;
return val ;
}
int main () {
sem_init (& qlock , 0 , 1) ;
sem_init (& empty , 0 , BUFSIZE ) ;
sem_init (& full , 0 , 0) ;
...
sem_destroy (& qlock ) ;
sem_destroy (& empty ) ;
sem_destroy (& full ) ;
}
We use three semaphores here. We still use qlock, which is needed to pro-
tect the shared variables. Additionally, we use the semaphore empty that is
initialized to BUFSIZE (maximum size of the queue) and the full semaphore
that is initialized to 0. These will be used for waking up threads that are wait-
ing. We define the WAIT and POST macros that wrap sem wait and sem post,
respectively.
Consider the enq function. We first wait on the empty semaphore. There
need to be free entries available. Initially, we have BUFSIZE free entries. Every
time a thread waits on the semaphore, it decrements the number of free entries
by 1 until the count reaches 0. After that the thread waits. Then we enter
the critical section that is protected by the binary semaphore qlock. There is
no need to perform any check on whether the queue is full or not. We know
that it is not full because the thread successfully acquired the empty semaphore.
This means that at least one free entry is available in the array. After releasing
qlock, we signal the full semaphore. This indicates that an entry has been
added to the queue.
Let us now look at the deq function. It follows the reverse logic. We start
out by waiting on the full semaphore. There needs to at least be one entry
in the queue. Once this semaphore has been acquired, we are sure that there
is at least one entry in the queue, and it will remain there until it is dequeued
(property of the semaphore). The critical section again need not have any
checks regarding whether the queue is empty or not. It is protected by the
qlock binary semaphore. Finally, we complete the function by signaling the
187 c Smruti R. Sarangi
empty semaphore. The reason for this is that we are removing an entry from
the queue, or creating one additional free entry. Waiting enqueuers will get
signaled.
Note that there is no busy waiting. Threads either immediately acquire the
semaphore if the count is non-zero or are swapped out. They are put in a wait
queue inside the kernel. They thus do not monopolize CPU resources and more
useful work is done. We are also utilizing the natural strength of semaphores.
int peak () {
/* This is a read function */
get_read_lock () ;
int val = ( head == tail ) ? -1 : queue [ head ];
release_read_lock () ;
return val ;
}
int enq ( int val ) {
WAIT ( empty ) ;
POST ( full ) ;
return 0; /* success */
}
int deq () {
int val ;
WAIT ( full ) ;
POST ( empty ) ;
return val ;
}
The code of the enq and deq functions remains more or less the same. We
wait and signal the same set of semaphores: empty and full. The only difference
is that we do not acquire a generic lock, but we acquire the write lock using the
get write lock function.
It is just that we are using a different set of locks for the peak function and
the enq/deq functions. We allow multiple readers to work in parallel.
large number of data structures, writing correct and efficient lock-free code is
very difficult, and writing wait-free code is even more difficult. Hence, a large
part of the kernel still uses regular spinlocks; however, they come with a twist.
Along with being regular spinlocks that rely on busy waiting, there are a
few additional restrictions. Unlike regular mutexes that are used in user space,
the thread holding the lock is not allowed to go to sleep or get swapped out.
This means that interrupts need to be disabled in the critical section (protected
by kernel spinlocks). This further implies that these locks can also be used
in the interrupt context. A thread holding such a lock will complete in a finite
amount of time unless it is a part of a deadlock (discussed later). On a multicore
machine, it is possible that a thread may wait for the lock to be released by
a thread running on another core. Given that the lock holder cannot block or
sleep, this mechanism is definitely lock free. We are assuming that the lock
holder will complete the critical section in a finite amount of time. This will
indeed be the case given our restrictions on blocking interrupts and disallowing
preemption.
If we were to allow context switching after a spinlock has been acquired, then
we may have a deadlock situation The new thread may have a higher priority.
To make matters worse, it may try to acquire the lock. Given that we shall have
busy waiting, it will continue to loop and wait for the lock to get freed. But the
lock may never get freed because the thread that is holding the lock may never
get a chance to run. The reason it may not get a chance to run is because it has
a lower priority than the thread that is waiting on the lock. Hence, kernel-level
spinlocks need these restrictions. It effectively locks the CPU. The lock-holding
thread does not migrate, nor does it allow any other thread to run until it has
finished executing the critical section and released the spinlock.
# define preempt_enable () \
do { \
barrier () ; \
if ( unlikely ( p r e e m p t _ c o u n t _ d e c _ a n d _ t e s t () ) ) \
__preempt_schedule () ; \
c Smruti R. Sarangi 190
} while (0)
The core idea is a preemption count variable. If the count is non-zero, then
it means that preemption is not allowed. Whereas if the count is 0, it means
that preemption is allowed. If we want to disable preemption, all that we have
to do is increment the count and also insert a fence operation, which is also
known as a memory barrier. The reason for a barrier is to ensure that the code
in the critical section is not reordered and brought before the lock acquire. Note
that this is not the same barrier that we discussed in the section on barriers
and phasers (Section 5.1.11). They just happen to share the same name. These
are synchronization operations, whereas the memory barrier is akin to a fence,
which basically disables memory reordering. The preemption count is stored in
a per-CPU region of memory (accessible via a segment register). Accessing it
is a very fast operation and requires very few instructions.
The code for enabling preemption is shown in Listing 5.16. In this case, we
do more or less the reverse. We have a fence operation to ensure that all the
pending memory operations (executed in the critical section) completely finish
and are visible to all the threads. After that, we decrement the count using an
atomic operation. If the count reaches zero, it means that now preemption is
allowed, so we call the schedule function. It finds a process to run on the core.
An astute reader will make out that this is like a semaphore, where if preemption
is disabled n times, it needs to be enabled n times for the task running on the
core to become preemptible.
The code for a spinlock is shown in Listing 5.17. We see that the spinlock
structure encapsulates an arch spinlock t lock and a dependency map (struct
lockdep map). The raw lock member is the actual spinlock. The dependency
map is used to check for deadlocks (we will discuss that later).
Listing 5.18: Inner workings of a spinlock
source : include/asm − generic/spin − lock.h
void arch_spin_lock ( arch_spinlock_t * lock ) {
u32 val = atomic_fetch_add (1 < <16 , lock ) ;
u16 ticket = val >> 16; /* upper 16 bits of lock */
if ( ticket == ( u16 ) val ) /* Ticket id == ticket next in
line */
return ;
a t om i c _ co n d _ re a d _ ac q u i re ( lock , ticket == ( u16 ) VAL ) ;
smp_mb () ; /* barrier instruction */
}
191 c Smruti R. Sarangi
Let us understand the design of the spinlock. Its code is shown in List-
ing 5.18. It is a classic ticket lock that has two components: a ticket, which
acts like a coupon, and the id of the next ticket (next). Every time that a thread
tries to acquire a lock, it gets a new ticket. It is deemed to have acquired the
lock when ticket == next.
Consider a typical bank where we go to meet a teller. We first get a coupon,
which in this case is the ticket. Then we wait for our coupon number to be
displayed. Once that happens, we can go to the counter at which a teller is
waiting for us. The idea here is quite similar. If you think about it, you will
conclude that this lock guarantees fairness. Starvation is not possible. The
way that this lock is designed in practice is quite interesting. Instead of using
multiple fields, a single 32-bit unsigned integer is used to store both the ticket
and the next field. We divide the 32-bit unsigned integer into two smaller
unsigned integers that are 16 bits wide. The upper 16 bits store the ticket id.
The lower 16 bits store the value of the next field.
When a thread arrives, it tries to get a ticket. This is achieved by adding
216 (1 ¡¡ 16) to the lock variable. This basically increments the ticket stored in
the upper 16 bits by 1. The atomic fetch and add instruction is used to achieve
this. This instruction has a built-in memory barrier as well (more about this
later). Now, the original ticket can be extracted quite easily by right shifting
the value returned by the fetch and add instruction by 16 positions.
The next task is to extract the lower 16 bits (next field). This is the number
of the ticket that is the holder of the lock, which basically means that if the
current ticket is equal to the lower 16 bits, then we can go ahead and execute
the critical section. This is easy to do using a simple typecast operation. Hear
the type u16 refers to a 16-bit unsigned integer. Simply typecasting val to the
type u16 type retrieves the lower 16 bits as an unsigned integer. This is all that
we need to do. Then, we need to compare this value with the thread’s ticket,
which is also a 16-bit unsigned integer. If both are equal, then the spinlock has
effectively been acquired and the method can return.
Now, assume that they are not equal. Then there is a need to wait – there is a
need to busy wait. This is where we call the macro atomic cond read acquire,
which requires two arguments: the lock value and the condition that needs to
be true. This condition checks whether the obtained ticket is equal to the next
field in the lock variable. It ends up calling the macro smp cond load relaxed,
whose code is shown next.
Listing 5.19: The code for the busy-wait loop
source : include/asm − generic/barrier.h
# define smp_ cond _load _rel axed ( ptr , cond_expr ) ({ \
typeof ( ptr ) __PTR = ( ptr ) ; \
__ un qu al_ sc al ar_ ty pe of (* ptr ) VAL ; \
for (;;) { \
VAL = READ_ONCE (* __PTR ) ; \
if ( cond_expr ) \
break ; \
cpu_relax () ; /* insert a delay */ \
} \
( typeof (* ptr ) ) VAL ; \
})
c Smruti R. Sarangi 192
The kernel code for the macro is shown in Listing 5.19. In this case, the
inputs are a pointer to the lock variable and an expression that needs to evaluate
to true. Then we have an infinite loop where we dereference the pointer and
fetch the current value of the lock. Next, we evaluate the conditional expression
(ticket == (u16)VAL). If the conditional expression evaluates to true, then it
means that the lock has been implicitly acquired. We can then break from the
infinite loop and resume the rest of the execution. Note that we cannot return
from a macro because a macro is just a piece of code that is copy-pasted by the
preprocessor with appropriate argument substitutions.
In case the conditional expression evaluates to false, then of course, there
is a need to keep iterating. But along with that, we would not like to contend
for the lock all the time. This would lead to a lot of cache line bouncing across
cores, which is detrimental to performance. We are unnecessarily increasing the
memory and on-chip network traffic. It is a better idea to wait for some time
and try again. This is where the function cpu relax is used. It makes the
thread back off for some time.
Given that fairness is guaranteed, we will ultimately exit the infinite loop,
and we will come back to the main body of the arc spinlock function. In this
case, there is a need to introduce a memory barrier. Note that this is a generic
pattern? Whenever we get a lock or acquire a lock, there is a need to insert a
memory barrier after it. This ensures that prior to entering the critical section all
the reads and writes are fully completed and are visible to all the threads in the
SMP system. Moreover, no instruction in the critical section can complete before
the memory barrier has completed its operation. This ensures that changes
made in the critical section get reflected only after the lock has been acquired.
Listing 5.20: The code for unlocking a spinlock
source : include/asm − generic/spinlock.h
void arch_spin_unlock ( arch_spinlock_t * lock )
{
u16 * ptr = ( u16 *) lock + IS_ENABLED (
CON FIG_ CPU_B IG_E NDIAN ) ;
u32 val = atomic_read ( lock ) ;
smp_store_release ( ptr , ( u16 ) val + 1) ; /* store
following release consistency semantics */
}
Let us now come to the unlock function. This is shown in Listing 5.20. It
is quite straightforward. The first task is to find the address of the next field.
This needs to be incremented to let the new owner of the lock know that it
can now proceed. There is a complication here. We need to see if the machine
is big endian or little endian. If it is a big endian machine, which basically
means that the lower 16 bits are actually stored in the higher addresses, then
a small correction to the address needs to be made. This logic is embedded in
the isenabled (Big endian) macro. In any case at the end of this statement,
the address of the next field is stored in the ptr variable. Next, we get the
value of the ticket from the lock variable, increment it by 1, and store it in the
address pointed to by ptr, which is nothing but the address of the next field.
Now if there is a thread whose ticket number is equal to the contents of the
next field, then it knows that it is the new owner of the lock. It can proceed
with completing the process of lock acquisition and start executing the critical
193 c Smruti R. Sarangi
section. At the very end of the unlock function, we need to execute a memory
barrier known as an SMP store release, which basically ensures that all the
writes made in the critical section are visible to the rest of the threads after the
lock has been released. This completes the unlock process.
struct mutex {
atomic_long_t owner ;
raw_spinlock_t wait_lock ;
struct list_head wait_list ;
# ifdef C ON F IG _D E BU G _L OC K _A L LO C
struct lockdep_map dep_map ;
# endif
};
The code of the kernel mutex is shown in Listing 5.22. Along with a spinlock
(wait lock), it contains a pointer to the owner of the mutex and a waiting list
of threads. Additionally, to prevent deadlocks it also has a pointer to a lock
dependency map. However, this field is optional – it depends on the compilation
parameters. Let us elaborate.
The owner field is a pointer to the task struct of the owner. An as-
tute reader may wonder why it is an atomic long t and not a task struct
*. Herein, lies a small and neat trick. We wish to provide a fast-path mech-
anism to acquire the lock. We would like the owner field to contain the value
of the task struct pointer of the lock-holding thread, if the lock is currently
acquired and held by a thread. Otherwise, its value should be 0. This neat
trick will allow us to do a compare and exchange on the owner field in the hope
of acquiring the lock quickly. We try the fast path only once. To acquire the
lock, we will compare the value stored in owner with 0. If there is an equality
then we will store a pointer to the currently running thread’s task struct in
its place.
Otherwise, we enter the slow path. In this case, the threads waiting to
acquire the lock are stored in wait list, which is protected by the spinlock
wait lock. This means that before enqueueing the current thread in wait list,
we need to acquire the spinlock wait lock first.
Listing 5.23 shows the code of the lock function (mutex lock) in some more
detail. Its only argument is a pointer to the mutex. First, there is a need to
check if this call is being made in the right context or not. For example, the
kernel defines an atomic context in which the code cannot be preempted. In this
context, sleeping is not allowed. Hence, if the mutex lock call has been made
in this context, it is important to flag this event as an error and also print the
stack trace (the function call path leading to the current function).
Assume that the check passes, and we are not in the atomic context, then
we first make an attempt to acquire the mutex via the fast path. If we are not
195 c Smruti R. Sarangi
successful, then we try to acquire the mutex via the slow path using the function
mutex lock slowpath.
In the slow path, we first try to acquire the spinlock, and if that is not
possible then the process goes to sleep. In general, the task is locked in the
UNINTERRUPTIBLE state. This is because we don’t want to wake it up to
process signals. When the lock is released, it wakes up all such sleeping processes
such they can contend for the lock. The process that is successful in acquiring
the spinlock wait lock adds itself to wait list and goes to sleep. This is done
by setting its state (in general) to UNINTERRUPTIBLE.
Note that this is a kernel thread. Going to sleep does not mean going
to sleep immediately. It just means setting the status of the task to either
INTERRUPTIBLE or UNINTERRUPTIBLE. The task still runs. It needs to
subsequently invoke the scheduler such that it can find the most eligible task to
run on the core. Given the status of the current task is set to a sleep state, the
scheduler will not choose it for execution.
The unlock process pretty much does the reverse. We first check if there are
waiting tasks in the wait list. If there are no waiting tasks, then the owner
field can directly be set to 0, and we can return. However, if there are waiting
tasks, then there is a need to do much more processing. We first have to acquire
the spinlock associated with the wait list (list of waiting processes). Then, we
remove the first entry and extract the task, next, from it. The task next needs
to be woken up in the near future such that it can access the critical section.
However, we are not done yet. We need to set the owner field to next such that
incoming threads know that the lock is acquired by some thread, it is not free.
Finally, we release the spinlock and hand over the id of the woken up task next
to the scheduler.
Note that the kernel code can use many other kinds of locks. Their code is
available in the directory kernel/locking.
A notable example is a queue-based spin lock (MCS lock, qspinlock in
kernel code). It is in general known to be a very scalable lock that it is quite fast.
It also minimizes cache line bouncing (movement of the cache line containing
the lock variable across cores). The idea is that we create a linked list of nodes
where the tail pointer points to the end of the linked list. We then add the
current node (wrapper of the current task) to the very end of this list. This
process requires two operations: make the current tail node point to the new
node (containing our task), and modify the tail pointer to point to the new
node. Both of these operations need to execute atomically – it needs to appear
that both of them executed at a single point of time, instantaneously. The MCS
lock is a very classical lock and almost all texts on concurrent systems discuss its
design a great detail. Hence, we shall not delve further (reference [Herlihy and
Shavit, 2012]). It suffices to say that it uses complex lock-free programming,
and we do not perform busy waiting on a single location, instead a thread only
busy waits on a Boolean variable declared within its node structure. When its
predecessor in the list releases the lock, it sets this variable to false, and the
current thread can then acquire the lock. This eliminates cache line bouncing
to a very large extent.
c Smruti R. Sarangi 196
There are a few more variants like the osq lock (variant of the MCS lock)
and the qrwlock (reader-writer lock that gives priority to readers).
The kernel code has its version of semaphores (see Listing 5.24). It has a spin
lock (lock), which protects the semaphore variable count. Akin to user-level
semaphores, the kernel semaphore supports two methods that correspond to
wait and post, respectively. They are known as down (wait) and up (post/signal).
The kernel semaphore functions in exactly the same manner as the user-level
semaphore. After acquiring the lock, the count variable is either incremented
or decremented. However, if the count variable is already zero, then it is not
possible to decrement it and the current task needs to wait. This is the point
at which it is added to the list of waiting processes (wait list) and the task
state is set to UNINTERRUPTIBLE. Similar to the case of unlocking a spin
lock, here also, if the count becomes non-zero from zero, we pick a process
from the wait list and set its task state to RUNNING. Given that all of this
is happening within the kernel, setting the task state is very easy. All of this
is very hard at the user level for obvious reasons. We need a system call for
everything. However, in the kernel, we do not have those restrictions and thus
these mechanisms are much faster.
In any unsafe state, it is possible that the thread gets preempted and the
interrupt handler runs. This interrupt handler may try to acquire the lock. Note
that any softirq − unsafe state is hardirq − unsafe as well. This is because hard
irq interrupt handlers have a higher priority as compared to softirq handlers.
We define hardirq − safe and hardirq − unsafe analogously. These states will be
used to flag potential deadlock-causing situations.
We next validate the chain of lock acquire calls that have been made. Check
for trivial deadlocks first (fairly common in practice): A → B → A. Such
trivial deadlocks are also known as lock inversions. Let us now use the states.
No path can contain a hardirq − unsafe lock and then a hardirq − safe lock. This
allows the latter call to possibly interrupt the critical section associated with
the former lock. This may lead to the lock inversion deadlock.
Let us now look at the general case in which we search for cyclic (circular)
waits. We need to create a global graph where each lock instance is a node, and
if the process holding lock A waits to acquire lock B, then there is an arrow
from A to B. If we have V nodes and E edges, then the time complexity is
O(V + E). This is quite slow. Note that we need to check for cycles before
acquiring every lock.
Let us use a simple caching-based technique. Consider a chain of lock acqui-
sitions, where the lock acquire calls can possibly be made by different threads.
Given that the same kind of code sequences tend to repeat in the kernel code,
we can cache a full sequence of lock acquisition calls. If the entire sequence is
devoid of cycles, then we can deem the corresponding execution to be deadlock
free. Hence, the brilliant idea here is as follows.
Figure 5.15: A hash table that stores an entry for every chain of lock acquisitions
Instead of checking for a deadlock on every lock acquire, we check for dead-
locks far more infrequently. We consider a long sequence (chain) of locks and
hash all of them. A hash table stores the “deadlock status” associated with
such chains (see Figure 5.15). It is indexed with the hash of the chain. If the
chain has been associated with a cyclic wait (deadlock) in the past, then the
hash table stores a 1, otherwise it stores a 0. This is a much faster mechanism
for checking for deadlocks and the overheads are quite limited. Note that if no
entry is found in the hash table, then either we keep building the chain and
try later, or we run a cycle detection algorithm immediately. This is a generic
mechanism that is used to validate spinlocks, mutexes and reader-writer locks.
The process of allocating and freeing objects is the most interesting. Allo-
cation is per se quite straightforward – we can use the regular malloc call. The
object can then be used by multiple threads. However, freeing the object is rel-
atively more difficult. This is because threads may have references to it. They
may try to access the fields of the object after it has been freed. We thus need
to free the allocated object only when no thread is holding a valid reference
to it or is holding a reference but promises never to use it in the future. In
C, it is always possible to arrive at the old address of an object using pointer
arithmetic. However, let us not consider such tricky situations because RCU
requires a certain amount of disciplined programming.
One may be tempted to use conventional reference counting, which is rather
slow and complicated in a concurrent, multiprocessor setting. A thread needs
to register itself with an object, and then it needs to deregister itself once
it is done using it. Registration and deregistration increment and decrement
the reference count, respectively. Any deallocation can happen only when the
reference count reaches zero. This is a complicated mechanism. The RCU
mechanism [McKenney, 2007] far simpler.
It needs to have the following features:
Note that we have only focus on the free part because it is the most difficult.
Consider the example of a linked list (see Figure 5.16).
Synchronize
and reclaim
Readers may be reading this the space
Final state
In this case even though we delete a node from the linked list, other threads
may still have references to it. The threads holding a reference to the node will
not be aware that the node has been removed from the linked list. Hence, after
deletion from the linked list we still cannot free the associated object.
count requires busy waiting. We already know the problems in busy waiting
such as cache line bouncing and doing useless work. This is precisely what we
would like to avoid via the synchronize rcu call.
Writing is slightly different here – we create a copy of an object, modify it,
and assign the new pointer to a field in the encapsulating data structure. Note
that a pointer is referred to as RCU protected, when it can be assigned and
dereferenced with special RCU-based checks (we shall see later).
Listing 5.25: Example code that traverses a list within an RCU read context
source : include/linux/rcupdate.h
rcu_read_lock () ;
l is t _f or _ ea ch _ en t ry _r c u (p , head , list ) {
t1 = p - > a ;
t2 = p - > b ;
}
rcu_read_unlock () ;
Listing 5.26: Replace an item in a list and then wait till all the readers finish
list_replace_rcu (& p - > list , &q - > list ) ;
synchronize_rcu () ;
kfree ( p ) ;
Listing 5.26 shows a piece of code that waits till all the readers complete. In
this case, one of the threads calls the list replace rcu function that replaces
an element in the list. It is possible that there are multiple readers who currently
have a reference to the old element (p->list) and are currently reading it. We
need to wait for them to finish the read operation. The only assumption that
can be made here is that all of them are accessing the list in an RCU context –
the code is wrapped between the RCU read lock and read unlock calls.
201 c Smruti R. Sarangi
The function synchronize rcu makes the thread wait for all the readers to
complete. Once, all the readers have completed, we can be sure that the old
pointer will not be read again. This is because the readers will check if the node
pointed to by the pointer is still a part of the linked list or not. This is not
enforced by RCU per se. Coders nevertheless have to observe such rules if they
want to use RCU correctly.
After this we can free the pointer p using the kfree call.
Let us now consider an example that uses the function rcu assign pointer
function in the context of the list replace rcu function (see Listing 5.28). In
a doubly-linked list, we need to replace the old entry by new. We first start
with setting the next and prev pointers of new (make them the same as old).
Note that at this point, the node is not added to the list.
It is added when we set the next pointer of new− > prev is set to new. This
is the key step that adds the new node to the list. This pointer assignment is
done in an RCU context because it delinks the earlier node from the list. There
may be references to the earlier node that are still held by readers. This is
why this pointer has to be done in an RCU context. We need to wait for those
readers to complete before old is deallocated.
Dereferencing a Pointer
At this point the object can be freed, and its space can be reclaimed. This
method is simple and slow.
Reader
Reader
Reader Reader
Reader
Removal Reclama�on
Figure 5.17: Removal and reclamation of an object (within the RCU context)
Let us now understand when the grace period (from the point of view of a
thread) ends and the period of quiescence starts. One of the following conditions
needs to be satisfied.
1. When a thread blocks: If there is a restriction that blocking calls are not
allowed in an RCU read block, then if the thread blocks we can be sure
that it is in the quiescent state.
3. If the kernel enters an idle loop, then also we can be sure that the read
block is over.
Whenever any of these conditions is true, we set a bit that indicates that the
CPU is out of the RCU read block – it is in the quiescent state. The reason that
this enables better performance is as follows. There is no need to send costly
inter-processor interrupts to each CPU and wait for a task to execute. Instead,
we adopt a more proactive approach. The moment a thread leaves the read
block, the CPU enters the quiescent state and this fact is immediately recorded
by setting a corresponding per-CPU bit. Note the following: this action is off
the critical path and there is no shared counter.
Once all the CPUs enter a quiescent state, the grace period ends and the
object can be reclaimed. Hence, it is important to answer only one question
when a given CPU enters the quiescent state: Is this the last CPU to have
entered the quiescent state? If, the answer is “Yes”, then we can go forward
and declare that the grace period has ended. The object can then be reclaimed.
This is because we can be sure that no thread holds a valid reference to the
object (see the answer to Question 5.3.5).
205 c Smruti R. Sarangi
Question 5.3.1
What if there are more threads than CPUs? It is possible that all of
them hold references to an object. Why are we maintaining RCU state
at a CPU level?
Answer: We assume that whenever a thread accesses the object that
is RCU-protected, it is accessed only within an RCU context (within a
read block). Furthermore, a check is also made within the read block
that it is very much a part of the containing data structure. It cannot
access the object outside the RCU context. Now, once a thread enters
an RCU read block, it cannot be preempted until has finished executing
the read block.
It is not possible for the thread to continue to hold a reference and use it.
This is because it can be used once again only within the RCU context,
and there it will be checked if the object is a part of its containing data
structure. If it has been removed, then the object’s reference has no
value.
For a similar reason, no other thread running on the CPU can access
the object once the object has been removed and the quiescent state has
been reached on the CPU. Even if another thread runs on the CPU, it
will not be able to access the same object because it will not find it in
the containing data structure.
Tree RCU
Let us now suggest an efficient method of managing the quiescent state of all
CPUs. The best way to do so is to maintain a tree. Trees have natural paral-
lelism; they avoid centralized state.
struct rcu state is used to maintain quiescence information across the
cores. Whenever the grace period ends (all the CPUs are quiescent at least
once), a callback function may be called. This will let the writer know that the
object can be safely reclaimed.
struct rcu_state
struct Register a callback
rcu_node func�on.
Called when the
grace period ends.
struct struct struct
rcu_node rcu_node rcu_node
Preemptible RCU
Sadly, RCU stops preemption and migration when the control is in an RCU block
(read-side critical section). This can be detrimental to real-time programs as
they come with strict timing requirements and deadlines. For real-time versions
of Linux, there is a need to have a preemptible version of RCU where preemption
is allowed within an RCU read-side block. Even though doing this is a good
idea for real-time systems, it can lead to many complications.
In classical RCU, read-side critical sections had almost zero overhead. Even
on the write-side all that we had to do is read the current data structure, make a
copy, make the changes (update) and add it to the encapsulating data structure
(such as a linked list or a tree). The only challenge was to wait for all the
outstanding readers to complete, which has been solved very effectively.
Here, there are many new complications. If there is a context switch in the
middle of a read block, then the read-side critical section gets “artificially length-
ened”. We can no more use the earlier mechanisms for detecting quiescence. In
this case, whenever a process enters a read block, it needs to register itself, and
then it needs to deregister itself when it exits the read block. Registration and
deregistration can be implemented using counter increments and decrements,
respectively. The rcu read lock function needs to increment a counter and
the rcu read unlock function needs to decrement a counter. These counters
are now a part of a process’s context, not the CPU’s context (unlike classical
RCU). This is because we may have preemption and subsequent migration. It
is also possible for two concurrent threads to run on a CPU that access RCU-
207 c Smruti R. Sarangi
protected data structures. Note that this was prohibited earlier. We waited
for a read block to completely finish before running any other thread. In this
case, two read blocks can run concurrently (owing to preemption). Once pre-
empted, threads can also migrate to other CPUs. Hence, counters can no more
be per-CPU counters. State management thus becomes more complex.
We have thus enabled real-time execution and preemption at the cost of
making RCU slower and more complex.
5.4 Scheduling
Scheduling is one of the most important activities performed by an OS. It is a
major determinant of the overall system’s responsiveness and performance.
the same for all the jobs. Here again, there are two types of problems. In one
case, the jobs that will arrive in the future are known. In the other case, we
have no idea – jobs may arrive at any point of time.
J1 J2 J3 J4
3 2 4 1
Figure 5.19: Example of a set of jobs that are awaiting to be scheduled
Figure 5.19 shows an example where we have a bunch of jobs that need to
be scheduled. We assume that the time that a job needs to execute (processing
time) is known (shown in the figure).
In Figure 5.20, we introduce an objective function, which is the mean job
completion time. The completion time is the duration between the arrival time
and the time at which the job fully completes. This determines the responsive-
ness of the system. It is possible for a system to minimize the makespan yet
unnecessarily delay a lot of jobs, which shall lead to an adverse mean completion
time value.
J1 J2 J3 J4
t1
t2
t3
t4
P
Figure 5.20: Mean completion time µ = i ti /n
We can thus observe that the problem of scheduling is a very fertile ground
for proposing and solving optimization problems. We can have a lot of con-
straints, settings and objective functions.
To summarize, we have said that in any scheduling problem, we have a list
of jobs. Each job has an arrival time, which may either be equal to 0 or some
other time instant. Next, we typically assume that we know how long a job shall
take to execute. Then in terms of constraints, we can either have preemptible
jobs or we can have non-preemptible jobs. The latter means that the entire
job needs to execute in one go without any other intervening jobs. Given these
constraints, there are a couple of objective functions that we can minimize. One
would be to minimize the makespan, which is basically the time from the start
of scheduling till the time it takes for the last job to finish execution. Another
objective function is the average completion time, where the completion time is
again defined as the time at which a job completes minus the time at which it
arrived (measure of the responsiveness).
209 c Smruti R. Sarangi
For scheduling such a set of jobs, we have a lot of choice. We can use many
simple algorithms, which in some cases, can also be proven to be optimal. Let us
start with the random algorithm. It randomly picks a job and schedules it on a
free core. There is a lot of work that analyzes the performance of such algorithms
and many times such random choice-based algorithms perform quite well. In
the space of deterministic algorithms, the shortest job first (SJF) algorithm is
preferred. It schedules all the jobs in ascending order of their execution times.
It is a non-preemptible algorithm. We can prove that it minimizes the average
completion time.
KSW Model
Let us now introduce a more formal way of thinking and introduce the Karger-
Stein-Wein (KSW) model [Karger et al., 1999]. It provides an abstract or generic
framework for all scheduling problems. It essentially divides the space of prob-
lems into large classes and finds commonalities in between problems that belong
to the same class. Specifically, it requires three parameters: α, β and γ.
The first parameter α determines the machine environment. It specifies the
number of jobs and the processing time of each job. It specifies the number of
cores, number of jobs, and the execution time of each job. The second parameter
β specifies the constraints. For example, it specifies whether preemption is al-
lowed or not, whether the arrival times are all the same or are different, whether
the jobs have dependencies between them or whether there are job deadlines.
A dependency between a pair of jobs can exist in the sense that we can specify
that job J1 needs to complete before J2 . Note that in real-time systems, jobs
come with deadlines, which basically means that jobs have to finish before a
certain time. A deadline is thus one more type of constraint.
Finally, the last parameter is γ, which is the optimality criterion. We have
already discussed the average mean completion time and makespan criteria.
We can also define a weighted completion time – a weighted mean of completion
times. Here a weight in a certain sense represents a job’s priority. Note that
the mean completion time metric is a special case of the weighted completion
time metric – all the weights are equal to 1. Let the completion time of job i
be Ci . The cumulative completion time is equivalent to the mean completion
time in this case because the number of jobs behaves like a constant. We can
represent this criterion as ΣCi . The makespan is represented as Cmax (maximum
completion time of all jobs).
We can consequently have a lot of scheduling algorithms for every scheduling
problem, which can be represented using the 3-tuple α | β | γ as per the KSW
formulation.
We will describe two kinds of algorithms in this book. The most popular
algorithms are quite simple, and are also provably optimal in some scenarios.
We will also introduce a bunch of settings where finding the optimal schedule is
an NP-complete problem [Cormen et al., 2009]. There are good approximation
algorithms for solving such problems.
c Smruti R. Sarangi 210
Let us define the problem 1 || ΣCj in the KSW model. We are assuming
that there is a single core. The objective function is to minimize the sum of
completion times (Cj ). Note that minimizing the sum of completion times is
equivalent to minimizing the mean completion time because the number of tasks
is known a priori and is a constant.
J4 J2 J1 J3
1 2 3 4
Figure 5.21: Shortest job first scheduling
The claim is that the SJF (shortest job first) algorithm is optimal in this
case (example shown in Figure 5.21). Let us outline a standard approach for
proving that a scheduling algorithm is optimal with respect to the criterion that
is defined in the KSW problem. Here we are minimizing the mean completion
time.
Let the SJF algorithm be algorithm A. Assume that another algorithm A0 is
optimal. There must be a pair of jobs j and k such that j immediately precedes
k and the processing time (execution time) of j > k. This means pj > pk . Note
that such a pair of jobs will not be found in algorithm A. Assume pj started
at time t. Let us exchange jobs j and k with the rest of the schedule remaining
the same. Let this new schedule be produced by another algorithm A00 .
Next, let us evaluate the contribution to the cumulative completion time by
jobs j and k in algorithm A0 . It is (t + pj ) + (t + pj + pk ). Let us evaluate
the contribution of these two jobs in the schedule produced by A00 . It is (t +
pk ) + (t + pj + pk ). Given that pj > pk , we can conclude that the schedule
produced by algorithm A0 is longer (higher cumulative completion time). This
can never be the case because we have assumed A0 to be optimal. We have
a contradiction here because A00 appears to be more optimal than A0 , which
violates our assumption.
Hence, A0 or any algorithm that violates the SJF order cannot be optimal.
Thus, algorithm A (SJF) is optimal.
Weighted Jobs
Let us now define the problem where weights are associated with jobs. It will
be 1 || wj Cj in the KSW formulation. If ∀j, wj = 1, we have the classical
unweighted formulation for which SJF is optimal.
For the weighted version, let us schedule jobs in descending order of (wj /pj ).
Clearly, if all wj = 1, this algorithm is the same as SJF. We can use the same
exchange-based argument to prove that using (wj /pj ) as the job priority yields
an optimal schedule.
211 c Smruti R. Sarangi
EDF Algorithm
Let us next look at the EDF (Earliest Deadline First) algorithm. It is one of
the most popular algorithms in real-time systems. Here, each job is associated
with a distinct non-zero arrival time and deadline. Let us define the lateness as
hcompletion timei - hdeadlinei. Let us define the problem as follows:
We are still considering a single core machine. The constraints are on the
arrival time and deadline. The constraint ri represents the fact that job i is
associated with arrival time ri – it can start only after it has arrived (ri ). Jobs
can arrive at any point of time (dynamically). The dli constraint indicates
that job i has deadline dli associated with it – it needs to complete before it.
Preemption is allowed (pmtn). We wish to minimize the maximum lateness
(Lmax ). This means that we would like to ensure that jobs complete as soon as
possible, with respect to their deadline. Note that in this case, we care about
the maximum value of the lateness, not the mean value. This means that we
don’t want any single job to be delayed significantly.
The algorithm schedules the job whose deadline is the earliest. Assume that
a job is executing, and a new job arrives that has an earlier deadline. Then the
currently running job is swapped out, and the new job that now has the earliest
deadline executes.
If the set of jobs are schedulable, which means that it is possible to find
a schedule such that no job misses its deadline, then the EDF algorithm will
produce such a schedule. If they are not schedulable, then the EDF algorithm
will broadly minimize the time by which jobs miss their deadline (minimize
Lmax ).
The proof is on similar lines and uses exchange-based arguments (refer to
[Mall, 2009]).
SRTF Algorithm
Let us continue our journey and consider another problem: 1 | ri , pmtn | ΣCi .
Consider a single core machine where the jobs arrive at different times and
preemption is allowed. We aim to minimize the mean/cumulative completion
time.
In this case, the most optimal algorithm is SRTF (shortest remaining time
first). For each job, we keep a tab on the time that is left for it to finish
execution. We sort this list in ascending order and choose the job that has the
shortest amount of time left. If a new job arrives, we compute its remaining time
and if that number happens to be the lowest, then we preempt the currently
running job and execute the newly arrived job.
We can prove that this algorithm minimizes the mean (cumulative) comple-
tion time using a same exchange-based argument.
• 1 | ri | ΣCi : In this case, preemption is not allowed and jobs can arrive at
any point of time. There is much less flexibility in this problem setting.
This problem is provably NP-complete.
• 1 | ri | Lmax : This problem is similar to the former. Instead of the average
(cumulative) completion time, we have lateness as the objective function.
• 1 | ri , pmtn | Σwi Ci : This is a preemptible problem that is a variant
of 1 | ri , pmtn | ΣCi , which has an optimal solution – SRTF. The only
addendum is the notion of the weighted completion time. It turns out
that for generic weights, this problem becomes NP-complete.
We thus observe that making a small change to the problem renders it NP-
complete. This is how sensitive these scheduling problems are.
Practical Considerations
All the scheduling problems that we have seen assume that the job execution
(processing) time is known. This may be the case in really well-characterized
and constrained environments. However, in most practical settings, this is not
known.
Figure 5.22 shows a typical scenario. Any task typically cycles between two
bursts of activity: a CPU-bound burst and an I/O burst. The task typically
does a fair amount of CPU-based computation, and then makes a system call.
This initiates a burst where the task waits for some I/O operation to complete.
We enter an I/O bound phase in which the task typically does not actively
execute. We can, in principle, treat each CPU-bound burst as a separate job.
Each task thus yields a sequence of jobs that have their distinct arrival times.
The problem reduces to predicting the length of the next CPU burst.
We can use classical time-series methods to predict the length of the CPU
burst. We predict the length of the nth burst tn as a function of tn−1 , tn−2 . . . tn−k .
For example, tn could be described by the following equation:
these predictions, the algorithms listed in the previous sections like EDF, SJF
and SRTF can be used. At least some degree of near-optimality can be achieved.
Let us consider the case when we have a poor prediction accuracy. We need
to then rely on simple, classical and intuitive methods.
Conventional Algorithms
We can always make a random choice, however, that is definitely not desirable
here. Something that is much more fair is a simple FIFO algorithm. To im-
plement it, we just need a queue of jobs. It guarantees the highest priority to
the job that arrived the earliest. A problem with this approach is the “convoy
effect”. A long-running job can delay a lot of smaller jobs. They will get unnec-
essarily delayed. If we had scheduled them first, the average completion time
would have been much lower.
We can alternatively opt for round-robin scheduling. We schedule a job for
one time quantum. After that we preempt the job and run another job for one
time quantum, so on and so forth. This is at least fairer to the smaller jobs.
They complete sooner.
There is thus clearly a trade-off between the priority of a task and system-
level fairness. If we boost the priority of a task, it may be unfair to other tasks
(refer to Figure 5.23).
Priority
Fairness
Queue-based Scheduling
A standard method of scheduling tasks that have different priorities is to use
a multilevel feedback queue as shown in Figure 5.24. Different queues in this
composite queue are associated with different priorities. We start with the
highest-priority queue and start scheduling tasks using any of the algorithms
that we have studied. If empty cores are still left, then we move down the
c Smruti R. Sarangi 214
Queue: Level 1
Queue: Level 2
Queue: Level 3
priority order of queues: schedule tasks from the second-highest priority queue,
third-highest priority queue and so on. Again note that we can use a different
scheduling algorithm for each queue. They are independent in that sense.
Depending upon the nature of the task and for how long it has been waiting,
tasks can migrate between queues. To provide fairness, tasks in low-priority
queues can be moved to high-priority queues. If a background task suddenly
comes into the foreground and becomes interactive, its priority needs to be
boosted, and the task needs to be moved to a higher priority queue. On the
other hand, if a task stays in the high-priority queues for a long time, we can
demote it to ensure fairness. Such movements ensure both high performance
and fairness.
Dispatcher
Let us now come to the issue of multicore scheduling. The big picture is
215 c Smruti R. Sarangi
shown in Figure 5.25. We have a global queue of tasks that typically contains
newly created tasks or tasks that needs to be migrated. A dispatcher module
sends the tasks to different per-CPU task queues. Theoretically, it is possible
to have different scheduling algorithms for different CPUs. However, this is not
a common pattern. Let us again look at the space of problems in the multicore
domain.
Bin Packing Problem: We have a finite number of bins, where each bin
has a fixed capacity S. There are n items. The size of the ith item is si .
We need to pack the items in bins without exceeding any bin’s capacity.
The objective is to minimize the number of bins and find an optimal
mapping between items to bins.
List Scheduling
Let us consider one of the most popular non-preemptive scheduling algorithms
in this space known as list scheduling. We maintain a list of ready jobs. They
are sorted in descending order according to some priority scheme. The priority
here could be the user’s job priority or could be some combination of the arrival
time, deadline, and the time that the job has waited for execution. When a
CPU becomes free it fetches the highest priority task from the list. In case, it
is not possible to execute that job, then the CPU walks down the list and finds
a job to execute. The only condition here is that we cannot return without a
job if the list is non-empty. Moreover, all the machines are considered to be
identical in terms of computational power.
Let us take a deeper look at the different kinds of priorities that we can use.
We can order the jobs in descending order of arrival time or job processing time.
We can also consider dependencies between jobs. In this case, it is important to
find the longest path in the graph (jobs are nodes and dependency relationships
are edges). The longest path is known as the critical path. The critical path often
determines the overall makespan of the schedule assuming we have adequate
compute resources. This is why in almost all scheduling problems, a lot of
emphasis is placed on the critical path. We always prefer scheduling jobs on
the critical path as opposed to jobs off the critical path. We can also consider
attributes associated with nodes in this graph. For example, we can set the
priority to be the out-degree (number of outgoing edges). If a job as a high
out-degree, then it means that a lot of other jobs are dependent on it. Hence,
if this job is scheduled, many other jobs will get benefited – they will have one
less dependency.
It is possible to prove that list scheduling is often near-optimal in some cases
using theoretical arguments [Graham, 1969]. Consider the problem P || Cmax .
Let the makespan (Cmax ) produced by an optimal scheduling algorithm OPT
have a length C ∗ . Let us compute the ratio of the makespan produced by list
scheduling and C ∗ . Our claim is that regardless of the priority that is used, we
are never worse off by a factor of 2.
Theorem 5.4.1 Makespan
Proof: Let there be n jobs and m CPUs. Let the execution times of the jobs
be p1 . . . pn , and job k (execution time pk ) complete the last. Assume it started
at time t. Then Cmax = t + pk .
Given that there is no idleness in list scheduling, we can conclude that till t
all the CPUs were 100% busy. This means that if we add all the work done by
all the CPUs till point t, it will be mt. This comprises the execution times of
a subset of jobs that does not include job k (one that completes the last). We
thus arrive at the following inequality.
217 c Smruti R. Sarangi
X
pi ≥ mt
i6=k
X
⇒ pi − pk ≥ mt
i
1 X pk
⇒t ≤ pi − (5.2)
m i m
P
pi pk
⇒t + pk = Cmax ≤ i − + pk
P m
m
pi 1
⇒Cmax ≤ i + pk 1 −
m m
Now, C ∗ ≥ pk and C ∗ ≥ mean(pi ). These follow from the fact that jobs
cannot be split across CPUs (no preemption) and we wait for all the jobs to
complete. We thus have,
P
ipi 1
Cmax ≤ + pk 1 −
m m
1
≤ C∗ + C∗ 1 − (5.3)
m
Cmax 1
⇒ ≤2−
C∗ m
P1 P2 P3
A B C
Let us look at the data structures used in the Banker’s algorithm (see Ta-
ble 5.2). There are n processes and m types of resources. The array avlbl
stores the number of copies that we have for resource i.
In Algorithm 1, we first initialize the cur cnt array and set it equal to avlbl
(count of free resources). At the beginning, the request of no process is assumed
to be satisfied (allotted). Hence, we set the value of all the entries in the array
done to false.
Next, we need to find a process with id i such that it is not done yet (done[i]
== false) and its requirements stored in the need[i] array are element-wise less
than cur cnt. Let us define some terminology here before proceeding forward.
need[][] is a 2-D array. need[i] is a 1-D array that captures the resource
requirements for process i – it is the ith row in need[n][m] (row-column format).
For two 1-D arrays A and B of the same size, the expression A ≺ B means that
∀i, A[i] ≤ B[i] and ∃j, A[j] < B[j]. This means that each element of A is less
than or equal to the corresponding element of B. Furthermore, there is at least
one entry in A that is strictly less than the corresponding entry in B. If both
the arrays are element-wise identical, we write A = B. Now, if either of the
cases is true – A ≺ B or A = B – we write A B.
Let us now come back to need[i] cur cnt. It means that the maximum
requirement of a process is less than the currently available count of resources
(for all entries) – the request of process i can be satisfied.
If no such process is found, we jump to the last step. It is the safety check
c Smruti R. Sarangi 220
step. However, if we are able to find such a process with id i, then we assume
that it will be able to execute. It will subsequently return all the resources that
it currently holds (acq[i]) back to the free pool of resources (cur cnt). Given
that we were able to satisfy the request for process i, we set done[i] equal to
true. We continue repeating this process till we can satisfy as many requests
of processes as we can.
Let us now come to the last step, where we perform the safety check. If
the requests of all the processes are satisfied, all the entries in the done array
will be equal to true. It means that we are in a safe state – all the requests
of processes can be satisfied. In other words, all the requests that are currently
pending can be safely accommodated. Otherwise, we are in an unsafe state. It
basically means that we have more requirements as compared to the number of
free resources. This situation indicates a potential deadlock.
Let us now look at the resource request algorithm (Algorithm 2). We start
out with introducing a new array called req, which holds process i’s require-
ments. For example, if req[j] is equal to k, it means that process i needs k
copies of resource j.
Let us now move to the check phase. Consider the case where need[i] ≺ req,
which basically means that every entry of req is greater than or equal to the cor-
responding entry of need[i], and at least one entry is strictly greater than the
corresponding entry in need[i]. In this case, there are clearly more require-
ments than what was declared a priori (stored in the need[i] array). Such
requests cannot be satisfied. We need to return false. On the other hand, if
avlbl ≺ req, then it means that we need to wait for resource availability, which
may happen in the future. In this case, we are clearly not exceeding pre-declared
thresholds, as we were doing in the former case.
Next, let us make a dummy allocation once enough resources become avail-
able (allocate). The first step is to subtract req from avlbl. This basically
means that we satisfy the request for process i. The resources that it requires
are not free anymore. Then we add req to acq[i], which basically means that
221 c Smruti R. Sarangi
the said resources have been acquired. We then proceed to subtract req from
need[i]. This is because at all points of time, max=acq + need.
After this dummy allocation, we check if the state is safe or not by invoking
Algorithm 1. If the state is not safe, then it means that the current resource
allocation request should not be allowed – it may lead to a deadlock.
Let us now understand the expression reqs[i] cur cnt. This basically
means that for some process i, we can satisfy its request at that point of time.
We subsequently move to update, where we assume that i’s request has been
satisfied. Therefore, similar to the safety checking algorithm, we return the
resources that i had held. We thus add acq[i] to cur cnt. This process is done
now (done[i] ← true). We go back to the find procedure and keep iterating till
we can satisfy the requests of as many processes as possible. When this is not
possible anymore, we jump to deadlock check.
Now, if done[i] == true for all processes, then it means that we were able
to satisfy the requests of all processes. There cannot be a deadlock. However,
if this is not the case, then it means that there is a dependency between pro-
cesses because of the resources that they are holding. This indicates a potential
deadlock situation.
There are several ways of avoiding a deadlock. The first is that before every
resource/lock acquisition we check the request using Algorithm 2. We do not
acquire the resource if we are entering an unsafe state. If the algorithm is
more optimistic, and we have entered an unsafe state already, then we perform
a deadlock check, especially when the system does not appear to make any
progress. We kill one of the processes involved in a deadlock and release its
resources. We can choose one of the processes that has been waiting for a long
time or has a very low priority.
do {
preempt_disable () ;
__schedule ( SM_NONE ) ;
s c h e d _ p r e e m p t _ e n a b l e _ n o _ r e s c h e d () ;
} while ( need_resched () ) ;
223 c Smruti R. Sarangi
There are several ways in which the schedule function can be called. If
a task makes a blocking call to a mutex or semaphore, then there is a pos-
sibility that it may not acquire the mutex/semaphore. In this case, the task
needs to be put to sleep. The state will be set to either INTERRUPTIBLE or
UNINTERRUPTIBLE. Since the current task is going to sleep, there is a need
to invoke the schedule function such that another task can execute.
The second case is when a process returns after processing an interrupt or
system call. The kernel checks the TIF NEED RESCHED flag. If it is set to true,
then it means that there are waiting tasks and there is a need to schedule them.
On similar lines, if there is a timer interrupt, there may be a need to swap the
current task out and bring a new task in (preemption). Again we need to call
the schedule to pick a new task to execute on the current core.
Every CPU has a runqueue where tasks are added. This is the main data
structure that manages all the tasks that are supposed to run on a CPU. The
apex data structure here is the runqueue (struct rq) (see kernel/sched/sched.h).
Linux defines different kinds of schedulers (refer to Table 5.3). Each sched-
uler uses a different algorithm to pick the next task that needs to run on a
core. The internal schedule function is a wrapper function on the individual
scheduler-specific function. There are many types of runqueues – one for each
type of scheduler.
Scheduling Classes
Let us introduce the notion of scheduling classes. A scheduling class represents
a class of jobs that need to be scheduled by a specific type of scheduler. Linux
defines a hierarchy of scheduling classes. This means that if there is a pending
job in a higher scheduling class, then we schedule it first before scheduling a job
in a lower scheduling class.
The classes are as follows in descending order of priority.
Stop Task This is the highest priority task. It stops everything and executes.
DL This is the deadline scheduling class that is used for real-time tasks. Every
task is associated with a deadline. Typically, audio and video encoders
create tasks in this class. This is because they need to finish their work
in a bounded amount of time. For 60-Hz video, the deadline is 16.66 ms.
RT These are regular real-time threads that are typically used for processing
interrupts (top or bottom halves), for example softIRQs.
Fair This is the default scheduler that the current version of the kernel uses
(v6.2). It ensures a degree of fairness among tasks where even the lowest
priority task gets some CPU time.
Idle This scheduler runs the idle process, which means it basically accounts for
the time in which the CPU is not executing anything – it is idle.
c Smruti R. Sarangi 224
In Listing 5.33, we observe that most of the functions have the same broad
pattern. The key argument is the runqueue struct rq that is associated with
each CPU. It contains all the task structs scheduled to run on a given CPU. In
any scheduling operation, it is mandatory to provide a pointer to the runqueue
such that the scheduler can find a task among all the tasks in the runqueue to
execute on the core. We can then perform several operations on it such as en-
queueing or dequeueing a task: enqueue task and dequeue task, respectively.
The most important functions in any scheduler are the functions pick task
and pick next task – they select the next task to execute. These functions are
225 c Smruti R. Sarangi
scheduler specific. Each type of scheduler maintains its own data structures and
has its own internal notion of priorities and fairness. Based on the scheduler’s
task selection algorithm an appropriate choice is made. The pick task function
is the fast path that finds the highest priority task (all tasks are assumed to
be separate), whereas the pick next task function is on the slow path. The
slow path incorporates some additional functionality, which can be explained
as follows. Linux has the notion of control groups (cgroups). These are groups
of processes that share scheduling resources. Linux ensures fairness across pro-
cesses and cgroups. In addition, it ensures fairness between processes in a
cgroup. cgroups further can be grouped into hierarchies. The pick next task
function ensures fairness while also considering cgroup information.
Let us consider a few more important functions. migrate task rq mi-
grates the task to another CPU – it performs the crucial job of load balancing.
update curr performs some bookkeeping for the current task – it updates its
runtime statistics. There are many other functions in this class such as func-
tions to yield the CPU, check for preemptibility, set CPU affinities and change
priorities.
These scheduling classes are defined in the kernel/sched directory. Each
scheduling class has an associated scheduler, which is defined in a separate C
file (see Table 5.3).
Scheduler File
Stop task scheduler stop task.c
Deadline scheduler deadline.c
Real-time scheduler rt.c
Completely fair scheduler (CFS) cfs.c
Idle idle.c
The runqueue
Let us now take a deeper look at a runqueue (struct rq) in Listing 5.34. The
entire runqueue is protected by a single spinlock lock. It is used to lock all key
operations on the runqueue. Such a global lock that protects all the operations
on a data structure is known as a monitor lock.
The next few fields are basic CPU statistics. The field nr running is the
number of runnable processes in the runqueue. nr switches is the number of
process switches on the CPU and the field cpu is the CPU number.
The runqueue is actually a container of individual scheduler-specific run-
queues. It contains three fields that point to runqueues of different schedulers:
cfs, rt and dl. They correspond to the runqueues for the CFS, real-time and
deadline schedulers, respectively. We assume that in any system, at the min-
imum we will have three kinds of tasks: regular (handled by CFS), real-time
tasks and tasks that have a deadline associated with them. These scheduler
types are hardwired into the logic of the runqueue.
It holds pointers to the current task (curr), the idle task (idle) and the
mm struct (prev mm). The task that is chosen to execute is stored in struct
c Smruti R. Sarangi 226
*core pick.
Scheduling-related Statistics
/* Preferred CPU */
struct rb_node core_node ;
/* statistics */
u64 exec_start ;
u64 sum_exec_runtime ;
u64 vruntime ;
u64 p rev_ sum_e xec_ runti me ;
c Smruti R. Sarangi 228
u64 nr_migrations ;
struct sched_avg avg ;
/* runqueue */
struct cfs_rq * cfs_rq ;
};
Notion of vruntimes
Equation shows the relation between the actual runtime and the vruntime.
The vruntime is δvruntime times the actual runtime. The formula for δvruntime is
229 c Smruti R. Sarangi
shown in Equation 5.4. Let δ be equal to the time interval between the current
time and the time at which the current started executing. If vruntime is equal
to δ then it means that we are not using a scaling factor. The scaling factor
is equal to the weight associated with a nice value of 0 divided by the weight
associated with the real nice value. We clearly expect the ratio to be less than
1 for high-priority tasks and be less than 1 for low-priority tasks.
weight(nice = 0)
δvruntime = δ × (5.4)
weight(nice)
Listing 5.37 shows the mapping between nice values as weights. The nice
value is 1024 for the nice value 0, which is the default. For every increase in the
nice value by 1, the weight reduces 1.25×. For example, if the nice value is 5,
the weight is 335. δvruntime = 3.05δ. Clearly, we have an exponential decrease
in the weight as we modify the nice value. For a nice value if n, the weight is
roughly 1024 × (1.25)n . The highest priority user task has a weight equal to
88761 (86.7×). This means that it gets significantly more runtime as compared
to a task that has the default priority.
Let us use the three mnemonics SP , N and G for the sake of readability.
Please refer to the code snippet shown in Listing 5.38. If the number of runnable
tasks are more than N (limit of the number of runnable tasks that can be con-
sidered in a scheduling period (SP )), then it means that the system is swamped
with tasks. We clearly have more tasks than what we can run. This is a crisis
situation, and we are looking at a rather unlikely situation. The only option in
this case is to increase the scheduling period by multiplying nr running with
G (minimum task execution time).
Let us consider the else part, which is the more likely case. In this case, we
set the scheduling period as SP .
Listing 5.38: Implementation of scheduling quanta in CFS
source : kernel/sched/fair.c
u64 __sched_period ( unsigned long nr_running )
{
if ( unlikely ( nr_running > sched_nr_latency ) )
return nr_running * s y s c t l _ s c h e d _ m i n _ g r a n u l a r i t y ;
c Smruti R. Sarangi 230
else
return sysctl_sched_latency ;
}
Once the scheduling period has been set, we set the scheduling slice for
each task as shown in Equation 5.5 (assuming we have the normal case where
nr running ≤ N).
weight(taski )
slicei = SP × P (5.5)
j weight(taskj )
We basically partition the scheduling period based on the weights of the con-
stituent tasks. Clearly, high-priority tasks get larger scheduling slices. However,
if we have the unlikely case where nr running > N, then each slice is equal to
G.
The scheduling algorithm works as follows. We find the task with the least
vruntime in the red-black tree. We allow it to run until it exhausts its scheduling
slice. This logic is shown in Listing 5.39. Here, if the CFS queue is non-
empty, we compute the time for which a task has already executed (ran). If
slice > ran, then we execute the task for slice − ran time units by setting
the timer accordingly, otherwise we reschedule the current task.
Listing 5.39: hrtick start fair
source : kernel/sched/fair.c
if ( rq - > cfs . h_nr_running > 1) {
u64 slice = sched_slice ( cfs_rq , se ) ;
u64 ran = se - > sum_exec_runtime - se - >
pre v_su m_exe c_ru ntime ;
s64 delta = slice - ran ;
if ( delta < 0) {
if ( task_current ( rq , p ) )
resched_curr ( rq ) ;
return ;
}
hrtick_start ( rq , delta ) ;
}
Clearly, once a task has exhausted its slice, its vruntime has increased and
its position needs to be adjusted in the RB tree. In any case, every time we
need to schedule a task, we find the task with the least vruntime in the RB tree
and check if it has exhausted its time slice or not. If it has, then we mark it as
a candidate for rescheduling (if there is spare time left in the current scheduling
period) and move to the next task in the RB tree with the second-smallest
vruntime. If that also has exhausted its scheduling slice or is not ready for some
reason, then we move to the third-smallest, and so on.
Once all tasks are done, we try to execute tasks that are rescheduled, and
then start the next scheduling period.
up after a long sleep and tasks getting migrated. They will start with a zero
vruntime and shall continue to have the minimum vruntime for a long time.
This has to be prevented – it is unfair for existing tasks. Also, when tasks move
from a heavily-loaded CPU to a lightly-loaded CPU, they should not have an
unfair advantage there. The following safeguards are in place.
3. If an old task is being restored or a new task is being added, then set
se− > vruntime+ = cfs rq− > minv runtime. This ensures that some
degree of a level playing field is being maintained.
4. This ensures that other existing tasks have a fair chance of getting sched-
uled
5. Always ensure that all vruntimes monotonically increase (in the cfs rq
and sched entity structures).
loadavg = u0 + u1 × y + u2 × y 2 + . . . (5.6)
This is a time-series sun with a decay term y. The decaying rate is quite
1
slow. y 32 = 0.5, or in other words y = 2− 32 . This is known as per-entity load
tracking (PELT, kernel/sched/pelt.c), where the number of intervals for which
we compute the load average is a configurable parameter.
c Smruti R. Sarangi 232
Real-time Scheduler
The real-time scheduler has one queue for every real-time priority. In addition,
we have a bit vector– one bit for each real-time priority. The scheduler finds
the highest-priority non-empty queue. It starts picking tasks from that queue.
If there is a single task then that task executes. The scheduling is clearly not
fair. There is no notion of fairness across real-time priorities.
However, for tasks having the same real-time priority, there are two op-
tions: FIFO and round-robin (RR). In the real-time FIFO option, we break ties
between two equal-priority tasks based on when they arrived (first-in first-out
order). In the round-robin (RR) algorithm, we check if a task has exceeded its
allocated time slice. If it has, we put it at the end of the queue (associated
with the real-time priority). We find the next task in this queue and mark it
for execution.
i
X t
Wi (t) = dj
j=1
Pj
Wi (t)
Qi (t) =
t (5.7)
Qi = min Qi (t)
{0<t≤Pi }
Q = max Qi
{1≤i≤n}
We consider a time interval t, and find the number of periods of job j that
are contained within it. Then we multiply the number of periods with the
execution time of a task (in job j). This is the total CPU load for the j th job. If
we aggregate the loads for the first i jobs in the system (arranged in descending
order of RMS priority), we get the cumulative CPU load Wi (t). Let us next
compute Qi (t) = Wi (t)/t. It is the mean load of the first i tasks over the time
interval t.
Next, let us minimize this quantity over the time period t and a given i. Let
this quantity be Qi . If Qi ≤ 1, then the ith task is schedulable and vice versa.
Let us define Q = max(Qi ). If Q ≤ 1, then it means that all the tasks are
schedulable. It turns out that this is both a necessary and sufficient condition.
For obvious reasons, it is not as elegant and easy to compute as the Liu-Layland
bound. Nevertheless, this is a more exact expression and is often used to assess
schedules.
c Smruti R. Sarangi 236
only when Ii (t) + di > t. We subsequently set the new value of t to be equal to
the sum the interference (Ii (t)) and the execution time of the ith task (di ). We
basically set t ← Ii (t) + di .
Before proceeding to the next iteration, it is necessary to perform a sanity
check. We need to check if t exceeds the deadline Di or not. If it exceeds the
deadline, clearly the ith task is not schedulable. We return false. If the deadline
has not been exceeded, then we can proceed to the next iteration.
We perform the same of steps in the next iteration. We compute the new
value of the interference using the new value of t. Next, we add the execution
time di of the ith task to it. Now, if the sum is less than or equal to the value
of t, we are done. We can declare the ith task to be schedulable, subject to the
fact that t ≤ Di . Otherwise, we increment t and proceed to the next iteration.
Given the fact that in every iteration we increase t, we will either find task i to
be schedulable or t will ultimately exceed Di .
Let us first consider a simple setting where there are two tasks in the system.
The low-priority task happens to lock a resource first. When the high-priority
tries to access the resource, it blocks. However, in this case the blocking time is
predictable – it is the time that the low-priority task will take to finish using the
resource. After that the high-priority task is guaranteed to run. This represents
the simple case and is an example of bounded priority inversion.
Let us next consider the more complicated case. If a high-priority task is
blocked by a low-priority task, a medium priority task can run in its place.
c Smruti R. Sarangi 238
This ends up blocking the low-priority task, which is holding the resource. If
such medium-priority tasks continue to run, the low-priority task may remain
blocked for a very long time. Here, the biggest loser is the high-priority task
because the time for which it will remain inactive is not known and dependent
on the behavior of many other tasks. Hence, this scenario is known as unbounded
priority inversion.
Next, assume that a task needs access to k resources, which it needs to
acquire sequentially (one after the other). It may undergo priority inversion
(bounded or unbounded) while trying to acquire each of these k resources. The
total amount of time that the high-priority task spends in the blocked state
may be prohibitive. This is an example of chain blocking, which needs to be
prevented. Assume that these are nested locks – the task acquires resources
without releasing previously held resources.
To summarize, the main issues that arise out of priority inversion related
phenomena are unbounded priority inversion and chain blocking. Coupled with
known issues like deadlocks, we need to design protocols such that all three
scenarios are prevented by design.
Definition 5.5.2 Unbounded Priority Inversion and Chain Blocking
of the resource-requesting task Treq be preq . If phld < preq , we temporarily raise
the priority of Thld to preq . However, if phld > preq , nothing needs to be done.
Note that this is a temporary action. Once the contended resource is released,
the priority of Thld reverts to phld . Now, it is possible that phld may not be
the original priority of Thld because this itself may be a boosted priority that
Thld may have inherited because it held some other resource. We will not be
concerned about that and just revert the priority to the value that existed just
before the resource was acquired, which is phld in this case.
Note that a task can inherit priorities from different tasks in the interval of
time in which it holds a resource. Every time a task is blocked because it cannot
access a resource, it tries to make the resource-holding task inherit its priority
if its priority is greater than the priority of the resource-holding task.
Let us explain with an example. Assume that the real-time priority of the
low-priority task Tlow is 5. The priority of a medium-priority task Tmed is 10,
and the priority of the high-priority task Thigh is 15. These are all real-time
priorities: higher the number, greater the priority. Now assume that Tlow is the
first to acquire the resource. Next, Tmed tries to acquire the resource. Due to
priority inheritance, the priority of Tlow now becomes 10. Next, Thigh tries to
acquire the resource. The priority of Tlow ends up getting boosted again. It is
now set to 15. After releasing the resource, the priority of Tlow reverts back to
5.
This is a very effective idea, and it is simple to implement. This is why many
versions of Linux support the PIP protocol. Sadly, the PIP protocol suffers from
deadlocks, unbounded priority inversion and chain blocking. Let us explain.
function, which is defined as the priority of the highest priority task that can
possibly acquire a resource (some time in the future).
Priority Inversion
Once a task acquires a resource, we raise its priority to ceil(resource) + 1. This
may be perceived to be unfair. In this case, we are raising the priority to an
absolute maximum, which may be much more than the priority of the high-
priority tasks that have blocked on the resource. Essentially, this is priority
inheritance on steroids !!!
This basically means that the moment a task acquires a resource, its pri-
ority gets boosted to a high value. It cannot be blocked any more by any
other task whose priority is less than or equal to ceil(resource). Note that akin
to the PIP protocol, unbounded priority inversion is not possible because no
medium-priority task can block the resource-holding task. In fact, any priority
inheritance or priority boosting protocol will avoid unbounded priority inver-
sion because the priority of the resource-holder becomes at least as large as
the resource-requester. It thus cannot be blocked by any intermediate-priority
process.
Deadlocks
Next, let us consider deadlocks. Let us consider the same example that we
considered in the case of the priority inheritance protocol. In this case Tlow
acquires R1 first. It is not possible for Thigh to run. This is because the priority
of Tlow will become at least phigh + 1. It is thus not possible for the scheduler to
choose Thigh . In fact, it is possible to easily prove that the moment a resource
is acquired, no other task that can possibly acquire the resource can run – all
their priorities are less than ceil(resource) + 1. Hence, resource contention is
not possible because the contending task can never execute after the resource
acquisition. If there is no contention, a deadlock is not possible.
For every resource, we can define a set of tasks that may possibly acquire
it. Let us refer to it as the resource’s request set. The moment one of them
acquires the resource, the priority of that task gets set to ceil(resource) + 1.
This means that no other task in the request set can start or resume execution
after the resource R has been acquired. This ensures that henceforth there will
be no contention because of R. No contention ⇒ No deadlock.
Chain Blocking
The key question that we need to answer is after task T has acquired a resource
R, can it get blocked when it tries to acquire more resources? Assume that T
has acquired R, and then it tries to acquire R0 . If R0 is free, then there is no
chain blocking. Now, assume that R0 is already acquired by task T 0 – this leads
to T getting blocked. Let this be the first instance of such a situation, where
a task gets blocked while trying to acquire a resource after already acquiring
another resource. The following relationships hold:
Let pri denote the instantaneous priority function. It is clear that T 0 is not
blocked (given our assumption). Now, given that T is running, its priority must
be more than that of T 0 . Next, from the definition of the HLP protocol, we
have pri(T 0 ) > ceil(R0 ). Note that this relationship holds because the resource
R0 has already been acquired by T 0 . The priority of T 0 can be further boosted
because T 0 may have acquired other resources with higher ceilings. In any case,
after resource acquisition pri(T 0 ) > ceil(R0 ) holds.
If we combine these equations, we have pri(T ) > ceil(R0 ). Note that at this
point of time, T has not acquired R0 yet. Given that the ceiling is defined as
the maximum priority of any interested task, we shall have pri(T ) ≤ ceil(R0 )
(before R0 has been acquired by T ). We thus have a contradiction here.
Hence, we can conclude that it never shall be the case that a task that has
already acquired one resource is waiting to acquire another. There will thus be
no chain blocking. Let us now quickly prove a lemma about chain blocking and
deadlocks.
Lemma 1
If there is no chain blocking, there can be no deadlocks.
Proof: Deadlocks happen because a task holds on to one resource, and tries
to acquire another (hold and wait condition). Now, if this process is guaranteed
to happen without blocking (no chain blocking), then a hold-and-wait situation
will never happen. Given that hold-and-wait is one of the necessary conditions
for a deadlock, there will be no deadlocks.
Inheritance Blocking
This protocol however does create an additional issue namely inheritance block-
ing. Assume that the priority of task T is 5 and the resource ceiling is 25. In
this case, once T acquires the resource, its priority becomes 26. This is very
high because 25 is a hypothetical maximum that may get realized very rarely.
Because of this action all the high-priority tasks with priorities between 6 and
25 get preempted. They basically get blocked because T inherited the priority
26. The sad part is that there may be no other process that is interested in
acquiring the resource regardless of its priority. We still end up blocking a lot
of other processes.
Point 5.5.3
Inheritance blocking is the major issue in the HLP protocol. It does not
suffer from chain blocking, deadlocks and unbounded priority inversion.
The system ceiling or CSC is defined as the maximum of all the ceilings of
all the resources that are currently acquired by some task. PCP has a “resource
grant clause” and “inheritance clause”. The latter changes the priority of the
task.
Inheritance Clause The task holding a resource inherits the priority of the
blocked task, if its priority is lower.
Let us understand the resource grant clause in some further detail. Let
us call a resource that has set the CSC a critical resource. If a task T owns a
critical resource, then the resource grant clause allows it to acquire an additional
resource. There are two cases that arise after the resource has been acquired.
Either the existing critical resource continues to remain critical or the new
resource that is going to be acquired becomes critical. In both cases, T continues
to own the critical resource.
In the other subclause, a task can acquire a resource if it has a priority
greater than the CSC. It will clearly have the highest priority in the system. It
is also obvious that it hasn’t acquired any resource yet. Otherwise, its priority
would not have been greater than the CSC. Let us state a few lemmas without
proof. It is easy to prove them using the definition of CSC.
Lemma 2
The CSC is greater than or equal to the priority of any task that currently
holds a resource in the system.
Lemma 3
The moment a task whose priority is greater than the current CSC ac-
quires a resource, it sets the CSC and that resource becomes critical.
Lemma 4
After a task acquires a resource, no other task with a priority less than
or equal to the CSC at that point of time can acquire any resource until
the CSC is set to a lower value.
Proof: A resource is acquired when the task priority is either more than the
CSC or the CSC has already been set by a resource that the task has acquired
in the past. In either case, we can be sure that the task has acquired a critical
resource (see Lemma 3). Subsequently, no new task with a priority less than
CSC can acquire a resource. It will not pass any of the subclauses of the re-
source grant clause – it will not have any resource that has set the CSC and its
priority is also less than the CSC.
protocol.
Next, note that in the PCP protocol we do not elevate the priority to very
high levels, as we did in the HLP protocol. The priority inheritance protocol is
the same as the PIP protocol. Hence, we can conclude that inheritance blocking
is far more controlled.
Point 5.5.4
The PCP protocol does not suffer from deadlocks, chain blocking and
unbounded priority inversion. The problem of inheritance blocking is
also significantly controlled.
Exercises
Ex. 1 — What are the four necessary conditions for a deadlock? Briefly ex-
plain each condition.
Ex. 2 — Assume a system with many short jobs with deterministic execution
times. Which scheduler should be used?
Do not use any locks (in any form). In your algorithm, there can be starvation;
however, no deadlocks. Provide the code for the push and pop methods. They
need to execute atomically. Note that in any real system there can be arbitrary
delays between consecutive instructions.
245 c Smruti R. Sarangi
Ex. 4 — Explain why spinlocks are not appropriate for single-processor sys-
tems yet are often used in multiprocessor systems.
Ex. 6 — Solve the Dining Philosopher’s Problem using only semaphores. Use
three states for each philosopher: THINKING, HUNGRY and EATING.
Ex. 8 — Consider the kernel mutex. It has an owner field and a waiting queue.
A process is added to the waiting queue only if the owner field is populated
(mutex is busy). Otherwise, it can become the owner and grab the mutex.
However, it is possible that the process saw that the owner field is populated,
added itself to the waiting queue but by that time the owner field became empty
– the previous mutex owner left without informing the current process. There
is thus no process to wake it up now, and it may wait forever. Assume that
there is no dedicated thread to wake processes up. The current owner wakes up
one waiting process when it releases the mutex (if there is one).
Sadly, because of such race conditions, processes may wait forever. Design a
kernel-based mutex that does not have this problem. Consider all race condi-
tions. Assume that there can be indefinite delays between instructions. Try
to use atomic instructions and avoid large global locks. Assume that task ids
require 40 bits.
Ex. 9 — The Linux kernel has a policy that a process cannot hold a spinlock
while attempting to acquire a semaphore. Explain why this policy is in place.
Ex. 11 — Explain the spin lock mechanism in the Linux kernel (based on
ticket locks). In the case of a multithreaded program, how does the spin lock
mechanism create an order for acquiring the lock? Do we avoid starvation?
Ex. 13 — Why are memory barriers present in the code of the lock and
unlock functions?
Ex. 15 — What is the lost wakeup problem? Explain from a theoretical per-
spective with examples.
Ex. 16 — Does the Banker’s algorithm prevent starvation? Justify your an-
swer.
c Smruti R. Sarangi 246
Ex. 21 — Show the pseudocode for registering and deregistering readers, and
the synchronize rcu function.
Ex. 23 — Why is it advisable to use RCU macros like rcu assign pointer
and rcu dereferencec heck? Why cannot we read or write to the memory
locations directly using simple assignment statements?
/* . . . */
struct foo * p = kmalloc ( sizeof ( struct foo ) , GFP_KERNEL ) ;
/* kernel malloc */
p - > a = 1;
p - > b = 2;
p - > c = 3;
Ex. 25 — Correct the following piece of code in the context of the RCU mech-
anism.
p = gp ;
if ( p != NULL ) {
247 c Smruti R. Sarangi
Ex. 27 — How can we modify the CFS scheduling policy to fairly allocate
processing time among all users instead of processes? Assume that we have a
single CPU and all the users have the same priority (they have an equal right
to the CPU regardless of the processes that they spawn). Each user may spawn
multiple processes, where each process will have its individual CFS priority
between 100 and 139. Do not consider the real-time or deadline scheduling
policies.
Ex. 28 — How does the Linux kernel respond if the current task has exceeded
its allotted time slice?
Ex. 29 — The process priorities vary exponentially with the nice values. Why
is this the case? Explain in the context of a mix of compute and I/O-bound
jobs where the nice values change over time.
** Ex. 33 — Prove that any algorithm that uses list scheduling will have a
competitive ratio (Clist /C ∗ ), which is less than or equal to (2 − 1/m). There
are m processors, C is the makespan and C ∗ is the optimal makespan.
Ex. 34 — For a system with periodic and preemptive jobs, what is the uti-
lization bound (maximum value of U till which the system remains schedulable)
for EDF?
Ex. 35 — Prove that in PCP algorithm, once the first resource is acquired,
there can be no more priority inversions (provide a very short proof).
c Smruti R. Sarangi 248
Chapter 6
The Memory System
249
c Smruti R. Sarangi 250
regions (see Figure 6.1). There are holes between the allocated regions. If a
new process is created, then its memory needs to be allocated within one of the
holes. Let us say that a process requires 100 KB and the size of a hole is 150
KB, then we are leaving 50 KB free. We basically create a new hole that is 50
KB long. This phenomenon of having holes between regions and not using that
space is known as external fragmentation. On the other hand, leaving space
empty within a page in a regular virtual memory system is known as internal
fragmentation.
Definition 6.1.1 Fragmentation
Hole Limit
Allocated region Base
The next question that we need to answer is that if we are starting a new
process, and we are exactly aware of the maximum amount of memory that it
requires, then which hole do we select for allocating its memory? Clearly the size
of the hole needs to be more than the amount of requested memory. However,
there could be multiple such holes, and we need to choose one of them. Our
choice really matters because it determines the efficiency of the entire process.
It is very well possible that later on we may not be able to satisfy requests
primarily because we will not have holes of adequate size left. Hence, designing
a proper heuristic in this space is important particularly in anticipation of the
future. There are several heuristics in this space. Let us say that we need R
bytes.
Best Fit Choose the smallest hole that is just about larger than R.
Next Fit Start searching from the last allocation that was made and move
towards higher addresses (with wraparounds).
their performance quite suboptimal. It is also possible to prove that they are
optimal in some cases assuming some simple distribution of memory request
sizes in the future. In general, we do not know how much memory a process is
going to access. Hence, declaring the amount of memory that a process requires
upfront is quite difficult. This information is not there with the compiler or even
the user. In today’s complex programs, the amount of memory that is going to
be used is a very complicated function of the input, and it is thus not possible
to predict it beforehand. As a result, these schemes are seldom used as of
today. They are nevertheless still relevant for very small embedded devices that
cannot afford virtual memory. However, by and large, the base-limit scheme is
consigned to the museum of virtual memory schemes.
The stack distance typically has a distribution that is similar to the one
shown in Figure 6.3. Note that we have deliberately not shown the units of the
x and y axes because the aim was to just show the shape of the figure and not
focus on specific values. We observe a classic heavy-tailed distribution where
c Smruti R. Sarangi 252
small values are relatively infrequent. Then there is a peak followed by a very
heavy tail. The tail basically refers to the fact that we have non-trivially large
probabilities when we consider rather high values of the stack distance.
This curve can be interpreted as follows. Low values of the stack distance
are relatively rare. This is because we typically tend to access multiple streams
of data simultaneously. We are definitely accessing data as well as instructions.
This makes it two streams, but we could be accessing other streams as well. For
instance, we could be accessing multiple arrays or multiple structures stored
in memory in the same window of time. This is why consecutive accesses to
the same page, or the same region, are somewhat infrequent. Hence, extremely
low values of the stack distance are rarely seen. However, given that most
programs have a substantial amount of temporal locality, we see a peak in the
stack distance curve in the low to low-medium range of values – they are very
frequent. Almost all computer systems take advantage of such a pattern because
the stack distance curve roughly looks similar for cache accesses, page accesses,
hard disk regions, etc.
The heavy tail arises because programs tend to make a lot of random ac-
cesses, tend to change phases and also tend to access a lot of infrequently used
data. As a result, large stack distances are often seen. This explains the heavy
tail in the representative plot shown in Figure 6.3. There are a lot of distribu-
tions that have heavy tail. Most of the time, researchers model this curve using
the log-normal distribution. This is because it has a heavy tail as well as it is
easy to analyze mathematically.
Let us understand the significance of the stack distance. It is a measure of
temporal locality. Lower the average stack distance, higher the temporal local-
ity. It basically means that we keep accessing the same pages over and over
again in the same window of time. Similarly, higher the stack distance, lower
the temporal locality. This means that we tend to re-access the same page af-
ter a long period of time. Such patterns are unlikely to benefit from standard
architectural optimizations like caching. As discussed earlier, the log-normal
distribution is typically used to model the stack distance curve because it cap-
253 c Smruti R. Sarangi
0.3
0.25
Representa�ve plot
0.2
Probability
0.15
0.1
0.05
0
0 1 2 3 4 5 6 7 8
Stack distance
tures the fact that very low stack distances are rare, then there is a strong peak
and finally there is a heavy tail. This is easy to interpret and also easy to use as
a theoretical tool. Furthermore, we can use it to perform some straightforward
mathematical analyses as well as also realize practical algorithms that rely on
some form of caching or some other mechanism to leverage temporal locality.
Stack-based Algorithms
WS-Clock Algorithm
Let us now implement the approximation of the LRU protocol. A simple im-
plementation algorithm is known as the WS-Clock page replacement algorithm,
which is shown in Figure 6.4. Here WS stands for “working set”, which we shall
discuss later in Section 6.1.3.
Every physical page in memory is associated with an access bit. That is set
to either 0 or 1 and is stored along with the corresponding page table entry. A
c Smruti R. Sarangi 256
pointer like the minute hand of a clock points to a physical page; it is meant
to move through all the physical pages one after the other (in the list of pages)
until it wraps around.
If the access bit of the page pointed to by the pointer is equal to 1, then it is
set to 0 when the pointer traverses it. There is no need to periodically scan all
the pages and set their access bits to 0. This will take a lot of time. Instead, in
this algorithm, once there is a need for replacement, we check the access bit and
if it is set to 1, we reset it to 0. However, if the access bit is equal to 0, then we
select that page for replacement. For the time being, the process stops at that
point. Next time the pointer starts from the same point and keeps traversing
the list of pages towards the end until it wraps around the end.
This algorithm can approximately find the pages that are not recently used
and select one of them for eviction. It turns out that we can do better if we
differentiate between unmodified and modified pages in systems where the swap
space is inclusive – every page in memory has a copy in the swap space, which
could possibly be stale. The swap space in this case acts as a lower-level cache.
If both the bits are equal to 0, then they remain so, and we go ahead and
select that page as a candidate for replacement. On the other hand if they are
equal to h0, 1i, which means that the page has been modified and after that its
257 c Smruti R. Sarangi
access bit has been set to 0, then we perform a write-back and move forward.
The final state in this case is set to h00i because the data is not deemed to be
modified anymore since it is written back to memory. Note that every modified
page in this case has to be written back to the swap space whereas unmodified
pages can be seamlessly evicted given that the swap space has a copy. As a
result we prioritize unmodified pages for eviction.
Next, let us consider the combination h1, 0i. Here, the access bit is 1, so we
set it to 0. The result combination of bits is now h0, 0i; we move forward. We
are basically giving the page a second chance in this case as well because it was
accessed in the recent past.
Finally, if the combination of these 2 bits is h1, 1i, then we perform the
write-back, and reset the new state to h1, 0i. This means that this is clearly a
frequently used frame that gets written to, and thus it should not be evicted or
downgraded (access bit set to 0).
This is per se a simple algorithm, which takes the differing overheads of
reads and writes into account. For writes, it gives a page a second chance in a
certain sense.
We need to understand that such LRU approximating algorithms are quite
heavy. They introduce artificial page access faults. Of course, they are not
as onerous as full-blown page faults because they do not fetch data from the
underlying storage device that takes millions of cycles. Here, we only need to
perform some bookkeeping and change the page access permissions. This is
much faster than fetching the entire page from the hard disk or NVM drive.
It is also known as a soft page fault. They however still lead to an exception
and require time to service. There is some degree of complexity involved in this
mechanism. But at least we are able to approximate LRU to some extent.
FIFO Algorithm
The queue-based FIFO (first-in first-out) algorithm is one of the most popular
algorithms in this space, and it is quite easy to implement because it does not
require any last usage tracking or access bit tracking. It is easy to implement
primarily because all that we need to do is that we need to have a simple priority
queue in memory that stores all the physical pages based on when the time at
which they were brought into memory. The page that was brought in the earliest
is the replacement candidate. There is no run time overhead in maintaining or
updating this information. We do not spend any time in setting and resetting
access bits or in servicing page access faults. Note that this algorithm is not
stack based, and it does not follow the stack property. This is not a good thing
as we shall see shortly.
Even though this algorithm is simple, it suffers from a very interesting
anomaly known as the Belady’s Anomaly [Belady et al., 1969]. Let us un-
derstand it better by looking at the two examples shown in Figures6.5 and 6.6.
In Figure 6.5, we show an access sequence of physical page ids (shown in square
boxes). The memory can fit only four frames. If there is a page fault, we mark
the entry with a cross otherwise we mark the box corresponding to the access
with a tick. The numbers at the bottom represent the contents of the FIFO
queue after considering the current access. After each access, the FIFO queue
is updated.
If the memory is full, then one of the physical pages (frames) in memory
c Smruti R. Sarangi 258
needs to be removed. It is the page that is at the head of the FIFO queue –
the earliest page that was brought into memory. The reader should take some
time and understand how this algorithm works and mentally simulate it. She
needs to understand and appreciate how the FIFO information is maintained
and why this algorithm is not stack based.
Access 1 2 3 4 1 2 5 1 2 3 4 5
sequence
1 2 3 4 4 4 5 1 2 3 4 5
1 2 3 3 3 4 5 1 2 3 4
1 2 2 2 3 4 5 1 2 3
1 1 1 2 3 4 5 1 2
4 frames
Access 1 2 3 4 1 2 5 1 2 3 4 5
sequence
1 2 3 4 1 2 5 5 5 3 4 4
1 2 3 4 1 2 2 2 5 3 3
1 2 3 4 1 1 1 2 5 5
3 frames
In this particular example shown in Figure 6.5, we see that we have a total
of 10 page faults. Surprisingly, if we reduce the number of physical frames in
memory to 3 (see Figure 6.6), we have a very counter-intuitive result. We would
ideally expect the number of page faults to increase because the memory size is
smaller. However, we observe an anomalous result. We have 9 page faults (one
page fault less than the larger memory with 4 frames) !!!
The reader needs to go through this example in great detail. She needs to
understand the reasons behind this anomaly. These anomalies are only seen in
algorithms that are not stack-based. Recall that in a stack-based algorithm,
we have the stack property – at all points of time the set of pages in a larger
memory are a superset of the pages that we would have in a smaller memory.
Hence, we cannot observe such an anomaly. Now, we may be tempted to believe
259 c Smruti R. Sarangi
that this anomaly is actually limited to small discrepancies. This means that if
we reduce the size of the memory, maybe the size of the anomaly is quite small
(limited to a very few pages).
However, this presumption is sadly not true. It was shown in a classic paper
by Fornai et al. [Fornai and Iványi, 2010a, Fornai and Iványi, 2010b] that a
sequence always exists that can make the discrepancy arbitrarily large. In other
words, it is unbounded. This is why the Belady’s anomaly renders many of
these non-stack-based algorithms can ineffective. They perform very badly in
the worst case. One may argue that such “bad” cases are pathological and rare.
But in reality, such bad cases to occur to a limited extent. This significantly
reduces the performance of the system because page faults are associated with
massive overheads.
#pages in memory
Figure 6.7: Page fault rate versus the working set size
c Smruti R. Sarangi 260
Thrashing
Consider a system with a lot of processes. If the space that is allocated to a
process is less than the size of its working set, then the process will suffer from
a high page fault rate. Most of its time will be spent in fetching its working set
and servicing page faults. The CPU performance counters will sadly indicate
that there is a low CPU utilization. The CPU utilization will be low primarily
because most of the time is going in I/O: servicing page faults. However, the
kernel’s load calculator will observe that the CPU load is low. Recall that we
had computed the CPU load in Section 5.4.6 (Equation 5.6) using a similar
logic.
Given that the load average is below a certain threshold, the kernel will try
to spawn more processes to increase the average CPU utilization. This will
actually exacerbate the problem and make it even worse. Now the memory that
is available to a given process will further reduce.
Alternatively, the kernel may try to migrate processes to the current CPU
that is showing reduced activity. Of course, here we are assuming a non-uniform
memory access machine (NUMA machine), where a part of the physical memory
is “close” to the given CPU. This proximate memory will now be shared between
many more processes.
In both cases, we are increasing the pressure on memory. Process will spend
most of their time in fetching their working set into memory – the system will
thus become quite slow and unresponsive. This process can continue and become
a vicious cycle. In the extreme case, this will lead to a system crash because
key kernel threads will not be able to finish their work on time.
This phenomenon is known as thrashing. Almost all modern operating sys-
tems have a lot of counters and methods to detect thrashing. The only practical
261 c Smruti R. Sarangi
Holes
User space 64 PB
more or less unfettered access to the kernel’s data structures, however off late
this is changing.
Modules are typically used to implement device drivers, file systems, and
cryptographic protocols/mechanisms. They help keep the core kernel code
small, modular and clean. Of course, security is a big concern while load-
ing kernel modules and thus module-specific safeguards are increasingly getting
more sophisticated – they ensure that modules have limited access to only the
functionalities that they need. With novel module signing methods, we can
ensure that only trusted modules are loaded. 1520 MB is a representative figure
for the size reserved for storing module-related code and data in kernel v6.2.
Note that this is not a standardized number, it can vary across Linux versions
and is also configurable.
struct mm_struct {
….
Pointer to the page table. The CR3
pgd_t *pgd; register is set to this value. Type: u64
…
};
Figure 6.9: The high-level organization of the page table (57-bit address)
Figure 6.9 shows the mm struct structure that we have seen before. It specif-
ically highlights a single field, which stores the page table (pgd t *pgd). The
page table is also known as the page directory in Linux. There are two virtual
memory address sizes that are commonly supported: 48 bits and 57 bits. We
have chosen to describe the 57-bit address in Figure 6.9. We observe that there
are five levels in a page table. The highest level of the page table is known as the
page directory (PGD). Its starting address is stored in the CR3 MSR (model
specific register). CR3 stores the starting address of the page table (highest
level) and is specific to a given process. This means that when the process
changes, the contents of the CR3 register also need to change. It needs to point
to the page table of the new process. There is a need to also flush the TLB. This
is very expensive. Hence, various kinds of optimizations have been proposed.
We shall quickly see that the contents of the CR3 register do not change
when we make a process-to-kernel transition or in some cases in a kernel-to-
process transition as well. Here the term process refers to a user process. The
main reason for this is that changing the virtual memory context is associated
with a lot of performance overheads and thus there is a need to minimize such
events as much as possible.
c Smruti R. Sarangi 264
The page directory is indexed using the top 9 bits of the virtual address
(bits 49-57). Then we have four more levels. For each level, the next 9 bits
(towards the LSB) are used to address the corresponding table. The reason
that we have a five-level page table here is because we have 57 virtual address
bits and thus there is a need to have more page table levels. Our aim is to
reduce the memory footprint of page tables as much as possible and properly
leverage the sparsity in the virtual address space. The details of all of these
tables are shown in Table 6.2. We observe that the last level entry is the page
table entry, which contains the mapping between the virtual page number and
the page frame number (or the number of the physical page) along with some
page protection information.
Listing 6.1: The follow pte function (assume the entry exists)
source : mm/memory.c
int follow_pte ( struct mm_struct * mm , unsigned long address ,
pte_t ** ptepp , spinlock_t ** ptlp ) {
pgd_t * pgd ;
p4d_t * p4d ;
pud_t * pud ;
pmd_t * pmd ;
pte_t * ptep ;
* ptepp = ptep ;
return 0;
}
Listing 6.1 shows the code for traversing the page table (follow pte func-
tion) assuming that an entry exists. We first walk the top-level page directory,
and find a pointer to the next level table. Next, we traverse this table, find a
pointer to the next level, so on and so forth. Finally, we find the pointer to the
page table entry. However, in this case, we also pass a pointer to a spinlock.
It is locked prior to returning a pointer to the page table entry. This allows us
to make changes to the page table entry. It needs to be subsequently unlocked
after it has been used/modified by another function.
Let us now look slightly deeper into the code that looks up a table in the
5-level page table. A representative example for traversing the PUD table is
shown in Listing 6.2. Recall that the PUD table contains entries that point to
PMD tables. Let us thus traverse the PUD table. We find the index of the
PMD entry using the function pmd index and add it to the base address of the
PUD table. This gives us a pointer to the PMD table. Recall that each entry of
the PUD table contains a pointer to a PMD table (pmd t *). Let us elaborate.
Listing 6.2: Accessing the page table at the PMD level
/* include / linux / pgtable . h */
pmd_t * pmd_offset ( pud_t * pud , unsigned long address ) {
return pud_pgtable (* pud ) + pmd_index ( address ) ;
}
First consider the pmd index inline function that takes the virtual address
as input. We need to next extract bits 31-39. This is achieved by shifting the
address to the right by 30 positions and then extracting the bottom 9 bits (using
a bitwise AND operation). The function returns the entry number in the PMD
table. This is multiplied with the size of a PMD entry and then added to the
base address of the PUD page table that is obtained using the pud pgtable
function. Note that the multiplication is implicit primarily because the return
type of the pud pgtable function is pmd t *.
Let us now look at the pud pgtable function. It relies on the va inline
function that takes a physical address as input and returns the virtual address.
The reverse is done by the pa inline function (or macro). In va(x), we
simply add the argument x to an address called PAGE OFFSET. This is not the
offset within a page, as the name may suggest. It is an offset into a memory
region where the page table entries are stored. These entries are stored in the
direct-mapped region of kernel memory. The PAGE OFFSET variable points to the
starting point of this region or some point within this region (depending upon
the architecture). Note the linear conversion between a physical and virtual
address.
The inline pud pgtable function invokes the va function with an argument
that is constructed as follows. The pud valpud returns the bits corresponding
to the physical address of the PUD table. We compute a bitwise AND between
this value and a constant that has all 1s between bit positions 13 and 52 (rest
0s). The reason is that the maximum physical address size is assumed to be 252
bytes in Linux. Furthermore, we are aligning the address with a page boundary,
hence, the first 12 bits (offset within the page) are set to 0. This is the address
corresponding to the PUD, which is assumed to be aligned with a page boundary.
This physical address is then converted to a virtual address using the va
function. We then add the PMD index to it and find the virtual address of the
PMD entry.
struct page
struct page is defined in include/linux/mm types.h. It is a fairly complex data
structure that extensively relies on unions. Recall that a union in C is a data
type that can store multiple types of data in the same memory location. It is a
good data type to use if we want it to encapsulate many types of data, where
only one type is used at a time.
The page structure begins with a set of flags that indicate the status of the
page. They indicate whether the page is locked, modified, in the process of being
written back, active, already referenced or reserved for special purposes. Then
there is a union whose size can vary from 20 to 40 bytes depending upon the
configuration. We can store a bunch of things such as a pointer to the address
space (in the case of I/O devices), a pointer to a pool of pages, or a page
map (to map DMA pages or pages linked to an I/O device). Then we have a
reference count, which indicates the number of entities that are currently holding
a reference of the page. This includes regular processes, kernel components or
even external devices such as DMA controllers.
We need to ensure that before a page is recycled (returned to the pool of
pages), its reference count is equal to zero. It is important to note that the
page structure is ubiquitously used and that too for numerous purposes, hence
it needs to have a very flexible structure. This is where using a union with the
large number of options for storing diverse types of data turns out to be very
useful.
Folios
Let us now discuss folios [Corbet, 2022, Corbet, 2021]. A folio is a compound or
aggregate page that comprises two or more contiguous pages. The reason that
folios were introduced is because memories are very large as of today, and it
is very difficult to handle the millions of pages that they contain. The sheer
translation overhead and overhead for maintaining page-related metadata and
information is quite prohibitive. Hence, a need was felt to group consecutive
pages into larger units called folios. Specifically, a folio points to the first page
in a group of pages (compound page). Additionally, it stores the number of
pages that are a part of it.
The earliest avatars of folios were meant to be a contiguous set of virtual
pages, where the folio per se is identified by a pointer to the head page (first
page). It is a single entity insofar as the rest of the kernel code is concerned. This
in itself is a very useful concept because in a sense we are grouping contiguous
virtual memory pages based on some notion of application-level similarity.
Now if the first page of the folio is accessed, then in all likelihood the rest
of the pages will also be accessed very soon. Hence, it makes a lot of sense to
prefetch these pages to memory in anticipation of being used in the near future.
However, over the years the thinking has somewhat changed even though folios
are still in the process of being fully integrated into the kernel. Now most
interpretations try to also achieve contiguity in the physical address space as
well. This has a lot of advantages with respect to I/O, DMA accesses and
reduced translation overheads. Let us discuss another angle.
Almost all server-class machines as of today have support for huge pages,
which have sizes ranging from 2 MB to 1 GB. They reduce the pressure on the
c Smruti R. Sarangi 268
TLB and page tables, and also increase the TLB hit rate as well. We maintain
a single entry for the entire huge page. Consider a 1 GB huge page. It can
store 218 4 KB pages. If we store a single mapping for it, then we are basically
reducing the number of entries that we need to have in the TLB and page table
substantially. Of course, this requires hardware support and also may sometimes
be perceived to be wasteful in terms of memory. However, in today’s day and
age we have a lot of physical memory. For many applications this is a very
useful facility and the entire 1 GB region can be represented by a set of folios –
this simplifies its management significantly.
Furthermore, I/O and DMA devices do not use address translation. They
need to access physical memory directly, and thus they benefit by having a
large amount of physical memory allocated to them. It becomes very easy to
transfer a huge amount of data directly to/from physical memory if they have
a large contiguous allocation. Additionally, from the point of view of software
it also becomes much easier to interface with I/O devices and DMA controllers
because this entire memory region can be mapped to a folio. The concept
of a folio along with a concomitant hardware mechanism such as huge pages
enables us to perform such optimizations quite easily. We thus see the folio as
a multifaceted mechanism that enables prefetching and efficient management of
I/O and DMA device spaces.
Given that a folio is perceived to be a single entity, all usage and replacement-
related information (LRU stats) are maintained at the folio level. It basically
acts like a single page. It has its own permission bits as well as copy-on-write
status. Whenever a process is forked, the entire folio acts as a single unit like
a page and is copied in totality when there is a write to any constituent page.
LRU information and references are also tracked at the folio level.
Mapping the struct page to the Page Frame Number (and vice versa)
Let us now discuss how to map a page or folio structure to a page frame number
(pfn). There are several simple mapping mechanisms. Listing 6.3 shows the code
for extracting the pfn from a page table entry (pte pfn macro). We simply right
shift the address by 12 positions (PAGE SHIFT).
Listing 6.3: Converting the page frame number to the struct page and vice
versa
source : include/asm − generic/memory model.h
# define pte_pfn ( x ) phys_to_pfn ( x . pte )
# define phys_to_pfn ( p ) (( p ) >> PAGE_SHIFT )
The next macro pfn to page has several variants. A simpler avatar of
this macro simply assumes a linear array of page structures. There are n such
structures, where n is the number of frames in memory. The code in Listing 6.3
shows a more complex variant where we divide this array into a bunch of sec-
tions. We figure out the section number from the pfn (page frame number), and
269 c Smruti R. Sarangi
every section has a section-specific array. We find the base address of this array
and add the page frame number to it to find the starting address of the corre-
sponding struct page. The need for having sections will be discussed when we
introduce zones in physical memory (in Section 6.2.5).
ASIDs
Intel x86 processors have the notion of the processor context ID (PCID), which
in software parlance is also known as the address space ID (ASID). We can take
some important user-level processes that are running on a CPU and assign them
a PCID each. Then their corresponding TLB entries will be tagged/annotated
with the PCID. Furthermore, every memory access will now be annotated with
the PCID (conceptually). Only those TLB entries will be considered that match
the given PCID. Intel CPUs typically provide 212 (=4096) PCIDs. One of them
is reserved, hence practically 4095 PCIDs can be supported. There is no separate
register for it. Instead, the top 12 bits of the CR3 register are used to store the
current PCID.
Now let us come to the Linux kernel. It supports the generic notion of ASIDs
(address space IDs), which are meant to be architecture independent. Note that
it is possible that an architecture does not even provide ASIDs.
In the specific case of Intel x86-64 architectures, an ASID is the same as a
PCID. This is how we align a software concept (ASID) with a hardware concept
(PCID). Given that the Linux kernel needs to run on a variety of machines and
all of them may not have support for so many PCIDs, it needs to be slightly
more conservative, and it needs to find a common denominator across all the
architectures that it is meant to run on. For the current kernel (v6.2), the
developers decided support only 6 ASIDs, which they deemed to be enough.
This means that out of 4095, only 6 PCIDs on an Intel CPU are used. From
271 c Smruti R. Sarangi
Let us now consider the case of multithreaded processes that run multiple
threads across different cores. They share the same virtual address space, and
it is important that if any TLB modification is made on one core, then the
modification is sent to the rest of the cores to ensure program consistency and
correctness. For instance, if a certain mapping is invalidated/removed, then
it needs to be removed from the page table, and it also needs to be removed
from the rest of the TLBs (on the rest of the cores). This requires us to send
many inter-processor interrupts (IPIs) to the rest of the cores such that they
can run the appropriate kernel handler and remove the TLB entry. As we would
have realized by now, this is an expensive operation. It may interrupt a lot of
high-priority tasks.
Consider a CPU that is currently executing another process. Given that
it is not affected by the invalidation of the mapping, it need not invalidate it
immediately. Instead, we can set the CPU state to the “lazy TLB mode”.
Point 6.2.1
Kernel threads do not have separate page tables. A common kernel page
table is appended to all user-level page tables. At a high level, there is a
pointer to the kernel page table from every user-level page table. Recall
that the kernel and user virtual addresses only differ in their highest bit
(MSB bit), and thus a pointer to the kernel-level page table needs to be
there at the highest level of the five-level composite page table.
Let us now do a case-by-case analysis. Assume that the kernel in the course
of execution tries to access the invalidated page – this will create a correctness
issue if the mapping is still there. Note that since we are in the lazy TLB
mode, the mapping is still valid in the TLB of the CPU on which the kernel
thread is executing. Hence, in theory, the kernel may access the user-level page
that is not valid at the moment. However, this cannot happen in the current
implementation of the kernel. This is because access to user-level pages does not
happen arbitrarily. Instead, such accesses happen via functions with well-defined
entry points in the kernel. Some examples of such functions are copy from user
and copy to user. At these points, special checks can be made to find out if the
pages that the kernel is trying to access are currently valid or not. If they are
not valid because another core has invalidated them, then an exception needs
to be thrown.
Next, assume that the kernel switches to another user process. In this case,
either we flush all the pages of the previous user process (solves the problem) or
if we are using ASIDs, then the pages remain but the current task’s ASID/PCID
changes. Now consider shared memory-based inter-process communication that
involves the invalidated page. This happens through well-defined entry points.
Here checks can be carried out – the invalidated page will thus not be accessed.
c Smruti R. Sarangi 272
Finally, assume that the kernel switches back to a thread that belongs to
the same multithreaded user-level process. In this case, prior to doing so, the
kernel checks if the CPU is in the lazy TLB mode and if any TLB invalidations
have been deferred. If this is the case, then all such deferred invalidations are
completed immediately prior to switching from the kernel mode. This finishes
the work.
The sum total of this discussion is that to maintain TLB consistency, we do
not have to do it in mission mode. There is no need to immediately interrupt all
the other threads running on the other CPUs and invalidate some of their TLB
entries. Instead, this can be done lazily and opportunistically as and when there
is sufficient computational bandwidth available – critical high-priority processes
need not be interrupted for this purpose.
Node
Shared interconnect
Refer to Figure 6.10 that shows a NUMA machine where multiple chips
(group of CPUs) are connected over a shared interconnect. They are typically
organized into clusters of chips/CPUs and there is a notion of local memory
within a cluster, which is much faster than remote memory (present in another
cluster). We would thus like to keep all the data and code that is accessed within
a cluster to remain within the local memory. We need to minimize the number
of remote memory accesses as far as possible. This needs to be explicitly done to
273 c Smruti R. Sarangi
guarantee the locality of data and ensure a lower average memory access time.
In the parlance of NUMA machines, each cluster of CPUs or chips is known as
a node. All the computing units (e.g. cores) within a node have roughly the
same access latency to local memory as well as remote memory. We need to
thus organize the physical address hierarchically. The local memory needs to be
the lowest level and the next level should comprise pointers to remote memory.
Zones
Given that the physical address space is not flat, there is a need to partition
it. Linux refers to each partition as a zone [Rapoport, 2019]. The aim is to
partition the set of physical pages (frames) in the physical address space into
different nonoverlapping sets.
Each such set is referred to as a zone. They are treated separately and dif-
ferently. This concept can easily be extended to also encompass frames that
are stored on different kinds of memory devices. We need to understand that
in modern systems, we may have memories of different types. For instance,
we could have regular DRAM memory, flash/NVMe drives, plug-and-play USB
memory, and so on. This is an extension of the NUMA concept where we have
different kinds of physical memories, and they clearly have different characteris-
tics with respect to the latency, throughput and power consumption. Hence, it
makes a lot of sense to partition the frames across the devices and assign each
group of frames (within a memory device) to a zone. Each zone can then be
managed efficiently and appropriately (according to the device that it is associ-
ated with). Memory-mapped I/O and pages reserved for communicating with
the DMA controller can also be brought within the ambit of such zones.
Listing 6.4 shows the details of the enumeration type zone type. It lists the
different types of zones that are normally supported in a regular kernel.
The first is ZONE DMA, which is a memory area that is reserved for physical
pages that are meant to be accessed by the DMA controller. It is a good idea to
partition the memory and create an exclusive region for the DMA controller. It
can then access all the pages within its zone freely, and we can ensure that data
in this zone is not cached. Otherwise, we will have a complex sequence of cache
evictions to maintain consistency with the DMA device. Hence, partitioning the
set of physical frames helps us clearly mark a part of the memory that needs to
remain uncached as is normally the case with DMA pages. This makes DMA
operations fast and reduces the number of cache invalidations and writebacks
substantially.
Next, we have ZONE NORMAL, which is for regular kernel and user pages.
Sometimes we may have a peculiar situation where the size of the physical
memory actually exceeds the total size of the virtual address space. This can
happen on some older processors and also on some embedded systems that use
16-bit addressing. In such special cases, we would like to have a separate zone
of the physical memory that keeps all the pages that are currently not mapped
to virtual addresses. This zone is known as ZONE HIGHMEM.
User data pages, anonymous pages (stack and heap), regions of memory used
by large applications, and regions created to handle large file-based applications
can all benefit from placing their pages in contiguous zones of physical memory.
For example, if we want to design a database’s data structures, then it is a good
idea to create a large folio of pages that are contiguous in physical memory. The
c Smruti R. Sarangi 274
database code can lay out its data structures accordingly. Contiguity in physical
addresses ensures better prefetching performance. A hardware prefetcher can
predict the next frame very accurately. The other benefit is a natural alignment
with huge pages, which leads to reduced TLB miss rates and miss penalties. To
create such large contiguous regions in physical memory, pages have to be freely
movable – they cannot be pinned to physical addresses. If they are movable, then
pages can dynamically be consolidated at runtime and large holes – contiguous
regions of free pages – can be created. These holes can be used for subsequent
allocations. It is possible for one process to play spoilsport by pinning a page.
Most often these are kernel processes. These actions militate against the creation
of large contiguous physical memory regions. Hence, it is a good idea to group
all movable pages and assign them to a separate zone where no page can be
pinned. Linux defines such a special zone called ZONE MOVABLE that comprises
pages that can be easily moved or reclaimed by the kernel.
The next zone pertains to novel memory devices that cannot be directly man-
aged by conventional memory management mechanisms. This includes parts of
the physical address space stored on nonvolatile memory devices (NVMs), mem-
ory on graphics cards, Intel’s Optane memory (persistent memory) and other
novel memory devices. A dedicated zone called ZONE DEVICE is thus created
to encompass all these physical pages that are stored on a device that is not
conventional DRAM.
Such unconventional devices have many peculiar features. For example, they
can be removed at any point of time without prior notice. This means that no
copy of pages stored in this zone should be kept in regular DRAM – they will
become inconsistent. Page caching is therefore not allowed. This zone also
allows DMA controllers to directly access device memory. The CPU need not
be involved in such DMA transfers. If a page is in ZONE DEVICE, we can safely
assume that the device that hosts the pages will manage them.
It plays an important role while managing nonvolatile memory (NVM) de-
vices because now the hardware can manage the pages in NVMs directly. They
are all mapped to this zone and there is a notion of isolation between device
pages and regular memory pages. The key idea here is that device pages need to
be treated differently in comparison to regular pages stored on DRAM because
of device-specific idiosyncrasies.
Point 6.2.2
NVM devices are increasingly being used to enhance the capacity of
the total available memory. We need to bear in mind that nonvolatile
memory devices are in terms of performance between hard disks and
regular DRAM memory. The latency of a hard disk is in milliseconds,
whereas the latency of nonvolatile memory is typically in microseconds
or in the 100s of nanoseconds range. The DRAM memory on the other
hand has a sub 100-ns latency. The advantage of nonvolatile memories
is that even if the power is switched off, the contents still remain in the
device (persistence). The other advantage is that it also doubles up as a
storage device and there is no need to actually pay the penalty of page
faults when a new processor starts or the system boots up. Given the
increasing use of nonvolatile memory in laptops, desktops and server-
class processors, it was incumbent upon Linux developers to create a
275 c Smruti R. Sarangi
device-specific zone.
/* Normal pages */
ZONE_NORMAL ,
Sections
Recall that in Listing 6.3, we had talked about converting page frame numbers
to page structures and vice versa. We had discussed the details of a simple linear
layout of page structures and then a more complicated hierarchical layout that
divides the zones into sections.
It is necessary to take a second look at this concept now (refer to Figure 6.11).
To manage all the memory and that too efficiently, it is necessary to sometimes
divide it into sections and create a 2-level hierarchical structure. The first reason
is that we can efficiently manage the list of free frames within a section because
we use smaller data structures. Second, sometimes zones can be noncontiguous.
It is thus a good idea to break a noncontiguous zone into a set of sections,
where each section is a contiguous chunk of physical memory. Finally, sometimes
there may be intra-zone heterogeneity in the sense that the latencies of different
memory regions within a zone may be slightly different in terms of performance
or some part of the zone may be considered to be volatile, especially if the device
tends to be frequently removed.
c Smruti R. Sarangi 276
Zone
one-to-one
PFN struct page
The primary problem is given a physical page, we need to find the processes
that map to it. Given a physical page’s address, we have already seen how to
map it to a struct page in Section 6.2.3. A straightforward solution appears
to be that we store a list of processes (task structs) within each struct page.
This is a bad idea because there is no limit on the number of processes that
can map to a page. We thus need to store the list of processes in a linked list.
However, it is possible that a lot of space is wasted because numerous pages
may be shared in an identical manner. We will end up creating many copies of
the same linked list.
Moreover, many a time there is a requirement to apply a common policy to
a group of pages. For example, we may want all the pages to have a given access
permission. Given that we are proposing to treat each page independently, this
is bound to be quite difficult. Let us thus create a version 2 of this idea. Instead
of having a linked list associated with each struct page, let us add a pointer
to a vma (vm area structure) in each page. In this case, accessing the vma that
contains a page is quite easy. The common policy and access permissions can
be stored in the vma (as they currently are).
This does solve many of our problems with information redundancy and
grouping of pages. However, it introduces new problems. vmas tend to get
split and merged quite frequently. This can happen due to changing the ac-
cess permissions of memory regions (due to allocation or deallocation), moving
groups of pages between different regions of a program and enforcing different
policies to different regions of address within a vma. Examples of the last cat-
egory include different policies with regard to prefetching, swapping priorities
and making pages part of huge pages. Given this fluid nature of vmas, it is not
advisable to directly add a pointer to a vma to a struct page. If there is a
change, then we need to walk through the data structures corresponding to all
the constituent pages of a vma and make changes. This will take O(N ) time.
We desire a solution that shall take O(1) time.
mapping
struct page anon_vma
field
ing
pp
ma ld
struct page fie
associated with two vmas across two processes (the parent and the child). The
relationship is as follows: 2 vma ↔ 1 anon vma (refer to Figure 6.13(a)).
Now, consider another case. Consider a case where a parent process has
forked a child process. In this case, they have their separate vmas that point to
the same anon vma. This is the one that the shared pages also point to. Now,
assume a situation where the child process writes to a page that is shared with
the parent. This means that a new copy of the page has to be created for the
child due to the copy-on-write mechanism. This new page needs to point to
an anon vma, which clearly cannot be the one that the previously shared page
was pointing to. It needs to point to a new anon vma that corresponds to pages
exclusive to the child. There is an important question that needs to be answered
here. What happens to the child’s vma? Assume it had 1024 pages, and the
write access was made to the 500th page. Do we then split it into three parts:
0-499, 500, 501-1023? The first and last chunks of pages are unmodified up till
now. However, we made a modification in the middle, i.e., to the 500th page.
This page is now pointing to a different anon vma.
vma anon_vma
anon_vma vma
vma anon_vma
(a) (b)
Figure 6.13: (a) Many-to-one mapping (b) One-to-many mapping
Splitting the vma is not a good idea. This is because a lot of pages in a
vma may see write accesses when they are in a copy-on-write (COW) mode. We
cannot keep on splitting the vma into smaller and smaller chunks. This is a lot
of work and will prohibitively increase the number of vma structures that need
to be maintained. Hence, as usual, the best solution is to do nothing, i.e., not
split the vma. Instead, we maintain the vma as it is but assume that all the
pages in its range may not be mapped to the same anon vma. We thus have the
following relationship: 1 vma ↔ 2 anon vma . Recall that we had earlier shown
a case where we had the following relationship: 2 vma ↔ 1 anon vma (refer to
Figure 6.13(b)).
Let us now summarize our learnings.
281 c Smruti R. Sarangi
Point 6.3.1
This is what we have understood about the relationship between a vma
and anon vma.
• For a given virtual address region, every process has its own private
vma.
vma
vma anon_vma
vma anon_vma
We thus observe that a complex relationship between the anon vma and vma
has developed at this point (refer to Figure 6.14 for an example). Maintaining
this information and minimizing runtime updates is not easy. There is a classical
time and space trade-off that we need to defer to here. If we want to minimize
time, we should increase space. Moreover, we desire a data structure that
captures the dynamic nature of the situation. The events of interest that we
have identified up till now are as follows: a fork operation, a write to a COW
page, splitting or merging a vma and killing a process.
Let us outline our requirements using a few 1-line principles.
1. Every anon vma should know which vma structures it is associated with.
2. Every vma should know which anon vma structures it is associated with.
3. The question that we need to answer now is whether a these two structures
c Smruti R. Sarangi 282
We shall show the C code of an anon vma after we describe another structure
known as the anon vma chain because both are quite intimately connected. It
is not possible to completely explain the former without explaining the latter.
Figure 6.15: The relationship between vma, anon vma and anon vma chain
(avc). A dashed arrow indicates that anon vma does not hold a direct pointer
to the avc, but holds a reference to a red-black tree that in turn has a pointer
to the avc.
We can think of the anon vma chain structure as a link between a vma and
an anon vma. We had aptly referred to it as a level of indirection. An advantage
of this structure is that we can link it to other anon vma chain nodes via the
same vma list (refer to Figure 6.16). All of them correspond to the same vma.
283 c Smruti R. Sarangi
They are thus stored in a regular doubly-linked list. We shall see later that for
faster access, it is also necessary to store anon vmas in a red-black tree. Hence,
an anon vma chain points to a red-black tree node.
avc anon_vma
avc anon_vma
Figure 6.16: anon vma chain nodes connected together as a list (all correspond-
ing to the same vma)
Now, let us look at the code for the anon vma in Listing 6.8.
Listing 6.8: anon vma
source : include/linux/rmap.h
struct anon_vma {
/* Arrange all the anon_vmas corresponding to a vma
hierarchically */
/* The root node holds the lock for accessing the chain
of anon_vmas */
struct anon_vma * root ;
struct anon_vma * parent ;
We don’t actually store any state. We just store pointers to other data
structures. All that we want is that all the pages in a virtual memory region with
similar access policies point to the same anon vma. Now, given the anon vma,
we need to quickly access all the vma structures that may contain the page. This
is where we use a red-black tree (rb root) that actually stores three pieces of
information: anon vma chain nodes (the value), and the start and end virtual
addresses of the associated vma (the key). The red-black tree can quickly be
used to find all the vma structures that contain a page.
Additionally, we organize all the anon vma structures as a tree, where each
node also has a direct pointer to the root. The root node stores a read-write
semaphore that is used to lock the list of anon vma chain nodes such that
changes can be made (add/delete/etc.).
The exclusive anon vma is named anon vma and the list of anon vma chain
nodes is represented by its namesake anon vma chain.
vma avc
anon_vma
Figure 6.17: Updated relationship between vma, anon vma and avc
Next, it is possible that because of fork operations, many other vma struc-
tures (across processes) point to the same anon vma via avcs – the anon vma
that is associated with the vma of the parent process. Recall that we had cre-
ated a linked list of anon vma chain nodes (avcs) precisely for this purpose.
Figure 6.18 shows an example where an anon vma is associated with multiple
vmas across processes.
anon_vma
vma avc
avc anon_vma
avc anon_vma
Figure 6.18: Example of a scenario with multiple processes where an anon vma
is associated with multiple vmas
285 c Smruti R. Sarangi
Red-black tree
Figure 6.19: Reverse map structures after a fork operation
Let us now consider the case of a fork operation. The reverse map (rmap)
structures are shown in Figure 6.19 for both the parent and child processes.
The parent process in this case has one vma and an associated anon vma. The
fork operation starts out by creating a copy of all the rmap structures of the
parent. The child thus gets an avc that links its vma to the parent’s anon vma.
This ensures that all shared pages point to the same anon vma and the structure
is accessible from both the child and parent processes.
The child process also has its own private anon vma that is pointed to by its
vma. This is for pages that are exclusive to it and not shared with its parent
process. Let us now look at the classical reverse mapping problem. Up till now,
we have discussed a one-way mapping from vmas to anon vmas. But we have
not discussed how we can locate all vmas of different processes given a page.
This is precisely the reverse mapping problem that we shall solve next.
Given a page frame number, we can locate its associated struct page, and
then using its mapping field, we can retrieve a pointer to the anon vma. The next
problem is to locate all the avcs that point to the anon vma. Every anon vma
stores a red-black tree that stores the list of avcs that point to it. Specifically,
each node of the red-black tree stores a pointer to an avc (value) and the range
of virtual addresses it covers (key). The latter is the key used to access the
red-black tree and the former is the value.
The child process now has multiple avcs. They are connected to each other
and the child process’s vma using a doubly linked list.
Let us now consider the case of a shared page that points to the anon vma
of the parent process (refer to Figure 6.20). After a fork operation, this page is
stored in copy-on-write (COW) mode. Assume that the child process writes to
the page. In this case, a new copy of the page needs to be made and attached
to the child process. This is shown in Figure 6.21.
The new page now points to the private anon vma of the child process. It is
now the exclusive property of the child process.
Next, assume that the child process is forked. In this case, the rmap struc-
tures are replicated, and the new grandchild process is also given its private vma
and anon vma (refer to Figure 6.22). In this case, we create two new avcs. One
avc points to the anon vma of the child process and the other avc points to the
anon vma of the original parent process. We now have two red-black trees: one
c Smruti R. Sarangi 286
Red-black tree
page
Figure 6.20: A page pointing to the parent’s anon vma
Red-black tree
old new
page page
Figure 6.21: A new page pointing to the child’s anon vma
corresponds to the parent process’s anon vma and the other corresponds to the
child process’s anon vma. The avcs of each process are also nicely linked using
a doubly linked list, which also includes its vma.
After repeated fork operations, it is possible that a lot of avcs and anon vmas
get created. This can lead to a storage space blowup. Modern kernels optimize
this. Consider the anon vma (+ associated avc) that is created for a child process
such that pages that are exclusive to the child can point to it. In some case,
instead of doing this, an existing anon vma along with its avc can be reused.
This introduces an additional level of complexity; however, the space savings
justify this design decision to some extent.
Red-black tree
Grandchild process avc
vma avc
avc anon_vma
Figure 6.22: The structure of the rmap structures after a second fork operation
(fork of the child)
• The algorithm divides pages into different generations based on the recency
of the last access. If a page is accessed, there is a fast algorithm to upgrade
it to the latest generation.
• The algorithm reclaims pages in the background by swapping them out to
the disk. It swaps pages that belong to the oldest generations.
• It ages the pages very intelligently. This is workload dependent.
• It is tailored to running large workloads and integrates well with the notion
of folios.
struct lruvec {
/* contains the physical memory layout of the NUMA
node */
struct pglist_data * pgdat ;
/* Number of refaults */
unsigned long refaults [ ANON_AND_FILE ];
/* LRU state */
struct lru_gen_struct lrugen ;
struct lru_gen_mm_state mm_state ;
};
Linux uses the lruvec structure to store LRU replacement related important
information. Its code is shown in Listing 6.10. The first key field is a pointer
to a pglist data structure that stores the details of the zones in the current
NUMA node (discussed in Sections 6.2.5 and 6.2.5).
Next, we store the number of refaults for anonymous and file-backed pages. A
refault is a page access after it has been evicted. We clearly need to minimize the
number of refaults. If it is high, it means that the page replacement and eviction
algorithms are suboptimal – they evict pages that have a high probability of
being accessed in the near future.
The next two fields lrugen and mm state store important LRU-related state.
lrugen is of type lru gen struct (shown in Listing 6.11). mm state is of type
lru gen mm state (shown in Listing 6.12).
/* 3 D array of lists */
struct list_head lists [ MAX_NR_GENS ][ ANON_AND_FILE ][
MAX_NR_ZONES ];
};
A lru gen struct structure stores a set of sequence numbers: maximum and
minimum (for anonymous and file-backed pages, resp.), an array of timestamps
(one per generation) and a 3D array of linked lists. This array will prove to
be very useful very soon. It is indexed by the generation number, type of page
(anonymous or file) and zone number. Each entry in this 3D array is a linked
list whose elements are struct pages. The idea is to link all the pages of the
same type that belong to the same generation in a single linked list. We can
traverse these lists to find pages to evict based on additional criteria.
Let us next discuss the code of lru gen mm state (see Listing 6.12). This
structure stores the current state of a page walk – a traversal of all the pages
to find pages that should be evicted and written to secondary storage (swapped
out). At one point, multiple threads may be performing page walks (stored in
the nr walkers variable).
The field seq is the current sequence number that is being considered in
the page walk process. Each sequence number corresponds to a generation –
lower the sequence number, earlier the generation. The head and tail pointers
point to consecutive elements of a linked list of mm struct structures. In a
typical page walk, we traverse all the pages that satisfy certain criteria of a
given process (its associated mm struct), then we move to the next process (its
mm struct), and so on. This process is easily realized by storing a linked list of
mm struct structures. The tail pointer points to the mm struct structure that
was just processed (pages traversed). The head pointer points to the mm struct
that needs to be processed.
Finally, we use an array of Bloom filters to speed up the page walk process
(we shall see later). Whenever, the word Bloom filter comes up, the only thing
that one should have in mind is that in a Bloom filter a false negative is not
possible, but a false positive is possible.
facility. They can automatically set this flag. A simple scan of the page table
can yield the list of pages that have been recently accessed (after the last time
that the bits were cleared). If hardware support is not available, then there is
a need to mark the pages as inaccessible. This will lead to a page fault, when
the page is subsequently accessed. This is not a full-scale (hard) page fault,
where the contents of the page need to be read from an external storage device.
It is instead, a soft page fault, where after recording the fact that there was a
page fault, its access permissions are changed – the page is made accessible once
again. We basically deliberately induce fake page faults to record page accesses.
can have plain old-fashioned eviction, or we can reclaim pages from specialized
page buffers. The sizes of the latter can be adjusted dynamically to release
pages and make them available to other processes.
17 return false ;
Let us first understand, whether we need to run aging or not in the first
place. The logic is shown in Listing 6.14. The aim is to maintain the following
relationship: min seq + MIN NR GENS == max seq. This means that we wish
to ideally maintain MIN NR GENS+1 sequence numbers (generations). The check
in Line 2 checks if there are too few generations. Then, definitely there is a
need to run the aging algorithm. On similar lines, if the check in Line 7 is true,
then it means that there are too many generations. There is no need to run the
aging algorithm.
Next, let us consider the corner case when there is equality. First, let us
define what it means for a page to be young or old. A page is said to be young
if its associated sequence number is equal to max seq. This means that it belongs
to the latest generation. On similar lines, a page is said to be old if its sequence
number follows this relationship: seq + MIN NR GENS == max seq. Given that
we would ideally like to maintain the number of generations at MIN NR GENS+1,
we track two important pieces of information – the number of young and old
pages, respectively.
The first check young × MIN NR GENS > total ensures that if there are too
many young pages, there is a need to run aging. The reason is obvious. We
want to maintain a balance between the young and not-so-young pages. Let
us consider the next inequality: old × (MIN NR GENS + 2) < total. This clearly
says that if the number of old pages is lower than what is expected (too few),
then also we need to age. An astute reader may notice that here that we add
an offset of 2, whereas we did not add such an offset in the case of young
pages. There is an interesting explanation here, which will help us appreciate
the nuances involved in designing practical systems.
As mentioned before, we wish to ideally maintain only MIN NR GENS+1 gen-
erations. There is a need to provide a small safety margin here for young pages
because we do not want to run aging very frequently. Hence, the multiplier is
set to MIN NR GENS. In the case of old pages, the safety margin works in the
reverse direction. We can allow it to go as low as total / (MIN NR GENS + 2).
This is because, we do not want to age too frequently, and in this case aging
will cause old pages to get evicted. We would also like to reduce unnecessary
eviction. Hence, we set the safety margin differently in this case.
actually old. To eliminate this possibility, there is therefore a need to scan all
the PMD’s constituent pages and check if the pages were recently accessed or
not. The young/old status of the PMD can then be accurately determined.
This entails extra work. However, if a PMD address is not found, then it means
that it predominantly contains old pages for sure (the PMD is old). Given
that Bloom filters do not produce false negatives, we can skip such PMDs with
certainty because they are old.
When we walk through the page tables, the idea is to skip unaccessed pages.
To accelerate this process, we can skip full PMDs (512 pages) if they are not
found in the Bloom filter. For the rest of the young pages, we clear the accessed
bit and set the generation of the page or folio to max seq.
Both the arguments are correct in different settings. Hence, Linux pro-
vides both the options. If nothing is specified, then the first argument holds –
file-backed pages. However, if there is more information and the value of the
swappiness is in the range 1-200, then anonymous pages are slightly depriori-
tized as we shall see.
Let us next compare the minimum sequence numbers for both the types.
If the anonymous pages have a lower generation, then it means that they are
more aged, and thus should be evicted. However, if the reverse is the case –
file-backed pages have a lower generation (sequence number) – then we don’t
evict them outright (as per the second aforementioned argument).
Listing 6.15: The algorithm that chooses the type of the pages/folios to evict
source : mm/vmscan.c
1 if (! swappiness )
2 type = LRU_GEN_FILE ;
3 else if ( min_seq [ LRU_GEN_ANON ] < min_seq [ LRU_GEN_FILE ])
4 type = LRU_GEN_ANON ;
5 else if ( swappiness == 1)
6 type = LRU_GEN_FILE ;
7 else if ( swappiness == 200)
8 type = LRU_GEN_ANON ;
9 else if (!( sc - > gfp_mask & __GFP_IO ) ) /* I / O operations are
involved */
10 type = LRU_GEN_FILE ;
11 else
12 type = get_type_to_scan ( lruvec , swappiness , & tier ) ;
least). Next, we define two ctrl pos variables: sp (set point) and pv (process
variable). The goal of any such algorithm based on control theory is to set the
process variable equal to the set point (a pre-determined state of the system).
We shall stick to this generic terminology because such an algorithm is valuable
in many scenarios, not just in finding the type of pages to evict. It tries to bring
two quantities closer in the real world. Hence, we would like to explain it in
general terms.
The parameter gain plays an important role here. For anonymous pages
it is defined as the swappiness (higher it is, more are the evicted anonymous
pages) and for file-backed pages it is 200-swappiness. The gain indirectly is
a measure of how aggressively we want to evict a given type of pages. If it
approaches 200, then we wish to evict anon pages, and if it approaches 1, we
wish to evict f ile pages. It quantifies the preference.
Next, we initialize the sp and pv variables. The set point is set equal to the
eviction statistics h # refaults, #evictions, gain i of anon pages. The process
variable is set to the eviction statistics of f ile pages. We need to now compare
sp and pv. Note that we are treating sp (anon) as the reference point here and
trying to ensure that “in some sense” pv approaches pv (f ile) approaches sp.
This will balance both of them, and we will be equally fair to both of them.
Point 6.3.2
In any control-theoretic algorithm, our main aim is to bring pv as close
to sp as possible. In this case also we wish to do so and in the process
ensure that both f ile and anon pages are treated fairly.
pv.refaulted (sp.refaulted + α)
≤ (6.2)
pv.total × pv.gain (sp.total + β) × sp.gain
Equation 6.1 is an idealized equation. In practice, Linux adds a couple
of constants to the numerator and the denominator to incorporate practical
considerations and maximize the performance of real workloads. Hence, the
exact version of the formula implemented in the Linux kernel v6.2 is shown in
Equation 6.2. The designers of Linux set α = 1 and β = 64. Note that these
constants are based on experimental results, and it is hard to explain them
logically.
Linux has another small trick. If the absolute value of the f ile refaults is low
< 64, then it chooses to evict file-backed pages. The reason is that most likely
the application either does not access a file or the file accessed is very small.
On the other hand, every application shall have a large amount of anonymous
memory comprising stack and heap sections. Even if it is a small application,
the code to initialize it is sizable. Hence, it is good idea to only evict anon
pages only if the f ile refault rate is above a certain threshold.
Now, we clearly have a winner. If the inequality in Equation 6.2 is true,
then pv is chosen (f ile), else sp is chosen (anon).
higher tiers are more frequently referenced. We would ideally like to evict folios
in lower tiers. Folios in higher tiers may be more useful.
Assume that the type of files chosen for eviction is T , and the other type
(not chosen for eviction) is T 0 . Clearly, the choice was made based on average
statistics. Let us now do a tier-wise analysis, and compare their normalized re-
fault rates by taking the gain into account tier-wise using Equation 6.2. Instead
of comparing average statistics, we perform the same comparison tier-wise. We
may find that till a certain tier k, folios of type T need to be evicted. However,
for tier k + 1 the reverse may be the case. It means that folios of type T 0 need
to be evicted as per the logic in Equation 6.2. If no such k exists, then nothing
needs to be done. Let us consider the case, when we find such a value of k.
In this case, the folios in tiers [0, k] should be considered for eviction as long
as they are not pinned, being written back, involved in race conditions, etc.
However, the folios in tiers k + 1 and beyond, should be given a second chance
because they have seen more references. We already know that folios in tier
k + 1 should not be evicted because if we compare their statistics with those of
the corresponding folios in type T , it is clear that by Equation 6.2 they should
remain in memory. Instead of doing the same computation for the rest of the
folios, we can simply assume that they also need to be kept in memory for the
time being. This is done in the interest of time and there is a high probability of
this being a good decision. Note that higher-numbered tiers see an exponential
number of more references. Hence, we increment the generations of all folios in
the range [k + 1, M AX T IERS]. They do not belong to the oldest generation
anymore. They enjoy a second chance. Once a folio is promoted to a new
generation, we can optionally clear its reference count. The philosophy here is
that the folio gained because of its high reference count once. Let it not benefit
once again. Let it start afresh after getting promoted to the higher generation.
This is depicted pictorially in Figure 6.23.
Eviction of a Folio
Now, we finally have a list of folios that can be evicted. Note that some folios
did get rescued a and were given a second chance though.
The process of eviction can now be started folio after folio. Sometimes it
is necessary to insert short delays in this process, particularly if there are high
c Smruti R. Sarangi 298
priority tasks that need to access storage devices or some pages in the folio are
being swapped out. This is not a one-shot process, it is punctuated with periods
of activity initiated by other processes.
Once a folio starts getting evicted, we can do some additional bookkeeping.
We can scan proximate (nearby) virtual addresses. The idea here is that pro-
grams tend to exhibit spatial locality. If one folio was found to be old, then
pages in the same vicinity should also be scrutinized. We may find many more
candidates that can possibly be evicted in the near future. For such candidate
pages, we can mark them to be old (clear the accessed bit) and also note down
PMD (Page Middle Directory) entries that comprise mostly of young pages.
These can be added to a Bloom filter, which will prove to be very useful later
when we discuss the page walk process. We can also slightly reorganize the
folios here. If a folio is very large, it can be split to several smaller folios.
Point 6.3.3
Such additional bookkeeping actions that are piggybacked on regular
operations like a folio eviction are a common pattern in modern operating
systems. Instead of operating on large data structures like the page table
in one go, it is much better to slightly burden each operation with a small
amount of additional bookkeeping work. For example, folio eviction is
not on the critical path most of the time, and thus we can afford to do
some extra work.
Once the extra work of bookkeeping is done, the folio can be written back
to the storage device. This would involve clearing its kernel state, freeing all
the buffers that were storing its data (like the page cache for file-backed pages),
flushing the relevant entries in the TLB and finally writing the folio back.
contains mostly old pages. Given that there is no possibility of an error, we can
confidently skip scanning all the constituent page table entries that are covered
by the PMD entry. This will save us a lot of time.
Let us consider the other case, when we find a PMD address in the Bloom
filter. It is most likely dominated by young pages. The reason we use the term
“most likely” because Bloom filters can lead to false positive outcomes. We scan
the pages in the PMD region – either all or a subset of them at a time based
on performance considerations. This process of looking around marks young
folios as old on the lines of classic clock-based page replacement algorithms.
Moreover, note that when a folio is marked, all its constituent pages are also
marked. At this point, we can do some additional things. If we find a PMD
region to comprise mostly of young pages, then the PMD address can be added
to the Bloom filter. Furthermore, young folios in this region can be promoted to
the latest generation – their generation/sequence number can be set to max seq.
This is because they are themselves young, and they also lie in a region that
mostly comprises young pages. We can use spatial and temporal locality based
arguments to justify this choice.
6.3.3 Thrashing
Your author is pretty sure that everybody is guilty of the following performance
crime. The user boots the machine and tries to check her email. She finds
it to be very slow because the system is booting up and all the pages of the
email client are not in memory. She grows impatient, and tries to start the web
browser as well. Even that is slow. She grows even more impatient and tries to
write a document using MS Word. Things just keep getting slower. Ultimately,
she gives up and waits. After a minute or two, all the applications come up and
the system stabilizes. Sometimes if she is unlucky, the system crashes.
What exactly is happening here? Let us look at it from the point of view of
paging. Loading the pages for the first time into memory from a storage device
such as a hard disk or even a flash drive takes time. Storage is after several orders
of magnitude slower than main memory. During this time, if another application
is started its pages also start getting loaded. This reduces the bandwidth to
the storage device and both applications get slowed down. However, this is
not the only problem. If these are large programs, whose working set (refer
to Section 6.1.3) is close to the size of main memory, then they need to evict
each other’s pages. As a result, when we start a new application it evicts pages
of the applications that are already running. Then, when there is a context
switch existing applications stall because crucial pages from their working set
were evicted. They suffer from page faults. Their pages are then fetched from
memory. However, this has the same effect again. These pages displace the
pages of other applications. This cycle continues. This phenomenon is known
as thrashing. A system goes into thrashing when there are too many applications
running at the same time and most of them require a large amount of memory.
They end up evicting pages from each other’s working sets, which just increases
the page fault rate without any beneficial outcome.
It turns out that things can get even worse. The performance counters
detect that there is low CPU activity. This is because most of the time is
going in servicing page faults. As a result, the scheduler tries to schedule even
more applications to increase the CPU load. This increases the thrashing even
c Smruti R. Sarangi 300
further. This can lead to a vicious cycle, which is why thrashing needs to be
detected and avoided at all costs.
Linux has a pretty direct solution to stop thrashing. It tries to keep the
working set of an application in memory. This means that once a page is brought
in, it is not evicted very easily. The page that is brought in (upon a page fault)
is most likely a part of the working set. Hence, it makes little sense to evict
it. The MGLRU algorithm already ensures this to some extent. A page that is
brought into main memory has the latest generation. It takes time for it to age
and be a part of the oldest generation and become eligible for eviction. However,
when there are a lot of applications the code in Listing 6.14 can trigger the aging
process relatively quickly because we will just have a lot of young pages. This
is not a bad thing when there is no thrashing. We are basically weeding out old
pages. However, when thrashing sets in, such mechanisms can behave in erratic
ways.
There is thus a need for a master control. The eviction algorithm simply
does not allow a page to be evicted if it was brought into memory in the last N
ms. In most practical implementations, N = 1000. This means that every page
is kept in memory for at least 1 second. This ensures that evicting pages in the
working set of any process is difficult. Thrashing can be effectively prevented
in this manner.
However, there is one problem here. Assume that an application is trying
to execute, but its pages cannot be loaded to memory because of the aforemen-
tioned rule. In this case, it may wait in the runqueue for a long time. This will
make it unresponsive. To prevent this, Linux simply denies it permission to run
and terminates it with an “Out of Memory” (OOM) error. It has a dedicated
utility called the OOM killer whose job is to terminate such applications. This
is a form of admission control where we limit the number of processes. Along
with persisting working set pages in memory for a longer duration, terminating
new processes effectively prevents thrashing.
Definition 6.3.1 Thrashing
We had also discussed the organization of the kernel’s virtual address space
in Section 6.2.1. Here we saw many regions that are either not “paged”, or where
the address translation is a simple linear function. This implies that contiguity
in the virtual address implies contiguity in the physical address space as well.
We had argued that this is indeed a very desirable feature especially when we are
communicating with external I/O devices, DMA controllers and managing the
memory space associated with kernel-specific structures. Having some control
over the physical memory space was deemed to be a good thing.
On the flip side, this will take us back to the bad old days of managing a
large chunk of contiguous memory without the assistance of paging-based sys-
tems that totally delink the virtual and physical address spaces, respectively.
We may again start seeing a fair amount of external fragmentation. Notwith-
standing this concern, we also realize that in paging systems there is often a
need to allocate a large chunk of contiguous virtual addresses. This is quite
beneficial because prefetching-related optimizations are possible. In either case,
we are looking at the same problem, which is maintaining a large chunk of con-
tiguous memory while avoiding the obvious pitfalls: management of holes and
uncontrolled external fragmentation
Recall that we are discussed the base-limit scheme in Section 6.1.1. It was
solving a similar problem, albeit ineffectively. We had come across the problem
of holes, and it was very difficult to plug holes or solve the issues surrounding
external fragmentation. We had proposed a bunch of heuristics such as first-fit,
next-fit and so on, however we could not come up with a very effective method
of managing the memory this way. It turns out that if we have a bit more
of regularity in the memory accesses, then we can use many other ingenious
mechanisms to manage the memory better without resorting to conventional
paging. We will discuss several such mechanisms in this section.
64 KB 64 KB 64 KB 64 KB
32 KB 32 KB 32 KB 32 KB
Full
Allocate 20 KB Allocated 20 KB
128 KB
page buddy
64 KB 64 KB
Figure 6.25: Freeing the 20 KB region allocated earlier
Let us now free the 20 KB region that was allocated earlier (see Figure 6.25).
In this case, we will have two 32 KB regions that are free and next to each other
(they are siblings in the tree). There is no reason to have two free regions at
the same level. Instead, we can get rid of them and just keep the parent, whose
size is 64 KB. We are essentially merging free regions (holes) and creating a
larger free region. In other words, we can say that if both the children of a
parent node are free (unallocated), they should be removed, and we should only
have the parent node that coalesces the full region. Let us now look at the
implementation. We refer to the region represented by each node in the buddy
tree as a block.
Implementation
Let us look at the implementation of the buddy allocator by revisiting the
free area array in struct zone (refer to Section 6.2.5). We shall define the
order of a node in the buddy tree. The order of a leaf node that corresponds
to the smallest possible region – one page – is 0. Its parent has order 1. The
order keeps increasing by 1 till we reach the root. Let us now represent the tree
as an array of lists: one list per order. All the nodes of the tree (of the same
order) are stored one after the other (left to right) in an order-specific list. A
node represents an aggregate page, which stores a block of memory depending
303 c Smruti R. Sarangi
upon the order. Thus, we can say that each linked list is a list of pages, where
each page is actually an aggregate page that may point to N contiguous 4 KB
pages, where N is a power of 2.
The buddy tree is thus represented by an array of linked lists – struct
free area free area[MAX ORDER]. Refer to Listing 6.16, where each struct
free area is a linked list of nodes (of the same order). The root’s order is
limited to MAX ORDER - 1. In each free area structure, the member nr free
refers to the number of free blocks (=number of pages in the associated linked
list).
There is a subtle twist involved here. We actually have multiple linked
lists – one for each migration type. The Linux kernel classifies pages based on
their migration type: it is based on whether they can move, once they have
been allocated. One class of pages cannot move after allocation, then there are
pages that can freely move around physical memory, there are pages that can be
reclaimed and there are pages reserved for specific purposes. These are different
examples of migration types. We maintain separate lists for different migration
types. It is as if their memory is managed separately.
free_area
list of type 0
zone free_area list of type 1 List of buddy blocks
list of type 2
free_area
Figure 6.26: Buddies within a zone. The type refers to the migration type
This can be visualized in Figure 6.26, where we see that a zone has a pointer
to a single free area (for a given order), and this free area structure has
pointers to many lists depending on the type of the page migration that is
allowed. Each list contains a list of free blocks (aggregate pages). Effectively,
we are maintaining multiple buddy trees – one for each page reclamation type.
An astute reader may ask how the buddy tree is being created – there are
after all no parent or child pointers. This is implicit. We will soon show that
c Smruti R. Sarangi 304
Listing 6.18 shows the code for freeing an aggregate page (block in the buddy
system). In this case, we start from the block that we want to free and keep
proceeding towards the root. Given the page, we find the page frame number
of the buddy. If the buddy is not free then the find buddy page pfn returns
NULL. Then, we exit the for loop and go to label done merging. If this is not
the case, we delete the buddy and coalesce the page with the buddy.
Let us explain this mathematically. Assume that pages with frame numbers
A and B are buddies of each other. Let the order be φ. Without loss of
generality, let us assume that A < B. Then we can say that B = A + 2φ , where
φ = 0 for the lowest level (the unit here is pages). Now, if we want to combine
A and B and create one single block that is twice the block size of A and B,
then it needs to start at A and its size needs to be 2φ+1 pages.
Let us now remove the restriction that A < B. Let us just assume that they
are buddies of each other. We then have A = B⊕2φ . Here ⊕ stands for the XOR
operator. Then, if we coalesce them, the aggregate page corresponding to the
parent node needs to have its starting pfn (page frame number) at min(A, B).
This is the same as A&B, where & stands for the logical AND operation. This
is because they vary at a single bit: the (φ + 1)th bit (LSB is bit #1). If we
compute a logical AND, then this bit gets set to 0, and we get the minimum of
the two pfns. Let us now compute min(A, B) − A. It can either be 0 or −2φ ,
where the order is φ.
305 c Smruti R. Sarangi
We implement exactly the same logic in Listing 6.18, where A and B are
buddy pfn and pfn, respectively. The combined pfn represents the minimum:
starting address of the new aggregate page. The expression combined pfn -
pfn is the same as min(A, B) − A. If A < B, it is equal to 0, which means
that the aggregate page (corresp. to the parent) starts at struct page* page.
However, if A > B, then it starts at page minus an offset. The offset should
be equal to A − B multiplied by the size of struct page. In this case A − B
is equal to pfn - combined pfn. The reason that this offset gets multiplied
with struct page is because when we do pointer arithmetic in C, any constant
that gets added or subtracted to a pointer automatically gets multiplied by the
size of the structure (or data type) that the pointer is pointing to. In this case,
the pointer is pointing to date of type struct page. Hence, the negative offset
combined pfn - pfn also gets multiplied with sizeof(struct page). This is
the starting address of the aggregate page (corresponding to the parent node).
done_merging :
/* set the order of the new
set_buddy_order ( page , order ) ;
add_to_free_list ( page , zone , order , migratetype ) ;
}
Once we combine a page and its buddy, we increment the order and try to
combine the parent with its buddy and so on. This process continues until we are
successful. Otherwise, we break from the loop and reach the label done merging.
Here we set the order of the merged (coalesced) page and add it to the free list
at the corresponding order. This completes the process of freeing a node in the
buddy tree.
c Smruti R. Sarangi 306
The buddy system overlays a possibly unbalanced binary tree over a lin-
ear array of pages. Each node of the tree corresponds to a set of contigu-
ous pages (the number is a power of 2). The range of pages represented
by a node is equally split between its children (left-half and right-half).
This process continues recursively. The allocations are always made at
the leaf nodes that are also constrained to have a capacity of N pages,
where N is a power of 2. It is never the case that two children of the
same node are free (unallocated). In this case, we need to delete them
and make the parent a leaf node. Whenever an allocation is made in
a leaf node that exceeds the minimum page size, the allocated memory
always exceeds 50% of the capacity of that node (otherwise we would
have split that node).
object
full slab
slab
object slab
par�al
free slab
object slab
slab
kmem_cache_node
memory_region
are recently freed objects, which can be quickly reused. This is a very fast way
of allocating an object without accessing other data structures to find which
object is free. Every object in this array is associated with a slab. Sadly, when
such an object is allocated or freed, the state in its encapsulating slab needs to
also be changed. We will see later that this particular overhead is not there in
the slub allocator.
Now, if there is a high demand for objects, then we may run out of free
objects in the per-CPU array cache. In such a case, we need to find a slab
that has a free object available.
It is very important to appreciate the relationship between a slab and the
slab cache at this point of time. The slab cache is a system-wide pool whose job
is to provide a free object and also take back an object after it has been used
(added back to the pool). A slab on the other hand is just a storage area for
storing a set of k objects: both active and inactive.
The slab cache maintains three kinds of slab lists – full, partial and free –
for each NUMA node. The full list contains only slabs that do not have any
free object. The partial list contains a set of partially full slabs and the free
list contains a set of slabs that do not even have a single allocated object. The
algorithm is to first query the list of partially full slabs and find a partially full
slab. Then in that slab, it is possible to find an object that has not been allo-
cated yet. The state of the object can then be initialized using an initialization
function whose pointer must be provided by the user of the slab cache. The
object is now ready for use.
However, if there are no partially full slabs, then one of the empty slabs
needs to be taken and converted to a partially full slab by allocating an object
within it.
We follow the reverse process when returning an object to the slab cache.
Specifically, we add it to the array cache, and set the state of the slab that
c Smruti R. Sarangi 308
the object is a part of. This can easily be found out by looking at the address
of the object and then doing a little bit of pointer math to find the nearest
slab boundary. If the slab was full, then now it is partially full. It needs to be
removed from the full list and added to the partially full list. If this was the
only allocated object in a partially full slab, then the slab is empty now.
We assume that a dedicated region in the kernel’s memory map is used to
store the slabs. Clearly all the slabs have to be in a contiguous region of the
memory such that we can do simple pointer arithmetic to find the encapsulating
slab. The memory region corresponding to the slabs and the slab cache can be
allocated in bulk using the high-level buddy allocator.
This is a nice, flexible and rather elaborate way of managing physical memory
for storing objects of only a particular type. A criticism of this approach is that
there are too many lists, and we frequently need to move slabs from one list to
the other.
per cpu
kmem_cache kmem_cache_cpu
void ** freelist: pointer
slab kmem_cache_cpu
to free objects
*cpu_slab
slab_cache uint object_size struct slab *slab
inuse: #objects ctor: object constructor
func�on
freelist: list of free objects kmem_cache_node * slab
node [NUMA_NODES]
object
usage
counters 1. Return empty slab
slab
slabs to the par�al slab
memory system. kmem_cache_node
2. Forget about full
slabs.
We reuse the same slab that was used for designing the slab allocator. We
specifically make use of the inuse field to find the number of objects that are
currently being used and the freelist. Note that we have compressed the slab
part in Figure 6.28 and just summarized it. This is because it has been shown
in its full glory in Figure 6.27.
Here also every slab has a pointer to the slab cache (kmem cache). However,
the slab cache is architected differently. Every CPU in this case is given a private
slab that is stored in its per-CPU region. We do not have a separate set of free
objects for quick allocation. It is necessary to prioritize regularity for achieving
better performance. Instead of having an array of recently-freed objects, a slab
309 c Smruti R. Sarangi
is the basic/atomic unit here. From the point of view of memory space usage
and sheer simplicity, this is a good idea.
There are performance benefits because there is more per-CPU space, and
it is quite easy to manage it. Recall that in the case of the slab allocator, we
had to also go and modify the state of the slabs that encapsulated the allocated
objects. Here we maintain state at only one place, and we never separate an
object from its slab. All the changes are confined to a slab and there is no need
to go and make changes at different places. We just deal in terms of slabs and
assign them to the CPUs and slab caches at will. Given that a slab is never split
into its constituent objects their high-level management is quite straightforward.
If the per-CPU slab becomes full, all that we need to do in this case is simply
forget about it and find a new free slab to assign to the CPU. In this case, we
do not maintain a list of fully free and full slabs. We just forget about them.
We only maintain a list of partially full slabs, and query this list of partially full
slabs, when we do not find enough objects in the per-CPU slab. The algorithm
is the same. We find a partially full slab and allocate a free object. If the
partially full slab becomes full, then we remove it from the list and forget about
it. This makes the slab cache much smaller and more memory efficient. Let us
now see where pointer math is used. Recall that the slub allocator heavily relies
on pointer arithmetic.
Note that we do not maintain a list of full slabs nor empty slabs. Instead,
we chose to just forget about them. Now if an object is deallocated, we need
to return it back to the pool. From the object’s address, we can figure out that
it was a part of a slab. This is because slabs are stored in a dedicated memory
region. Hence, the address is sufficient to figure out that the object is a part of
a slab, and we can also find the starting address of the slab by computing the
nearest “slab boundary”. We can also figure out that the object is a part of a
full slab because the slab is not present in the slab cache. Now that the object
is being returned to the pool, a full slab becomes partially full. We can then
add it to the list of partially full slabs in the slab cache.
Exercises
Ex. 4 — Let us say that we want to switch between user-mode processes with-
out flushing the TLB or splitting the virtual address space among user processes.
How can we achieve this with minimal hardware support?
* Ex. 5 — We often transfer data between user programs and the kernel. For
example, if we want to write to a device, we first store our data in a character
array, and transfer a pointer to the array to the kernel. In a simple implemen-
tation, the kernel first copies data from the user space to the kernel space, and
then proceeds to write the data to the device. Instead of having two copies of the
same data, can we have a single copy? This will lead to a more high-performance
implementation. How do we do it, without compromising on security?
Now, consider the reverse problem, where we need to read a device. Here also,
the kernel first reads data from the device, and then transfers data to the user’s
memory space. How do we optimize this, and manage with only a single copy
of data?
Ex. 6 — Prove the optimality of the optimal page replacement algorithm.
Ex. 9 — When and how is the MRU page replacement policy better than the
LRU page replacement policy?
Ex. 10 — What is the reason for setting the page size to 4 KB? What happens
if the page size is higher or lower? List the pros and cons.
311 c Smruti R. Sarangi
Ex. 11 — Consider a memory that can hold only 3 frames. We have a choice
of two page-replacement algorithms: LRU and LFU.
a)Show a page access sequence where LRU is better than LFU?
b)Show a page access sequence where LFU is better than LRU?
Explain the insights as well.
Ex. 13 — What are the causes of thrashing? How can we prevent it?
Ex. 14 — What is the page walking process used for in the MG-LRU algo-
rithm? Answer in the context of the lru gen mm state structure.
Ex. 15 — How is a Bloom filter used to reduce the overhead of page walking?
Ex. 16 — What is the need to deliberately mark actively used pages as “non-
accessible”?
Ex. 17 — What is the swappiness variable used for, and how is it normally
interpreted? When would you prefer evicting FILE pages as opposed to ANON
pages, and vice versa? Explain with use cases.
Ex. 18 — Let us say that you want to “page” the page table. In general, the
page table is stored in memory, and it is not removed or swapped out – it is
basically pinned to memory at a pre-specified set of addresses. However, now let
us assume that we are using a lot of storage space to store page tables, and we
would like to page the page tables such that parts of them, that are not being
used very frequently, can be swapped out. Use concepts from folios, extents and
inodes to create such a swappable page table.
Ex. 19 — How is reverse mapping done for ANON and FILE pages?
Ex. 20 — How many anon vma structures is an anon vma chain connected
to?
Ex. 21 — Why do we need separate anon vma chain structures for shared
COW pages and private pages?
Ex. 22 — Given a page, what is the algorithm for finding the pfn number of
its buddy page, and the pfn number of its parent?
Ex. 23 — What are the possible advantages of handing over full slabs to the
baseline memory allocation system in the SLUB allocator?
Ex. 24 — Compare the pros and cons of all the kernel-level memory alloca-
tors.
c Smruti R. Sarangi 312
Chapter 7
The I/O System, Storage Devices
and Device Drivers
There are three key functions of an OS: process manager, memory manager and
device manager. We have already discussed the role of the OS for the former
two Functionalities in earlier chapters. Let us now come to the role of the OS in
managing devices, especially storage devices. As a matter of fact, most low level
programmers to work with garden court actually work in the space of writing
device drivers for I/O devices. Core kernel developers in comparison are much
fewer, mainly because 70% of the overall kernel code is accounted for by device
drivers. This is expected mainly because a modern OS supports a very large
number of devices and each device pretty much needs its own custom driver.
Of course with the advent of USB technology, some of that is changing in the
sense that it is possible for a single USB driver to handle multiple devices.
For example, a generic keyboard driver can take care of a large number of
USB-based keyboards. Nevertheless, given the sheer diversity of devices, driver
development still accounts for the majority of “OS work”.
In the space of devices, storage devices such as hard disks and flash/NVM
drives have a very special place. They are clearly the more important citizens
in the device world. Other devices such as keyboards, mice and web cameras
are nonetheless important, but they are clearly not in the same league as stor-
age devices. The reasons are simple. Storage devices are often needed for a
computing system to function. Such a device stores all of its data when the
system is powered off (provides nonvolatile storage). It plays a vital role in the
boot process, and also stores the swap space, which is a key component of the
overall virtual memory system. Hence, any text on devices and drivers always
has a dedicated set of sections that particularly look at storage devices and the
methods of interfacing with them.
Linux distinguishes between two kinds of devices: block and character. Block
devices read and write a large block of data at a time. For example, storage
devices are block devices that often read and write 512-byte chunks of data in
one go. On the other hand, character devices read and write a single character
or a set of few characters at a time. Examples of character devices are keyboards
and mice. For interfacing with character devices, a device driver is sufficient;
313
c Smruti R. Sarangi 314
it can be connected to the terminal or the window manager. This provides the
user a method to interact with the underlying OS and applications.
We shall see that for managing block devices, we need to create a file system
that typically has a tree-structured organization. The internal nodes of this file
system are directories (folders in Windows). The leaf nodes are the individual
files. The file is defined as a set of bytes that has a specific structure based on
the type of data it contains. For instance, we can have image files, audio files,
document files, etc. A directory or a folder on the other hand has a fixed tabular
structure that just stores the pointers to every constituent file or directory within
it. Linux generalizes the concept of a file. For it, everything is a file including
a directory, device, regular file and process. This allows us to interact with all
kinds of entities within Linux using regular file-handling mechanisms.
This chapter has four parts: basics of the I/O system, details of storage
devices, structure of character and block device drivers and the design of file
systems.
Processor
Frontside
PCI express bus
bus Memory
Graphics North Bridge modules
processor chip
disk drives. The role of the South bridge chip is quite important in the sense
that it needs to interface with numerous controllers corresponding to a diverse
set of buses. Note that we need additional chips corresponding to each kind of
bus, which is why when we look at the picture of a motherboard, we see many
chips. Each chip is customized for a given bus (set of devices). These chips
are together known as the chipset. They are a basic necessity in a large system
with a lot of peripherals. Without a chipset, we will not be able to connect to
external devices, notably I/O and storage devices. Over the last decade, the
North Bridge functionality has moved on-chip. Connections to the GPU and
the memory modules are also more direct in the sense that they are directly
connected to the CPUs via either dedicated buses or memory controllers.
However, the South Bridge functionality has remained as an off-chip entity in
many general purpose processors on server-class machines. It is nowadays (as of
2024) referred to as the Platform Controller Hub (PCH). Modern motherboards
still have a lot of chips including the PCH primarily because there are limits to
the functionality that can be added on the CPU chip. Let us elaborate.
I/O controller chips sometimes need to be placed close to the corresponding
I/O ports to maintain signal integrity. For example, the PCI-X controller and
the network card (on the PCI-X bus) are in close proximity to the Ethernet
port. The same is the case for USB devices and audio inputs/ outputs. The
c Smruti R. Sarangi 316
Applica�on Request
Response
Opera�ng Device
system Kernel driver
So�ware
Hardware
I/O I/O Processor
device system
Figure 7.2: Flow of actions in the kernel: application → kernel → device driver
→ CPU → I/O device (and back)
Figure 7.2 shows the flow of actions when an application interacts with an
I/O device. The application makes a request to the kernel via a system call.
This request is forwarded to the corresponding device driver, which is the only
subsystem in the kernel that can interact with the I/O device. The device driver
issues specialized instructions to initiate a connection with the I/O device. A
request gets sent to the I/O device via the chips in the chipset. A set of chips
that are a part of the chip set route the request to the I/O device. The South
Bridge chip is one of them. Depending upon the request type, read or write, an
appropriate response is sent back. In the case of a read, it is a chunk of data
and in the case of a write, it is an acknowledgment.
The response follows the reverse path. Here there are several options. If it
317 c Smruti R. Sarangi
was a synchronous request, then the processor waits for the response. Once it
is received, the response (or a pointer to it) is put in a register, which is visible
to the device driver code. However, given that I/O devices can take a long time
to respond, a synchronous mechanism is not always the best. Instead, an asyn-
chronous mechanism is preferred where an interrupt is raised when the response
is ready. The CPU that handles the interrupt fetches the data associated with
the response from the I/O system.
This response is then sent to the interrupt handler, which forwards it to the
device driver. The device driver processes the response. After processing the
received data, a part of it can be sent back to the application via other kernel
subsystems.
Device-driver level
Protocol layer transmission protocol
Routing messages to
Network layer the right I/O device
cases when a string of 0s and 1s are sent, artificial transitions are inserted into
the data for easier clock recovery.
The data link layer has a similar functionality as the corresponding layer in
networks. It performs the key tasks of error correction and framing (chunk data
into fixed sets of bytes).
Finally, the protocol layer is concerned with the high-level data transfer
protocol. There are many methods of transferring data such as interrupts,
polling and DMA. Interrupts are a convenient mechanism. Whenever there
is any new data at an I/O device, it simply raises an interrupt. Interrupt
processing has its overheads.
On the other hand polling can be used where a thread continuously polls
(reads the value) an I/O register to see if new data has arrived. If there is new
data, then the I/O register stores a logical 1. The reading thread can reset this
value and read the corresponding data from the I/O device. Polling is a good
idea if there is frequent data transfer. We do not have to pay the overhead of
interrupts. We are always guaranteed to read or write some data. On the other
hand, the interrupt-based mechanism is useful when data transfer is infrequent.
The last method is outsourcing the entire process to a DMA (Direct Memory
Access) controller. It performs the full I/O access (read or write) on its own
and raises an interrupt when the overall operation has completed. This is useful
for reading/ writing large chunks of data.
319 c Smruti R. Sarangi
So�ware interface
Registers
Input Output
Port controller
Port connector
Figure 7.4: I/O ports
Instruction Semantics
in r1, hi/oporti r1 ← contents of hi/oporti
out r1, hi/oporti contents of hi/oporti ← r1
port-mapped I/O. An I/O request contains the address of the I/O port. We
can use the in and out instructions to read the contents of an I/O port or
write to it, respectively. The pipeline of a processor sends an I/O request to
the North Bridge chip, which in turn forwards it to the South bridge chip. The
latter forwards the request to the destination – the target I/O device. This uses
the routing resources available in the chipset. This pretty much works like a
conventional network. Every chip in the chipset maintains a small routing table;
it knows how to forward the request given a target I/O devices. The response
follows the reverse path, which is towards the CPU.
This is a simple mechanism that has its share of problems. The first is that
it has very high overheads. An I/O port is 8 to 32 bits wide, which means
that we can only read or write 1 to 4 bytes of data at a time. This basically
means that if we want to access a high-bandwidth device such as a scanner or
a printer, a lot of I/O instructions need to be issued. This puts a lot of load on
the CPU’s pipeline and prevents the system from doing any other useful work.
We need to also note that such I/O instructions are expensive instructions in
the sense that they need to be executed sequentially. They have built-in fences
(memory barriers). They do not allow reordering. I/O instructions permanently
change the system state and thus no other instruction – I/O or regular memory
read/write – instruction can be reordered with respect to it.
Along with bandwidth limitations and performance overheads, using such
instructions makes the code less portable across architectures. Even if the code is
migrated to another machine, it is not guaranteed to work because the addresses
of the I/O ports assigned to a given device may vary. The assignment of I/O
port numbers to devices is a complicated process. For devices that are integrated
into the motherboard, the port numbers are assigned at the manufacturing time.
For other devices that are inserted to expansion slots, PCI-express buses, etc.,
the assignment is done at boot time by the BIOS. Many modern systems can
modify the assignments after booting. This is why, there can be a lot of variance
in the port numbers across machines, even of the same type.
Now, if we try to port the code to a different kind of machine, for example,
if we try to port the code from an Intel machine to an ARM machine, then
pretty much nothing will work. ARM has a very different I/O port architecture.
Note that the in and out assembly instructions are not supported on ARM
machines. At the code level, we thus desire an architecture-independent solution
for accessing I/O devices. This will allow the kernel or device driver code to be
portable to a large extent. The modifications to the code required to port it to
a new architecture will be quite limited.
Note that the I/O address space is only 64 KB using this mechanism. Often
there is a need for much more space. Imagine we are printing a 100 MB file; we
would need a fair amount of buffering capacity on the port controller. This is
why many modern port controllers include some amount of on-device memory.
321 c Smruti R. Sarangi
It is possible to write to the memory in the port controller directly using conven-
tional instructions or DMA-based mechanisms. GPUs are prominent examples
in this space. They have their memory. The CPU can write to it. Many modern
devices have started to include such on-device memory. USB 3.0, for example,
has about 250 KB of buffer space on its controllers.
it has become the dominant storage technology in all kinds of computing de-
vices starting from laptops to desktops to servers. After 2015, it started to get
challenged in a big way by other technologies that rely on nonvolatile memory.
However, hard disks are still extremely popular as of 2024, given their scalability
and cost advantages.
Clock
0 1 1 1 0 1 0 1 0 0 0
Data
We need to understand the hard disk read/write head moves very quickly
over the recording surface. At periodic intervals, it needs to read the bits stored
on the recording surface. Note that the magnetic field is typically not directly
measured, instead the change in magnetic field is noted. It is much easier to
do so. Given that a changing magnetic field induces a current across terminals
on the disk head, this can be detected very easily electronically. Let us say this
happens at the negative edge of the clock. We need perfect synchronization
here. This means that whenever the clock has a negative edge, that is exactly
when a magnetic field transition should be happening. We can afford to have
a very accurate clock but placing magnets, which are physical devices, such
accurately on the recording surface is difficult. There will be some variation in
the production process. Hence, there is a need to periodically resynchronize the
clock with the magnetic field transitions recorded by the head while traversing
over the recording surface. Some minor adjustments are continuously required.
If there are frequent 0 → 1 and 1 → 0 transitions in the stored data, then such
resynchronization can be done.
However, it is possible that the data has a long sequence of 0s and 1s. In
this case, it is often necessary to introduce dummy transitions for the purpose
of synchronization. In the light of this discussion, let us try to understand the
NRZI protocol. A value equal to 0 maintains the voltage value. Whereas, a
value equal to 1, flips the voltage. If the voltage is high, it becomes low, and
vice versa. A logical 1 thus represents a voltage transition, whereas a logical 0
simply maintains the value of the voltage. It is true that there are transitions in
this protocol whenever there is a logical 1, however, if there could still be a long
run of 0s. This is where, it is necessary to introduce a few dummy transitions.
The dummy data is discarded later.
323 c Smruti R. Sarangi
N S N S S N S N N S
0 1 0 1
Figure 7.6: Arrangement of tiny magnets on a hard disk’s recording surface
Sector
Track
Let us now understand how these small magnets are arranged on a circular
disk that is known as a platter. As we can see in Figure 7.7, the platter is divided
into concentric rings that contain such tiny magnets. Each such ring is called a
track. It is further divided into multiple sectors. Each sector typically has the
same size: 512 bytes. In practice, a few more bytes are stored for the sake of
c Smruti R. Sarangi 324
error correction. To maximize the storage density, we would like each individual
magnet to be as small as possible. However, there is a trade-off here. If the
magnets are very small, then the EMF that will be induced will be very small
and will become hard to detect. As a result, there are technological limitations
on the storage density.
Hence, it is a wise idea to store different numbers of sectors per track. The
number of sectors that we store per track depends on the latter’s circumference.
The tracks towards the periphery shall have more sectors and track towards the
center will have fewer sectors. Modern hard disks are actually slightly smarter.
They divide the set of tracks into contiguous sets of rings called zones. Each zone
has the same number of sectors per track. The advantage of this mechanism is
that the electronic circuits get slightly simplified given that the platters rotate
at a constant angular velocity. Within a zone, we can assume that the same
number of sectors pass below the head each second.
Definition 7.2.1 Key Elements of a Hard Disk
Spindle
Platter
Head
Arm
The structure of a hard disk is shown in Figures 7.8 and 7.9. As we can
see, there are a set of platters that have a spindle passing through their centers.
325 c Smruti R. Sarangi
Spindle
Read/Write
head Platter
Arm
Bus
Bus
interface Actuator
Spindle motor
Drive
electronics
The spindle itself is controlled by a spindle motor that rotates the platters at a
constant angular velocity. There are disk heads on top of each off the recording
surfaces. These heads are connected to a common rotating arm. Each disk head
can read as well as write data. Reading data involves sensing whether there is
a change in the voltage levels or not (presence or absence of an induced EMF).
Writing data involves setting a magnetic field using a small electromagnet. This
aligns the magnet on the platter with the externally induced magnetic field. We
have a sizable amount of electronics to accurately sense the changes in the
magnetic field, perform error correction, and transmit the bytes that are read
back to the processor via a bus.
Let us now understand how a given sector is accessed. Every sector has
a physical address. Given the physical address, the disk controller knows the
platter on which it is located. A platter can have two recording surfaces: one
on the top and one on the bottom. The appropriate head needs to be activated,
and it needs to be positioned at the beginning of the corresponding sector and
track. This involves first positioning the head on the correct track, which will
happen via rotating the disk arm. The time required for this is known as the
seek time. Once the disk head is on the right track, it needs to wait for the sector
to come underneath it. Given the fact that the platter rotates at a constant
angular velocity, this duration can be computed quite accurately. This duration
is known as the rotational latency. Subsequently, the data is read, error checking
is done, and after appropriately framing the data, it is sent back to the CPU
via a bus. This is known as the transfer latency. The formula for the overall
disk access time is shown in Equation 7.1.
• The rotational latency is the time that the head needs to wait for
beginning of the desired sector to come below it after it has been
positioned on the right track.
• The transfer time is the time it takes to transfer the sector to
the CPU. This time includes the time to perform error checking,
framing and sending the data over the bus.
Given that in a hard disk, there are mechanical parts and also the head
needs to physically move, there is a high chance of wear and tear. Hence, disk
drives have limited reliability. They mostly tend to have mechanical failures.
To provide a degree of failure resilience, the disk can maintain a set of spare
sectors. Whenever there is a fault in a sector, which will basically translate to
an unrecoverable error, one of the spare sectors can be used to replace this “bad
sector”.
There are many optimizations possible here. We will discuss many of these
when we introduce file systems. The main idea here is to store a file in such a
way on a storage device that it can be transferred to memory very quickly. This
means that the file system designer has to have some idea of the physical layout
of the disk and the way in which physical addresses are assigned to logical
addresses. If some of this logic is known, then the seek time, as well as the
rotational latency can be reduced substantially. For instance, in a large file all
the data sectors can be placed one after the other on the same track. Then
they can be placed in corresponding tracks (same distance from the center)
in the rest of the recording surfaces such that the seek time is close to zero.
This will ensure that transitioning between recording surfaces will not involve
a movement of the head in the radial direction.
All the tracks that are vertically above each other have almost the same
distance from the center. We typically refer to a collection of such tracks using
the term cylinder. The key idea here is that we need to preserve locality and
thus ensure that all the bytes in a file can quickly be read or written one after
the other. Once a cylinder fills up, the head can move to the adjacent cylinder
(next concentric track), so on and so forth.
327 c Smruti R. Sarangi
7.2.2 RAID
Hard drives are relatively flimsy and have reliability issues. This is primarily
because they rely on mechanical parts, which are subject to wear and tear. They
thus tend to fail. As a result, it is difficult to create large storage arrays that
comprise hard disks. We need to somehow make large storage arrays resilient to
disk failures. There is a need to have some built-in redundancy in the system.
The concept of RAID (Redundant Array of Inexpensive Disks) was proposed
to solve such problems. Here, the basic idea is to have additional disks that
store redundant data. In case a disk fails, other disks can be used to recover
the data. The secondary objective of RAID-based solutions is to also enhance
the bandwidth given that we have many disks that can be used in parallel. If
we consider the space of these two dual aims – reliability and performance – we
can design many RAID solutions that cater to different kinds of users. The user
can choose the best solution based on her requirements.
RAID 0
B1 B2
B3 B4
B5 B6
B7 B8
B9 B10
Disk 1 Disk 2
Figure 7.10: RAID 0
RAID 1
On the other hand, RAID 1 (shown in Figure 7.11) enhances the reliability.
Here the same block is stored across the two disks. For example, block B1 is
stored on both the disks: Disk 1 and 2. If one of the disks fails, then the other
disk can be used to service all the reads and writes (without interruption). Later
on, if we decide to replace the failed disk then the other disk that is intact can
provide all the data to initialize the new disk.
c Smruti R. Sarangi 328
B1 B1
B2 B2
B3 B3
B4 B4
B5 B5
Disk 1 Disk 2
Figure 7.11: RAID 1
This strategy does indeed enhance the reliability by providing a spare disk.
However, the price that is incurred is that for every write operation, we actually
need to write the same copy of the block to both the disks. Reads are still fast
because we can choose one of the disks for reading. We especially choose the
one that is lightly loaded to service the read. This is sadly not possible in the
case of write operations.
RAID 2, 3 and 4
B1 B2 B3 B4 P1
B5 B6 B7 B8 P2
B9 B10 B11 B12 P3
B13 B14 B15 B16 P4
B17 B18 B19 B20 P5
We clearly have some issues with RAID 1 because it does not enhance the
bandwidth of write operations. In fact, in this case we need to write the same
data to multiple disks. Hence, a series of solutions have been proposed to
ameliorate this issue. They are named RAID 2, 3 and 4, respectively. All of
them belong to the same family of solutions (refer to Figure 7.12).
In the figure we see an array of five disks: four store regular data and one
stores parities. Recall that the parity of n bits is just their XOR. If one of the
329 c Smruti R. Sarangi
bits is lost, we can use the parity to recover the lost bit. The same can be done
at the level of 512-byte blocks as well. If one block is lost due to a disk failure, it
can be recovered with the help of the parity block. As we can see in the figure,
the parity block P1 is equal to B1 ⊕ B2 ⊕ B3 ⊕ B4, where ⊕ stands for the
XOR operation. Assume that the disk with B2 fails. We can always compute
B2 as B1 ⊕ P 1 ⊕ B3 ⊕ B4.
Let us instead focus on some differences across the RAID levels: 2, 3 and 4.
RAID 2 stores data at the level of a single bit. This means that its block size is
just a single bit, and all the parities are computed at the bit level. This design
offers bit-level parallelism, where we can read different bit streams in parallel
and later on fuse them to recreate the data. Such a design is hardly useful,
unless we are looking at bit-level storage, which is very rare in practice.
RAID level 3 increases the block size to a single byte. This allows us to read
or write to different bytes in parallel. In this case, Disk i stores all the bytes at
locations 4n + i. Given a large file, we can read its constituent bytes in parallel,
and then interleave the byte streams to create the file in memory. However, this
reconstruction process is bound to be slow and tedious. Hence, this design is
also not very efficient nor very widely used.
Finally, let us consider RAID 4, where the block size is equal to a conven-
tional block size (512 bytes). This is typically the size of a sector in a hard disk
and thus reconstructing data at the level of blocks is much easier and much more
intuitive. Furthermore, it is also possible to read multiple files in parallel given
that their blocks are distributed across the disks. Such designs offer a high level
of parallelism and if the blocks are smartly distributed across the disks, then a
theoretical bandwidth improvement of 4× is possible in this case.
There is sadly a problem with these RAID designs. The issue is that there
is a single parity disk. Whenever, we are reading something, we do not have
to compute the parity because we assume that if the disk is alive, then the
block that is read is correct. Of course, we are relying on block-level error
checking, and we are consequently assuming that they are sufficient to attest
the correctness of the block’s contents. Sadly, in this case writing data is much
more onerous. Let us first consider a naive solution.
We may be tempted to argue that to write to any block, it is necessary to
read the rest of the blocks from the other disks and compute the new value of
the parity. It turns out that there is no need to actually do this; we can instead
rely on an interesting property of the XOR function. Note the following:
P 1 = B1 ⊕ B2 ⊕ B3 ⊕ B4
(7.2)
P 10 = P 1 ⊕ B10 ⊕ B1 = B10 ⊕ B2 ⊕ B3 ⊕ B4
The new parity P 10 is thus equal to B10 ⊕B2⊕B3⊕B4. We thus have a neat
optimization here; it is not necessary to read the rest of the disks. Nevertheless,
there is still a problem. For every write operation, the parity disk has to be read,
and it has to be written to. This makes the parity disk a point of contention –
it will slow down the system because of requests queuing up. Moreover, it will
also see a lot of traffic, and thus it will wear out faster. This will cause many
reliability problems, and the parity disk will most likely fail the first. Hence,
there is a need to distribute the parity blocks across these disks. This is precisely
the problem the novelty of RAID 5.
c Smruti R. Sarangi 330
RAID 5
Figure 7.13 shows a set of disks with distributed parity, where there is no single
disk dedicated to exclusively storing parity blocks. We observe that for the first
set of blocks, the parity block is in Disc 5. Then for the next set, the parity
block P 2 is stored in Disk 1, so on and so forth. Here the block size is typically
equal to the block size of RAID 4, which is normally the disk sector size, i.e.,
512 bytes. The advantage here is that there is no single disk that is a point
of contention. The design otherwise has the rest of the advantages of RAID 4,
which are basically the ability to support parallel read accesses and optimized
write accesses. The only disks that one needs to access while writing are as
follows: the disk that is being written to and the parity disk.
B1 B2 B3 B4 P1
P2 B5 B6 B7 B8
B9 P3 B10 B11 B12
B13 B14 P4 B15 B16
B17 B18 B19 P5 B20
RAID 6
Let us now ask a more difficult question, “What if there are two disk failures?”
Having a single parity block will not solve the problem. We need at least two
parity blocks. The mathematics to recover the contents of the blocks is also
much more complex.
Without getting into the intricate mathematical details, it suffices to say
that we have two parity blocks for every set of blocks, and these blocks are dis-
tributed across all the disks such that there is no point of contention. Figure 7.14
pictorially describes the scheme.
7.2.3 SSDs
Let us next discuss another genre of storage devices that rely on semiconductor
technologies. The technology that is used here is known as flash. This technol-
ogy is used to create SSDs (solid state devices). Such storage technologies do
not use magnets to store bits, and they also do not have any mechanical parts.
Hence, they are both faster and often more reliable as well. Sadly, they have
their share of failure mechanisms, and thus they are not as reliable as we think
they perhaps are. Nevertheless, we can confidently say that they are immune
331 c Smruti R. Sarangi
B1 B2 B3 B4 P1A P1B
P2B B5 B6 B7 B8 P2A
P3A P3B B9 B10 B11 B12
B13 P4A P4B B14 B15 B16
B17 B18 P5A P5B B19 B20
Basic Operation
Let us understand at a high level how they store a bit. Figure 7.15 shows a novel
device that is known as a floating gate transistor. It looks like a normal NMOS
transistor with its dedicated source and drain terminals and a gate connected
to an external terminal (known as the control gate). Here, the interesting point
to note is that there are actually two gates stacked on top of each other. They
are separated by an insulating silicon dioxide layer.
Let us focus on the gate that is sandwiched between the control gate and the
transistor’s channel. It is known as the floating gate. If we apply a very strong
positive voltage, then electrons will get sucked into the floating gate because of
the strong positive potential and when the potential is removed, many of the
electrons will actually stay back. When they stay back in this manner, the cell
is said to be programmed. We assume that at this point it stores a logical 0. If
we wish to reset the cell, then there is a need to actually push the electrons back
into the transistor’s substrate and clear the floating gate. This will necessitate
the application of a strong negative voltage at the control gate terminal, which
will push the electrons back into the transistor’s body. In this case, the floating
gate transistor or the flash cell are said to be reset. The cell stores a logical 1
in this state.
Let us now see look at the process of reading the value stored in such a
memory cell. When the cell is programmed, its threshold voltage rises. It
becomes equal to VT+ , which is higher than the normal threshold voltage VT .
Hence, to read the value in the cell we set the gate voltage equal to a value that
is between VT and VT+ . If it is not programmed, then the voltage will be higher
than the threshold voltage and the cell will conduct current, otherwise it will
be in the cutoff state and will not conduct current. This is known as enabling
the cell (or the floating gate transistor).
c Smruti R. Sarangi 332
Control gate
SiO2 Floating gate
Source Drain
Symbol
(a) (b)
Figure 7.15: Floating gate transistor
P/E Cycles
Let us now discuss a very fascinating aspect of such flash-based devices. These
devices provide read-write access at the level of pages, not bytes – we can only
read or write a full page (512-4096 bytes) at a time. We cannot access data at
a smaller granularity. As we have seen, the storage of data within such devices
is reasonably complicated. We have fairly large flash cells and reading them
requires some work. Hence, it is a much better idea to read a large number of
bytes in one go such that a lot of the overheads can be amortized. Enabling
these cells and the associated circuits have associated time overheads, which
necessitates page-level accesses. Hence, reading or writing small chunks of data,
let’s say a few bytes at a time, is not possible. We would like to emphasize here
that even though the term “page” is being used, it is totally different from a
page in virtual memory. They just happen to share the same name.
Let us now look at writes. In general, almost all such devices have a DRAM-
backed cache that accumulates/coalesces writes. A write is propagated to the
array of flash cells either periodically, when there is an eviction from the cache, or
when the device is being ejected. In all cases effecting a write is difficult mainly
because there is no way of directly writing to a flash cell that has already been
programmed. We need to first erase it or rather deprogram it. In fact, given
that we only perform page-level writes, the entire page has to be erased. Recall
that this process involves applying a very strong negative voltage to the control
gate to push the electrons in the floating gates back into the substrate.
Sadly, in practice, it is far easier to do this at the level of a group of pages,
333 c Smruti R. Sarangi
because then we can afford to have a single strong driver circuit to push the
electrons back. We can successfully generate a strong enough potential to reset
or deprogram the state of a large number of flash cells. In line with this phi-
losophy, in flash-based SSD devices, em blocks are created that contain 32-128
pages. We can erase data only at the level of a block. After that, we can write
to a page, which basically would mean programming all the cells that store a
logical 0 and leaving/ignoring all the cells that store a logical 1. One may ask a
relevant question here, “What happens to the rest of the pages in a block that
are not written to?” Let us keep reading to find the answer.
We thus have a program-erase (P/E) cycle. We read or write at the granu-
larity of pages, but erase at the level of blocks (of pages). To rewrite a page, it
is necessary to erase it first. This is because the 0 → 1 transition is not possible
without an erase operation. The crux of the issue is that we cannot write to
a cell that already stores a 0. This means that every page is first written to
(programmed), and then erased, then programmed again, so on and so forth.
This is known as a program-erase cycle (PE cycle).
Let us understand in detail what happens to the data when we wish to
perform a page rewrite operation. Whenever we wish to write to a page, we
actually need to do a couple of things. The first is that we need to find another
empty (not programmed) block. Next, we copy the contents of the current block
to the location of the empty block. We omit the page that we wish to write
to. This will evolve many read-write operations. Subsequently, we write the
modified version of the page. Note that the actual physical location of this page
has now changed. Now it is being written to a different location, because the
block that it was a part of is going to be erased, and all the other pages that
were in its block have already been copied to their new locations in a new block.
They are a part of a different physical block now, even though they are a part
of the same em logical block. This answers the question with regard to what
happens with the rest of the pages in the block.
There is therefore a need to have a table that maps a logical block to its
corresponding physical block. This is because, in designs like this, the physical
locations of the blocks are changed on every write. Whenever a block is copied
to a new address, we update the corresponding mapping. This is done by the
Flash Translation Layer (FTL) – typically firmware stored in the SSD itself. The
mapping table is also stored on the SSD. It is modified very carefully because we
don’t want any inconsistencies here. It is seldom the case that the OS maintains
this table. This is because most flash devices do not give access to the OS at
this low a level. There are experimental devices known as raw flash devices
that allow OS designers to implement the FTL in the OS and subsequently
evaluate different mapping algorithms. However, this is a rarity. In practice,
even something as simple as a pen drive has its own translation layer.
Reliability Issues
Let us now discuss some reliability issues. Unfortunately, a flash device as of
today can only endure a finite number of P/E cycles per physical block. The
maximum number of P/E cycle is sadly not much – it is in the range of 50-
150k cycles as of 2024. The thin oxide layer breaks down, and the floating
gate does not remain usable anymore. Hence, there is a need to ensure that
all the blocks wear out almost at the same rate or in other words, they endure
c Smruti R. Sarangi 334
the same number of P/E cycles. Any flash device maintains a counter for each
block. Whenever there is a P/E cycle, this counter is incremented. The idea is
to ensure that all such counts are roughly similar across all the blocks.
Free block
Performance Considerations
Let us now take a high-level view and go over what we discussed. We introduced
a flash cell, which is a piece of nonvolatile memory that retains its values when
powered off, quite unlike conventional DRAM memory. Nonvolatile memories
essentially play the role of storage devices. They are clearly much faster than
hard disks, and they are slower than DRAM memory. However, this does not
come for free. There are concomitant performance and reliability problems that
require both OS support, and features such as wear leveling and swapping blocks
to minimize read disturbance.
Modern SSD devices take care of a lot of this within the confines of the de-
vice itself. Nevertheless, operating system support is required, especially when
we have systems with large flash arrays. It is necessary to equally distribute
requests across the individual SSD memories. This requires novel data layout
and partitioning techniques. Furthermore, we wish to minimize the write am-
plification. This is the ratio of the number of physical writes to the number
of logical writes. Writing some data to flash memory may involve many P/E
cycles and block movements. All of them increase the write amplification. This
is why there is a need to minimize all such extraneous writes that are made to
the SSD drive.
Most modern SSD disk arrays incorporate many performance optimizations.
They do not immediately erase a block that has served as a temporary block and
is not required anymore. They simply mark it as invalid, and it is later erased
or rather garbage collected. This is done to increase performance. Moreover,
the OS can inform the SSD disk that a given block is not going to be used in
the future. It can then be marked as invalid and can be erased later. Depending
upon its P/E count, it can be either used to store a regular block, or it can even
act as a temporary block that is useful during a block swap operation. The
OS also plays a key role in creating snapshots of file systems stored on SSD
devices. These snapshots can be used as a backup solution. Later on, if there
is a system crash, then a valid image of the system can be recovered from the
stored snapshot.
Up till now, we have used SSD drives as storage devices (as hard disk re-
placements). However, they can be used as regular main memory as well. Of
course, they will be much slower. Nevertheless, they can be used for capacity
enhancement. There are two configurations in which SSD drives are used: ver-
tical and horizontal. The SSD drive can be used in the horizontal configuration
to just increase the size of the usable main memory. The OS needs to place
physical pages intelligently across the DRAM and SSD devices to ensure opti-
mal performance. The other configuration is the vertical configuration, where
the SSD drive is between the main memory and the hard disk. It acts like a
c Smruti R. Sarangi 336
cache for the hard disk – a faster storage device that stores a subset of the data
stored on the hard disk. In this case also, the role of the OS is crucial.
of files. All that the program needs to do is write to these pages, which is much
faster.
This sounds like a good idea; however, it seems to be hard to implement.
How does the program know if a given file offset is present in memory (in a
mapped page), or is present in the underlying storage device? Furthermore,
the page that a file address is mapped to, might change over the course of
time. Thankfully, there are two easy solutions to this problem. Let us consider
memory-mapped I/O, where file addresses are directly mapped to virtual mem-
ory addresses. In its quintessential form, the TLB is supposed to identify that
these are actually I/O addresses, and redirect the request to the I/O (storage)
device that stores the file. This is something that we do not want in this case.
Instead, we can map the memory-mapped virtual addresses to the physical ad-
dresses of pages, which are stored in the page cache. In this case, the target
of memory-mapped I/O is a set of another pages located in the memory itself.
These are pages that are a part of the page cache. This optimization will not
change the programmer’s view and programs can run unmodified, albeit much
faster. Memory mapping files is of course not a very scalable solution and does
not work for large files.
The other solution is when we use I/O-mapped I/O. This means that I/O is
performed on files using read and write system calls, and the respective requests
go to the I/O system. They are ultimately routed to the storage device. Of
course, the programmer does not work with the low-level details. She simply
invokes library calls and specifies the number of bytes that need to be read or
written, including their contents (in case of writes). The library calls curate
the inputs and make the appropriate system calls. After transferring data to
the right kernel buffers, the system call handling code invokes the device driver
routines that finally issues the I/O instructions. This is a long and slow process.
Modern kernels optimize this process. They hold off on issuing I/O instructions
and instead effect the reads and writes on pages in the page cache. This is
a much faster process and happens without the knowledge of the executing
program.
Let us look now at the data structures used to manage devices in Linux in
detail.
major number
device
device
driver
minor number
the same major number, and thus share the same driver. However, individual
devices are assigned different minor numbers.
a long time, the adoption of Linux was somewhat subdued primarily because of
the limited device support. Over the years, the situation has changed, which is
why can we see the disproportionate fraction of driver code in the overall code
base. As Linux gets more popular, we will see more driver code entering the
codebase. Note that the set of included drivers is not exhaustive. There are
still a lot of devices whose drivers are not bundled with the operating system
distribution. The drivers have to be downloaded separately. Many times there
are licensing issues, and there is also a need to reduce the overall size of the OS
install package.
Let us ask an important question at this stage. Given that the Linux kernel is
a large monolithic piece of code, should we include the code of all the drivers also
in the kernel image? There are many reasons for why this should not be done.
The first reason is that the image size will become very large. It may exhaust
the available memory space and little memory will be left for applications. The
second is that very few drivers may actually be used because it is not the case
that a single system will be connected to 200 different types of printers, even
though the code of the drivers of these printers needs to be bundled along with
the OS code. The reason for this is that when someone is connecting a printer,
the expectation is that things will immediately work and all drivers will get
auto-loaded.
In general, if it is a common device, there should be no need to go to the web
and download the corresponding driver. This would be a very inefficient process.
Hence, it is a good idea to bundle the driver along with the OS code. However,
bundling the code does not imply that the compiled version of it should be
present in the kernel image all the time. Very few devices are connected to a
machine at runtime. Only the images of the corresponding drivers should be
present in memory.
Modules
Recall that we had a very similar discussion in the context of libraries in Ap-
pendix B. We had argued that there is no necessity to include all the library
code in a process’s image. This is because very few library functions are used
in a single execution. Hence, we preferred dynamic loading of libraries and cre-
ated shared objects. It turns out that something very similar can be done here.
Instead of statically linking all the drivers, the recommended method is to cre-
ate a kernel module, which is nothing but a dynamically linked library/shared
object in the context of the kernel. All the device drivers should preferably be
loaded as modules. At run time they can be loaded on demand. This basically
means that the moment a device is connected, we find the driver code corre-
sponding to it. It is loaded to memory on-demand the same way that we load
a DLL. The advantages are obvious: efficiency and reduced memory footprint.
To load a module, the kernel provides the insmod utility that can be invoked
by the superuser – one who has administrative access. The kernel can also au-
tomatically do this action, especially when a new device is connected. There is
a dedicated utility called modprobe that is tasked with managing and loading
modules (including their dependences).
The role of a module-loading utility is specifically as follows:
1. Locate the compiled code and data of the module, and map its pages to
c Smruti R. Sarangi 342
2. Concomitantly, increase the kernel’s runtime code size and memory foot-
print.
4. The symbols exported by the module is added to the global symbol table.
Address of a func�on/
global variable
Kernel symbol table
Module 1 symbol
table
Module 2 symbol
table
device
Block device
bus_type request_queue
of the system are two types of objects namely generic devices (struct device)
and block devices.
A device is a generic construct that can represent both character and block
devices. It points to a device driver and a bus. A bus is an abstraction for a
shared hardware interconnect that connects many kinds of devices. Examples of
such buses are USB buses and PCI Express (PCIe) buses. A bus has associated
data structures and in some cases even device drivers. Many times it is necessary
to query all the devices connected to the bus and find the device that needs to
be serviced. Hence, the generic device has two specific bindings: one with device
drivers and one with buses.
Next, let us come to the structure of a block device. It points to many oth-
ers important subsystems and data structures. First, it points to a file system
(discussed in detail in Section7.6) – a mechanism to manage the full set of files
on a storage device. This includes reading and writing the files, managing the
metadata and listing them. Given that every block device stores blocks, it is
conceptually similar to a hard disk. We associate it with a gendisk structure,
which represents a generalized disk. We need to appreciate the historical sig-
nificance of this choice of the word “gendisk”. In the good old days, hard disks
were pretty much the only block devices around. However, later on many other
kinds of block devices such as SSD drives, scanners and printers came along.
Nevertheless, the gendisk structure still remained. It is a convenient way of
abstracting all of such block devices. Both a block device and the gendisk are
associated with a request queue. It is an array of requests that is associated
with a dedicated I/O scheduler. A struct request is a linked list of memory
regions that need to be accessed while servicing an I/O request.
/* generic parameters */
c Smruti R. Sarangi 344
/* function pointers */
int (* probe ) ( struct device * dev ) ;
void (* sync_state ) ( struct device * dev ) ;
int (* remove ) ( struct device * dev ) ;
void (* shutdown ) ( struct device * dev ) ;
int (* suspend ) ( struct device * dev , pm_message_t state ) ;
int (* resume ) ( struct device * dev ) ;
}
The generic structure of a device driver is shown in Listing 7.2. struct device
driver stores the name of the device, the type of the bus that it is connected
on and the module that corresponds to the device driver. It is referred to as the
owner. It is the job of the module to run the code of the device driver.
The key task of the kernel is to match a device with its corresponding
driver. Every device driver maintains an identifier of type of device id. It
encompasses a name, a type and other compatibility information. This can be
matched with the name and the type of the device.
Next, we have a bunch of function pointers, which are the callback func-
tions. They are called by other systems of the kernel, when there is a change in
the state. For example, when a device is inserted, the probe function is called.
When there is a need to synchronize the state of the device’s configuration be-
tween the in-memory buffers and the device, the sync state function is called.
The remove, shutdown, suspend and resume calls retain their usual meanings.
The core philosophy here is that these functions that are common to all kinds
of devices. It is the job of every device driver to provide implementations for
these functions. Creating such a structure with function pointers is a standard
design technique – it is similar to virtual functions in C++ and abstract functions
in Java.
A Generic Device
The code of struct device is shown in Listing 7.3. Every device contains
a hmajor, minori number pair (devt) and an unsigned 32-bit id.
Devices are arranged as a tree. Every device thus has a parent. It addi-
tionally has a pointer to the bus and its associated device driver. Note that a
device driver does not point to a device because it can be associated with many
devices. Hence, the device is given as an argument to the functions defined
in the device driver. However, every device needs to maintain a pointer to its
associated device driver because it is associated with only a single one.
Every block device has a physical location. There is a generic way of describ-
ing a physical location at which the block device is connected. It is specified
using struct device physical location. Note the challenges in designing
such a data structure. Linux is designed for all kinds of devices: wearables,
mobile phones, laptops, desktops and large servers. There needs to be a device-
independent way for specifying where a device is connected. The kernel defines
a location panel (id of the surface on the housing), which can take generic values
such as top, left, bottom, etc. A panel represents a generic region of a device.
On each panel, the horizontal and vertical positions are specified. These are
coarse-grained positions: (top, center, bottom) and (left, center, right). We
additionally store two bits. One bit indicates whether the device is connected
to a docking station and the second bit indicates whether the device is located
on the lid of the laptop.
Block devices often read and write large blocks of data in one go. Port-
mapped I/O and memory-mapped I/O often turn out to be quite slow and
unwieldy in such cases. DMA-based I/O is much faster in this case. Hence,
every block I/O device is associated with a DMA region. Further, it points to
a linked list of DMA pools. Each DMA pool points to a set of buffers that can
be used for DMA transfers. These are buffers in kernel memory and managed
by a slab cache (refer to Section 6.4.2).
The code of a block device is shown in Listing 7.4. It is like a derived class
where the base class is a device. Given that C does not allow inheritance, the
next best option is to add a pointer to the base class (device in this case) in the
definition of struct block device. Along with a pointer, we add the version
numbers as well in the device type (devt) field.
Every block device is divided into a set of sectors. However, it can be divided
into several smaller devices that are virtual. Consider a hard disk, which is
a block device. It can be divided into multiple partitions. For example, in
Windows they can be C:, D:, E:, etc. In Linux, popular partitions are /swap,
/boot and the base directory ‘/’. Each such partition is a virtual disk. It
represents a contiguous range of sectors. Hence, we store the starting sector
number and the number of sectors.
For historical reasons, a block device is always associated with a generic
disk. This is because the most popular block devices in the early days were
hard disks. This decision has persisted even though there are many more types
of block devices these days such as SSD drives, NVM memories, USB storage
devices, SD cards and optical drives. Nevertheless, a block device structure has
a pointer to a struct gendisk.
/* table of partitions */
struct block_device * part0 ;
struct xarray part_tbl ; /* partition table */
block device). Recall that we had associated a block device structure with
each partition.
The next important data structure is a pointer to a structure called block
device operations. It contains a set of function pointers that are associated
with different functions that implement specific functionalities. There are stan-
dard functions to open a device, release it (close it), submit an I/O request,
check its status, check pending events, set the disk as read-only and freeing the
memory associated with the disk.
Let us now discuss the request queue that is a part of the gendisk structure.
It contains all the requests that need to be serviced.
The code of struct request queue is shown in Listing 7.6. It stores a small
amount of current state information – the last request that has been serviced
(last merge).
The key structure is a queue of requests – an elevator queue. Let us explain
the significance of the elevator here. We need to understand how I/O requests
are scheduled in the context of storage devices, notably hard disks. We wish
to minimize the seek time (refer to Section 7.2.1). The model is that at any
point of time, a storage device will have multiple pending requests. They need
to scheduled in such a way that per-request the disk head moves the least.
One efficient algorithm is to schedule I/O requests the same way an elevator
schedules its stops. We will discuss more about this algorithm in the section on
I/O scheduling algorithms.
The two other important data structures that we need to store are I/O
request queues. The first class of queues are per-CPU software request queues.
They store pending requests for the I/O device. It is important to note that
these are waiting requests that have not been scheduled to execute on the storage
device yet. Once they are scheduled, they are sent to a per-device request queue
that sends requests directly to the underlying hardware. These per-CPU queues
are lockless queues, which are optimized for speed and efficiency. Given that
multiple CPUs are not accessing them at the same time, there is no need for
c Smruti R. Sarangi 348
request_queue
HW queues
SW queues
phy. struct
struct bio address request
Func�on called to ranges
total length of the data (data length), starting sector number, the deadline and
a timeout value (if any). The fact that block I/O requests can have a deadline
associated with them is important. This means that they can be treated as soft
real time tasks.
Listing 7.7: struct request
source : include/linux/blk − mq.h
struct request {
/* Back pointers */
struct request_queue * q ;
struct blk_mq_ctx * mq_ctx ;
struct blk_mq_hw_ctx * mq_hctx ;
struct block_device * part ;
/* Parameters */
unsigned int __data_len ;
sector_t sector ;
unsigned int deadline , timeout ;
There are two more fields of interest. The first is a function pointer (end io)
that is invoked to complete the request. This is device-specific and is imple-
mented by its driver code. The other is a generic data structure that has more
details about the I/O request (struct bio).
Its structure is shown in Listing 7.8. It has a pointer to the block device and
an array of memory address ranges (struct bio vec). Each entry is a 3-tuple:
physical page number, length of the data and starting offset. It points to a
memory region that either needs to be read or written to. A bio vec structure
is a list of many such entries. We can think of it as a sequence of memory
regions, where each single chunk is contiguous. The entire region represented
by bio vec however may not be contiguous. Moreover, it is possible to merge
multiple bio structs or bio vec vectors to create a larger I/O request. This
is often required because many storage devices such as SSDs, disks and NVMs
prefer long sequential accesses.
to the center (innermost) and stop at the request that is the farthest from the
center (outermost). This minimizes back and forth movement of the disk head,
and also ensures fairness. After reaching the outermost request, the disk head
then moves towards the innermost track servicing requests on they way. An
elevator processes requests in the same manner.
A Old request
New request
B
Figure 7.22: Example of the elevator algorithm, where fairness is being com-
promised. Fairness would require Request B to be scheduled before Request
A because it arrived earlier. If we start servicing requests on the reverse path
(outer to inner) then all the requests in the vicinity (nearby tracks) of A will
get serviced first. Note that requests in the vicinity of A got two back-to-back
chances: one when the head was moving towards the outermost track and one
when it reversed its direction.
There are many variants of this basic algorithm. We can quickly observe
that fairness is slightly being compromised here. Assume the disk head is on
the track corresponding to the outermost request. At that point of time, a
new request arrives. It is possible for the disk head to immediately reverse its
direction and process the new request that has just arrived. This will happen
if it is deemed to be in the outermost track (after the earlier request has been
processed). This situation is shown in Figure 7.22.
It is possible to make the algorithm fairer by directly moving to the innermost
request after servicing the outermost direction. In the reverse direction (outer
to inner), no requests are serviced. It is a direct movement of the head, which
is a relatively fast operation. These classes of algorithms are very simple and
not used in modern operating systems.
Linux uses three I/O scheduling algorithms: Deadline, BFQ and Kyber.
Deadline Scheduler
The Deadline scheduler stores requests in two queues. The first queue (sorted
queue) stores requests in the order of their block address, which is roughly the
same as the order of sectors. The reason for such a storage structure is to
351 c Smruti R. Sarangi
BFQ Scheduler
The BFQ (Budget Fair Queuing) scheduler is similar to the CFS scheduler for
processes (see Section 5.4.6). The same way that CFS apportions the processing
time between jobs, BFQ creates sector slices and gives every process the freedom
to access a certain number of sectors in the sector slice. The main focus here is
fairness across processes. Latency and throughout are secondary considerations.
Kyber Scheduler
This scheduler was introduced by Facebook (Meta). It is a simple scheduler
that creates two buckets: high-priority reads and low-priority writes. Each
type of request has a target latency. Kyber dynamically adjusts the number of
allowed in-flight requests such that all operations complete within their latency
thresholds.
General Principles
In general, I/O schedulers and libraries perform a combination of three opera-
tions: delay, merge and reorder. Sometimes it is desirable to delay requests a bit
such that a set of sufficient size can be created. It is easy to apply optimizations
on such a set with a sizeable number of requests. One of the common optimiza-
tions is to merge requests. For example, reads and writes to the same block
can be easily merged, and redundant requests can be eliminated. Accesses to
adjacent and contiguous memory regions can be combined. This will minimize
the seek time and rotational delay.
Furthermore, requests can be reordered. We have already seen examples of
reordering reads and writes. This is done to service reads quickly because they
are often on the critical path. We can also distinguish between synchronous
writes (we wait for it to complete) and asynchronous writes. The former should
have a higher priority because there is a process that is waiting for it to complete.
Other reasons for reordering could factor in the current position of the disk head
and deadlines.
In this case, the structure that represents the driver is mspro block driver.
It contains pointers to the probe, initialize, remove, suspend and resume
functions.
The initialization function mspro block probe initializes the memory card.
It sends it instructions to initialize its state and prepare itself for subsequent
read/write operations. Next, it creates an entry in the sysfs file system, which
is a special file system that exposes attributes of kernel objects such as devices
to users. Files in the sysfs file system can be accessed by user-level applica-
tions to find the status of devices. In some cases, superusers can also write
to these files, which allows them to control the behavior of the corresponding
devices. Subsequently, the initialization function initializes the various block
device-related structures: gendisk, block device, request queue, etc.
Typically, a device driver is written for a family of devices. For a specific
device in the family, either a generic (core) function can be used or a specific
function can be implemented. Let us consider one such specific function for the
Realtek USB memory card. It uses the basic code in the memstick directory but
defines its own function for reading/writing data. Let us explain the operation
of the function rtsx usb ms handle req.
It maintains a queue of outstanding requests. It uses the basic memstick code
to fetch the next request. There are three types of requests: read, write and bulk
transfer. For reading and writing, the code creates generic USB commands and
passes them on to a low-level USB driver. Its job is to send the raw commands
to the device. For a bulk transfer, the driver sets up a one-way pipe with
the low-level driver, which ensures that the data is directly transferred to a
memory buffer, which the USB device can access. The low-level commands can
be written to the USB command registers by the USB driver.
Point 7.4.1
The argument is a struct urb, which is a generic data structure that holds
information pertaining to USB requests and responses. It holds details about
the USB endpoint (device address), status of the transfer, pointer to a memory
buffer that holds the data and the type of the transfer. The details of the keys
that were pressed are present in this memory buffer. Specifically, the following
pieces of information are present for keyboards.
3. The character corresponding to the key that was pressed (internal code or
ASCII code).
4. Report whether Num Lock or Caps Lock have been processed.
Linux-like operating systems define the concept of an inode (see Figure 7.24). It
stores the metadata associated with a file like its name, ownership information,
size, permissions, etc., and also has a pointer to the block mapping table. If the
user wishes to read the 1046th byte of a file, all that she needs to do is compute
the block number and pass the file’s inode to a generic function. The outcome
is the address on the storage device.
Now a directory is also in a certain sense a regular file that stores data. It is
thus also represented by an inode. Since an inode acts like a metadata storage
unit in the world of files, it does not care what it is actually representing. It can
represent either regular files or directories or even devices. While representing
data (files and directories), it simply stores pointers to all the constituent blocks
without caring about their semantics. A block is treated as just a collection of
bytes. A directory’s structure is also simple. It stores a table that is indexed
by the name of the file/directory. The columns store the metadata information.
One of the most important fields is a pointer to the inode of the entry. This is
the elegance of the design. The inode is a generic structure that can point to
c Smruti R. Sarangi 356
any type of file including network sockets, devices and inter-process pipes – it
does not matter.
Let us explain with an example. In Linux, the default file system’s base direc-
tory is /. Every file has a path. Consider the path /home/srsarangi/ab.txt.
Assume that an editor wants to open the file. It needs to access its data blocks
and thus needs a pointer to its inode. The open system call locates the inode
and provides a handle to it that can be used by user programs. Assume that
the location of the inode of the / (or root) directory is known. It is inode #2
in the ext4 file system. The kernel code reads the contents of the / directory
and locates the inode of the home subdirectory in the table of file names. This
process continues recursively until the inode of ab.txt is located. Once it is
identified, there is a need to remember this information. The inode is wrapped
in a file handle, which is returned to the process. For subsequent accesses such
as reading and writing to the file, all that the kernel needs is the file handle.
It can easily extract the inode and process the request. There is a no need to
recursively traverse the tree of directories.
Let us now look at the file system in its entirety. It clearly needs to store all
the constituent inodes. Let us look at the rest of its components.
Recall that in hard disks and similar block devices, a single physical device
can be partitioned into multiple logical devices or logical disks. This is done
for effective management of the storage space, and also for security purposes –
we may want to keep all the operating system related files in one partition and
store all the user data in another partition. Some of these partitions may be
bootable. Bootable partitions typically store information related to booting the
kernel in the first sector (Sector 0), which is known as the boot block. The BIOS
can then load the kernel.
Most partitions just store a regular file system and are not bootable. For
example, D: and E: are partitions on Windows systems (refer to Figure 7.23).
On Linux, /usr and /home may be mounted on different partitions. In general, a
partition has only one file system. However, there are exceptions. For example,
swap on Linux (swap space) does not have a file system mounted on it. There
are file systems that span multiple partitions, and there are systems where
multiple file systems are mounted on the same partition. However, these are
very specialized systems. The metadata of most file systems is stored in Block
1, regardless of whether they are bootable or not. This block is known as the
superblock. It contains the following pieces of information: file system type
and size, attributes such as the block size or maximum file length, number of
inodes and blocks, timestamps and additional data. Some other important data
structures include inode tables, and a bitmap of free inodes and disk blocks. For
implementing such data structures, we can use bitmaps that can be accelerated
with augmented trees.
as a root directory. The key question is how do we access the files stored
in the new file system? Any file or directory has a path that is of the form
/dir1/dir2/.../filename in Linux. In Windows / is replaced with \. The
baseline is that all files need to be accessible via a string of this form, which is
known as the path of the file. It is an absolute path because it starts from the
root directory. A path can also be relative, where the location is specified with
respect to the current directory. Here the parent directory is specified with the
special symbol “..”. The first thing that the library functions do is convert all
relative paths to absolute paths. Hence, the key question still remains. How do
we specify paths across file systems?
/home/srsarangi/doc
Mount point
/books/osbook.pdf
/home/srsarangi/doc/books/osbook.pdf
mounted file system. It will be tasked with retrieving the file /videos/foo.mpg
relative to its root. Things can get more interesting. A mounted file system
can mount another file system, so on and so forth. The algorithm for traversing
the file system remains the same. Recursive traversal involves first identifying
the file system from the file path, and then invoking its functions to locate the
appropriate inode.
Finally, the unmount command can be used to unmount a file system. Its
files will not be accessible anymore.
File's inode
Directory
Path
of the Symbolic
link
Points to the
file
same inode
Hard
link
A separate file is created with its own inode (file type ‘l’). Its contents
contain the path of the target. Hence, resolving such a link is straightforward.
The kernel reads the path contained in the symbolic link file, and then uses the
359 c Smruti R. Sarangi
Hard Links
Hence, hard links were introduced. The same ln command can be used to create
hard links as follows (refer to Figure 7.26).
Listing 7.11: Creating a hard link
ln path_to_target path_to_link
A hard link is a directory entry that points to the same inode as the target
file. In this case, both the hard link and the directory file point to the same
inode. If one is modified, then the changes are reflected in the other. However,
deleting the target file does not lead to the deletion of the hard link. Hence,
the hard link still remains valid. We pretty much maintain a reference count
with each inode. The inode is deleted when all the files and hard links that
point to it are deleted. Another interesting property is that if the target file
is moved (within the same file system) or renamed, the hard link still remains
valid. However, if the target file is deleted and a new file with the same name
is created in the same directory, the inode changes and the hard link does not
remain valid.
There are nonetheless some limitations with hard links. They cannot be
used across file systems, and normally cannot link directories. The latter will
create infinite loops because a child can now link to an ancestor directory.
Applica�on Run df -a
Figure 7.27: File systems supported by the Linux virtual file system (VFS).
shows a conceptual view of the virtual file system where a single file system
unifies many different types of file systems. We wish to use a single interface to
access and work with all the files in the VFS regardless of how they are stored
or which underlying file system they belong to. Finally, given Linux’s histori-
cal ties to Unix, we would like the VFS’s interface to be similar to that of the
classical Unix file system (UFS). Given our observations, let us list down the
requirements of a virtual file system in Point 7.6.1.
Point 7.6.1
when the file system is mounted. Then they can be added to a software cache.
Any subsequent access will find the pseudostructure in the cache.
Next, we store the size of the file in terms of the number of blocks (i blocks)
and the exact size of the file in bytes (i size).
We shall study in Chapter 8 that permissions are very important in Linux
from a security perspective. Hence, it is important to store ownership and
permission information. The field i mode stores the type of the file. Linux
supports several file types namely a regular file, directory, character device,
block device, FIFO pipe, symbolic link and socket. Recall that everything-is-a-
file assumption. The file system treats all such diverse entities as files. Hence,
it becomes necessary to store their type as well. The field i uid shows the
id of the user who is the owner of the file. In Linux, every user belongs to
one or more groups. Furthermore, resources such as files are associated with
a group. This is indicated by the field i gid (group id). Group members get
some additional access rights as compared to users who are not a part of the
group. Some additional files include access times, modification times and file
locking-related state.
The next field i op is crucial to implementing VFS. It is a pointer to an inode
operations structure that contains a list of function pointers. These function
pointers point to generic file operations such as open, close, read, write, flush
(move to kernel buffers), sync (move to disk), seek and mmap (memory map).
Note that each file system has its own custom implementations of such functions.
The function pointers point to the relevant function (defined in the codebase of
the underlying file system).
Given that the inode in VFS is meant to be a generic structure, we cannot
store more fields. Many of them may not be relevant to all file systems. For
example, we cannot store a mapping table because inodes may correspond to
devices or sockets that do not store blocks on storage devices. Hence, it is a
good idea to have a pointer to data that is used by the underlying file system.
The pointer i private is useful for this purpose. It is of type void *, which
means that it can point to any kind of data structure. Often file systems set it
to custom data structures. Many times they define other kinds of encapsulating
data structures that have a pointer to the VFS inode and file system-specific
custom data structures. i private can also point to a device that corresponds
to the file. It is truly generic in character.
Point 7.6.2
An inode is conceptually a two-part structure. The first part is a VFS
inode (shown in Listing 7.12), which stores generic information about
a file. The second part is a file system-specific inode that may store a
mapping structure, especially in the case of regular files and directories.
Directory Entry
Address Space
365 c Smruti R. Sarangi
Let us now discuss the page cache. This is a very useful data structure
especially for file-backed pages. I/O operations are slow; hence, it should not be
necessary to access the I/O devices all the time. It is a far wiser idea to maintain
an in-memory page cache that can service reads and writes quickly. A problem
of consistency is sadly created. If the system is powered off, then there is a
risk of updates getting lost. Thankfully, in modern systems, this behavior can
be controlled and regulated to a large extent. It is possible to specify policies.
For example, we can specify that when a file is closed, all of its cached data
needs to be written back immediately. The close operation will be deemed
to be successful only after an acknowledgement is received indicating that all
the modified data has been successfully written back. There are other methods
as well. Linux supports explicit sync (synchronization) calls, kernel daemons
that periodically sync data to the underlying disk, and write-back operations
triggered when the memory pressure increases.
struct address space is an important part of the page cache (refer to List-
ing 7.14). It stores a mapping (i pages) from an inode to its cached memory
pages (stored as a radix tree). The second map is a mapping i mmap from the
inode to a list of vma s (stored as a red-black tree). The need to maintain all the
virtual memory regions (vma s) that have cached pages arises from the fact that
there is a need to quickly check if a given virtual address is cached or not. It
additionally contains a list of pointers to functions that implement regular oper-
ations such as reading or writing pages (struct address space operations).
Finally, each address space stores some private data, which is used by the
functions that work on it.
Point 7.6.4
This is a standard pattern that we have been observing for a while now.
Whenever we want to define a high-level base class in C, there is a need
to create an auxiliary structure with function pointers. These pointers
are assigned to real functions by (conceptually) derived classes. In an
object-oriented language, there would have been no reason to do so. We
could have simply defined a virtual base class and then derived classes
c Smruti R. Sarangi 366
could have overridden its functions. However, in the case of the kernel,
which is written in C, the same functionality needs to be created using
a dedicated structure that stores function pointers. The pointers are
assigned to different sets of functions based on the derived class. In this
case, the derived class is the actual file system. Ext4 will assign them to
functions that are specific to it, and other file systems such as exFat or
ReiserFS will do the same.
The role of struct vma s needs to be further clarified. A file can be mapped
to the address spaces of multiple processes. For each process, we will have
a separate vma region. Recall that a vma region is process-specific. The key
problem is to map a vma region to a contiguous region of a file. For example,
if the vma region’s start and end addresses are A and B (resp.), we need some
record of the fact that the starting address corresponds to the file offset P and
the ending address corresponds to file offset Q (note: B − A = Q − P ). Each
vma structure stores two fields that help us maintain this information. The
first is vm file (a pointer to the file) and the second is vm pgoff. It is the
offset within the file – it corresponds to the starting address of the vma region.
The page offset within the file can be calculated from the address X using the
following equation.
(X − vm start)
offset = + vm pgoff (7.3)
PAGE SIZE
Here PAGE SIZE is 4 KB and vm start is the starting address of the vma.
Finally, note that we can reverse map file blocks using this data structure as
well.
Op�mized for
small files Ptrs
Pointers to 12 directly
mapped blocks
Pointers to
Indirect data blocks Ptrs
Double indirect
Triple indirect
The basic idea is similar to the concept of folios – long contiguous sequences of
pages in physical and virtual memory. In this case, we define an extent to be a
contiguous region of addresses on a storage device. Such a region can be fairly
large. Its size can vary from 4 KB to 128 MB. The advantage of large contiguous
chunks is that there is no need to repeatedly query a mapping structure for
addresses that lie within it. Furthermore, allocation and deallocation is easy.
A large region can be allocated in one go. The flip side is that we may end up
creating holes as was the case with the base-limit scheme in memory allocation
(see Section 6.1.1). In this case, holes don’t pose a big issue because extents
can be of variable sizes. We can always cover up holes with extents of different
sizes. However, the key idea is that we wish to allocate large chunks of data as
extents, and simultaneously try to reduce the number of extents. This reduces
the amount of metadata required to save information related to extents.
The organization of extents is shown in Figure 7.29. In this case, the struc-
ture of the ext4 inode is different. It can store up to four extents. Each extent
points to a contiguous region on the disk. However, if there are more than 5 ex-
tents, then there is a need to organize them as a tree (as shown in Figure 7.29).
The tree can at the most have 5 levels. Let us elaborate.
c Smruti R. Sarangi 368
ext4_extent_header Regions in
ext4_inode
index node the disk
ext4_extent
ext4_extent_header
ext4_extent_idx
i_block[] (first ext4_extent
12 bytes)
ext4_extent_idx ext4_extent_header
ext4_extent
ext4_extent
There is no need to define a separate ext4 inode for the extent-based filesys-
tem. The ext4 inode defines 15 block pointers: 12 for direct block pointers,
1 for the single-indirect block, 1 for the double-indirect block and 1 for the
triple-indirect block. Each such pointer is 4 bytes long. Hence, the total storage
required in the ext4 inode structure is 60 bytes.
The great idea here is to repurpose these 60 bytes to store information related
to extents. There is no need to define a separate data structure. The first
12 bytes are used to store the extent header (struct ext4 extent header).
The structure is directly stored in these 12 bytes (not its pointer). An ext4
header stores important information about the extent tree: number of entries,
the depth of the tree, etc. If the depth is zero, then there is no extent tree.
We just use the remaining 48 (60-12) bytes to directly store extents (struct
ext4 extent). Here also the structures are directly stored, not their pointers.
Each ext4 extent requires 12 bytes. We can thus store four extents in this
case.
The code of an ext4 extent is shown in Listing 7.15. It maps a set of
contiguous logical blocks (within a file) to contiguous physical blocks (on the
disk). The structure stores the first logical block, the number of blocks and
the 48-bit address of the starting physical block. We store the 48 bits using
two fields: one 16-bit field and one 32-bit field. An extent basically maps a set
of contiguous logical blocks to the same number of contiguous physical blocks.
The size of an extent is naturally limited to 215 (32k) blocks. If each block is 4
KB, then an extent can map 32k × 4 KB = 128 MB.
Listing 7.15: struct ext4 extent
source : fs/ext4/ext4 extents.h
struct ext4_extent {
__le32 ee_block ; /* first logical block */
__le16 ee_len ; /* number of blocks */
__le16 ee_start_hi ; /* high 16 bits ( phy . block ) */
__le32 ee_start_lo ; /* low 32 bits ( phy . block ) */
};
369 c Smruti R. Sarangi
Now, consider the case when we need to store more than 4 extents. In this
case, there is a need to create an extent tree. Each internal node in the extent
tree is represented by the structure struct ext4 extent idx (extent index).
It stores the starting logical block number and pointer to the physical block
number of the next level of the tree. The next level of the tree is a block
(typically 4 KBs). Out of the 4096 bytes, 12 bytes are required for the extent
header and 4 bytes for storing some more metadata at the end of the block.
This leaves us with 4080 bytes, which can be used to store 340 12-byte data
structures. These could either be extents or extent index structures. We are
thus creating a 340-ary tree, which is massive. Now, note that we can at the
most have a 5-level tree. The maximum file size is thus extremely large. Many
file systems limit it to 16 TB. Let us compute the maximum size of the entire file
system. The total number of addressable physical blocks is 248 . If each block is
4 KB, then the maximum file system size (known as volume size) is 260 bytes,
which is 1 EB (exabyte). We can thus quickly conclude that an extent-based
file system is far more scalable than an indirect block-based file system.
Directory Structure
As discussed earlier, it is the job of the ext4 file system to define the internal
structure of the directory entries. VFS simply stores structures to implement
the external interface.
Listing 7.16 shows the structure of a directory entry in the ext4 file system.
The name of the structure is ext4 dir entry 2. It stores the inode number,
length of the directory entry, length of the name of the file, the type of the file
and the name of the file. It basically establishes a connection between the file
name and the inode number. In this context, the most important operation is a
lookup operation. The input is the name of a file, and the output is a pointer to
the inode (or alternatively its unique number). This is a straightforward search
problem in the directory. We need to design an appropriate data structure for
storing the directory entries (e.g.: ext4 dir entry 2 in the case of ext4). Let
us start with looking at some naive solutions. Trivia 7.6.2 discusses the space
of possible solutions.
c Smruti R. Sarangi 370
Trivia 7.6.2
• We can simply store the entries in an unsorted linear list. This
will require roughly n/2 time comparisons on an average, where n
is the total number of files stored in the directory. This is clearly
slow and not scalable.
• The next solution is a sorted list that requires O(log(n)) compar-
isons. This is a great data structure if files are not being added or
removed. However, if the contents of a directory change, then we
need to continuously re-sort the list, which is seldom feasible.
• A hash table has roughly O(1) search complexity. It does not
require continuous maintenance. However, it also has scalability
problems. There could be a high degree of aliasing (multiple keys
map to the same bucket). This will require constant hash table
resizing.
• Traditionally, red-black trees and B-trees have been used to solve
such problems. They scale well with the number of files in a direc-
tory.
File “abc”
FAT
table 1 2 3
Each entry points to the next entry in
the FAT table and the corresponding
loca�on in the storage device
Storage device
The basic concept is quite simple. We have a long table of entries (the FAT
table). This is the primary data structure in the overall design. Each entry has
two pointers: a pointer to the next entry in the FAT table (can be null) and a
pointer to a cluster stored on the disk. A cluster is defined as a set of sectors
(on the disk). It is the smallest unit of storage in this file system. We can think
of a file as a linked list of entries in the FAT table, where each entry additionally
points to a cluster on the disk (or some storage device). Let us elaborate.
Regular Files
Consider a file “abc”. It is stored in the FAT file system (refer to Figure 7.30).
Let us assume that the size of the file is three clusters. We can number the
clusters 1, 2 and 3, respectively. The 1st cluster is the first cluster of the file as
shown in the figure. The first FAT table entry of the file in the FAT table has
a pointer to this cluster. Note that this pointer is a disk address. Given that
this entry is a part of a linked list, it contains a pointer to the next entry (2nd
entry). This entry is designed similarly. It has a pointer to the second cluster
of the file. Along with it, it also points to the next node on the linked list (3rd
entry). Entry number 3 is the last element on the linked list. Its next pointer
is null. It contains a pointer to the third cluster.
The structure is thus quite simple and straightforward. The FAT table just
stores a lot of linked lists. Each linked list corresponds to a file. In this case a
file represents both a regular file and a directory. A directory is also represented
as a regular file, where the data blocks have a special format.
Almost everybody would agree that the FAT table distinguishes itself on the
basis of its simplicity. All that we need to do is divide the total storage space
c Smruti R. Sarangi 372
into a set of clusters. We can maintain a bitmap for all the clusters, where the
bit corresponding to a cluster is 1 if the cluster is free, otherwise it is busy.
Any regular file or directory is a sequence of clusters and thus can easily be
represented by a linked list.
Even though the idea seems quite appealing, linked lists have their share of
problems. They do not allow random access. This means that given a logical
address of a file block, we cannot find its physical address in O(1) time. There
is a need to traverse the linked list, which requires O(N ) time. Recall that
the ext4 file system allowed us to quickly find the physical address of a file
block regardless of its design in O(1) time (indirect blocks or extents). This
is something that we sacrifice with a FAT table. If we have pure sequential
accesses, then this limitation does not pose a major problem.
Directories
File “abc”
FAT
table
Figure 7.31: Storing files and directories in the FAT file system
Both ext4 and exFAT treat a directory as a regular file to a large extent. It is
just a collection of blocks (clusters in the case of exFAT). The “data” associated
with a directory has a special format. As shown in Figure 7.31, a directory is
a table with several columns. The first column is the file name, which is a
unique identifier of the file. Modern file systems such as exFAT support long
file names. Sometimes comparing such large file names can be time-consuming.
In the interest of efficiency, it is a better idea to hash a file name to a 32 or 64-
bit number. Locating a file thus involves simple 32 or 64-bit hash comparison,
which is an efficient solution.
The next set of columns store the file’s attributes that include a file’s status
(read-only, hidden, etc.), file length and creation/modification times. The last
column is a pointer to the first entry in the FAT table. This part is crucial. It
ties a directory entry to the starting cluster of a file via the FAT table. The
directory entry does not point to the cluster directly. Instead, it points to the
first entry of the file in the FAT table. This entry has two pointers: one points
to the first cluster of the file and the other points to the next entry of the linked
list.
Due to the simplicity of such file systems, they have found wide use in
portable storage media and embedded devices.
373 c Smruti R. Sarangi
Phase Action
Pre-write Discard the journal entry
Write Replay the journal entry
Cleanup Finish the cleanup process
Table 7.4: Actions that are taken when there the system crashes in different
phases of a write operation
Assume that the system crashes in the pre-write phase. This can be detected
from its journal entry. The journal entry would be incomplete. We assume that
it is possible to find out whether a journal entry is fully written to the journal
or not. This is possible using a dedicated footer section at the end of the entry.
Additionally, we can have an error checking code to verify the integrity of the
entry. In case, the entry is not fully written, then it can simply be discarded.
c Smruti R. Sarangi 374
If the journal entry is fully written, then the next stage commences where
a set of blocks on the storage device are written to. This is typically the most
time-consuming process. At the end of the write operation, the file system driver
updates the journal entry to indicate that the write operation is over. Now
assume that the system crashes before this update is made. After a restart, this
fact can easily be discovered. The journal entry will be completely written, but
there will no record of the fact that the write operation has been fully completed.
The entire write operation can be re-done (replayed). Given the idempotence
of writes, there are no correctness issues.
Finally, assume that the write operation is fully done but before cleaning up
the journal, the system crashes. When the system restarts it can clearly observe
that the write operation has been completed, yet the journal entry is still there.
It is easy to finish the remaining bookkeeping and mark the entry for removal.
Either it can be removed immediately or it can be removed later by a dedicated
kernel thread.
Example 7.6.1
int main () {
char c ;
FILE * src_file , * dst_file ;
if ( src_file == NULL ) {
printf ( " Could not open a . txt \ n " ) ;
exit (1) ;
}
if ( dst_file == NULL ) {
fclose ( src_file ) ;
printf ( " Could not open b . txt \ n " ) ;
exit (1) ;
}
On similar lines, we open the file “b.txt” for writing. In this case, the mode
is “w”, which means that we wish to write to the file. The corresponding mode
for opening the source file (“a.txt”) was “r” because we opened it in read-only
mode. Subsequently, we keep reading the source file character by character and
keep writing them to the destination file. If the character read is equal to EOF
(end of file), then it means that the end of the file has been reached and there
are no more valid characters left. The C library call to read characters is fgetc
and the library call to write a character is fputc. It is important to note that
both these library calls take the FILE handle (structure) as the sole argument
for identifying the file that has been opened in the past. Here, it is important
c Smruti R. Sarangi 376
to note that a file cannot be accessed without opening it first. This is because
opening a file creates some state in the kernel that is subsequently required while
accessing it. We are already aware of the changes that are made such as adding
a new entry to the systemwide open file table, per-process open file table, etc.
Finally, we close both the files using the fclose library calls. They clean up
the state in the kernel. They remove the corresponding entries from the per-
process file table. The entries from the systemwide table are removed only if
there is no other process that has simultaneously opened these files. Otherwise,
we retain the entries in the systemwide open file table.
Let us consider the next example (Example 7.6.2) that opens a file, maps it
to memory and counts the number of ’a’s in the file. We proceed similarly. We
open the file “a.txt”, and assign it to a file handle file. In this case, we need
to also retrieve the integer file descriptor because there are many calls that need
it. This is easily achieved using the fileno function.
Example 7.6.2
Open a file ”a.txt”, and count the number of ’a’s in the file.
Answer:
Listing 7.18: Count the number ’a’s in a file
# include < stdio .h >
# include < stdlib .h >
# include < fcntl .h >
# include < sys / stat .h >
# include < unistd .h >
# include < sys / mman .h >
int main () {
FILE * file ;
int fd ;
char * buf ;
struct stat info ;
int i , size , count = 0;
7.6.10 Pipes
Let us now look at a special kind of file known as a pipe. A pipe functions as a
producer-consumer queue. Even though modern pipes have support for multiple
producers and consumers, a typical pipe has a process that writes data at one
end, and another process that reads data from the other end. There is built-in
synchronization. This is a fairly convenient method of transferring data across
processes. There are two kinds of pipes: named and anonymous. We shall look
c Smruti R. Sarangi 378
Anonymous Pipes
An anonymous pipe is a pair of file descriptors. One file descriptor is used to
write, and the other is used to read. This means that the writing process has one
file descriptor, which it uses to write to the pipe. The reading process has one
more file descriptor, which it uses to read. A pipe is a buffered channel, which
means that if the reader is inactive, the pipe buffers the data that has not been
read. Once the data is read, it is removed from the pipe. Example 7.6.3 shows
an example.
Example 7.6.3
int main () {
pid_t pid ;
int pipefd [2];
char msg_sent [] = " I love my OS book " ;
char msg_rcvd [30];
/* fork */
pid = fork () ;
if ( pid > 0) {
/* parent process */
close ( pipefd [0]) ;
As we can see in the example, the pipe library call (and system call) creates
a pair of file descriptors. It returns a 2-element array of file descriptors. 0 is the
read end, and 1 is the write end. In the example, the array of file descriptors is
passed to both the parent and the child process. Given that the parent needs
to write data, it closes the read end (pipefd[0]). Note that instead of using
fclose, we use close that takes a file descriptor as input. In general, the
library calls with a prefix of ‘f’ are at a high level and have lower flexibility. On
the other hand, calls such as open, close, read and write directly wrap the
corresponding system calls and are at a much lower level.
The parent process quickly closes the file descriptor that it does not need
(read end). It writes a string msg sent to the pipe. The child process is the
reader. It does something similar – it closes the write end. It reads the message
from the pipe, and then prints it.
Named Pipes
DELL@Desktop-home2 ~
Create a DELL@Desktop-home2 ~
$ mkfifo mypipe
$ echo "I love my OS
named pipe course" > mypipe
DELL@Desktop-home2 ~
$ file mypipe
mypipe: fifo (named pipe) Write to the
pipe
DELL@Desktop-home2 ~
$ ls -al mypipe
Note the ‘p’ prw-rw-rw- 1 DELL None 0 Apr 23 09:38
mypipe
DELL@Desktop-home2 ~
Wait �ll the $ tail -f mypipe
pipe is I love my OS course
wri�en to
Figure 7.32 shows a method for using named pipes. In this case the mkfifo
command is used to create a pipe file called mypipe. Its details can be listed
with the file command. The output shows that it is a named pipe, which is
akin to a producer-consumer FIFO queue. A directory listing shows the file to
be of type ‘p’. Given that the file mypipe is now a valid file in the file system, a
process running on a different shell can simply write to it. In this case, we are
writing the string “I love my OS course” to the pipe by redirecting the output
stream to the pipe. The ‘>’ symbol redirects the output to the pipe. The other
reading process can now read the message from the pipe by using the tail shell
command. We see the same message being printed.
c Smruti R. Sarangi 380
Exercises
Ex. 2 — Why are modern buses like USB designed as serial buses?
Ex. 4 — Give an example where RAID 3 (striping at the byte level) is the
preferred approach.
Ex. 5 — What is the advantage of a storage device that rotates with a con-
stant linear velocity?
** Ex. 6 — RAID 0 stripes data – stores odd numbered blocks in disk 0 and
even numbered blocks in disk 1. RAID 1 creates a mirror image of the data (disk
0 and disk 1 have the same contents). Consider RAID 10 (first mirror and then
stripe), and RAID 01 (first stripe and then mirror). Both the configurations
will have four hard disks divided into groups of two disks. Each group is called
a first-level RAID group. We are essentially making a second-level RAID group
out of two first-level RAID groups. Now, answer the following questions:
a)Does RAID 01 offer the same performance as RAID 10?
b)What about their reliability? Is it the same? You need to make an implicit
assumption here, which is that it is highly unlikely that both the disks
belonging to the same first-level RAID group will fail simultaneously.
Ex. 7 — The motor in hard disks rotates at a constant angular velocity. What
problems does this cause? How should they be solved?
381 c Smruti R. Sarangi
Ex. 8 — We often use bit vectors to store the list of free blocks in file systems.
Can we optimize the bit vectors and reduce the amount of storage?
Ex. 9 — What is the difference between the contents of a directory, and the
contents of a file?
Ex. 10 — Describe the advantages and disadvantages of memory-mapped I/O
and port-mapped I/O.
Ex. 14 — How does memory-mapped I/O work in the case of hard disks? We
need to perform reads, writes and check the status of the disk. How does the
processor know that a given address is actually an I/O address, and how is this
communicated to software? Are these operations synchronous or asynchronous?
What is the advantage of this method over a design that uses regular I/O ports?
Explain your answers.
Ex. 16 — FAT file systems find it hard to support seek operations. How can
a FAT file system be modified to support such operations more efficiently?
Ex. 20 — Most flash devices have a small DRAM cache, which is used to
reduce the number of PE-cycles and the degree of read disturbance. Assume
c Smruti R. Sarangi 382
that the DRAM cache is managed by software. Suggest a data structure that
can be created on the DRAM cache to manage flash reads and writes such that
we minimize the #PE-cycles and read disturbance.
Ex. 22 — Answer the following questions with respect to devices and device
drivers:
a)Why do we have both software and hardware request queues in structrequest queue?
b)Why do device drivers deliberately delay requests?
c)Why should we just not remove (eject) a USB key?
d)What can be done to ensure that even if a user forcefully removes a USB
key, its FAT file system is not corrupted?
Ex. 25 — Suggest an algorithm for periodically draining the page cache (sync-
ing it with the underlying storage device). What happens if the sync frequency
is very high or very low?
** Ex. 27 — Design a file system for a system like Twitter/X. Assume that
each tweet (small piece of text) is stored as a small file. The file size is limited
to 256 bytes. Given a tweet, a user would like to take a look at the replies to
the tweet, which are themselves tweets. Furthermore, it is possible that a tweet
may be retweeted (posted again) many times. The “retweet” (new post) will
be visible to a user’s friends. Note that there are no circular dependences. It
is never the case that: (1) A tweets, (2) B sees it because B is A’s friend, (3)
B retweets the same message, and (4) A gets to see the retweet. Design a file
system that is suitable for this purpose.
Ex. 28 — Consider a large directory in the exFAT file system. Assume that
its contents span several blocks. How is the directory (represented as a file)
383 c Smruti R. Sarangi
stored in the FAT table? What does each row in a directory’s data block look
like? How do we create a new file and allocate space to it in this filesystem?
For the last part, explain the data structures that we need to maintain. Justify
the design.
c Smruti R. Sarangi 384
Chapter 8
Virtualization and Security
385
c Smruti R. Sarangi 386
Exercises
Ex. 2 — Describe the trap-and-emulate method. How does it work for inter-
rupts, privileged instructions and system calls?
** Ex. 3 — Most proprietary software use a license server to verify if the user
has sufficient credentials to run the software. Think of a “license server” as an
external server. The client sends its id, and IP address (cannot be spoofed) along
with some more information. After several rounds of communication, the server
sends a token that the client can use to run the application only once. The next
time we run the application, a fresh token is required. Design a cryptographic
protocol that is immune to changing the system time on the client machine,
387 c Smruti R. Sarangi
replay attacks, and man-in-the-middle attacks. Assume that the binary of the
program cannot be changed.
** Ex. 5 — Let us design an operating system that supports record and re-
play. We first run the operating system in record mode, where it executes a host
of applications that interact with I/O devices, the hard disk, and the network.
A small module inside the operating system records all the events of interest.
Let us call this the record phase.
After the record phase terminates, later on, we can run a replay phase. In this
case, we shall run the operating system and all the constituent processes exactly
the same way as they were running in the record phase. The OS and all the
processes will show exactly the same behavior, and also produce exactly the
same outputs in the same order. To an outsider both the executions will be
indistinguishable. Such systems are typically used for debugging and testing,
where it is necessary to exactly reproduce the execution of an entire system.
Your answer should at least address the following points:
a)What do we do about the time? It is clear that we have to use some notion
of a logical time in the replay phase.
b)How do we deliver I/O messages from the network or hard disk, and inter-
rupts with exactly the same content, and exactly at the same times?
c)What about non-determinism in the memory system such as TLB misses,
and page faults?
d)How do we handle inherently non-deterministic instructions such as reading
the current time and generating a random number?
Ex. 7 — How does the VMM keep track of updates to the guest OS’s page
tables in shadow and nested paging?
Ex. 9 — If there is a context switch in the guest OS, how does the VMM
get to know the id (or something equivalent) of the new process (one that is
being swapped in)? Even if the VMM is not able to find the pid of the new
process being run by the guest OS, it should have some information available
with it such that it can locate the page table and other bookkeeping information
corresponding to the new process.
In this book, we have concerned ourselves only with the Linux kernel and that
too in the context of the x86-64 (64-bit) ISA. This section will thus provide a
brief introduction to this ISA. It is not meant to be a definitive reference. For a
deeper explanation, please refer to the textbook on basic computer architecture
by your author [Sarangi, 2021].
The x86-64 architecture is a logical successor of the x86 32-bit architecture,
which succeeded the 16 and 8-bit versions, respectively. It is the default architec-
ture of all Intel and AMD processors as of 2023. The CISC ISA got complicated
with the passage of time. From its early 8-bit origins, the development of these
processors passed through several milestones. The 16-bit version arrived in
1978, and the 32-bit version arrived along with Intel 80386 that was released in
1985. Intel and AMD introduced the x86-64 ISA in 2003. The ISA has become
increasingly complex over the years and hundreds of new instructions have been
added henceforth, particularly vector extensions (a single instruction can work
on a full vector of data).
A.1 Registers
ax ah al
bx bh bl
cx ch cl
dx dh dl
389
c Smruti R. Sarangi 390
16-bit avatar of the ISA, these registers were simply extended to 16 bits. Their
names changed though, for instance a became ax, b became bx, and so on. As
shown in Figure A.1, the original 8-bit registers continued to be accessible for
backward compatibility. Each 16-bit register was split into a high and low part.
The lower 8 MSB bits are addressable using the specifier al (low) and bits 9-16
are addressable using the register ah (high).
A few more registers are present in the 16-bit ISA. There is a stack pointer
sp (top of the stack), a frame pointer bp (beginning of the activation block for
the current function), and two index registers for performing computations in a
loop via a single instruction (si and di). In the 32-bit variant, a prefix ‘e’ was
added. ax became eax, so on and so forth. Furthermore, in the 64-bit variant
the prefix ‘e’ was replaced with the prefix ‘r’. Along with these registers, 8 new
registers were added – r8 to r15. This is shown in Figure A.2. Note that even
in the 64-bit variant of the ISA, known as x86-64, the 8, 16 and 32-bit registers
are accessible. It is just that these registers exist virtually (as a part of larger
registers).
64 bits
32 bits
16 bits
rax eax ax
rbx ebx bx
rcx ecx cx
rdx edx dx
rsp esp sp
rbp ebp bp
rsi esi si
rdi edi di
r8
r9
r15
Figure A.2: The registers in the x86-64 ISA
Note that unlike newer RISC ISAs, the program counter is not directly
accessible. It is known as the instruction pointer in the x86 ISA, which is not
visible to the programmer. Along with the program counter, there is also a
flags register that becomes rflags in x86-64. It stores all the ALU flags. For
example, it stores the result of the last compare instruction. Subsequent branch
391 c Smruti R. Sarangi
instructions use the result of this compare instruction for deciding the outcome
of conditional branches (refer to Figure A.3).
64 bits
32 bits
16 bits
rip eip ip
There are a couple of fields in the rflags register that are commonly used.
Each field typically requires 1 bit of storage and has a designated bit position
in the 64-bit register rflags. If the corresponding bit position is set to 1, then
it means that the corresponding flag is set otherwise it is unset (flag is false).
OF is the integer overflow flag, CF is the carry flag (generated in an addition),
the ZF flag is set when the last comparison resulted in an equality, and the
SF sign flag is set when the last operation that could set a flag resulted in a
negative result. Note that a comparison operation is basically implemented as
a subtraction operation. If the two operands are equal, then the comparison
results in an equality (zero flag is set) otherwise if the first operand is less than
the second operand, then the result is negative and the sign flag is set to 1
(result is negative).
FP register
st0 st1 st0
st2 st0
st3 st4 st5 st6 st7
stack
point capabilities used this stack-based model. This basic programming model
has remained in the x86 ISA. Because backward compatibility is a necessary
requirement, this model is still present. With the advent of fast hardware and
compiler technology, this has not proved to be a very strong impediment.
The basic mov operation moves the first operand to the second operand.
The first operand is the source and the second operand is the destination in
this format. Each instruction admits a suffix (or modifier), which specifies the
number of bits that we want it to operate on. The ‘q’ modifier means that
we wish to operate on 64 bits, whereas the ‘l’ modifier indicates that we wish
to operate on 32-bit values. In the instruction movq $3, %rax, we move the
number 3 (prefixed with a ‘$’) to the register rax. Note that all registers are
prefixed with a percentage (‘%’) symbol. Similarly, the next instruction movq
$4, %rbx moves the number 4 to the register rbx. The third instruction addq
%rbx, %rax adds the contents of register rbx to the contents of register rax,
and stores the result in rax. Note that in this case, the second operand %rax
is both a source and a destination. The final instruction stores the contents of
rax (that was just computed) to memory. In this case, the memory address is
computed by adding the base address that is stored in the stack pointer (%rsp)
with the offset 8. The movq instruction moves data between registers as well as
between a register and a memory location. It thus works as both a load and a
store. Note that we cannot transfer data from one memory location to another
memory location using a single instruction. In other words, it is not possible to
have two memory operands in an instruction.
Let us look at the code for computing the factorial of the number 10 in
Listing A.1. In this case, we use the 32-bit version of the ISA. Note that it is
393 c Smruti R. Sarangi
395
c Smruti R. Sarangi 396
The last step in the backend of the compiler is code generation. The low-level
IR is converted to actual machine code. It is important for the compiler to know
the exact semantics of instructions on the target machine. Many times there are
complex corner cases where we have floating point flags and other rarely used
instructions involved. They have their own set of idiosyncrasies. Needless to
say, any compiler needs to be aware of them, and it needs to use the appropriate
set of instructions such that the code executes as efficiently as possible. We need
to guarantee 100% correctness. Furthermore, many compilers as of 2023 allow
the user to specify the compilation priorities. For instance, some programmers
may be looking at reducing the code size and for them performance may not
be that great a priority. Whereas, for other programmers, performance may be
the topmost priority. Almost all modern compilers are designed to handle such
concerns and generate code accordingly.
397 c Smruti R. Sarangi
Object Files
Let us now take a look at Figure B.1. It shows the different phases of the overall
compilation process. Let us look at the first phase, which is converting a C file
to a .o file. The .o file is also known as an object file, which represents the
compiler output obtained after compiling a single C file. It contains machine
code corresponding to the high-level C code along with other information. It is
of course possible that a set of symbols (variables and functions) do not have
their addresses set correctly in the .o file because they were not known at the
time of compilation. All such symbols are identified and placed in a relocation
table within the .o file. The linking process or the linker is then tasked with
taking all the individual .o files and combining them into one large binary file,
which can be executed by the user. This binary file has all the symbols’ addresses
defined (we will relax this assumption later). Note that we shall refer to the
final executable as the program binary or simply as the executable.
gcc –c x.c
x.c x.o
Let us further delve into the problem of specifying function signatures, which
will ensure that we can at least compile a single C source code file correctly and
create the corresponding object file. Subsequently, the linker can combine all
the object files and create the program’s binary or executable.
it is defined. In this case, there is no need to actually declare the signature of the
function – the definition serves the purpose of also declaring the signature of the
function. However, in the reverse case, a declaration is necessary. For example,
let us say that the function is invoked in Line 19 and its code (definition) starts
at Line 300. There is a need to declare the signature of the function before Line
19. This is because when the relevant compilation pass processes Line 19, it
will already be armed with the signature of the function, and it can generate
the corresponding code for invoking the function correctly.
We need to do something similar for functions defined in other files in a large
multifile project. Of course, dealing with so many signatures and specifying
them in every source code file is a very cumbersome process. In fact, we also
have to specify the signature of global variable definitions (their types) and even
enumerations, structures and classes. Hence, it makes a lot of sense to have a
dedicated file to just store these signatures. There can be a pre-compilation
phase where the contents of this file are copy-pasted into source code files (C or
C++ files).
A header file or a .h file precisely does this. It contains a large number of
signatures of variables, functions, structs, enumerations and classes. All that
a C file needs to do is simply include the header file. Here the term include
means that a pre-compilation pass needs to copy the contents of the header file
into the C file that is including it. This is a very easy and convenient mechanism
for providing a bunch of signatures to a C file. For instance, there could be a set
of C files that provide cryptographic services. All of them could share a common
header file via which they export the signatures of the functions that they define
to other modules in a large software project. Other C files need to include this
header file and call the relevant functions defined in it to obtain cryptographic
services. The header file thus facilitates a logical grouping of variable, function
and structure/class declarations. It is much easier for programmers to include
a single header file that provides a cohesive set of declarations as opposed to
manually adding declarations at the beginning of every C file.
Header files have other interesting uses as well. Sometimes, it is easier to
simply go through a header file to figure out the set of functions that a set of C
functions provide to the rest of the world. It is a great place for code browsing.
Barring a few exceptions, header files never contain function definitions or
any other form of source code. Their role is not to have regular C statements.
This is the role of source code files such as .c and .cpp files. Header files are
reserved only for signatures that aid in the process of compilation. For the
curious reader, it is important to mention that the only exception to this rule
is C++ templates. A template is basically a class definition that takes another
class or structure as an argument and generates code based on the type of the
class that is passed to it at compile time.
Now, let us look at a set of examples to understand how header files are
meant to be used.
Example
Listing B.1 shows the code for the header file factorial.h. First, we check
if a preprocessor variable FACTORIAL H is already defined. If it is already defined,
it means that the header file has already been included. This can happen for
a variety of reasons. It is possible that some other header file has included
factorial.h, and that header file has been included in a C file. Given that
the contents of factorial.h are already present in the C file, there is no need
to include it again explicitly. This is ensured using preprocessor variables. In
this case, if FACTORIAL H has not been defined, then we define the function’s
signature: int factorial(int);. This basically says that it takes a single
integer variable as input and the return value is an integer.
Listing B.2: factorial.c
# include " factorial . h "
Listing B.2 shows the code of the factorial.c file. Note the way in which
we are including the header file. It is being included by specifying its name
in between double quotes. This normally means that the header file should
be there in the same directory as the C file (factorial.c). We can also use the
traditional way of including a header file between the ’<’ and ’>’ characters. In
this case, the directory containing the header file should be there in the include
path. The “include path” is a set of directories in which the C compiler searches
for header files. The directories are searched in ascending order of preference
based on their order in the include path. There is always an option of adding
an additional directory to the include path by using the ‘-I’ compilation flag in
gcc. Any directory that succeeds the ‘-I’ flag is made a part of the include path
and the compiler searches that directory as well for the presence of the header
file. Now, when the compiler compiles factorial.c, it can create factorial.o
(the corresponding object file). This object file contains the compiled version
of the factorial function.
Let us now try to write the file that will use the factorial function. Let us
name it prog.c. Its code is shown in listing B.3.
Listing B.3: prog.c
# include < stdio .h >
# include " factorial . h "
int main () {
printf ( " % d \ n " , factorial (3) ) ;
}
All that the programmer needs to do is include the factorial.h header file
and simply call the factorial function. The compiler knows how to generate
the code for prog.c and create the corresponding object file prog.o. Given
401 c Smruti R. Sarangi
that we have two object files now – prog.o and factorial.o – we need to link
them together and create a single binary that can be executed. This is the job
of the linker that we shall see next. Before we look at the linker in detail, an
important point that needs to be understood here is that we are separating the
signature from the implementation. The signature was specified in factorial.h
that allowed prog.c to be compiled without knowing how exactly the factorial
function is implemented. The signature had enough information for the compiler
to compile prog.c.
In this mechanism, the programmer can happily change the implementation
as long as the signature is the same. The rest of the world will not be affected,
and they can continue to use the same function as if nothing has changed. This
allows multiple teams of programmers to work independently as long as they
agree on the signatures of functions that their respective modules export.
B.2 Linker
The role of the linker is to combine all the object files and create a single exe-
cutable. Any project in C/C++ or other languages typically comprises multiple
source files (.c and .cpp). Moreover, a source file may use functions defined in
the standard library. The standard library is a set of object files that defines
functions that many programs typically use such as printf and scanf. The
final executable needs to link these library files (collections of object files) as
well.
Definition B.2.1 Standard C Library
There are two ways of linking: static and dynamic. Static linking is a simple
approach where we just combine all the .o files and create a single executable.
This is an inefficient method as we shall quickly see. This is why dynamic
linking is used where all the .o files are not necessarily combined into a single
executable at the time of linking.
factorial.h
Declara�on of the
factorial func�on
#include
factorial.c prog.c
Figure B.3: Compiling the code in the factorial program and linking the com-
ponents
Each object file contains some text (program code), read-only constants and
global variables that may or may not be initialized. Along with that it references
variables and functions that are defined in other object files. All the symbols
that an object file exports to the world are defined in the symbol table and all
the symbols that an object file needs from other object files are listed in the
relocation table. The linker thus operates in two passes.
Pass 1: It scans through all the object files and concatenates all the text sec-
tions (instructions), global variables, function definitions and constant def-
inition sections. It also makes a list of all the symbols that have been
defined in the object files. This allows the linker to compute the final
sizes of all the sections: text, data (initialized global/static variables), bss
or (block starting symbol, uninitialized global/static variables) and con-
stants. All the program code and variable definitions are concatenated
and the final addresses of all the variables and functions are computed.
The concatenated code is however incomplete. The addresses of all the
relocated variables and functions (defined in other object files) are set to
403 c Smruti R. Sarangi
zero (undefined).
Pass 2: In this stage, the addresses of all the relocated variables and functions
are set to their real values. We know the address of each variable at the
end of Pass 1. In the second pass, the linker replaces the zero-valued
addresses of relocated variables and functions with the actual addresses
computed in the first pass.
lose a chance to reuse code pages that are required by multiple processes. For
instance, almost all processes share a few library functions defined in the stan-
dard C library. As a result, we would not like to replicate the code pages of
library functions – this would lead to a significant wastage of memory space.
Hence, we would like to share them across processes saving a lot of runtime
memory.
To summarize, if we use such statically linked binaries where the entire code
is packaged within a single executable, such code reuse options are not available
to us. Hence, we need a better solution. This solution is known as dynamic
linking.
test.c
#include <stdio.h>
Add the code to
a.out int main(){
� gcc –sta�c test.c int a = 4;
Check if all the func�ons � ldd a.out printf ("%d",a);
not a dynamic executable }
are bundled or not
� du –h a.out
892K a.out
The size of the binary is quite
large because the code of the
en�re library is included in a.out
first
Locate the
Stub �me
prin� func�on in a
func�on
library
subsequently
Copy the func�on
Call the func�on using to the address
its address space of the
process
have one copy of the shared library code in physical memory and simply map
regions of the virtual address space of each process to the physical addresses
corresponding to the library code. This also minimizes the memory footprint
and allows as much of runtime code reuse as possible. Of course, there is a very
minor performance penalty. Whenever a library function is accessed for the first
time, there is a necessity to first search for the library first and then find the
address of the function within it. Searching for a library, proceeds in the same
manner as searching for header files.
During the process of compilation, a small note is made about which function
is available in which library. Now if the executable is transferred to another
machine and run there or even run on the same machine, it is necessary to
locate the library at runtime. The stub function calls a function named dlopen.
When invoked for the first time for a given library function, its job is to locate the
library. Akin to the way that we search for a header file, there is a search order.
We first search for the library in the current directory. If it is not found, we
check the directories in the LD LIBRARY PATH environment variable. Then
we search known locations in the system such as /lib and /usr/lib. The search
order is very important because often there are multiple copies of a library, and
we want the program to fetch the correct copy.
Each library defines a symbol table that lists the symbols that it exports to
the rest of the world. This is how we can find the addresses of the functions
that are present within the library and copy them to the memory space of the
process that dynamically links the library. The code can also be copied to a
shared location and then mapped to the virtual address space of any process
that wishes to use the code. This is a very efficient method and as of today, this
is the de facto standard. Almost all software programs use the shared library
based dynamic linking mechanism to reduce their code size and ensure that they
remain portable across systems.
Many times, when we are not sure if the target system has a shared library
or not, the software package can either bundle the shared library along with
the executable or the target system can install the shared library first. This is
very common in Linux-based systems, where shared libraries are bundled into
packages. Whenever, a software is installed (also distributed as a package), it
checks for dependencies. If a package is dependent on other packages, then
c Smruti R. Sarangi 406
it means that those packages provide some shared libraries that are required.
Hence, it is necessary to install them first. Moreover, these packages could have
dependencies with other packages that also need to be installed. We thus need
to compute the backward slice of a package and install the missing packages.
This is typically done by the package manager in Ubuntu or RedHat Linux.
It is important to note that the notion of shared libraries and dynamic
linking is there in all operating systems, not just Linux. For example, it is
there in Windows where it is known as a DLL (dynamically linked library).
Conceptually, a shared library on Linux (.so file) and a DLL in Windows (.dll
file) are the same.
Figure B.6 shows the method to generate a shared object or shared library
in Linux. In this case, we want to generate a shared library that contains the
code for the factorial function. Hence, we first compile the factorial.c file
to generate the object file (factorial.o) using the ‘-c’ gcc option. Then we
create a library out of the object file using the archive or ar command. The
extension of the archive is ’.a’. This is a static library that can only be statically
linked like a regular .o file.
The next part shows us how to generate a dynamic library. First, we need
to compile the factorial.c file in a way that is position independent – the
starting address of the code does not matter. This allows us to place the
code at any location in the virtual address space of a process. All the ad-
dresses are relative to a base address. In the next line, we generate a shared
object from the factorial.o object file using the ‘-shared’ flag. This generates
libfactorial.so. Next, we compile and link prog.c with the dynamic library
that we just created (libfactorial.so). This part is tricky. We need to do
two separate things.
Consider the command gcc -L. prog.c -lfactorial. We use the ‘-L’ flag
to indicate that the library will be found in the current directory. Then, we
specify the name of the C file, and finally we specify the library using the ‘-l’
flag. Note that there is no space in this case between ‘-l’ and factorial. The
compiler searches for libfactorial.so in the current directory because of the
-L and -l flags.
In this case, running the executable a.out is not very straightforward. We
need to specify the location at which the factorial library will be found given
that it is not placed in a standard location that the runtime (library loader)
407 c Smruti R. Sarangi
usually checks such as a /lib or /usr/lib. We thus add the current directory
(output of the pwd command) to the LD LIBRARY PATH environment variable.
After that we can seamlessly execute the dynamically linked executable – it will
know where to find the shared library (libfactorial.so).
Readers are welcome to check the size of dynamically linked executables.
Recall the roughly 1 MB sized executable that we produced post static linking
(see Figure B.4); its size reduces to roughly 12 KB with dynamic linking !!!
Let us finish this round of discussion with describing the final structure of
the executable. After static or dynamic linking, Linux produces a shared object
file or executable in the ELF format.
B.3 Loader
The loader is the component of the operating system whose job is to execute a
program. When we execute a program in a terminal window, a new process is
spawned that runs the code of the loader. The loader reads the executable file
from the file system and lays it out in main memory. It needs to parse the ELF
executable to realize this.
It creates space for all the sections, loads the constants into memory and
allocates regions for the stack, heap and data/bss sections (static and global
variables). Additionally, it also copies all the instructions into memory. If
they are already present in the memory system, then instead of creating a
new copy, we can simply map the instructions to the virtual memory of the
new process. If there is a need for dynamic linking, then all the information
regarding dynamically linked symbols is stored in the relocation table and the
dynamic section in the process’s memory image. The loader also initializes the
jump tables.
Next, it initializes the execution environment such as setting the state of all
the environment variables, copying the command line arguments to variables
accessible to the process and setting up exception handlers. Sometimes for
security reasons, we wish to randomize the starting addresses of the stack and
heap such that it is hard for an attacker to guess runtime addresses. This can
c Smruti R. Sarangi 408
409
c Smruti R. Sarangi 410
such linked lists as well as easily create a linked list out of any kind of structure.
This is a very important software engineering problem that the early developers
of the kernel faced. Given that the kernel is written in C, a novel solution had
to be created.
Linux’s solution is quite ingenious. It heavily relies on C macros, which are
unique strength of C. We would advise the reader to go through this topic before
proceeding forward. Macros are very useful yet very hard to understand.
This is where we will use the magic of macros. We will use two macros to
solve this problem as shown in Listing C.2.
Listing C.2: The list entry and container of macros
source : include/linux/list.h and
source : include/linux/container of.h (resp.)
# define list_entry ( ptr , type , member ) container_of (
ptr , type , member )
Consider the structures shown in Listing C.3. In the case of struct abc,
the value of offsetof(abc, list) is 4. This is because we are assuming the
size of an integer is four bytes. The integer x is stored in the first four addresses
of struct abc. Hence, the offset of the list member is 4 here. On the same
lines, we can argue that the offset of the member list in struct def is 8.
This is because the size of an integer and that of a float are 4 bytes each.
Hence, ( mptr - offsetof(type, member)) provides the starting address of
the structure that is the linked list node. To summarize, the container of
macro returns the starting address of the linked list node or in other words the
encapsulating object given the offset of the list member in the object.
It is important to note that this is a compile-time operation. Specifically, it
is the role of the preprocessor to execute macros. The preprocessor is aware of
the code as well as the layouts of all the structures that are used. Hence, for
it to find the offset of a given member from the starting address of a structure
is very easy. After that computing the starting address of the linked list node
(encapsulating object) is easy. This is a simple piece of code that the macro
will insert into the program. It involves a simple subtraction operation.
A macro is quite different from a regular function. Its job is to generate
custom code that is subsequently compiled. In this case, an example piece of
code that will be generated will look like this: (Node *)( mptr - 8). Here,
we are assuming that the structure is struct Node and the offset of the list
member within it is 8. At runtime, it is quite easy to compute this given a
pointer (ptr) to a struct list head.
Listing C.4: Example of code that uses the list entry macro
struct abc * current = ... ;
struct abc * next = list_entry ( current - > list . next , struct
abc , list ) ;
Listing C.4 shows a code snippet that uses the list entry macro where
struct abc is the linked list node. The list entry macro is simply a syn-
onym of container of – their signatures are identical. The current node that
we are considering is current. To find the next node (the next one after
current), which is again of type struct abc, all that we need to do is invoke
the list entry macro. In this case, the pointer (ptr) is current->list.next.
This is a pointer to the struct list head object in the next node. From this
pointer, we need to find the starting address of the encapsulating abc structure.
The type is struct abc and the member is list. The list entry macro inter-
nally calls offsetof, which returns an integer. This integer is subtracted from
c Smruti R. Sarangi 412
the starting address of the struct list head member in the next node. The
final result is a pointer to the encapsulating object.
Such a mechanism is a very fast and generic mechanism to traverse linked
lists in Linux. It is independent of the type of the encapsulating object. These
primitives can also be used to add and remove nodes from the linked list. We
can extend this discussion to create a linked list that has different kinds of
encapsulating objects. Theoretically, this is possible as long as we know the
type of the encapsulating object for each struct list head on the list.
Listing C.5: Example of code that uses the list entry macro
struct hlist_head {
struct hlist_node * first ;
};
struct hlist_node {
struct hlist_node * next , ** pprev ;
};
Let us now describe singly-linked lists that are frequently used in kernel code.
Here the explicit aim is a one-way traversal of the linked list. An example is a
hash table where we resolve collisions by chaining entries that hash to the same
entry. Linux uses the struct hlist head structure (shown in Listing C.5). It
points to a node that is represented using struct hlist node.
This data structure has a next pointer to another hlist node. Sadly, this
information is not enough if we wish to delete the hlist node from the linked
list. We need a pointer to the previous entry as well. This is where a small
optimization is possible, and a few instructions can be saved. We actually
store a pointer to the next member of the previous node in the linked list.
This information is stored in the field pprev. Its type is struct hlist node
**. The advantage of this is that we can directly set it to a different value
while deleting the current node. We cannot do anything else easily, which is
the explicit intention here. The conventional solution in this case is to store a
pointer to the previous hlist node. Any delete method needs to first fetch this
pointer, compute the address of its next member, and then reassign the pointer
to a different value. The advantage of the pprev pointer is that we save on the
instruction that computes the address of the next pointer of the previous node.
Such data structures that are primarily designed to be singly-linked lists are
often very performance efficient. Their encapsulating objects are accessed in
exactly the same way as the doubly-linked list struct list head.
The maximum depth of any leaf is at most twice the minimum depth.
This is quite easy to prove. As we have mentioned, the black depth of all
the leaves is the same. Furthermore, we have also mentioned that a red node
can never have a red child. Assume that in any path from the root to a leaf,
there are r red nodes and b black nodes. We know that b is a constant for all
paths from the root. Furthermore, every red node will have a black child (note
that all leaves or sentinel nodes are black). Hence, r ≤ b. The total depth of
any leaf is r + b ≤ 2b. This basically means that the maximum depth is at most
twice the minimum depth b.
This vital property ensures that all search operations always complete in
O(log(n)) time. Note that a search operation in an RB tree operates in exactly
the same manner as a regular binary search tree. Insert and delete operations
also complete in O(log(n)) time. They are however not very simple because we
need to ensure that the black depth of all the leaves always stays the same, and
a red parent never has a red child.
This requires a sequence of recolorings and rotations. However, we can prove
that at the end, all the properties hold and the overall height of the tree is always
O(log(n)).
C.3 B-Tree
A B-tree is a generalization of a binary search tree. It is a k-ary tree and is
self-balancing. In this case, a node can have more than two children; quite
unlike a red-black tree. This is also a balanced tree and all of its operations
are realizable in logarithmic time. The methods of traversing the tree are very
similar to traversing a classical binary search tree. It is typically used in systems
that store a lot of data and quickly accessing a given datum or a contiguous
subset of the data is essential. Hence, databases and file systems tend to use
B-trees quite extensively.
c Smruti R. Sarangi 414
Let us start with the definition of a B-tree of order m. It stores a set of keys,
which can optionally point to values. The external interface is similar to a hash
table.
3. It is important that the tree does not remain sparse. Hence, every internal
node needs to have at least dm/2e children (alternatively dm/2e − 1 keys).
2 4 8 10 16 25
1 3 5 7 9 11 14 17 26
stores two keys – 8 and 10. It points to three leaf nodes. Finally, the rightmost
child of the root only stores keys that are greater than 12.
It is easy to observe that traversing a B-tree is similar to traversing a regular
BST (binary search tree). It has O(logm (n)) levels. At each level, the time to
find the pointer to the right subtree takes O(log(m)) time. We are assuming
that we perform a binary search over all the keys. The total time complexity is
thus O(logm (n)log(m)), which is O(log(n)).
C.3.3 B+ Tree
The B+ tree is a variant of the classical B-tree. In the case of a B-tree, internal
nodes can store both keys and values, however in the case of a B+ tree, internal
nodes can only store keys. All the values (or pointers to them) are stored in the
leaf nodes. Furthermore, all the leaf nodes are connected to each other using
a linked list, which allows for very efficient range queries. It is also possible to
do a sequential search in the linked list and locate data with proximate keys
quickly.
A balanced binary search tree (BST) has roughly log2 (n) levels, whereas a
B-tree and its variants have logm (n) levels. They thus have fewer levels mainly
because more information is stored in each internal node. This is where the
design can be made cache efficient. An internal node can be designed in such a
way that its contents fit within a cache block or maybe a few cache blocks. The
node’s contents fully occupy a cache block and no other information is stored in
each cache block. The advantages of these schemes are thus plenty. We end up
fetching fewer cache blocks to traverse the tree as compared to a BST. This is
because a cache block fetch is more productive. There is much more information
in a block in a B-tree. Fetching fewer cache blocks is a good idea. Statistically,
there will be fewer cache misses and the chances of having long memory-related
stalls will be much lower.
It is important to understand that a 64 or 128-byte cache block is the atomic
unit of transfer in the memory system. There is no point in fetching 64 bytes
yet using only 25% of it as is the case in a BST that simply stores two pointers
in each node: one to the left child and one to the right child.
There are other advantages as well. We ideally do not want the data of
two different tree nodes to be stored in the same cache block. In this case,
if different threads are accessing different nodes in the tree and making write
accesses, there is a chance that they may actually be accessing the same cache
block. This will happen in the case of a BST and will not happen with a B-
tree. Due to such conflicting accesses, there will be a lot of misses due to cache
coherence in a BST. The cache block will keep bouncing between cores. Such
misses are known as false sharing misses. Note that the same data is not being
shared across threads. The data is different, yet they are resident in the same
cache block. This problem does not afflict a B-tree and its variants.
Along with reduced false sharing, it is easy to handle true sharing misses as
well. In this case, two threads might be trying to modify the same tree node. It
is possible to lock a node quite easily. A small part of the corresponding cache
block can be reserved to store a multi-bit lock variable. This makes acquiring a
“node lock” very easy.
For a combination of all these factors, B-trees and B+ trees are preferred as
compared to different flavors of balanced binary search trees.
strings
tr
travel tryst
truck tread
a u i y ead tram tractor
trust trim
trick try
vel ctor ck
m ck st
m
st
A radix tree stores a set of keys very efficiently. Each key is represented
as a string (see Figure C.2). The task is to store all the keys in a single data
structure, and it is possible to query the data structure and find if it contains a
given string (key) or not. Here, values can be stored at both the leaf nodes and
internal nodes.
The algorithm works on the basis of common prefixes. The path from the
root to a node encodes the prefix. Consider two keys “travel” and “truck”.
In this case, we store the common prefix “tr” at the root node and add two
children to the root node: ‘a’ and ‘u’, respectively. We proceed similarly and
continue to create common prefix nodes across keys. Consider two more keys
“tram” and “tractor”. In this case, after we traverse the path with the prefix
“tra”, we create two leaf nodes “ctor” and “m”. If we were to now add a new
key “trams”, then we would need to create a new child “s” with the parent as
the erstwhile leaf node labeled “m”. In this case, both “tram” and “trams”
would be valid keys. Hence, there is a need to annotate every internal node
with an extra bit to indicate that the path leading from the root to that node
corresponds to a valid key. We can associate a value with any node that has a
valid key. In other words, this would mean that the path from the root to the
leaf node corresponds to a valid key.
The advantage of such a structure is that we can store a lot of keys very
efficiently and the time it takes to traverse it is proportional to the number
of letters within the key. Of course, this structure works well when the keys
share reasonably long prefixes. Otherwise, the tree structure will not form, and
we will simply have a lot of separate paths. Hence, whenever there is a fair
amount of overlap in the prefixes, a radix tree should be used. It is important
to understand that the lookup time complexity is independent of the number of
keys – it is theoretically only dependent on the number of letters (digits) within
a key.
Insertion and deletion are easy. We need to first perform a lookup operation
and find the point at which the non-matching part of the current key needs to
be added. There is a need to add a new node that branches out of an existing
c Smruti R. Sarangi 418
node. Deletion follows the reverse process. We locate the key first, delete the
node that stores the suffix of the string that is unique to the key and then
possibly merge nodes.
There is a popular data structure known as a trie, which is a prefix tree
like a radix tree with one important difference: in a trie, we proceed letter by
letter. This means that each edge corresponds to a single letter. Consider a
system with two keys “tractor” and “tram”. In this case, we will have the root
node, an edge corresponding to ‘t’, then an edge corresponding to ‘r’, an edge
corresponding to ‘a’, so on and so forth. There is no point in having a node
with a single child. We can compress this information to create a more efficient
data structure, which is precisely a radix tree. In a radix tree, we can have
multi-letter edges. In this case, we can have an edge labeled “tra” (fuse all
single-child nodes).
This kind of data structure is very useful for representing information stored
in a bit vector.
419 c Smruti R. Sarangi
tricky. We need to traverse the tree towards the root, however we cannot blindly
convert 1s to 0s. Whenever, we reach a node on the path from a leaf to the root,
we need to take a look at the contents of the other child and decide accordingly.
If the other child contains a 1, then the process terminates right there. This is
because the parent node is the root of a subtree that contains a 1 (via the other
child). If the other child contains a 0, then the parent’s value needs to be set to
0 as well. This process terminates when we reach the root.
Map a key to k
Key
different bit posi�ons Hash func�on
using k hash func�ons
Set to 1
H1 H2 H3 H4
1 1 1 1
Array of m bits
Ini�ally all the bits are 0
bits. All of them will get removed, which is something that we clearly do not
want. One option is to store a counter at each entry instead of a bit. When a
key is added to the set, we just increment all the associated counters. This is
fine as long as we do not have overflows. One of the important reasons for opting
for a Bloom filter is its simplicity and compactness. This advantage will be lost
if we start storing large counters in each entry. With this approach removing a
key is very easy – we just decrement the associated counters. Nevertheless, the
overheads can be sizeable and the benefits of compactness will be lost. Hence,
counters are normally not used in Bloom filters.
The other issue is that bits get flipped in only one direction, 0 to 1. They
never get flipped back because we do not do anything when an element is re-
moved. As a result, the Bloom filter becomes full of 1s with the passage of time.
There is thus a need to periodically reset the bits.
c Smruti R. Sarangi 422
Bibliography
[Belady et al., 1969] Belady, L. A., Nelson, R. A., and Shedler, G. S. (1969).
An anomaly in space-time characteristics of certain programs running in a
paging machine. Communications of the ACM, 12(6):349–353.
[Corbet, 2010] Corbet, J. (2010). The case of the overly anonymous anon vma.
Online. Available at: https://lwn.net/Articles/383162/.
[Corbet, 2014] Corbet, J. (2014). Locking and pinning. Online. Available at:
https://lwn.net/Articles/600502/.
[Cormen et al., 2009] Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein,
C. (2009). Introduction to Algorithms. MIT Press, third edition.
[Fornai and Iványi, 2010a] Fornai, P. and Iványi, A. (2010a). Fifo anomaly is
unbounded. Acta Univ. Sapientiae, 2(1):80–89.
[Fornai and Iványi, 2010b] Fornai, P. and Iványi, A. (2010b). Fifo anomaly is
unbounded. arXiv preprint arXiv:1003.1336.
423
c Smruti R. Sarangi 424
[Herlihy and Shavit, 2012] Herlihy, M. and Shavit, N. (2012). The Art of Mul-
tiprocessor Programming. Elsevier.
[Karger et al., 1999] Karger, D. R., Stein, C., and Wein, J. (1999). Scheduling
algorithms. Algorithms and theory of computation handbook, 1:20–20.
[Lameter and Kumar, 2014] Lameter, C. and Kumar, P. (2014). this cpu oper-
ations. Online. Available at: https://docs.kernel.org/core-api/this_
cpu_ops.html.
[Mall, 2009] Mall, R. (2009). Real-time systems: theory and practice. Pearson
Education India.
426
427 c Smruti R. Sarangi
Zombie Task, 71
Zone, 324
Zones
sections, 269