Futex VM
Futex VM
Abstract—As discovered in our previous benchmark works, benefited from improving the virtualization layer regarding
a small number of workloads in PARSEC benchmark suite to the futex system call, effort is deserved to seek the root
suffer from heavy performance loss in a virtual execution cause and feasible measure for a better performance.
environment, of which the major loss exhibits fairly a strong
connection with the thread synchronization operations. This By nature the futex system call is simple in that only
paper examines one workload of this kind that makes heavy manipulation to a variable known as a lock in user space,
use of thread synchronization operations, and shows the per- and a wait-queue in the kernel space are involved. However,
formance impact by futex system call for applications in a considering that in a multi-threaded execution environment,
virtual execution environment. By measuring the occurrence competition among multiple threads for critical hardware or
of the subroutines running at particular time costs, the most
timing-decisive parts of the application in a virtual execution software resources may occur, current state of the execution
environment are located. Proper thread scheduling measure for environment can be highly unpredictable. It’s fairly hard to
a virtual execution environment are also proposed. estimate how long a given lock will be held by one thread,
Keywords- performance; futex; virtual machine; gang- how many threads are waiting there for a condition variable,
scheduling; and how fierce competition may be at a given moment, etc.
Even on a physical machine the futex system call can
I. I NTRODUCTION also be expensive in a certain cases, not mention that the
Virtual machine is implemented by pretending itself to virtual execution environment is much more complex in its
be the real hardware underlying. Applications may totally implementation and likely to incur more overhead.
unaware that they are in a virtual execution environment. Unlike in a native execution environment, program ex-
As the virtualization technologies is growing more and more ecution in a virtual one involves cooperation between the
mature, applications behave themselves increasingly as how guest kernel and the VMM. Overheads are almost inevitable
they do in a bare-metal environment. With the modern with the control flow being constantly transfered from one of
hardware supports instructions in recent virtual machines these software layers to another. It is therefore understand-
may be directly executed on the physical hardware under able that operations in the virtual execution environment suf-
supervision of the virtual machine monitor(VMM), most of fers more or less performance overhead compared with the
the instructions and operations execute at nearly a native native one, and for a certain kind of them cases may be even
speed, with exceptions of only a few operations, still suf- worse. As different operations may require different kind and
fering from heavy performance loss, of which the thread amount of resources, performance overhead varies from one
synchronization operation, namely, the futex system call kind to another. To investigate reasons for the performance
comes to this point. gap, differences between the virtual and physical hardwares
The futex system call tends to be inefficient on a virtual should be considered.
machine thus slows down applications that make heavy use We examined the thread synchronization, namely the
of it - that’s what we discovered in our previous bench- futex system call in a particular kind of workload from
mark. Meanwhile workloads exploiting the power of thread- the PARSEC benchmark suite, and proposed the scheduling
level parallelism are not uncommon in high performance measure that may alleviate the performance loss. The rest
computing, which may be plagued with heavy performance of this paper is organized as follows: In Section 2 the
degradation in a virtual execution environment because of related work is explored. Then a brief introduction to POSIX
the futex system call. Although a considerable number of thread and futex-related operations is presented next in
measures have been applied to tackle various performance Section 3. In Section 4 we make a detailed study on the
overheads because of virtualization, effective solution to performance impact delivered by the futex system call
reduce the performance overhead of the futex system call from a number of aspects, including changes in system
in a virtual execution environment has not been found yet. resource utilization and run time behavior, host and guest
Since a wide range of HPC applications could be potentially reactions under the pressure of workload, with and without
150
30000
100
20000
50
10000
0 0
00:00 00:15 00:30 00:45 01:00 01:15 01:30 01:45 02:00 02:15 00:00 00:15 00:30 00:45 01:00 01:15 01:30 01:45 02:00 02:15
(a) Resource utilization of dedup with futex (b) Behaviour of dedup with futex
110 140000
CPU BLOCK IN
MEM BLOCK OUT
100 INTERRUPT
CONTEXT SWITCH
120000
90
80
Percentage of Hardware Utilization
100000
70
80000
60
50
60000
40
30 40000
20
20000
10
0 0
00:00 00:10 00:20 00:30 00:40 00:50 01:00 01:10 00:00 00:10 00:20 00:30 00:40 00:50 01:00 01:10
(c) Resource utilization of dedup without futex (d) Behaviour of dedup without futex
Figure 1: dedup executions with and without futex on KVM guest machine
futex. A discussion about the possible reasons is made in the Quest operating system, based on the concept of
in Section 5 from the thread scheduling model point of a VCPU. Quest schedules VCPUs on physical processors,
view, corresponding measures are also proposed. Finally in while software threads are scheduled on VCPUs. U. Drepper
Section 6 we conclude and anticipate the work in future. [5] made a detailed analysis of the performance and imple-
mentation of the system call futex on the normal execution
II. R ELATED W ORKS environment, based on which the thread synchronization
Main research on virtual thread scheduling and synchro- was described. Early before, similar effort was made by H.
nization are surveyed below. D. Arteaga etc. [1] showed Franke etc. [6] based on the early Linux kernel. Connections
that in a multi-core, multi-threading micro processor, threads between the futex and POSIX thread synchronization
in a virtual machine actually undergo scheduling of two mechanism was also mentioned in it.
granularity-levels, first by VMM, which acts as a process Among numerous efforts, the most closely related contri-
scheduler, then by thread scheduler on the hardware. The bution is made by N. A. Dadhania [7], who implemented
book [2] addressed the synchronization problem among the an experimental Gang Scheduling mechanism in the Linux
virtual processor on the multi-core virtual machine platform. kernel means to address the limitations of KVM on Power,
To reduce the latency in communication and synchronization especially the lock-holder-preemption problem by reducing
among threads, “Gang Scheduling” is suggested to synchro- the lock-acquisition time. This provides us an important hint
nize the scheduling of virtual processors that are mapped in dealing with the overhead experienced in our practice.
on different physical processors. In [3], performance loss
on four kinds of virtual machines was examined. With a III. B RIEF I NTRODUCATION TO POSIX THREAD AND
F U T E X - RELATED OPERATIONS
detailed analysis of the run time behavior and the frequency
of system calls, the futex and storage-I/O access system POSIX thread library is a standard C threading API on
call are identified as the key factors which incur the major Linux. For POSIX thread specification requires sharing of
performance overhead in virtual execution environments. almost all resources, POSIX threads’ mutexes are imple-
M. Danish etc. [4] describes the scheduling infrastructure mented based on futex-related operations and variables on
658
620
300 50000
CPU BLOCK IN
MEM BLOCK OUT
45000 INTERRUPT
CONTEXT SWITCH
250
40000
30000
150 25000
20000
100
15000
10000
50
5000
0 0
00:00 00:10 00:20 00:30 00:40 00:50 01:00 01:10 00:05 00:10 00:15 00:20 00:25 00:30 00:35 00:40 00:45 00:50 00:55 01:00
110 40000
CPU BLOCK IN
MEM BLOCK OUT
100 INTERRUPT
35000 CONTEXT SWITCH
90
30000
80
Percentage of Hardware Utilization
70 25000
60
20000
50
40 15000
30
10000
20
5000
10
0 0
00:00 00:05 00:10 00:15 00:20 00:25 00:30 00:35 00:40 00:45 00:50 00:55 01:00 00:00 00:10 00:20 00:30 00:40 00:50 01:00
recent versions of Linux, and a part of the POSIX standard IV. B ENCHMARK O BSERVATION
library. [8].
To demonstrate the impact of futex system call in
different cases, we compiled the dedup without pthread-
related operations, thus closed the door to futex, naturally
no pipeline available for data processing any more. We
The concept - futex (short for “fast userspace mutex”)
compared the performance and some run-time behaviors
is a Linux construct that can be used to implement basic
in such aspects: 1. Performance and run time behavior of
locking, or as a building block for higher-level locking
workloads with and without futex system call. 2. Host
abstractions such as semaphores and POSIX mutexes or
and guest reactions upon futex invocation of various
condition variables [9]. A key component for the imple-
cases. 3. Time spent in futex in both virtual and native
mentation of futex mechanism is a wait queue attached
cases. The general configuration of our test machine is
to an aligned variable in user-space. The aligned variable
listed below in Table I
serves as the control variable as mentioned formerly, when
multiple of threads or processes are going to operate on it,
they compare the value of this variable against a former
A. Comparison between workloads with and without futex
specified memory slot to decide whether trapping into the
kernel is still needed for a further system call such as wake As depicted by Figure 1 , great contrast with and without
up waiting threads or processes, remove the current waiting the futex system call can be witnessed in both performance
process from the wait queue etc., depending on the result and run-time behaviors. While memory utilization remained
of comparison. However, a properly programmed futex nearly the same, workload is now only on a single core and
- based lock should trap as rarely as possible except that distributed more evenly the whole time it runs. Interrupt and
the lock is contended. Since most operations do not require context switch have also been significantly reduced. As the
arbitration between processes, the futex is efficient for its total run time diminished by almost a half, performance loss
implementation. is also minimized to a great deal. Similar cases happen on
659
621
Table I: Host machine configuration 50.00
native
virtual
Bogomips 5226.48
25.00
TLB size 1024 4K pages
Memory DDR2 8192 MB
20.00
Hard disk 250 GB
Operating System Debian Squeeze 6.0.2 Linux 15.00
10.00
5.00
T=1 T=2 T=4 T=8
physical machine, suggested in Figure 2, regarding to pro-
Figure 4: Time spent in futex in both virtual and native cases
cessor core and memory utilization, a minimized interrupt
and context switch. However, large cut off in time cost is
not observed as it does on virtual machines.
highly unpredictable, not obviously related to the number of
B. Host and guest reactions upon heavy invocation of futex threads, as a result of the complexity added by virtualization.
For the KVM guest is a full virtualized execution envi-
V. A NALYSIS OF THE THE P ERFORMANCE OVERHEAD
ronment thus isolated from the host, applications on it are
expected to behave just as they do on the host. Without help A. Thread synchronization part of algorithm in dedup
of specific counter or tracer it’s difficult to observe what is As futex serves as the building block of the mutual
happening exactly under the hood as futex is called in the exclusive operations in the kernel, on which most of the
guest. As a remedy we compare the reactions of guest and programming libraries depend, the POSIX NPTL package
host under pressure of workloads with and without futex. is of no exception. futex system call was thus introduced
Figure 3 illustrates this in the aspects as before by collecting to dedup. The central part of this workload as shown in
statistics of qemu - the userspace part of the KVM guest Figure 5 generates a five-stage pipeline, from which a model
on the host. Just as in a virtual case, with an increase of is summarized in Figure 6.
thread number, contention over the processor among threads For threads of the neighboring stages may operate on the
becomes more and more fierce. same data structures, lock and synchronization should be
Figure 3(a), 3(c), 3(e) and 3(g) show the variation of applied to guarantee the validity of the data. How efficient
processor and memory utilization observed on the host as these mutual exclusive operations are may have a instant
dedup was executed in different cases on the guest. The impact on the overall performance and throughput of the
physical processor cores get more and more overloaded by pipeline. However, because the resolution of the timer in the
threads in qemu, most likely by the threads that were run- kernel is not so high enough to measure the time consumed
ning as the virtual processors of the guest. Contention over by the individual futex operation, other approaches should
physical processor core also gets worse and worse as dedup be resorted to instead.
gets started with more threads. Furthermore, as Figure 3(a) We measure in our practice the time consumed by pro-
suggested, part of the contention is contributed by qemu gram sections on a higher granularity-level, which tend to
itself, for no futex-related operations are involved any include multiple of them. Execution time for each subroutine
more with dedup from the guest side. is measured in both virtual and native cases, except those
In addition measurements of storage traffic flow, interrupt subroutines that are only called once in the execution and
and context switch are presented in Figure 3(b), 3(d), 3(f) thus are of little significance for our purpose, as shown in
and 3(h). These metrics are generally considered keeping Table II and III.
pace with the occurrence of contention among threads the Based on data in the two Tables, subroutines that be-
workload forked. have differently in the two kinds of execution environment
Finally the time spent by futex in both virtual and are identified, of which the subroutine sub_Compress and
native cases are illustrated by Figure 4. futex in virtual sub_ChunkProcess are the most unusual ones. They con-
cases may consume as multiple of time as in native cases. tain the basic thread locking and unlocking operations on
Beside, the time cost of futex in the native case rises the mutexes and locks in many instances of queue and
steadily, almost proportional to the number of threads. In hash_table type, and serve as the major operations of
contrast, time cost of futex in a virtual machine becomes the pipeline stages Compress and ChunkProcess.
660
622
300% 45000
CPU BLOCK IN
MEM BLOCK OUT
INTERRUPT
40000 CONTEXT SWITCH
250%
35000
200% 30000
25000
150%
20000
100% 15000
10000
50%
5000
0% 0
00:00 00:05 00:10 00:15 00:20 00:25 00:30 00:35 00:40 00:45 00:50 00:00 00:10 00:20 00:30 00:40 00:50 01:00
(a) Host resource utilization without futex (b) Host behavior without futex
300 60000
CPU BLOCK IN
MEM BLOCK OUT
INTERRUPT
CONTEXT SWITCH
250 50000
200 40000
150 30000
100 20000
50 10000
0 0
00:00 00:10 00:20 00:30 00:40 00:50 01:00 01:10 01:20 01:30 01:40 00:00 00:15 00:30 00:45 01:00 01:15 01:30 01:45 02:00
(c) Host resource utilization in 1-thread case (d) Host behaviour in 1-thread case
350% 35000
CPU BLOCK IN
MEM BLOCK OUT
INTERRUPT
CONTEXT SWITCH
300% 30000
250% 25000
200% 20000
150% 15000
100% 10000
50% 5000
0% 0
00:00 00:05 00:10 00:15 00:20 00:25 00:30 00:35 00:40 00:45 00:50 00:00 00:10 00:20 00:30 00:40 00:50 01:00
(e) Host resource utilization in 2-thread case (f) Host behaviour in 2-thread case
400% 35000
CPU BLOCK IN
MEM BLOCK OUT
INTERRUPT
350% CONTEXT SWITCH
30000
300%
25000
250%
20000
200%
15000
150%
10000
100%
5000
50%
0% 0
00:00 00:05 00:10 00:15 00:20 00:25 00:30 00:35 00:40 00:45 00:50 00:00 00:05 00:10 00:15 00:20 00:25 00:30 00:35 00:40 00:45 00:50 00:55
(g) Host resource utilization in 4-thread case (h) Host behaviour in 4-thread case
Figure 3: Host reaction upon dedup with and without futex execution in KVM guest
661
623
D a t a P r o c e s s (& d a t a p r o c e s s a r g s ) ;
p t h r e a d c r e a t e (& t h r e a d s p r o c e s s , NULL, D a t a P r o c e s s , &d a t a p r o c e s s a r g s ) ;
f o r ( i = 0 ; i < c o n f−>n t h r e a d s ; i ++) {
p t h r e a d c r e a t e (& t h r e a d s a n c h o r [ i ] , NULL, F i n d A l l A n c h o r s , &a n c h o r t h r e a d a r g s [ i ] ) ;
}
......
f o r ( i = 0 ; i < c o n f−>n t h r e a d s ; i ++) {
p t h r e a d c r e a t e (& t h r e a d s c h u n k [ i ] , NULL, C h u n k P r o c e s s , &c h u n k t h r e a d a r g s [ i ] ) ;
}
......
f o r ( i = 0 ; i < c o n f−>n t h r e a d s ; i ++) {
p t h r e a d c r e a t e (& t h r e a d s c o m p r e s s [ i ] , NULL, Compress , &c o m p r e s s t h r e a d a r g s [ i ] ) ;
}
i f ( ! c o n f−>p r e l o a d i n g ) {
p t h r e a d c r e a t e (& t h r e a d s s e n d , NULL, SendBlock , &s e n d b l o c k a r g s ) ;
}
Threads Subroutine 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Total
sub Compress 165563 2791 0 0 0 0 0 168354
sub ChunkProcess 367888 2062 0 0 0 0 0 369950
t=1
FindAllAnchor 1 0 0 0 0 0 0 1
Sum 533452 4853 0 0 0 0 0 538305
sub Compress 165239 3115 0 0 0 0 0 168354
sub ChunkProcess 367482 2465 2 1 0 0 0 369950
t=2
FindAllAnchor 2 0 0 0 0 0 0 2
Sum 532723 5580 14 0 0 0 0 538306
sub Compress 164968 3383 3 0 0 0 0 168354
sub ChunkProcess 367133 2808 7 2 0 0 0 369950
t=4
FindAllAnchor 3 1 0 0 0 0 0 4
Sum 532104 6192 10 2 0 0 0 538308
sub Compress 164650 3667 23 1 9 4 0 168354
sub ChunkProcess 366342 3547 51 4 4 2 0 369950
t=8
FindAllAnchor 8 0 0 0 0 0 0 8
Sum 526762 7214 74 5 13 6 0 538312
Threads Subroutine 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Total
sub Compress 164933 3418 3 0 0 0 0 168354
sub ChunkProcess 365556 4380 14 0 0 0 0 369950
t=1
FindAllAnchor 1 0 0 0 0 0 0 1
Sum 530490 7798 17 0 0 0 0 538305
sub Compress 163019 5306 29 0 0 0 0 168354
sub ChunkProcess 365438 4497 15 0 0 0 0 369950
t=2
FindAllAnchor 0 2 0 0 0 0 0 2
Sum 528457 9805 44 0 0 0 0 538306
sub Compress 161158 7084 110 2 0 0 0 168354
sub ChunkProcess 365756 4185 9 0 0 0 0 369950
t=4
FindAllAnchor 4 0 0 0 0 0 0 4
Sum 526918 11269 119 0 0 0 0 538308
sub Compress 159238 8816 279 9 8 2 1 168353
sub ChunkProcess 365443 4478 27 1 0 1 0 369950
t=8
FindAllAnchor 7 1 0 0 0 0 0 8
Sum 524688 13295 306 10 8 3 1 538311
662
624
In these cases the time they may cost varies from 0.00 s T5
T2
T1
occur always much more frequently in virtual than in native
Host OS Scheduler on a physical CPU
execution environment.
T2
Overhead of numerous small fractions are then added up Threads on host T4
Threads on guest
to the total performance overhead, which soon becomes non-
T0
T3
T3
negligible as compared to that of the native case. Threads on guest
T0
B. Implementation of thread synchronization in virtual case T1 Guest OS Scheduler on VCPU 0 Guest OS Scheduler on VCPU 1
Figure 7: A simplified thread scheduling model on a virtual machine
Being aware of the inefficient program sections on virtual
machines, it’s necessary to seek the possible root reasons
from the implementation of thread synchronization on a vir-
tual machine. Since the virtual processors are implemented time of synchronization. In future we will implemente such
by kernel threads, which can be scheduled and executed on a scheduling strategy on a kind of virtual machine like KVM
physical processor core as normal, the nature differs greatly to proved the assumption here proposed.
from that of the physical counterpart, as soon as thread
synchronization on virtual processor cores are concerned. ACKNOWLEDGMENT
Figure 7 depicted a much simplified scheduling model of The PARSEC group provides a benchmark suite with
the threads on a virtual machine, thread in a virtual execution various representative workloads from the real world, based
environment undergoes scheduling of different levels on a on whose efforts is our work possible. Besides, invaluable
guest, first scheduled by the guest OS on a virtual processor comments and remarks from our anonymous reviewers also
core, then executed as the virtual processor is scheduled by helped us to improved the work. We’d like to extend our
the VMM or the host OS. Userspace threads on a guest are sincere gratitude to them.
mapped by the guest OS kernel on the virtual processors,
R EFERENCES
which themselves may have already been mapped by the
host OS kernel or VMM to physical processor cores. In [1] D. Arteaga, M. Zhao, C. Liu, P. Thanarungroj and L. Weng.
cases that two or more different userspace threads, e.g. T1, Cooperative Virtual Machine Scheduling on Multi-core Multi-
threading Systems — A Feasibility Study. Florida International
T3 and T5 are mapped on different virtual processors, that University, Miami, FL, Dec. 2010.
are most likely to be scheduled at different moments on [2] Intel Corporation, Parallel Processing Research Institute of
the physical processors. As T1 is at this moment active Fudan University, 系 统 虚 拟 化 : 原 理 与 实 现. Tsinghua
on VCPU 0, T3 on VCPU 1 and T5 on VCPU 2 are still University Press, Beijing, China, 2009, 127-128.
standing in the waiting queues for their turns on the physical [3] Y. Zhang, R. Oertel and W. Rehm. Performance Loss on Virtual
Machines. Studentensymposium Informatik Chemnitz 2012,
processors. In such a case, T1 has to wait for a longer time
Tagungsband zum 1. Studentensymposium Chemnitz vom 4.
after the corresponding VCPU on which the desired thread Jul. 2012
resides regain its time-slice, the lock then may get a chance [4] M. Danish, Y. Li and R. West, Virtual-CPU Scheduling in
to be released. Unlike the native case, with multi level of the Quest Operating System. Computer Science Department,
scheduling, threads may feel restricted and can be harder to Boston University, Boston, MA 02215, USA, Jan. 2011.
talk with each other. [5] U. Drepper, Futexes Are Tricky. Dec. 2005.
[6] H. Franke and R. Russell, Fuss, Futexes and Furwocks: Fast
VI. C ONCLUSION AND O UTLOOK Userlevel Locking in Linux. Proceedings of the Ottawa Linux
Symposium 2002, Ottawa, Ontario Canada, Jun. 2002.
We have shown the performance impact of the futex [7] Gang scheduling in CFS, https://lkml.org/lkml/2011/12/19/44.
in a virtual execution environment based on the benchmark [8] U. Drepper and I. Molnar, The Native POSIX Thread Library
results with a particular workload from PARSEC benchmark for Linux. Red Hat, Inc, Feb. 2005.
suite. From the difference between basic implementations [9] Futex, http://en.wikipedia.org/wiki/Futex.
of thread synchronization in both virtual and native cases,
we infer that the asynchronous VCPU scheduling is very
likely to be the main reason for great performance loss of
futex in the virtual execution environment. As a remedy,
a possible solution is to schedule all the concerned VCPUs
synchronously, to guarantee all threads take part in commu-
nication on the virtual processor are running exactly at the
663
625