Skip to content

yugr/uInit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 

Repository files navigation

What is this?

This document contains instructions for setting up systems with Ubuntu distros for stable benchmarking.

Modern distributions (e.g. Ubuntu or Debian) have a lot of running daemons or services which eat precious CPU time (each context switch costs 2-3K cycles), pollute caches (D$, I$, TLB, BTB, etc.) and steal DRAM/L3 bandwidth (this is collectively called "OS jitter"). If that's not enough, hyperthreading and dynamic frequency scaling (DVFS) also add to the jitter so in practice you can see up to 5% noise in benchmark runs (e.g. SPEC2000) which prevents reliable performance comparisons (a typical compiler optimization for general-purpose CPU may yield around 2-3% improvement).

Note that recommendations in this guide are for obtaining stable, but not necessarily the fastest or even "realistic", measurements. Also they are most applicable to SPEC-like (userspace, CPU-bound) benchmarks.

I also don't cover multithreading-specific jitter (e.g. caused by false sharing or priority inversion).

Ways to reduce noise

Obviously you should run benchmarks on real hardware (not cloud server, virtual machine, Docker or WSL).

For remote servers be sure to use non-standard SSH port and disable password authentication to avoid SSH scanning overheads (see e.g. this for details). Try to avoid tools like fail2ban because they may introduce additonal noise.

To obtain more or less stable measurements (std. deviation less than 0.5%), you'll also need to

  • disable non-deterministic HW features in BIOS
  • reserve CPU cores for benchmarking (so that OS never touches them)
  • boot in non-GUI mode
  • disable ASLR and other funky OS features which hurt stability
  • execute your benchmark in special high-perf mode (increased priority, etc.)

Basics

This section contains some really basic (but sadly often overlooked) recommendations.

If benchmark prints a lot of output to stdout/stderr, it should be redirected to file (or /dev/null).

Before measuring the time prefer to do several warmup runs to populate OS file cache and L1i.

Fix any random numbers in benchmarks that may influence program flow.

Try to measure all compared versions of benchmark on the same day (differences in environment temperatures may toggle slightly different hardware frequency scaling).

Finally, avoid running other programs in parallel with benchmark (including other benchmarks on separate cores, top/htop, etc.). In particular, disable cronjobs.

BIOS settings

Disable frequency scaling (sometimes called "power save mode", "turbo mode", "perf boost", etc.), hyperthreading, HW prefetching, Turbo Boost, Intel Speed Shift and Speed Step, C-states, etc. in BIOS settings. Note that disabling them in kernel won't work for all kernel versions so BIOS is preferred.

Run in non-GUI mode

To boot to non-GUI mode on systemd systems see https://linuxconfig.org/how-to-disable-enable-gui-in-ubuntu-22-04-jammy-jellyfish-linux-desktop (sudo init 1 otherwise).

Disable unnecessary services

Even non-GUI multi-user mode modern distros will execute a lot of services which may distort your measurements.

For systemd distros a list of active services can be obtained via

$ systemctl list-units --all

(look for anything active or waiting).

You can identify services which actually cause problems on your system by running

$ script -qefc 'top -cbd3' > top.log
# Wait for several hours
$ grep -A2 CPU top.log | awk '{$1=$2=$3=$4=$5=$6=$7=$8=$10=$11=""; print $0}' | grep -v 'CPU\|^$\|\<script \|\<top \|^ *0\.0' | sed 's/^ *//'

and then disable them via sudo systemctl mask ....

On typical Ubuntu desktop I suggest disabling at least

$ UNITS=$(systemctl list-units --all --plain | awk '/^(apt-daily|dpkg|unattended-upgrades|update-notifier|cron|anacron|avahi|fwupd|man-db|snapd|irqbalance|cups|systemd-oomd|udisks2|logrotate|motd-news)/{print $1}')
$ sudo systemctl mask $UNITS

(note that simply stopping them is not enough).

Also check timers

$ systemctl list-timers --all

and disable unneeded.

For RAID servers disable mdcheck_start and mdcheck_continue which have rare but extremely high overhead.

Reserve cores for benchmarking

Add to /etc/default/grub:

# rcu_nocbs requires kernel built with CONFIG_RCU_NOCB_CPU and
# nohz_full requires CONFIG_NO_HZ_FULL
# (default Ubuntu kernels seem to have both)
GRUB_CMDLINE_LINUX_DEFAULT="nohz_full=8-15 kthread_cpus=0-7 rcu_nocbs=8-15 irqaffinity=0-7 isolcpus=nohz,managed_irq,8-15"

(this assumes that cores 0-7 are left to system and 8-15 are reserved for benchmarks).

Then run

$ sudo update-grub

and reboot.

You can now use taskset 0xff00 ... to run benchmarks on reserved cores.

When selecting cores to reserve, use lstopo to check that reserved cores do not share L2/L3 with unreserved ones.

WARNING: if benchmark runs more threads than isolated cores (which often happens to Rust programs because they rely on std::thread::available_parallelism which ignores affinity settings) NOISE INCREASES 10x (reproduced on Ubuntu 22 kernel 6.8.0-60-generic).

Additional isolation of cores can be achieved by adding domain to isolcpus setting above but I do not recommend this because kernel will then schedule all benchmark threads on single core (reproduced on Ubuntu 22 and 24).

This can be worked around by using SCHED_FIFO or SCHED_RR policies.

Use benchmark-friendly scheduling policy

SCHED_FIFO scheduling policy may result in fewer interrupts when running benchmarks.

To enable it, add permissions to change scheduling policy for ordinary users:

$ sudo setcap cap_sys_nice=ep /usr/bin/chrt

You can then use chrt -f 1 ... to enable SCHED_FIFO.

Fix frequency

Disable C-states by adding to /etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="... intel_idle.max_cstate=0 processor.max_cstate=0"

running

$ sudo update-grub

and rebooting.

Also disable P-states (Turbo Boost) via

$ echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

Finally set scaling governor to performance via

# Give CPU startup routines time to settle
$ sleep 120
$ echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

and reduce operating frequency to avoid it being reduced under heavy load:

# Set XXX to 70-80% of peak (may need to match one of scaling_available_frequencies)
$ echo XXX | sudo tee /sys/devices/system/cpu/*/cpufreq/scaling_min_freq /sys/devices/system/cpu/*/cpufreq/scaling_max_freq

Note that in some situations frequency scaling may still be enforced at hardware level to avoid overheating the CPU (as e.g. in the infamous AVX-512 case).

Increase priority

Add to /etc/security/limits.conf:

USERNAME soft nice -20
USERNAME hard nice -20

and run benchmarks under nice -n -20 ....

Disable ASLR

Run benchmarks under setarch -R ....

Fix start address of stack

On POSIX systems memory segment that is used for stack is first filled with environment variables, then padded to OS-specific alignment and finally the aligned address is used as start of the program stack.

Because of this addresses of program's local variables may change depending on environment variables even if you've disabled ASLR. Some variables, like $PWD or $_, may vary across benchmark invocations and influence results (5% fluctuations are not uncommon for microbenchmarks).

It is thus strongly recommended to run benchmarks that do not rely on environment under env -i (hyperfine runs benchmarks with randomized environment for this reason).

Fix code layout

Due to intricacies of modern CPU frontends (also here) some benchmarks may be sensitive to particular code layout (i.e. offsets of functions and basic blocks) produced by compiler and linker. This layout can vary due to unrelated changes (e.g. changing order in which object files are passed to linker or changes in register allocations which have different encoding sizes on x86) and complicate performance comparisons.

Although not a full solution, it's recommended to compile C/C++ code with -falign-functions=64 (-Z min-function-alignment=64 for Rust). Loops may also need to be aligned to fetchline size (via -falign-loops=N).

Running SPEC benchmarks

Taskset, nice, etc. can be set in monitor_specrun_wrapper in SPEC config:

monitor_specrun_wrapper = chrt -f 1 taskset 0xff00 nice -n -20 setarch -R \$command

Further work

Above instructions allow to achieve <0.5% noise which is usually enough in practice.

If you want to lower this further, here are some suggestions (I haven't tried them myself though):

  • custom init script

  • OSNOISE tracer

  • use transparent Huge Pages (to reduce TLB pressure)

  • turn off network via

    # Ubuntu (note that change is permanent)
    $ nmcli networking off
    
    # Didn't work for me
    $ sudo /etc/init.d/networking stop
    $ systemctl stop networking.service
    
  • boot to single-user mode

    • this would also disable networking
  • run from ramdisks

  • examine various platform settings in https://www.spec.org/cpu2006/flags/

  • experiment with performance-related BIOS settings

  • use numactl to control NUMA affinity

  • disable thread migration

  • disable returning memory to system in Glibc (to avoid e.g. TLB shootdowns)

    • export M_MMAP_MAX=0 M_ARENA_MAX=1 M_TRIM_THRESHOLD=-1
  • disable irqbalancer (IRQBALANCE_BANNED_CPULIST)

  • disable watchdogs

    • nmi_watchdog=0 nowatchdog nosoftlockup
  • disable kernel hardening (pti=off, etc.)

Additional readings

About

Instructions on obtaining stable benchmarks results on modern Linux distro

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors