X86-64 Architecture With a
Broadwell Accent
Registers
● Lower 32 bits of %rax is %eax Lower 16 bits is
%ax, lower 8 bits is %al
Same trick for vector registers
SSE->AVX
●
AT&T syntax
● If the registers are preceded by % signs. The
rightmost operand is the destination
● mov %rax, %rdx ; copy the value from %rax to
%rdx – misnamed instruction, cpy would have
been better!
● $ objudmp -DC ./bin/live_engine | less
● -D = disasemble, -C = demangle function
names
Some instructions
● addq %rax, %rbx
● subq, subbq
● cmp %rax, %rbx
● and
● test
● negb, negl
Control Flow
● jmp
● jmple
● jg
● jc
● Conditional moves work the same way!
● cmovle
● cmovg
The Pipeline
● Branch Prediction comes first, when wrong, the
pipeline is dumped
● Branch Target Buffer – to predict branches
Two level adaptive predictor
● Globally shared per core – also a loop predictor
Aside: Is this %rdx re-use slow?
● movq (%rbp), %rdx
● movq %rdx, (dest)
● movq (%r8), %rdx
● movq %rdx, (other_dest)
● what about if line 3 were mov src2, %eax?
● What about if it were movb src2, %al
● xor %eax, %eax ; zeros all of %rax
● xor %ax, %ax; leaves upper 48 bits as they
were
Instructions
● what are jmp, jl, jc?
● lea == calculate (an address?) and put the
calculation in a register - nothing loaded from
memory. Another misnamed instruction...
● fast multiply by 5
● lea (%rax, %rax, 4) %rbx
CISC? Or really RISC?
● addq %rax, (%rdi) ; this is CISC
● split into micro ops – which are RISC
● mov (%rdi), %register
● add %rax, %register
● mov %register, (%rdi)
Pipeline
Branch Prediction - work out what's coming next
Fetch - work out where the instructions are
Decode - decode them into micro ops – CISC becomes RISC
Rename - 100s of registers on chip, 16 architectural no deps can rename
Reorder Buffer Read - snoops output of completing micro-ops, fetches
inputs
Reservation Station - queue micro ops until inputs ready
Execution - micro ops with inputs go to an execution port! shit gets done yo.
Reorder Buffer Write - execution results written
Retire - results to register file, micro ops removed from reorder buffer
Execution ports
● We have 8 execution ports, this diagram has 6
but it’s close
Branch misprediction is expensive
● What to do?
● LIKELY() & UNLIKELY() macros
● Branchless implementations
− Replace jumps with conditional moves – but
dependency chains
− xor and bithacking
http://graphics.stanford.edu/~seander/bithacks.html
− uint64_t max(uint64_t a, uint64_t b) {
− uint64_t mask = -(!(b < a));
− return a ^ ((a ^ b) & mask);
What assembly does that
branchless max in C produce?
● branchless_demo.cpp – what instructions look
odd?
System V ABI
● How to call a function on anything that isn’t
windows
● %rdi, %rsi, %rdx, %rcx, %r8, %r9 integer
function args in order
● %xmm0 – %xmm7 for floating point args
● Integer return value in %rax
● Floating point return value in %xmm0
Lets see that max() again
● %rdi is first argument
● %rsi is the second
● %rax is the return value
● Looking at it can we do better still?
g++ -S
Hardcode a breakpoint
● if(some_complex_condition && is_met) {
● asm volatile(“int3”);
● }
● must run in gdb!
Stack
● %rbp is the stack frame base pointer
● %rsp is the stack pointer
● push %rax
● the same as:
● mov %rax, (%rsp)
● sub 8, %rsp
● Stack starts at a high address, adding to the
stack reduces the stack pointer address
● Can tell immediately by inspection of an
Function calls
● callq 0x1326bd(%rip)
● pushes the address of the instruction after the
call on the stack. Then jumps to the address.
● push(%rip + call instruction size)
● jmp function address
● retq pops that pushed address off the stack and
jumps (back) to it
● Implication: Any stack corruption will hurt, badly!
Function calls
● Label:
● push %rbp
● mov %rsp, %rbp
● … do stuff here, push values, use stack for
temp storage
● … end of function
● leaveq ; this instruction is really mov %rbp,
%rsp & pop %rpb
What about system calls?
● System V ABI - %rdi, %rsi, %rdx, %rcx, %r8,
%r9
● Kernel %rdi, %rsi, %rdx, %r10, %r8,
%r9
● Pretty similar
● Load the registers with arguments
● Put syscall number in %rax
● syscall instruction
● Return value in %rax
Full horror show...
●
Member Function
● struct Foo
● {
● uint64_t some_member_func(uint64_t bar)
● {
● return member + bar;
● }
● uint64_t member;
● };
C++ with no virtuals – syntactic
sugar over C
● struct Foo
● {
● uint64_t member;
● };
● uint64_t some_member_func(Foo* this,
uint64_t bar)
● {
● return this->member + bar;
● }
More instructions
● callq - push next instruction address after callq
& jmp to function address
● push
● pop
● set, setne, sete, setge etc
● shl - logical shift left
● shr
● sar - shift arithmetic right; through the carry flag
● sal
Memory Latency
● Main memory 240 cycles
● L3 at best 66 cycles on Broadwell
● L2 ~10 cycles
● L1 ~1 cycles
● An instruction that has to go to L2 for data takes
5 times as long as one that has data in L1 (~11
cycles vs 2 cycles given instruction takes 1
cycle itself).
● L3 is 30 times as long.
Cache
● Cache line size is 64 bytes. Size of 2 ymm
registers. 16 floats, 8 doubles or 8 int64_t
● L1 is 8 way, with 64 sets. Ie associative array
with 64 buckets and a maximum length of each
of 8 cache lines
● Bits 6-10 are the key
● Microbenchmarks usually have cache hot. It
may not be in the context of the whole program
in production.
● Data oriented programming – buzzwords up the
TLB
● Translation Lookaside Buffer
● Looks up the pagetable, inserts a mapping from
virtual to physical memory
● Associative array, virtual to physical address
● Program never sees a physical address!
● Limited shared resource – hardware support to
make TLB fills and pagetable lookups really fast
● But always: “Doing no work finishes faster than
doing work really fast...”
SIMD – aka “vectorisation”
● Where there is one there are many…
● Parallel execution
How floating point instructions look
● addXX
● subXX
● mulXX
● divXX
● movXX
● maxXX
● minXX
● XX =
Intrinsics
● Gives the compiler a chance and is less painful
than .s assembly or inline assembly
● gcc
● __builtin_popcountll()
● __builtin_ffsll()
● __builtin_clzll()
● __builtin_ctzll()
● __builtin_prefetch()
● https://gcc.gnu.org/onlinedocs/gcc/Other-Builtin
More Instrinsics
● https://software.intel.com/sites/landingpage/Intri
nsicsGuide/#techs=AVX,AVX2,FMA
● Available wherever C++ compilers are sold…
Struct of Arrays
● struct bad { uint64_t a; float b; double c;};
● bad bad_array[1024];
● struct good { uint64_t a[1024]; float b[1024];
double c[1024]; };
● makes simd, graphics card & fpga optimisations
possible
● don’t load unused struct members in cache,