0% found this document useful (0 votes)
10 views34 pages

Asm Presentation

Uploaded by

billzheng888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views34 pages

Asm Presentation

Uploaded by

billzheng888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

X86-64 Architecture With a

Broadwell Accent
Registers
● Lower 32 bits of %rax is %eax Lower 16 bits is
%ax, lower 8 bits is %al
Same trick for vector registers
SSE->AVX

AT&T syntax
● If the registers are preceded by % signs. The
rightmost operand is the destination

● mov %rax, %rdx ; copy the value from %rax to


%rdx – misnamed instruction, cpy would have
been better!
● $ objudmp -DC ./bin/live_engine | less
● -D = disasemble, -C = demangle function
names
Some instructions
● addq %rax, %rbx
● subq, subbq
● cmp %rax, %rbx
● and
● test
● negb, negl
Control Flow
● jmp
● jmple
● jg
● jc
● Conditional moves work the same way!
● cmovle
● cmovg
The Pipeline

● Branch Prediction comes first, when wrong, the


pipeline is dumped
● Branch Target Buffer – to predict branches
Two level adaptive predictor
● Globally shared per core – also a loop predictor
Aside: Is this %rdx re-use slow?
● movq (%rbp), %rdx
● movq %rdx, (dest)
● movq (%r8), %rdx
● movq %rdx, (other_dest)
● what about if line 3 were mov src2, %eax?
● What about if it were movb src2, %al
● xor %eax, %eax ; zeros all of %rax
● xor %ax, %ax; leaves upper 48 bits as they
were
Instructions

● what are jmp, jl, jc?

● lea == calculate (an address?) and put the


calculation in a register - nothing loaded from
memory. Another misnamed instruction...

● fast multiply by 5
● lea (%rax, %rax, 4) %rbx
CISC? Or really RISC?
● addq %rax, (%rdi) ; this is CISC

● split into micro ops – which are RISC


● mov (%rdi), %register
● add %rax, %register
● mov %register, (%rdi)
Pipeline

Branch Prediction - work out what's coming next


Fetch - work out where the instructions are
Decode - decode them into micro ops – CISC becomes RISC
Rename - 100s of registers on chip, 16 architectural no deps can rename
Reorder Buffer Read - snoops output of completing micro-ops, fetches
inputs
Reservation Station - queue micro ops until inputs ready
Execution - micro ops with inputs go to an execution port! shit gets done yo.
Reorder Buffer Write - execution results written
Retire - results to register file, micro ops removed from reorder buffer
Execution ports
● We have 8 execution ports, this diagram has 6
but it’s close
Branch misprediction is expensive
● What to do?
● LIKELY() & UNLIKELY() macros
● Branchless implementations
− Replace jumps with conditional moves – but
dependency chains
− xor and bithacking
http://graphics.stanford.edu/~seander/bithacks.html
− uint64_t max(uint64_t a, uint64_t b) {
− uint64_t mask = -(!(b < a));
− return a ^ ((a ^ b) & mask);
What assembly does that
branchless max in C produce?
● branchless_demo.cpp – what instructions look
odd?
System V ABI
● How to call a function on anything that isn’t
windows
● %rdi, %rsi, %rdx, %rcx, %r8, %r9 integer
function args in order
● %xmm0 – %xmm7 for floating point args
● Integer return value in %rax
● Floating point return value in %xmm0
Lets see that max() again
● %rdi is first argument
● %rsi is the second
● %rax is the return value
● Looking at it can we do better still?
g++ -S
Hardcode a breakpoint

● if(some_complex_condition && is_met) {


● asm volatile(“int3”);
● }
● must run in gdb!
Stack
● %rbp is the stack frame base pointer
● %rsp is the stack pointer
● push %rax
● the same as:
● mov %rax, (%rsp)
● sub 8, %rsp
● Stack starts at a high address, adding to the
stack reduces the stack pointer address
● Can tell immediately by inspection of an
Function calls
● callq 0x1326bd(%rip)
● pushes the address of the instruction after the
call on the stack. Then jumps to the address.
● push(%rip + call instruction size)
● jmp function address

● retq pops that pushed address off the stack and


jumps (back) to it
● Implication: Any stack corruption will hurt, badly!
Function calls
● Label:
● push %rbp
● mov %rsp, %rbp
● … do stuff here, push values, use stack for
temp storage

● … end of function
● leaveq ; this instruction is really mov %rbp,
%rsp & pop %rpb
What about system calls?
● System V ABI - %rdi, %rsi, %rdx, %rcx, %r8,
%r9
● Kernel %rdi, %rsi, %rdx, %r10, %r8,
%r9
● Pretty similar
● Load the registers with arguments
● Put syscall number in %rax
● syscall instruction
● Return value in %rax
Full horror show...

Member Function
● struct Foo
● {
● uint64_t some_member_func(uint64_t bar)
● {
● return member + bar;
● }
● uint64_t member;
● };
C++ with no virtuals – syntactic
sugar over C
● struct Foo
● {
● uint64_t member;
● };
● uint64_t some_member_func(Foo* this,
uint64_t bar)
● {
● return this->member + bar;
● }
More instructions
● callq - push next instruction address after callq
& jmp to function address
● push
● pop
● set, setne, sete, setge etc
● shl - logical shift left
● shr
● sar - shift arithmetic right; through the carry flag
● sal
Memory Latency
● Main memory 240 cycles
● L3 at best 66 cycles on Broadwell
● L2 ~10 cycles
● L1 ~1 cycles
● An instruction that has to go to L2 for data takes
5 times as long as one that has data in L1 (~11
cycles vs 2 cycles given instruction takes 1
cycle itself).
● L3 is 30 times as long.
Cache
● Cache line size is 64 bytes. Size of 2 ymm
registers. 16 floats, 8 doubles or 8 int64_t
● L1 is 8 way, with 64 sets. Ie associative array
with 64 buckets and a maximum length of each
of 8 cache lines
● Bits 6-10 are the key
● Microbenchmarks usually have cache hot. It
may not be in the context of the whole program
in production.
● Data oriented programming – buzzwords up the
TLB
● Translation Lookaside Buffer
● Looks up the pagetable, inserts a mapping from
virtual to physical memory
● Associative array, virtual to physical address
● Program never sees a physical address!
● Limited shared resource – hardware support to
make TLB fills and pagetable lookups really fast
● But always: “Doing no work finishes faster than
doing work really fast...”
SIMD – aka “vectorisation”
● Where there is one there are many…
● Parallel execution
How floating point instructions look
● addXX
● subXX
● mulXX
● divXX
● movXX
● maxXX
● minXX

● XX =
Intrinsics
● Gives the compiler a chance and is less painful
than .s assembly or inline assembly
● gcc
● __builtin_popcountll()
● __builtin_ffsll()
● __builtin_clzll()
● __builtin_ctzll()
● __builtin_prefetch()
● https://gcc.gnu.org/onlinedocs/gcc/Other-Builtin
More Instrinsics
● https://software.intel.com/sites/landingpage/Intri
nsicsGuide/#techs=AVX,AVX2,FMA
● Available wherever C++ compilers are sold…
Struct of Arrays
● struct bad { uint64_t a; float b; double c;};
● bad bad_array[1024];

● struct good { uint64_t a[1024]; float b[1024];


double c[1024]; };

● makes simd, graphics card & fpga optimisations


possible
● don’t load unused struct members in cache,

You might also like