Subroutines
1
Readings and Exercises
• P & H: Section 2.8
• ARMv8 Instruction Set Overview
▪ Section 5.2.2, 5.2.3
• ARM Procedure Call Standard
▪ Section 5
2
Objectives
At the end of this section, you will be able to
1. Define closed and open routines
2. Further understand the structure of the stack
frame (activation record)
3. Understand the role of the LR and FP registers
4. Pass parameters by value and by reference
5. Return values from subroutines
3
Introduction
• Subroutines allow code re-use using varying
argument values
• Two types:
▪ Open (inline)
• Code is inserted inline wherever the subroutine is invoked
▪ Usually using a macro preprocessor
• Arguments are passed in/out using registers
• Efficient, since overhead of branching and returning is
avoided
• Suitable only for fairly short routines
4
Open (Inline) Subroutines
• Are usually implemented using macros
• Eg: cube function
define(cube, `mul $2, $1, $1
mul $2, $1, $2’)
.global main
main: stp x29, x30, [sp, -16]!
. . .
mov x19, 8
cube(x19, x20)
. . .
5
Open (Inline) Subroutines (cont’d)
▪ m4 expands this to:
.global main
main: stp x29, x30, [sp, -16]!
. . .
mov x19, 8
mul x20, x19, x19
mul x20, x19, x20
. . .
6
Introduction (cont’d)
• Closed
▪ Machine code for the routine appears only once in
RAM
• Leads to more compact machine code than with open
routines
▪ When invoked, control “jumps” to the first instruction
of the routine
• i.e. PC is loaded with the address of the first instruction
▪ When finished, control returns to the next instruction
in the calling code
• i.e. PC is loaded with the return address
▪ Arguments are placed in registers or on the stack
▪ Slower than open routines, because of call/return
overhead 7
Closed Routine Calls
int main() {
int A(){
int i,j; …
… return k*l+2;
}
PC
i = A();
j *=i;
}
8
Closed Routine Calls
int main() {
int A(){ PC
int i,j; …
… return k*l+2;
}
LR
i = A();
j *=i;
}
9
Closed Routine Calls
int main() {
int A(){
int i,j; …
… return k*l+2;
}
LR
i = A();
PC
j *=i;
}
10
Introduction (cont’d)
• Subroutines should not change the state of the
machine for the calling code
▪ When invoked, a subroutine should save any registers
it uses on the stack
▪ When it returns, it should restore the original values
of the registers
• Arguments to subroutines are considered local
variables
▪ The subroutine may change their value
11
12
Parameter Passing
• Using registers (by value)
▪ Register holds data
• Quick
• Limited to register size and number of registers
• Using registers (by reference)
▪ Register holds the address of data
• Not limited to register size
• Using the stack
▪ Standard compiler method
12
13
Stack Parameters
• Push parameters onto the stack
• Call subroutine
• Subroutine access parameters on the stack
• Subroutine may return a value on the stack
13
Closed Subroutines
• General form:
label: stp x29, x30, [sp, -alloc]!
mov x29, sp
. . . custom code
ldp x29, x30, [sp], alloc
ret
▪ label: names the subroutine
▪ alloc: number of bytes to allocate for the subroutine’s
stack frame
• SP must be quad-word aligned
• Minimum of 16 bytes
14
Subroutine Linkage
• A subroutine may be invoked using the branch
and link instruction: bl
▪ Form: bl subroutine_label
▪ Stores the return address into the link register: x30
• Return address is PC + 4, which points to the instruction
immediately following bl
▪ Transfers control to the address specified by the label
• Loads the PC register with the address of the subroutine’s
first instruction
15
Subroutine Linkage (cont’d)
• Use the ret instruction to return from a subroutine
back to the calling code
▪ Transfers control to the address stored in the link
register (x30)
• i.e. jumps to the instruction immediately following the
original bl in calling code
16
17
Calling and Returning
• Calling code:
bl fact
mov x20, x0
Calling
Returning • Subroutine code
fact:
stp …
…
ret
17
Calling and Returning
int main() {
int i,j;
…
PC
i = A(); BL A
PC
j *=i; STR x0, [fp, i_s];
LR
}
18
Subroutine Linkage (cont’d)
• Eg: C code
int main()
{
. . .
func1();
. . .
}
void func1()
{
. . .
func2();
. . .
}
void func2()
{
. . .
}
19
Subroutine Linkage (cont’d)
▪ Assembly code:
main: stp x29, x30, [sp, -16]!
mov x29, sp
. . .
bl func1
. . .
ldp x29, x30, [sp], 16
ret
func1: stp x29, x30, [sp, -16]!
mov x29, sp
. . .
bl func2
. . .
ldp x29, x30, [sp], 16
ret
20
Subroutine Linkage (cont’d)
func2: stp x29, x30, [sp, -16]!
mov x29, sp
. . .
ldp x29, x30, [sp], 16
ret
▪ The stp instructions create a frame record in each
function’s stack frame
• Safely stores the LR (x30), in case it is changed by a bl in
the body of the function
▪ Is restored by the ldp instruction, just before the ret
21
SP
Stack
stp x29, x30, [sp, -16]!
SP
X29/FP
X30/LR mov x29, sp
Old Stack
FP SP
X29/FP
X30/LR
Old Stack
22
Subroutine Linkage (cont’d)
• The FP and the stored FP values in the frame
records form a linked list
▪ Eg: the stack while func2() is executing
free
memory
SP FP
*
func2() stack frame
func1() return
address
*
func1() stack frame
main() return
address
*
main() stack frame
OS return
address
23
Revised Stack Frame Structure
SP
More Local Vars
FP
X29/FP
X30/LR
Saved GPRS
Local Variables &
Parameters
24
Saving and Restoring Registers
• A called function must save/restore the state of
the calling code
▪ If it uses any of the registers x19 – x28, it must save
their data to the stack at the beginning of the function
• Are “callee-saved registers”
▪ The function must restore the data in these registers
just before it returns
25
Example
x19_size = 8
alloc = -(16 + x19_size) & -16
dealloc = -alloc
x19_save = 16
func2: stp x29, x30, [sp, alloc]!
mov x29, sp
str x19, [x29, x19_save]// save x19
. . .
mov x19, 13 // use x19
. . .
ldr x19, [x29, x19_save]// restore x19
ldp x29, x30, [sp], dealloc
ret
26
stp x29, x30, [sp, alloc]!
mov x29, sp
FP, SP
X29/FP
X30/LR
FP+16, x19_save
x19
Old Stack
str x19, [x29, x19_save]
27
Caller-Saved Registers
• The callee can also use registers x9–x15
▪ By convention, these registers are not saved/restored
by the called function
• Thus are only safe to use in calling code between function
calls
▪ The calling code can save these registers to the stack,
if it is necessary to preserve their value over a
function call
• Are “caller-saved registers”
28
Register Summary
Screenshot from: https://developer.arm.com/documentation/den0024/a/The-ABI-for-ARM-64-bit-Architecture/Register-use-in-the-AArch64-
Procedure-Call-Standard/Parameters-in-general-purpose-registers
29
Registers Summary
▪ x0 – x7: used to pass arguments into a procedure, and
return results
▪ x8: indirect result location register
▪ x9 – x15: temporary GPRS registers, Caller-saved
▪ x16, x17: intra-procedure-call temporary registers
(IP0, IP1)
▪ x18: platform register
▪ x19 – x28: GPRS registers, Callee-saved
▪ x29: frame pointer (FP) register
▪ x30: procedure link register (LR)
30
Arguments to Subroutines
• 8 or fewer arguments can be passed into a
function using registers x0 – x7
▪ ints, short ints, and chars use w0 – w7
▪ long ints use x0 – x7
• Eg: C code
void sum(int a, int b)
{
register int i;
i = a + b;
. . .
}
31
Example: Register –Value
int main()
{
sum(3, 4);
. . .
}
▪ Assembly code:
define(i_r, w9)
sum: stp x29, x30, [sp, -16]!
mov x29, sp
add i_r, w0, w1 2nd argument
. . .
1st argument
ldp x29, x30, [sp], 16
ret
32
Example: Register –Value
main: stp x29, x30, [sp, -16]!
mov x29, sp
mov w0, 3 // set up 1st arg
mov w1, 4 // set up 2nd arg
bl sum
. . .
ldp x29, x30, [sp], 16
ret
▪ Note that the subroutine is free to overwrite registers
x0 – x7 as it executes
• These registers are not preserved over a function call
33
Pointer Arguments
• In calling code, the address of a variable is
passed to the subroutine
▪ Implies that the variable must be in RAM, not in a
register
• The called subroutine dereferences the address, to
manipulate the variable being pointed to
▪ Usually with a ldr or str instruction
34
Example: Register –Reference
• Eg: C code
int main()
{
int a = 5, b = 7;
swap(&a, &b);
. . .
}
void swap(int *x, int *y)
{
register int temp;
temp = *x;
*x = *y;
*y = temp;
}
35
Example: Register –Reference
▪ Assembly code:
a_size = 4
b_size = 4
int main()
alloc = -(16 + a_size + b_size) & -16
dealloc = -alloc
{
a_s = 16 int a = 5, b = 7;
b_s = 20 swap(&a, &b);
. . . . . .
main: stp x29, x30, [sp, alloc]! }
mov x29, sp
mov w19, 5 // init a to 5
str w19, [x29, a_s]
mov w20, 7 // init b to 7
str w20, [x29, b_s]
add x0, x29, a_s // set up 1st arg
add x1, x29, b_s // set up 2nd arg
bl swap
. . . 36
Example: Register –Reference
void swap(int *x, int *y)
{
register int temp;
define(temp_r, w9) temp = *x;
*x = *y;
swap: stp x29, x30, [sp, -16]! *y = temp;
mov x29, sp }
ldr temp_r, [x0] // temp = *x
ldr w10, [x1] // w10 = *y
str w10, [x0] // *x = w10
str temp_r, [x1] // *y = temp
ldp x29, x30, [sp], 16
ret
37
argc and argv
• The shell passes argc in x0
• The address for argv[] is passed in x1
38
Local Variables
• register int temp causes temp to be
mapped to a register
• In general, a declaration: int temp
requires allocation of needed memory on the
stack
• These are stack variables
39
Local Variables - Example
void xyz()
{
int x;
long y;
. . .
}
Requires:
stp x29, x30, [sp, -32]!
40
Local Variable Access
• Stack variables are addressed by adding an offset
to the base address in FP
▪ Eg: FP + 16, FP + 24 FP + 32
41
Returning with Integers
• A function returns:
▪ long ints in x0
▪ ints, short ints, and chars in w0
• Eg: Cube function in C
int cube(int x); // function prototype
int main()
{
register int result;
result = cube(3);
. . .
}
42
Returning with Integers (cont’d)
int cube(int x)
{
return x * x * x;
}
▪ Assembly code:
define(result_r, w19)
. . .
main: stp x29, x30, [sp, -16]!
mov x29, sp
mov w0, 3 // set up 1st arg
bl cube
mov result_r, w0 // result returned in w0
. . .
43
Returning with Integers (cont’d)
cube: stp x29, x30, [sp, -16]!
mov x29, sp
mul w9, w0, w0 // w9 = x * x
mul w0, w9, w0 // w0 = x * x * x
ldp x29, x30, [sp], 16
ret // result is in w0
44
Returning with Structures
• In C, a function may return a struct by value
▪ Eg: struct mystruct {
long int i;
long int j;
};
struct mystruct init()
{
struct mystruct lvar;
lvar.i = 0;
lvar.j = 0;
return lvar;
}
int main()
{
struct mystruct b;
b = init();
. . .
}
45
Returning with Structures (cont’d)
• Usually a struct is too big to return in x0 or w0
▪ Thus, another return mechanism is needed
• The calling code provides memory on the stack to
store the returned result
▪ The address of this memory is put into x8 prior to the
function call
• x8 is the “indirect result location register”
▪ The called subroutine writes to memory at this
address, using x8 as a pointer to it
46
Returning with Structures (cont’d)
▪ Equivalent assembly code: struct mystruct {
long int i;
mystruct_i = 0 long int j;
mystruct_j = 8 };
int main()
b_size = 16 {
alloc = -(16 + b_size) & -16 struct mystruct b;
dealloc = -alloc
b_s = 16 b = init();
. . .
}
. . .
main: stp x29, x30, [sp, alloc]!
mov x29, sp
add x8, x29, b_s // calculate address of b
bl init // call init()
. . . // result is in b
47
Returning with Structures (cont’d)
define(lvar_base_r, x9) struct mystruct init()
{
lvar_size = 16 struct mystruct lvar;
alloc = -(16 + lvar_size) & -16 lvar.i = 0;
dealloc = -alloc lvar.j = 0;
lvar_s = 16 return lvar;
}
init: stp x29, x30, [sp, alloc]!
mov x29, sp
// Calculate lvar struct base address
add lvar_base_r, x29, lvar_s
str xzr, [lvar_base_r, mystruct_i] // lvar.i = 0
str xzr, [lvar_base_r, mystruct_j] // lvar.j = 0
ldr x10, [lvar_base_r, mystruct_i] // set in main:
str x10, [x8, mystruct_i] // b.i = lvar.i
ldr x10, [lvar_base_r, mystruct_j]
str x10, [x8, mystruct_j] // b.j = lvar.j
ldp x29, x30, [sp], dealloc
ret
48
Optimizing Leaf Subroutines
• Leaf subroutines do not call any other subroutines
▪ i.e. are leaf nodes on a tree structure diagram
• A frame record is not pushed onto the stack
▪ Since the routine does not do a bl, LR won’t change
▪ Since the routine does not call a subroutine, FP won’t
change
▪ Thus can eliminate the usual stp/ldp instructions
49
Optimizing Leaf Subroutines (cont’d)
• If one uses only the registers x0 – x7 and x9 –
x15, then a stack frame is not pushed at all
▪ No need to save/restore registers
▪ No stack variables are used
• Eg: optimized cube function
cube: mul w9, w0, w0 // w9 = x * x
mul w0, w9, w0 // w0 = x * x * x
ret // result is in w0
50
Subroutines with 9 or More
Arguments
• Arguments beyond the 8th are passed on the stack
▪ The calling code allocates memory at the top of the
stack, and writes the “spilled” argument values there
• By convention, each argument is allocated 8 bytes
▪ The callee reads this memory using the appropriate
offset
51
Subroutines with 9 or More
Arguments (cont’d)
// Function prototype
int sum(int a1, int a2, int a3, int a4, int a5,
int a6, int a7, int a8, int a9, int a10);
int main()
{
register int result;
result = sum(10, 20, 30, 40, 50, 60, 70, 80, 90, 100);
}
int sum(int a1, int a2, int a3, int a4, int a5,
int a6, int a7, int a8, int a9, int a10)
{
return a1 + a2 + a3 + a4 + a5 + a6 + a7 + a8 + a9 + a10;
}
52
Subroutines with 9 or More
Arguments (cont’d)
▪ Assembly code:
define(result_r, w19)
spilled_mem_size = 16
int main()
alloc = -spilled_mem_size & -16
{
dealloc = -alloc
register int result;
.global main
result = sum(10, 20, 30, 40, 50,
main: stp x29, x30, [sp, -16]!
60, 70, 80, 90, 100);
mov x29, sp
}
// Set up first 8 args
mov w0, 10
mov w1, 20
mov w2, 30
mov w3, 40
mov w4, 50
53
Subroutines with 9 or More
Arguments (cont’d)
mov w5, 60
mov w6, 70
mov w7, 80
int main()
// Allocate memory for args 9 and 10 {
add sp, sp, alloc register int result;
// Write spilled arguments to top of stack result = sum(10, 20, 30, 40, 50,
mov w9, 90 60, 70, 80, 90, 100);
str w9, [sp, 0] // Set up arg 9 }
mov w9, 100
str w9, [sp, 8] // Set up arg 10
bl sum // call sum function
mov result_r, w0 // result is in w0
// Deallocate memory for spilled arguments
add sp, sp, dealloc
. . .
54
Subroutines with 9 or More
Arguments (cont’d)
arg9_s = 16
arg10_s = 24
int sum(int a1, int a2, int a3, int a4,
sum: stp x29, x30, [sp, -16]!
mov x29, sp int a5, int a6, int a7, int a8,
int a9, int a10)
// add first 8 arguments {
add w0, w0, w1 return a1 + a2 + a3 + a4 + a5 + a6
add w0, w0, w2 + a7 + a8 + a9 + a10;
add w0, w0, w3 }
add w0, w0, w4
add w0, w0, w5
add w0, w0, w6
add w0, w0, w7
// Add 9th and 10th args
ldr w9, [x29, arg9_s]
add w0, w0, w9
ldr w9, [x29, arg10_s]
add w0, w0, w9 // result in w0
ldp x29, x30, [sp], 16
ret
55
Subroutines with 9 or More
Arguments (cont’d)
▪ When in sum(), the stack appears as:
free
memory
SP FP
Frame record Stack frame for sum
fp + arg9_s arg 9
pad bytes
fp + arg10_s arg 10
pad bytes
Stack frame for main
Frame record
56