EC340 COA
Performance
• Speed up of execution
– Response time
• How long it takes to do a task
• Throughput
– Total work done per unit time
• Instruction set, Hardware
• Software – OS, Compilers
• CPU time – user CPU time + system CPU time
Page 22 COA August 2024
Understanding performance
• Algorithm
– Determines number of operations executed
• Programming language, compiler,
architecture
– Determine number of machine instructions
executed per operation
• Processor and memory system
– Determine how fast instructions are executed
• I/O system (including OS)
– Determines how fast I/O operations are executed
Page 23 COA August 2024
Dept of E&C, NITK Surathkal 1
EC340 COA
Understanding performance
• How programs are translated into the machine
language
– And how the hardware executes them
• The hardware/software interface
• What determines performance of a program
• How hardware designers improve performance
– parallel processing
• How to improve energy efficiency
Page 24 COA August 2024
Seven great ideas
• Use abstraction to simplify design
• Make the common case fast
• Performance via parallelism
• Performance via pipelining
• Performance via prediction
• Hierarchy of memories
• Dependability via redundancy
Page 25 COA August 2024
Dept of E&C, NITK Surathkal 2
EC340 COA
CPU performance
• Performance = 1/Execution Time
• CPU time = CPU clocks x clock cycle time (tc)
• CPU clocks = Instruction count x clocks/instruction
• CPU time = IC x CPI/clock frequency (fc)
– CPI – average clocks per instruction
• Determined by CPU hardware
• If different instructions have different CPI
– Average CPI affected by instruction mix
• compare two different implementations of the same ISA
– IC – Instruction count
• Determined by program, ISA and compiler
– ISA – Instruction set architecture
Page 26 COA August 2024
Example
Computer tc CPI CPU time Rel Perf
A 250ps 2 ICx2x250ps 𝑃𝑒𝑟𝑓 𝐴 𝐶𝑃𝑈𝑡𝑖𝑚𝑒𝐵
= = 1.2
B 500ps 1.2 ICx1.2x500ps 𝑃𝑒𝑟𝑓 𝐵 𝐶𝑃𝑈𝑡𝑖𝑚𝑒𝐴
If different instruction classes take different numbers of cycles
n
Clock Cycles = (CPIi Instructio n Count i )
i=1
Weighted average CPI
Clock Cycles n
Instructio n Count i
CPI = = CPIi
Instructio n Count i=1 Instructio n Count
Page 27 COA August 2024
Dept of E&C, NITK Surathkal 3
EC340 COA
Power
5-<1V
Power C V2 fc • Dynamic Power
• Leakage
×30 ×1000 Courtesy- H&P, Computer Organisation, 6e
Page 28 COA August 2024
Reducing power
• Suppose a new CPU has
– 85% of capacitive load of old CPU
– 15% voltage and 15% frequency reduction
Pnew Cold 0.85 (Vold 0.85)2 Fold 0.85
= = 0.854 = 0.52
Cold Vold Fold
2
Pold
◼ The power wall
◼ We can’t reduce voltage further
◼ We can’t remove more heat
◼ How else can we improve performance?
Page 29 COA August 2024
Dept of E&C, NITK Surathkal 4
EC340 COA
Processor performance
Constrained by power, instruction-level parallelism,
Courtesy- H&P, Computer Organisation, 6e
memory latency
Page 30 COA August 2024
Multicore processors
• Requires explicitly parallel programming
• Hardware executes multiple instructions at once
• Hidden from the programmer
• Programming for performance
• Scheduling
• Load balancing
• Optimizing communication and synchronization
Page 31 COA August 2024
Dept of E&C, NITK Surathkal 5
EC340 COA
SPEC CPU Benchmark
• Programs used to measure performance
– Supposedly typical of actual workload
• Standard Performance Evaluation Coop (SPEC)
– Develops benchmarks for CPU, I/O, Web, …
• SPEC CPU2017
– Elapsed time to execute a selection of programs
– Negligible I/O, so focuses on CPU performance
– Normalize relative to reference machine
– Integer (10) and floating-point (13)
– Summarize as geometric mean of performance ratios
n
n
Execution time ratio
i=1
i
Page 32 COA August 2024
SPECspeed 2017 Integer benchmarks on a
1.8 GHz Intel Xeon E5-2650L
Courtesy- H&P, Computer Organisation, 6e
Page 33 COA August 2024
Dept of E&C, NITK Surathkal 6
EC340 COA
SPEC power benchmark
• Power consumption of server at different
workload levels
– Performance: ssj_ops/sec
– Power: Watts (Joules/sec)
10 10
Overall ssj_ops per Watt = ssj_ops i power i
i=0 i=0
Page 34 COA August 2024
SPECpower_ssj2008 for Xeon E5-2650L
10 10
Overall ssj_ops per Watt = ssj_ops i power i Courtesy- H&P, Computer Organisation, 6e
i=0 i=0
Page 35 COA August 2024
Dept of E&C, NITK Surathkal 7
EC340 COA
Amdahl’s Law
• Improving an aspect of a computer and expecting a
proportional improvement in overall performance
• Make the common case fastest
Taf f ected
Timprov ed = + Tunaf f ected
improvemen t factor
◼ Example: multiply accounts for 80s/100s
◼ How much improvement in multiply performance to
get 5× overall?
80 ◼ Can’t be done!
20 = + 20
n
Page 36 COA August 2024
Example
• Consider three different processors P1, P2, and P3 executing the
same instruction set. P1 has a 3 GHz clock rate and a CPI of
1.5. P2 has a 2.5 GHz clock rate and a CPI of 1.0. P3 has a 4.0
GHz clock rate and has a CPI of 2.2.
– Which processor has the highest performance expressed in instructions per
second?
Processor Instns/sec
P1 3x109/ 1.5 =2x109
P2 2.5x109/ 1 = 2.5x109
P3 4x109 / 2.2 = 1.8x109
Page 37 COA August 2024
Dept of E&C, NITK Surathkal 8
EC340 COA
Example
• Consider two different implementations of the same ISA. The
instructions can be divided into four classes according to their CPI
(class A, B, C, and D). P1 with a clock rate of 2.5 GHz and CPIs of 1, 2,
3, and 3, and P2 with a clock rate of 3 GHz and CPIs of 2, 2, 2, and 2.
Given a program with a dynamic instruction count of 1.0E6 instructions
divided into classes as follows: 10% class A, 20% class B, 50% class C,
and 20% class D, which implementation is faster? What is the global
CPI for each implementation?
– Time = No. instr. x CPI/clock rate
Processor Total Time CPI
P1 10.4x10-4 s 2.6
P2 6.66 x10-4 s 2
Page 38 COA August 2024
Exercise
• A processor has CPIs of 1, 12, and 5, respectively for arithmetic,
load/store, and branch instructions, Assume that
– On a single processor a program requires the execution of 2.56E9
arithmetic instructions, 1.28E9 load/store instructions, and 256
million branch instructions.
– Each processor has a 2 GHz clock frequency.
– As the program is parallelized to run over multiple cores, the
number of arithmetic and load/store instructions per processor is
divided by 0.7 x p (where p is the number of processors) but the
number of branch instructions per processor remains the same.
• Find the total execution time for this program on 1, 2, 4, and 8
processors, and show the relative speedup of the 2, 4, and 8
processor result relative to the single processor result.
Page 39 COA August 2024
Dept of E&C, NITK Surathkal 9
EC340 COA
Exercise
• A computer spends 30 percent of its time accessing memory, 20
percent performing multiplications, and 50 percent executing
other instructions. As a computer architect, you have to choose
between improving either the memory, multiplication hardware,
or execution of non multiplication instructions. There is only
space on the chip for one improvement, and each of the
improvements will improve its associated part of the
computation by a factor of 2.
– Without performing any calculations, which improvement would
you expect to give the largest performance increase, and why?
– What speedup would making each of the three changes give?
Page 40 COA August 2024
MIPS as performance benchmark
• MIPS: Millions of Instructions Per Second
– Doesn’t account for
• Differences in ISAs between computers
• Differences in complexity between instructions
Instructio n count
MIPS =
Execution time 106
Instructio n count Clock rate
= =
Instructio n count CPI CPI 106
106
Clock rate
◼ CPI varies between programs on a given CPU
Page 41 COA August 2024
Dept of E&C, NITK Surathkal 10
EC340 COA
Summary
• Cost/performance is improving
– Due to underlying technology development
• Hierarchical layers of abstraction
– In both hardware and software
• Instruction set architecture
– The hardware/software interface
• Execution time: the best performance measure
• Power is a limiting factor
– Use parallelism to improve performance
Page 42 COA August 2024
Dept of E&C, NITK Surathkal 11