Performance
Topics:  performance metrics  CPU performance equation  benchmarks and benchmarking  reporting averages  Amdahls Law  Littles Law  concepts
 balance  tradeoffs  bursty behavior (average and peak performance)
Performance Metrics
latency: response time, execution time  good metric for fixed amount of work (minimize time) throughput: bandwidth, work per time  = (1 / latency) when there is NO OVERLAP  > (1 / latency) when there is overlap
 in real processors, there is always overlap (e.g., pipelining)
 good metric for fixed amount of time (maximize work) comparing performance  A is N times faster than B iff
 perf(A)/perf(B) = time(B)/time(A) = N
 A is X% faster than B iff
 perf(A)/perf(B) = time(B)/time(A) = 1 + X/100
Performance Metric I: MIPS
MIPS (millions of instructions per second)  (instruction count / execution time in seconds) x 10-6  but instruction count is not a reliable indicator of work
 Prob #1: work per instruction varies (FP mult >> register move)  Prob #2: instruction sets arent equal (3 Pentium instrs != 3 Alpha instrs)
 may vary inversely with actual performance  particularly bad metric for multicore chips
Performance Metric II: MFLOPS
MFLOPS (millions of floating-point operations per second)
(FP ops / execution time) x 10-6 like MIPS, but counts only FP operations
 FP ops have longest latencies anyway (problem #1)  FP ops are the same across machines (problem #2)
 may have been valid in 1980 (most programs were FP)
 most programs today are integer i.e., light on FP  load from memory takes longer than FP divide (prob #1)  Cray doesnt implement divide, Motorola has SQRT, SIN, COS (#2)
CPU Performance Equation
processor performance = seconds / program  separate into three components (for single core)
instructions / program: dynamic instruction count  mostly determined by program, compiler, ISA cycles / instruction: CPI  mostly determined by ISA and CPU/memory organization seconds / cycle: cycle time, clock time, 1 / clock frequency  mostly determined by technology and CPU organization uses of CPU performance equation  high-level performance comparisons  back of the envelope calculations  helping architects think about compilers and technology
CPU Performance Comparison
famous example: RISC Wars (RISC vs. CISC)  assume
 instructions / program: CISC = P, RISC = 2P  CPI: CISC = 8, RISC = 2  T = clock period for CISC and RISC (assume they are equal)
 CISC time = P x 8 x T = 8PT  RISC time = 2P x 2 x T = 4PT  RISC time = CISC CPU time/2 the truth is much, much, much more complex  actual data from IBM AS/400 (CISC -> RISC in 1995):
 CISC time = P x 7 x T = 7PT  RISC time = 3.1P x 3 x T/3.1 = 3PT (+1 tech. gen.)
Actually Measuring Performance
how are execution-time & CPI actually measured?  execution time: time (Unix cmd): wall-clock, CPU, system  CPI = CPU time / (clock frequency * # instructions)  more useful? CPI breakdown (compute, memory stall, etc.)
 so we know what the performance problems are (what to fix)
measuring CPI breakdown  hardware event counters (built into core)
 calculate CPI using instruction frequencies/event costs
 cycle-level microarchitecture simulator (e.g., SimpleScalar)
+ measure exactly what you want  model microarchitecture faithfully (at least parts of interest)  method of choice for many architects (yours, too!)
Benchmarks and Benchmarking
program as unit of work  millions of them, many different kinds, which to use? benchmarks  standard programs for measuring/comparing performance + represent programs people care about + repeatable!!  benchmarking process
      define workload extract benchmarks from workload execute benchmarks on candidate machines project performance on new machine run workload on new machine and compare not close enough -> repeat
Benchmarks: Toys, Kernels, Synthetics
toy benchmarks: little programs that no one really runs
 e.g., fibonacci, 8 queens
 little value, what real programs do these represent?
 scary fact: used to prove the value of RISC in early 80s
kernels: important (frequently executed) pieces of real programs
 e.g., Livermore loops, Linpack (inner product)
+ good for focusing on individual features, but not big picture  over-emphasize target feature (for better or worse) synthetic benchmarks: programs made up for benchmarking
 e.g., Whetstone, Dhrystone
 toy kernels++, which programs do these represent?
Benchmarks: Real Programs
real programs + only accurate way to characterize performance  requires considerable work (porting) Standard Performance Evaluation Corporation (SPEC)  http://www.spec.org  collects, standardizes and distributes benchmark suites  consortium made up of industry leaders  SPEC CPU (CPU intensive benchmarks)
 SPEC89, SPEC92, SPEC95, SPEC2000, SPEC2006
 other benchmark suites
 SPECjvm, SPECmail, SPECweb, SPEComp
Other benchmark suite examples: TPC-C, TPC-H for databases
SPEC CPU2006
12 integer programs (C, C++)
     gcc (compiler), perl (interpreter), hmmer (markov chain) bzip2 (compress), go (AI), sjeng (AI) libquantum (physics), h264ref (video) omnetpp (simulation), astar (path finding algs) xalanc (XML processing), mcf (network optimization)
17 floating point programs (C, C++, Fortran)
        fluid dynamics: bwaves, leslie3d, ibm quantum chemistry: gamess, tonto physics: milc, zeusmp, cactusADM gromacs (biochem) namd (bio, molec dynamics), dealll (finite element analysis) soplex (linear programming), povray (ray tracing) calculix (mechanics), GemsFDTD (computational E&M) wrf (weather), sphinx3 (speech recognition)
Benchmarking Pitfalls
 benchmark properties mismatch with features studied
 e.g., using SPEC for large cache studies
 careless scaling
 using only first few million instructions (initialization phase)  reducing program data size
 choosing performance from wrong application space
 e.g., in a realtime environment, choosing gcc
 using old benchmarks
 benchmark specials: benchmark-specific optimizations
Benchmarks must be continuously maintained and updated!
Reporting Average Performance
averages: one of the things architects frequently get wrong + pay attention now and you wont get them wrong important things about averages (i.e., means)  ideally proportional to execution time (ultimate metric)
 Arithmetic Mean (AM) for times  Harmonic Mean (HM) for rates (IPCs)  Geometric Mean (GM) for ratios (speedups)
 there is no such thing as the average program  use average when absolutely necessary
What Does The Mean Mean?
arithmetic mean (AM): average execution times of N programs   (time(i)) / N harmonic mean (HM): average IPCs of N programs  arithmetic mean cannot be used for rates (e.g., IPCs)
 30 MPH for 1 mile + 90 MPH for 1 mile != avg. 60 MPH
 N / 1..N(1 / rate(i)) geometric mean (GM): average speedups of N programs  N ( 1..N(speedup(i)) what if programs run at different frequencies within workload?  weighting  weighted AM = (1..N w(i) * time(i)) / N
GM Weirdness
what about averaging ratios (speedups)?  HM / AM change depending on which machine is the base
machine A machine B B/A A/B
Program1 Program2
1 1000
10 0.1 0.1 10 (10+.1)/2 = 5.05 (.1+10)/2 = 5.05 AM B is 5.05 times faster! A is 5.05 times faster! 2/(1/10+1/.1) = 5.05 2/(1/.1+1/10) = 5.05 HM B is 5.05 times faster! A is 5.05 times faster! GM (10*.1) = 1 (.1*10) = 1
10 100
 geometric mean of ratios is not proportional to total time!
 if we take total execution time, B is 9.1 times faster  GM says they are equal
Amdahls Law
Validity of the Single-Processor Approach to Achieving Large- Scale Computing Capabilities G. Amdahl, AFIPS, 1967  let optimization speed up fraction f of program by factor s
 speedup = old / ([(1-f) x old] + f/s x old) = 1 / (1 - f + f/s)
 f = 95%, s = 1.1  f = 5%, s = 10  f = 5%, s =  f = 95%, s
1/[(1-0.95) + (0.95/1.1)] = 1.094 1/[(1-0.05) + (0.05/10)] = 1.047 1/[(1-0.05) + (0.05/ )] = 1.052
1/[(1-0.95) + (0.95/ )] = 20
make common case fast, but... ...uncommon case eventually limits performance
Littles Law
Key Relationship between latency and bandwidth: Average number in system = arrival rate * mean holding time Possibly the most useful equation I know  Useful in design of computers, software, industrial processes, etc. Example:  How big of a wine cellar should we build?  We drink (and buy) an average of 2 bottles per week  On average, we want to age the wine for 5 years  bottles in cellar = 2 bottles/week * 52 weeks/year * 5 years
 = 520 bottles
System Balance
each system component produces & consumes data  make sure data supply and demand is balanced
 X demand >= X supply  computation is X-bound
 e.g., memory bound, CPU-bound, I/O-bound
 goal: be bound everywhere at once (why?)  X can be bandwidth or latency
 X is bandwidth  buy more bandwidth  X is latency  much tougher problem
Tradeoffs
Bandwidth problems can be solved with money. Latency problems are harder, because the speed of light is fixed and you cant bribe God  David Clark well...  can convert some latency problems to bandwidth problems  solve those with money  the famous bandwidth/latency tradeoff  architecture is the art of making tradeoffs
Bursty Behavior
Q: to sustain 2 IPC... how many instructions should processor be able to  fetch per cycle?  execute per cycle?  complete per cycle? A: NOT 2 (more than 2)  dependences will cause stalls (under-utilization)  if desired performance is X, peak performance must be > X programs dont always obey average behavior  cant design processor only to handle average behvaior