0% found this document useful (0 votes)

112 views7 pages

Performance: Latency

The document discusses various performance metrics like latency, throughput, and MIPS and MFLOPS. It explains the CPU performance equation that separates performance into instructions per program, cycles per instruction, and seconds per cycle. Amdahl's Law and Little's Law are also summarized. The document discusses benchmarks and issues with different types of benchmarks. It covers reporting averages using arithmetic, harmonic, and geometric means. Concepts around system balance, tradeoffs, and bursty behavior are also summarized.

Uploaded by

Ni Tin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

112 views7 pages

Performance: Latency

Uploaded by

Ni Tin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Performance

Topics: performance metrics CPU performance equation benchmarks and benchmarking reporting averages Amdahls Law Littles Law concepts
balance tradeoffs bursty behavior (average and peak performance)

Performance Metrics
latency: response time, execution time good metric for fixed amount of work (minimize time) throughput: bandwidth, work per time = (1 / latency) when there is NO OVERLAP > (1 / latency) when there is overlap
in real processors, there is always overlap (e.g., pipelining)

good metric for fixed amount of time (maximize work) comparing performance A is N times faster than B iff
perf(A)/perf(B) = time(B)/time(A) = N

A is X% faster than B iff

perf(A)/perf(B) = time(B)/time(A) = 1 + X/100

Performance Metric I: MIPS

MIPS (millions of instructions per second) (instruction count / execution time in seconds) x 10-6 but instruction count is not a reliable indicator of work
Prob #1: work per instruction varies (FP mult >> register move) Prob #2: instruction sets arent equal (3 Pentium instrs != 3 Alpha instrs)

may vary inversely with actual performance particularly bad metric for multicore chips

Performance Metric II: MFLOPS

MFLOPS (millions of floating-point operations per second)

(FP ops / execution time) x 10-6 like MIPS, but counts only FP operations
FP ops have longest latencies anyway (problem #1) FP ops are the same across machines (problem #2)

may have been valid in 1980 (most programs were FP)

most programs today are integer i.e., light on FP load from memory takes longer than FP divide (prob #1) Cray doesnt implement divide, Motorola has SQRT, SIN, COS (#2)

CPU Performance Equation

processor performance = seconds / program separate into three components (for single core)

instructions / program: dynamic instruction count mostly determined by program, compiler, ISA cycles / instruction: CPI mostly determined by ISA and CPU/memory organization seconds / cycle: cycle time, clock time, 1 / clock frequency mostly determined by technology and CPU organization uses of CPU performance equation high-level performance comparisons back of the envelope calculations helping architects think about compilers and technology

CPU Performance Comparison

famous example: RISC Wars (RISC vs. CISC) assume
instructions / program: CISC = P, RISC = 2P CPI: CISC = 8, RISC = 2 T = clock period for CISC and RISC (assume they are equal)

CISC time = P x 8 x T = 8PT RISC time = 2P x 2 x T = 4PT RISC time = CISC CPU time/2 the truth is much, much, much more complex actual data from IBM AS/400 (CISC -> RISC in 1995):
CISC time = P x 7 x T = 7PT RISC time = 3.1P x 3 x T/3.1 = 3PT (+1 tech. gen.)

Actually Measuring Performance

how are execution-time & CPI actually measured? execution time: time (Unix cmd): wall-clock, CPU, system CPI = CPU time / (clock frequency * # instructions) more useful? CPI breakdown (compute, memory stall, etc.)
so we know what the performance problems are (what to fix)

measuring CPI breakdown hardware event counters (built into core)

calculate CPI using instruction frequencies/event costs

cycle-level microarchitecture simulator (e.g., SimpleScalar)

+ measure exactly what you want model microarchitecture faithfully (at least parts of interest) method of choice for many architects (yours, too!)

Benchmarks and Benchmarking

program as unit of work millions of them, many different kinds, which to use? benchmarks standard programs for measuring/comparing performance + represent programs people care about + repeatable!! benchmarking process
define workload extract benchmarks from workload execute benchmarks on candidate machines project performance on new machine run workload on new machine and compare not close enough -> repeat

Benchmarks: Toys, Kernels, Synthetics

toy benchmarks: little programs that no one really runs
e.g., fibonacci, 8 queens

little value, what real programs do these represent?

scary fact: used to prove the value of RISC in early 80s

kernels: important (frequently executed) pieces of real programs

e.g., Livermore loops, Linpack (inner product)

+ good for focusing on individual features, but not big picture over-emphasize target feature (for better or worse) synthetic benchmarks: programs made up for benchmarking
e.g., Whetstone, Dhrystone

toy kernels++, which programs do these represent?

Benchmarks: Real Programs

real programs + only accurate way to characterize performance requires considerable work (porting) Standard Performance Evaluation Corporation (SPEC) http://www.spec.org collects, standardizes and distributes benchmark suites consortium made up of industry leaders SPEC CPU (CPU intensive benchmarks)
SPEC89, SPEC92, SPEC95, SPEC2000, SPEC2006

other benchmark suites

SPECjvm, SPECmail, SPECweb, SPEComp

Other benchmark suite examples: TPC-C, TPC-H for databases

SPEC CPU2006
12 integer programs (C, C++)
gcc (compiler), perl (interpreter), hmmer (markov chain) bzip2 (compress), go (AI), sjeng (AI) libquantum (physics), h264ref (video) omnetpp (simulation), astar (path finding algs) xalanc (XML processing), mcf (network optimization)

17 floating point programs (C, C++, Fortran)

fluid dynamics: bwaves, leslie3d, ibm quantum chemistry: gamess, tonto physics: milc, zeusmp, cactusADM gromacs (biochem) namd (bio, molec dynamics), dealll (finite element analysis) soplex (linear programming), povray (ray tracing) calculix (mechanics), GemsFDTD (computational E&M) wrf (weather), sphinx3 (speech recognition)

Benchmarking Pitfalls
benchmark properties mismatch with features studied
e.g., using SPEC for large cache studies

careless scaling
using only first few million instructions (initialization phase) reducing program data size

choosing performance from wrong application space

e.g., in a realtime environment, choosing gcc

using old benchmarks

benchmark specials: benchmark-specific optimizations

Benchmarks must be continuously maintained and updated!

Reporting Average Performance

averages: one of the things architects frequently get wrong + pay attention now and you wont get them wrong important things about averages (i.e., means) ideally proportional to execution time (ultimate metric)
Arithmetic Mean (AM) for times Harmonic Mean (HM) for rates (IPCs) Geometric Mean (GM) for ratios (speedups)

there is no such thing as the average program use average when absolutely necessary

What Does The Mean Mean?

arithmetic mean (AM): average execution times of N programs (time(i)) / N harmonic mean (HM): average IPCs of N programs arithmetic mean cannot be used for rates (e.g., IPCs)
30 MPH for 1 mile + 90 MPH for 1 mile != avg. 60 MPH

N / 1..N(1 / rate(i)) geometric mean (GM): average speedups of N programs N ( 1..N(speedup(i)) what if programs run at different frequencies within workload? weighting weighted AM = (1..N w(i) * time(i)) / N

GM Weirdness
what about averaging ratios (speedups)? HM / AM change depending on which machine is the base
machine A machine B B/A A/B

Program1 Program2

1 1000

10 0.1 0.1 10 (10+.1)/2 = 5.05 (.1+10)/2 = 5.05 AM B is 5.05 times faster! A is 5.05 times faster! 2/(1/10+1/.1) = 5.05 2/(1/.1+1/10) = 5.05 HM B is 5.05 times faster! A is 5.05 times faster! GM (10*.1) = 1 (.1*10) = 1

10 100

geometric mean of ratios is not proportional to total time!

if we take total execution time, B is 9.1 times faster GM says they are equal

Amdahls Law
Validity of the Single-Processor Approach to Achieving Large- Scale Computing Capabilities G. Amdahl, AFIPS, 1967 let optimization speed up fraction f of program by factor s
speedup = old / ([(1-f) x old] + f/s x old) = 1 / (1 - f + f/s)

f = 95%, s = 1.1 f = 5%, s = 10 f = 5%, s = f = 95%, s

1/[(1-0.95) + (0.95/1.1)] = 1.094 1/[(1-0.05) + (0.05/10)] = 1.047 1/[(1-0.05) + (0.05/ )] = 1.052

1/[(1-0.95) + (0.95/ )] = 20

make common case fast, but... ...uncommon case eventually limits performance

Littles Law
Key Relationship between latency and bandwidth: Average number in system = arrival rate * mean holding time Possibly the most useful equation I know Useful in design of computers, software, industrial processes, etc. Example: How big of a wine cellar should we build? We drink (and buy) an average of 2 bottles per week On average, we want to age the wine for 5 years bottles in cellar = 2 bottles/week * 52 weeks/year * 5 years
= 520 bottles

System Balance
each system component produces & consumes data make sure data supply and demand is balanced

X demand >= X supply computation is X-bound

e.g., memory bound, CPU-bound, I/O-bound

goal: be bound everywhere at once (why?) X can be bandwidth or latency

X is bandwidth buy more bandwidth X is latency much tougher problem

Tradeoffs
Bandwidth problems can be solved with money. Latency problems are harder, because the speed of light is fixed and you cant bribe God David Clark well... can convert some latency problems to bandwidth problems solve those with money the famous bandwidth/latency tradeoff architecture is the art of making tradeoffs

Bursty Behavior
Q: to sustain 2 IPC... how many instructions should processor be able to fetch per cycle? execute per cycle? complete per cycle? A: NOT 2 (more than 2) dependences will cause stalls (under-utilization) if desired performance is X, peak performance must be > X programs dont always obey average behavior cant design processor only to handle average behvaior

M116C 1 M116C 1 Lect02-Performance
No ratings yet
M116C 1 M116C 1 Lect02-Performance
23 pages
CPU Performance Evaluation Guide
No ratings yet
CPU Performance Evaluation Guide
36 pages
Advanced Computer Architecture Course Overview
No ratings yet
Advanced Computer Architecture Course Overview
56 pages
SEN307 Lecture 5
No ratings yet
SEN307 Lecture 5
34 pages
Measuring Computer Performance
No ratings yet
Measuring Computer Performance
26 pages
Module 3
No ratings yet
Module 3
23 pages
L-2 (Computer Performance)
No ratings yet
L-2 (Computer Performance)
47 pages
RISC-V ISA & Performance Metrics
No ratings yet
RISC-V ISA & Performance Metrics
72 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
18 pages
Puter Performance
No ratings yet
Puter Performance
15 pages
CS-3006 4 PerformanceAnalysis
No ratings yet
CS-3006 4 PerformanceAnalysis
62 pages
Lec 3
No ratings yet
Lec 3
20 pages
Computer Performance Evaluation Guide
No ratings yet
Computer Performance Evaluation Guide
17 pages
Chapter Two
No ratings yet
Chapter Two
33 pages
Performance Chap4
No ratings yet
Performance Chap4
20 pages
CPU Performance Metrics Guide
No ratings yet
CPU Performance Metrics Guide
31 pages
Computer Organization and Architecture (AT70.01)
No ratings yet
Computer Organization and Architecture (AT70.01)
29 pages
Performance Issues
No ratings yet
Performance Issues
19 pages
Module 2 (26-10-2024)
No ratings yet
Module 2 (26-10-2024)
50 pages
Computer Architecture & Performance
No ratings yet
Computer Architecture & Performance
31 pages
Da Ci
No ratings yet
Da Ci
13 pages
2 Week
No ratings yet
2 Week
35 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
Computer Architecture and Performance
No ratings yet
Computer Architecture and Performance
33 pages
Performance Matrices
No ratings yet
Performance Matrices
14 pages
(2010-02-27) Measuring Performance
No ratings yet
(2010-02-27) Measuring Performance
11 pages
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
28 pages
Performance Measures
No ratings yet
Performance Measures
25 pages
Computer Performance Insights
No ratings yet
Computer Performance Insights
22 pages
CSC232 - Chp1 (Compatibility Mode)
No ratings yet
CSC232 - Chp1 (Compatibility Mode)
50 pages
IT401 Computer Organization and Architecture: Prasun Ghosal
No ratings yet
IT401 Computer Organization and Architecture: Prasun Ghosal
30 pages
Lec 2 Performance
No ratings yet
Lec 2 Performance
28 pages
Cs2100 14 Understanding Performance
No ratings yet
Cs2100 14 Understanding Performance
46 pages
This Unit: - Metrics
No ratings yet
This Unit: - Metrics
7 pages
Lec 2
No ratings yet
Lec 2
31 pages
Lec10 Performance
No ratings yet
Lec10 Performance
22 pages
Bản Sao Của Lecture 2 - Performance Measurement
No ratings yet
Bản Sao Của Lecture 2 - Performance Measurement
9 pages
CPU Performance & Power Evaluation
No ratings yet
CPU Performance & Power Evaluation
76 pages
Designing For Performance - Performance Metrics
No ratings yet
Designing For Performance - Performance Metrics
19 pages
Computer Performance Analysis
No ratings yet
Computer Performance Analysis
23 pages
Advanced Computer Architecture: 563 L02.1 Fall 2011
No ratings yet
Advanced Computer Architecture: 563 L02.1 Fall 2011
57 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
17 pages
Mod6 2 PDF
No ratings yet
Mod6 2 PDF
15 pages
COD Ch. 2 The Role of Performance
No ratings yet
COD Ch. 2 The Role of Performance
28 pages
4 Performance
No ratings yet
4 Performance
67 pages
L14 Introduction To Performance Evaluation
No ratings yet
L14 Introduction To Performance Evaluation
48 pages
Computer Architecture
No ratings yet
Computer Architecture
26 pages
Lecture: Metrics To Evaluate Performance
No ratings yet
Lecture: Metrics To Evaluate Performance
15 pages
CH 2
No ratings yet
CH 2
10 pages
Computer Architecture Unit 1
No ratings yet
Computer Architecture Unit 1
12 pages
1aca L1
No ratings yet
1aca L1
35 pages
Computer Performance Metrics
No ratings yet
Computer Performance Metrics
40 pages
Lec 2
No ratings yet
Lec 2
31 pages
Computer Performance
No ratings yet
Computer Performance
18 pages
CSCI 8150 Advanced Computer Architecture
No ratings yet
CSCI 8150 Advanced Computer Architecture
26 pages
Measuring Performance: Chris Clack B261 Systems Architecture
No ratings yet
Measuring Performance: Chris Clack B261 Systems Architecture
19 pages
Quatitative Principle
No ratings yet
Quatitative Principle
56 pages
Lec02 1 Measuring Profiling
No ratings yet
Lec02 1 Measuring Profiling
25 pages
Module 3.3 - Problems On Performance
No ratings yet
Module 3.3 - Problems On Performance
54 pages
Civil Engineer's Professional Profile
No ratings yet
Civil Engineer's Professional Profile
3 pages
23-24 1st Sem - SyllabuSSSS
No ratings yet
23-24 1st Sem - SyllabuSSSS
5 pages
Myanmar: The Politics of Rakhine State
No ratings yet
Myanmar: The Politics of Rakhine State
48 pages
Sheaves Inspection
100% (1)
Sheaves Inspection
9 pages
Technical Data :: MERO Hollow Floor Combi T Hollow Floor Combi T Details
No ratings yet
Technical Data :: MERO Hollow Floor Combi T Hollow Floor Combi T Details
6 pages
Hyperflow: The Revolution
No ratings yet
Hyperflow: The Revolution
8 pages
The Future of Health in Europe
No ratings yet
The Future of Health in Europe
24 pages
Design and Implementation of An Electricity On-Line Billing Payment System
No ratings yet
Design and Implementation of An Electricity On-Line Billing Payment System
7 pages
【EN】DJI Mavic 3E & DJI Mavic 3T Repair Training 20221011
No ratings yet
【EN】DJI Mavic 3E & DJI Mavic 3T Repair Training 20221011
65 pages
Innovation in Public Transport Finance - Property Value - Shishir Mathur - Transport and Mobility, New Edition, 2014 - Ashgate Pub Co - 9781138250130 - Anna's
No ratings yet
Innovation in Public Transport Finance - Property Value - Shishir Mathur - Transport and Mobility, New Edition, 2014 - Ashgate Pub Co - 9781138250130 - Anna's
229 pages
This Little Light of Mine 3 Grade Poem/Song: This Text Is in The Public Domain
No ratings yet
This Little Light of Mine 3 Grade Poem/Song: This Text Is in The Public Domain
2 pages
Channel Management
No ratings yet
Channel Management
5 pages
28 Day Shred Day05
No ratings yet
28 Day Shred Day05
2 pages
Imnci. Sa 2024 Final
No ratings yet
Imnci. Sa 2024 Final
61 pages
Grammatical Gender and Number Agreement in Spanish: An ERP Comparison
No ratings yet
Grammatical Gender and Number Agreement in Spanish: An ERP Comparison
17 pages
Inventory Renewable Energy Standards
No ratings yet
Inventory Renewable Energy Standards
41 pages
Come Join Our Team! - 241205 - 190828
No ratings yet
Come Join Our Team! - 241205 - 190828
2 pages
Physiotherapy - Stroke - Arm Exercises
No ratings yet
Physiotherapy - Stroke - Arm Exercises
5 pages
Research Methods
No ratings yet
Research Methods
2 pages
Mapeh 3 Arts 3 PPT q3 - Lesson 7 Mask Making
100% (2)
Mapeh 3 Arts 3 PPT q3 - Lesson 7 Mask Making
12 pages
Sample: Excavator Safety Training
100% (2)
Sample: Excavator Safety Training
61 pages
Fantasy Genre: History and Subgenres
No ratings yet
Fantasy Genre: History and Subgenres
4 pages
Forms of Tourism
No ratings yet
Forms of Tourism
14 pages
2025051673
No ratings yet
2025051673
37 pages
CBC Animal Health Care & Management NC III
No ratings yet
CBC Animal Health Care & Management NC III
66 pages
API 579 Fitness For Service Using INSPECT - Codeware
No ratings yet
API 579 Fitness For Service Using INSPECT - Codeware
4 pages
OWASP Top 10 Web Vulnerabilities
No ratings yet
OWASP Top 10 Web Vulnerabilities
37 pages
Document Title: SWT Wilden Pump Maintenance Sheets: Equipment Information
No ratings yet
Document Title: SWT Wilden Pump Maintenance Sheets: Equipment Information
5 pages
CSI - 24 Charcha-ae-Celebal PPT Script
No ratings yet
CSI - 24 Charcha-ae-Celebal PPT Script
5 pages
Haar Wavelet Matrices For The Numerical Solutions of Differential Equations
No ratings yet
Haar Wavelet Matrices For The Numerical Solutions of Differential Equations
4 pages

Performance: Latency

Uploaded by

Performance: Latency

Uploaded by

Performance

A is X% faster than B iff

Performance Metric I: MIPS

Performance Metric II: MFLOPS

MFLOPS (millions of floating-point operations per second)

may have been valid in 1980 (most programs were FP)

CPU Performance Equation

CPU Performance Comparison

Actually Measuring Performance

measuring CPI breakdown hardware event counters (built into core)

cycle-level microarchitecture simulator (e.g., SimpleScalar)

Benchmarks and Benchmarking

Benchmarks: Toys, Kernels, Synthetics

little value, what real programs do these represent?

kernels: important (frequently executed) pieces of real programs

toy kernels++, which programs do these represent?

Benchmarks: Real Programs

other benchmark suites

Other benchmark suite examples: TPC-C, TPC-H for databases

17 floating point programs (C, C++, Fortran)

choosing performance from wrong application space

using old benchmarks

Benchmarks must be continuously maintained and updated!

Reporting Average Performance

What Does The Mean Mean?

geometric mean of ratios is not proportional to total time!

f = 95%, s = 1.1 f = 5%, s = 10 f = 5%, s = f = 95%, s

1/[(1-0.95) + (0.95/1.1)] = 1.094 1/[(1-0.05) + (0.05/10)] = 1.047 1/[(1-0.05) + (0.05/ )] = 1.052

X demand >= X supply computation is X-bound

goal: be bound everywhere at once (why?) X can be bandwidth or latency

You might also like