Lecture 3: Evaluating Computer Architectures
Announcements
- Reminder: Homework 1 due Thursday 2/2
Last Time technology back ground
Computer elements
Circuits and timing
Virtuous cycle of the past and future?
Today
What is computer performance?
What programs do I care about?
Performance equations
Amdahls Law
UTCS CS352 Lecture 3
1
Software & Hardware: The Virtuous Cycle?
Faster Single
Processor
Frequency Scaling Larger, More
Capable Software
Managed Languages
?
More Cores Scalable Software
Scalable Apps +
Multi/Many Core
Scalable Runtime
UTCS CS352
Lecture 3 2
1
Performance Hype
AMD Performance Preview: Taking Phenom II to 4.2 GHz
Intel Core i78 processing threads They are the best
desktop processor family on the planet.
With 8 cores, each supporting 4 threads, the UltraSPARC T1 processor
executes 32 simultaneous threads within a design consuming only 72 watts of power.
improves throughput by up to 41x
speed up by 10-25% in many cases
about 2x in two cases
more than 10x in two small benchmarks
speedups of 1.2x to 6.4x on a variety of benchmarks
can reduce garbage collection time by 50% to 75%
demonstrating high efficiency and scalability
our prototype has usable performance
speedups. are very significant (up to 54-fold)
sometimes more than twice as fast
our . is better or almost as good as . across the board
UTCS CS352 Lecture 3
3
What Does this Graph Mean?
Performance Trends on SPEC Int 2000
UTCS CS352 Lecture 3
4
2
Computer Performance Evaluation
Metric = something we measure
Goal: Evaluate how good/bad a design is
Examples
Clock rate of computer
Power consumed by a program
Execution time for a program
Number of programs executed per second
Cycles per program instruction
How should we compare two computer systems?
UTCS CS352 Lecture 3
5
Tradeoff: latency vs. throughput
Pizza delivery
Do you want your pizza hot?
Or do you want your pizza to be inexpensive?
Two different delivery strategies for pizza company!
This course focuses primarily on latency (hot pizza)
Latency = execution time for a single task
Throughput = number of tasks per unit time
UTCS CS352 Lecture 3
6
3
Two notions of performance
Throughput
Plane DC to Paris Speed Passengers
(pmph)
Boeing 747 6.5 hours 610 mph 470 286,700
Concorde 3 hours 1350 mph 132 178,200
Which has plane higher performance?
Time to do the task (Execution Time)
execution time, response time, latency
Tasks per day, hour, week, sec, ns. .. (Performance)
throughput, bandwidth
UTCS CS352 Slide courtesy of D. Patterson Lecture 3
7
Definitions
Performance is in units of things-per-second
bigger is better
Response time of a system Y running program Z
performance (Y) = 1
execution time (Z on Y)
Throughput of system Y running many programs
performance (Y) = number of programs
unit time
" System X is n times faster than Y" means
n = performance(X)
performance(Y)
UTCS CS352 Slide courtesy of D. Patterson Lecture 3
8
4
Definitions
Performance is in units of things-per-second
bigger is better
Response time of a system Y running program Z
performance (Y) = 1
execution time (Z on Y)
Throughput of system Y running many programs
performance (Y) = number of programs
unit time
" System X is n times faster than Y" means
n = performance(X)
performance(Y)
UTCS CS352 Slide courtesy of D. Patterson Lecture 3
9
Which Programs Should I Measure?
UTCS CS352 Slide courtesy of D. Patterson Lecture 3
10
5
Which Programs Should I Measure?
Pros Cons
Actual Target Workload
Full Application Benchmarks
Small Kernel
Benchmarks
Microbenchmarks
UTCS CS352 Slide courtesy of D. Patterson Lecture 3
11
Which Programs Should I Measure?
Pros Cons
very specific
representative Actual Target Workload non-portable
difficult to run, or
measure
hard to identify cause
portable
widely used Full Application Benchmarks less representative
improvements
useful in reality
Small Kernel
easy to run, early in easy to fool
Benchmarks
design cycle
identify peak peak may be a long
capability and Microbenchmarks way from application
potential bottlenecks performance
UTCS CS352 Slide courtesy of D. Patterson Lecture 3
12
6
Brief History of Benchmarking
Early days (1960s) Real Applications (1989-now)
Single instruction execution SPEC CPU C/Fortran
time Scientific, Irregular
Average instruction time 89, 92, 95, 00, 07, ??
[Gibson 1970]
TPC C: Transaction Processing
Pure MIPS (1/AIT)
SPECWeb
WinBench: Desktop
Simple programs(early 70s)
Graphics C/C++
Synthetic benchmarks
Quake III, Doom 3
(Whetstone, etc.)
MediaBench
Kernels (Livermore Loops)
Java: SPECJVM98
Relative Performance (late 70s) Problem: Programming Language
Parallel?, Java, C#, JavaScript??
VAX 11/780 1-MIPS
DaCapo Java Benchmarks 06, 09
but was it?
Parsec: Parallel C/C++, 2008
MFLOPs
UTCS CS352 Lecture 3
13
How to Compromise a Comparison:
C programs running on two architectures
UTCS CS352 Lecture 3
14
7
The compiler reorganized the code!
Change the memory system performance
Matrix multiply cache blocking
After
Before
UTCS CS352 Lecture 3
15
There are lies, damn lies, and statistics
Desraeli
UTCS CS352 Lecture 3
16
8
benchmarks
There are lies, damn lies, and statistics
Desraeli
UTCS CS352 Lecture 3
17
Benchmarking Java Programs
Lets consider the performance of the DaCapo
Java Benchmarks
What do we need to think about when comparing
two computers running Java programs?
http://dacapo.anu.edu.au/regression/perf
/2006-10-MR2.html
UTCS CS352 Lecture 3
18
9
Pay Attention to Benchmarks & System
Benchmarks measure the Benchmark timings are
whole system sensitive
application alignment in cache
compiler, VM, memory location of data on disk
management values of data
operating system
architecture Danger of inbreeding or
implementation positive feedback
Popular benchmarks often if you make an operation
reflect yesterdays fast (slow) it will be used
programs more (less) often
what about the programs therefore you make it
people are running today? faster (slower)
need to design for and so on, and so on
tomorrows problems the optimized NOP
UTCS CS352 Lecture 3
19
Performance Summary so Far
Key concepts
Throughput and Latency
Best benchmarks are real programs
DaCapo, Spec, TPC, Doom3
Pitfalls
Whole system measurement
Workload may not match users
Compiler, VM, memory management
Next
Amdahls Law
UTCS CS352 Lecture 3
20
10
Improving Performance: Fundamentals
Suppose we have a machine with two instructions
Instruction A executes in 100 cycles
Instruction B executes in 2 cycles
We want better performance.
Which instruction do we improve?
UTCS CS352 Lecture 3
21
Speedup
Make a change to an architecture
Measure how much faster/slower it is
UTCS CS352 Lecture 3
22
11
Speedup when we know details about the change
Performance improvements depend on:
how good is enhancement (factor S)
how often is it used (fraction p)
Speedup due to enhancement E:
ExTime w/out E Perf w/ E
Speedup(E) = =
ExTime w/ E Perf w/out E
UTCS CS352 Lecture 3
23
Amdahls Law: Example
FP instructions improved by 2x
But.only 10% of instructions are FP
0.1
ExTimenew = ExTimeold 0.9 + = 0.95 ExTimeold
2
Amdahls Law: Speedup bounded by
UTCS CS352 Lecture 3
24
12
How Does Amdahls Law Apply to Multicore?
Given N cores what is our ideal speedup?
UTCS CS352 Lecture 3
25
How Does Amdahls Law Apply to Multicore?
Given N cores what is our ideal speedup?
ExTimenew = ExTimeold /N
Say 90% of the code is parallel and N = 16?
UTCS CS352 Lecture 3
26
13
How Does Amdahls Law Apply to Multicore?
Given N cores what is our ideal speedup?
ExTimenew = ExTimeold /N
Say 90% of the code is parallel and N = 16?
p
ExTimenew = ExTimeold (1 p) +
N
0.9
ExTimenew = ExTimeold 0.1+ = 0.15625 ExTimeold
16
1
Speeduptotal = = 6.2
0.15625
UTCS CS352 Lecture 3
27
How Does Amdahls Law Apply to Multicore?
UTCS CS352 Lecture 3
28
14
Performance Summary so Far
Amdahls law: Pay attention to what are you speeding up.
Next Time
More on Performance
Cycles per Instruction
Means
Start: Instruction Set Architectures (ISA)
Read: P&H 2.1 2.5
Turn in your homework at the beginning of class
UTCS CS352 Lecture 3
29
15