Multi-Core Computer Architecture
Lecture 1D
 Performance Evaluation Methods
                  John Jose
             Associate Professor
 Department of Computer Science & Engineering
    Indian Institute of Technology Guwahati
             Measuring Performance
❖ When can we say one computer / architecture / design is
  better than others?
   ❖ Desktop PC – ( execution time of a program)
   ❖ Server (transactions / unit time)
❖ When can we say X is n times faster than Y ?
   ❖ Execution timeY / Execution timeX =n
   ❖ Throughput x/ Throughput y =n
            Measuring Performance
❖ Typical performance metrics:
     ❖Response time
     ❖Throughput
     ❖CPU time
     ❖Wall clock time
     ❖Speedup
❖ Benchmarks
     ❖Toy programs (e.g. sorting, matrix multiply)
     ❖Synthetic benchmarks (e.g. Dhrystone)
     ❖Benchmark suites (e.g. SPEC06, SPLASH)
Benchmark Suite
Benchmark Based Evaluation
SPEC Ratio
      Reference for SPEC 2006:
      Sun Ultra Enterprise 2 workstation with
      a 296-MHz UltraSPARC II processor
                         Amdahl's Law
❖ Amdahl’s Law defines the speedup that can be gained by improving
   some portion of a computer.
❖ The performance improvement to be gained from using some faster
   mode of execution is limited by the fraction of the time the faster mode
   can be used.
            Amdahl's Law- Illustration
Example: Suppose that we want to enhance the floating point
operations of a processor by introducing a new advanced FPU unit.
Let the new FPU is 10 times faster on floating point computations
than the original processor. Assuming a program has 40% floating
point operations, what is the overall speedup gained by
incorporating the enhancement?
Solution:
Fraction enhanced = 0.4
Speedup enhanced = 10
   Amdahl's Law for Parallel Processing
  100            100          100            100
  100       50     50    25 25 25 25   ∞ Processors, Time ≈0
  100            100          100            100
  100       50     50    25 25 25 25   ∞ Processors, Time ≈0
  100            100          100            100
Work 500,    Work 500,   Work 500,       Work 500,
Time 500     Time 400    Time 350        Time 300
Sp=1X        Sp=1.25X    Sp=1.4X         Sp=1.7X
How much Speed up you can achieve ?
                       Design Example
A common transformation required in graphics processors is square
root. Implementations of floating-point (FP) square root vary
significantly in performance, especially among processors designed
for graphics. Suppose FP square root (FPSQR) is responsible for
20% of the execution time of a critical graphics benchmark.
         One proposal is to enhance the FPSQR hardware and
speed up this operation by a factor of 10. The other alternative is
just to try to make all FP instructions in the graphics processor run
faster by a factor of 1.6; FP instructions are responsible for half of
the execution time for the application. Compare these two design
alternatives using Amdahl's Law.
                        Design Example
Case A: FPSQR hardware optimization   Case B: FP instructions optimization
         Principles of Computer Design
❖ All processors are driven by clock.
❖ Expressed as clock rate in GHz or clock period in ns
❖ CPU Time = CPU clock cycles x clock cycle time
        Principles of Computer Design
❖ Clock cycle time- hardware technology
❖ CPI- organization and ISA
❖ IC-ISA and compiler technology
          Principles of Computer Design
❖ Different instruction types having different CPIs
        Example: Basic Performance Analysis
Consider two programs A and B that solves a given problem. A is scheduled
to run on a processor P1 operating at 1 GHz and B is scheduled to run on
processor P2 running at 1.4 GHz. A has total 10000 instructions, out of
which 20% are branch instructions, 40% load store instructions and rest are
ALU instructions. B is composed of 25% branch instructions. The number of
load store instructions in B is twice the count of ALU instructions. Total
instruction count of B is 12000. In both P1 and P2 branch instructions have
an average CPI of 5 and ALU instructions has an average CPI of 1.5. Both
the architectures differ in the CPI of load-store instruction. They are 2 and 3
for P1 and P2, respectively. Which mapping (A on P1 or B on P2) solves the
problem faster, and by how much?
         Example: Basic Performance Analysis
         A on P1 (1GHz 🡪   CCT = 1ns)            B on P2 (1.4 GHz🡪   CCT = 0.714ns)
IC=10000                                IC=12000
Fraction BR: L/S: ALU = 20: 40: 40      Fraction     BR: L/S: ALU = 25: 50: 25
CPI of     BR: L/S: ALU = 5: 2: 1.5     CPI of      BR: L/S: ALU = 5: 3 : 1.5
(a) CPI A_P1=(0.2x5 + 0.4x2 + 0.4x1.5) = 2.4
    ExT = 2.4 x10000x1 ns= 24000 ns
(b) CPI B_P2=(0.25x5 + 0.5x3 + 0.25x1.5) = 3.125
    ExT = 3.125 x12000x0.714 ns = 26775 ns
Hence A on P1 is faster.
                   Example: Amdahl's Law
A company is releasing 2 latest versions (beta and gamma) of its basic
processor architecture named alpha. Beta and gamma are designed by
making modifications on three major components (X, Y and Z) of the alpha.
It was observed that for a program A the fractions of the total execution time
on these three components, X, Y, and Z are 40%, 30%, and 20%,
respectively. Beta speeds up X and Z by 2 times but slows down Y by 1.3
times, where as gamma speeds up X, Y and Z by 1.2, 1.3 and 1.4 times,
respectively.
(a) How much faster is gamma over alpha for running A?
(b) Whether beta or gamma is faster for running A? Find the speedup factor
                                            fx=0.4 : fy=0.3: fz=0.2
                                            Beta: Nx=2 : Ny=1/1.3 : Nz=2
                                            Gamma: Nx=1.2 : Ny=1.3 : Nz=1.4
                 Example: Amdahl's Law
(a) Gamma is 1.239 times faster over alpha
(b) Beta is faster than gamma🡪 1.267/1.239 = 1.022 times
      johnjose@iitg.ac.in
http://www.iitg.ac.in/johnjose/