0% found this document useful (0 votes)
84 views54 pages

Fundamentals of Quantitative Design and Analysis: A Quantitative Approach, Fifth Edition

Uploaded by

mwangilaureen493
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views54 pages

Fundamentals of Quantitative Design and Analysis: A Quantitative Approach, Fifth Edition

Uploaded by

mwangilaureen493
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Computer Architecture

A Quantitative Approach, Fifth Edition

Chapter 1
Fundamentals of Quantitative
Design and Analysis

Copyright © 2012, Elsevier Inc. All rights reserved. 1


Contents
1. Introduction
2. Classes of computers
3. Trends in computer architecture
4. Parallelism
5. Power and energy
6. Chip fabrication costs
7. Benchmarks
8. Principles of computer design
9. Fallacies and pitfalls
10. Evolution of supercomputers
11. Problem solving

Copyright © 2012, Elsevier Inc. All rights reserved. 2


Introduction
Computer technology

 Performance improvements:
 Improvements in semiconductor technology
 Feature size, clock speed
 Improvements in computer architectures
 Enabled by HLL compilers, UNIX
 Lead to RISC architectures

 Together have enabled:


 Lightweight computers
 Productivity-based managed/interpreted
programming languages

Copyright © 2012, Elsevier Inc. All rights reserved. 3


Introduction
Single processor performance
Move to multi-processor

RISC

Copyright © 2012, Elsevier Inc. All rights reserved. 4


Introduction
Current trends in architecture

 Cannot continue to leverage Instruction-Level parallelism


(ILP)
 Single processor performance improvement ended in 2003.
Why?

 New models for performance:


 Data-level parallelism (DLP)
 Thread-level parallelism (TLP)
 Request-level parallelism (RLP)

 The new models for performance require explicit


restructuring of the application. No more free lunch for
application developers!!!

Copyright © 2012, Elsevier Inc. All rights reserved. 5


Classes of Computers
Classes of computers
1. Personal Mobile Device (PMD)
1. e.g. smart phones, tablet computers
2. Emphasis on energy efficiency and real-time
2. Desktop Computers
1. Emphasis on price-performance
3. Servers
1. Emphasis on availability, scalability, throughput
4. Clusters / Warehouse Scale Computers (WSCs)
1. Used for “Software as a Service (SaaS)”
2. Emphasis on availability and price-performance
3. Sub-class: Supercomputers, emphasis: floating-point
performance and fast internal networks
5. Embedded Computers
1. Emphasis: price

Copyright © 2012, Elsevier Inc. All rights reserved. 6


Personal Desktop Server Cluster Embedded
Mobile WSC
device

Price of 100 - 1,000 300 – 2,500 5,000 – 100,000 – 10 – 100,000


system ($) 10,000,000 200,000,000

Price of 10 - 100 50 - 500 200- 2,000 50 - 250 0.01 - 100


processor
($)

Critical Cost, energy, Cost, energy, Throughput, Price- Price,


system performance performance, availability, performance, energy,
design graphics Scalability, energy- performance
issues energy proportionality

dcm 7
Classes of Computers
Parallelism
 Application parallelism:
 Data-Level Parallelism (DLP)
 Task-Level Parallelism (TLP)

 Architectural parallelism exploits application parallelism:


 Instruction-Level Parallelism (ILP)  pipelining, speculative
execution.
 Vector architectures/Graphic Processor Units (GPUs) exploit
DLP in SIMD architectures.
 Thread-Level Parallelism  DLP or TLP of interacting threads.
 Request-Level Parallelism parallelism among decoupled tasks.

dcm. 8
Classes of Computers
Michel Flynn’s taxonomy
 SISD - Single Instruction stream, Single Data stream

 SIMD - Single Instruction stream, Multiple Data streams


 Vector architectures
 Multimedia extensions
 Graphics processor units

 MIMD - Multiple Instruction streams, Multiple Data streams


 Tightly-coupled MIMD
 Loosely-coupled MIMD

 MISD - Multiple Instruction streams, Single Data stream


 No commercial implementation

Copyright © 2012, Elsevier Inc. All rights reserved. 9


Defining Computer Architecture
Defining computer architecture
 Old view of computer architecture:
 Instruction Set Architecture (ISA) design
 Decisions regarding:
 registers, memory addressing,

 addressing modes,

 instruction operands,

 available operations,

 control flow instructions,

 instruction encoding.

 Real computer architecture:


 Specific requirements of the target machine
 Design to maximize performance within constraints: cost, power,
and availability
 Includes ISA, microarchitecture, hardware

Copyright © 2012, Elsevier Inc. All rights reserved. 10


MIPS instruction format
 R-instructions  all data values are in registers
OPCODE rd,rs,rt Example: add $s1, $s2, $s3
rd- destination register
rs, rt – source registers
 I-instructions  operate on an immediate value and a register value.
Immediate values may be a maximum of 16 bits long.
OPCODE rs,rt,Imm
 J-instructions  used to transfer control
OPCODE label
 FR- instructions  similar to R-instruction but operating of floating point
OPCODE fmt,fs,ft,fd,funct
 FI- instructions  similar to I-instruction but operating of floating point
OPCODE fmt,ft,Imm

Copyright © 2012, Elsevier Inc. All rights reserved. 11


Figure 1.6 MIPS64 instruction set architecture formats. All instructions are 32 bits long. The R format is for integer
register-to-register operations, such as DADDU, DSUBU, and so on. The I format is for data transfers, branches, and
immediate instructions, such as LD, SD, BEQZ, and DADDIs. The J format is for jumps, the FR format for floating-point
operations, and the FI format for floating-point branches.

Copyright © 2011, Elsevier Inc. All rights Reserved. 12


dcm 13
Computer implementation

 Organization / microarchitecture  high-level aspects of


computer design including:
 Memory system
 Memory interconnect
 CPU
 Hardware  detailed logic design and the packaging
technology.

dcm 14
Trends in Technology
Technology improvement rate per year
 Integrated circuit
 Transistor density: 35% (Moore’s law)
 Die size: 10-20%
 Integration overall: 40-55%

 DRAM capacity: 25-40% (slowing)

 Flash capacity: 50-60%


 15-20X cheaper/bit than DRAM

 Magnetic disk: 40%


 15-25X cheaper/bit then Flash
 300-500X cheaper/bit than DRAM

Copyright © 2012, Elsevier Inc. All rights reserved. 15


Flash memory

 Flash memory - electronic non-volatile storage medium that


can be electrically erased and reprogrammed.
 NAND flash memory
 May be written and read in blocks (or pages) which are generally much
smaller than the entire device.
 Used in main memory, memory cards, USB flash drives, solid-state
drives for general storage and transfer of data.
 NOR flash memory
 Allows a single machine word (byte) to be written—​to an erased
location—​or read independently.
 Allows true random access and therefore direct code execution

dcm 16
DRAM – dynamic random-access memory
 Stores each bit in a separate capacitor within an
integrated circuit. The capacitor can be either charged or
discharged; these two states are taken to represent the
two values of a bit, 0 and 1.
 Dynamic, as opposite to SRAM (static RAM)needs to
be periodically refreshed as capacitors leak charge.
 Structural simplicity: only one transistor and a capacitor
are required per bit, compared to four or six transistors in
SRAM. This allows DRAM to reach very high densities.
 Unlike flash memory, DRAM is volatile memory since it
loses its data quickly when power is removed.

dcm 17
Trends in Technology
Evolution of bandwidth and latency

 Bandwidth or throughput total work done in a given time


 improvement for processors 10,000 - 25,000 times

 improvement for memory and disks  300 - 1,200 times

 Latency or response time  time between start and


completion of an operation
 improvement for processors  30 - 80 times

 improvement for memory and disks  6 - 8 times

 Processors have improved at a much faster rate than


memory and disks.!!

Copyright © 2012, Elsevier Inc. All rights reserved. 18


Trends in Technology
Bandwidth and latency

Log-log plot of bandwidth and latency milestones

Copyright © 2012, Elsevier Inc. All rights reserved. 19


Trends in Technology
Feature size of transistors and wires

 Feature size Minimum size of transistor or wire


in x or y dimension
 10.0 microns in 1971

 0.32 microns in 2011

 Transistor performance scales linearly with feature size.


 Wire delay does not improve with feature size!
 Integration density scales quadratically

Copyright © 2012, Elsevier Inc. All rights reserved. 20


Moore’s Law

 The number of transistors in a dense integrated circuit


doubles approximately every two years, 18 months to be
exact.
 Gordon E. Moore, co-founder of Intel Corporation, who
described the trend in a 1965 paper.
 His prediction has proven to be accurate and the law is
now used in the semiconductor industry to guide long-
term planning and to set targets for research and
development

Copyright © 2012, Elsevier Inc. All rights reserved. 21


Feature size of transistors and wires (cont’d)

Nature 479, 310–316 (17 November 2011) 22


Application: questions related to Moore’s law
(a) The number of transistors on a chip in 2015 should be how many times
the number in 2005 based on Moore’s law?
(b) In the 90s the increase in clock rate once mirrored the trend. Had the
clock rate continued to climb at the same rate fast would the clock rate
be in 2015?
(c) At the current rate of increase what are the clock rates projected to be
in 2015?
(d) What has limited the growth of the clock rate and what are architects
doing with the extra transistors to increase performance?
(e) The rate of growth of DRAM capacity has also slowed down. For 20
years it increased by 60%/year. It dropped to 40%/year and now is in
the 25-40%/year . If this trend continues what will be this rate in 2020?

dcm 23
Answers

dcm 24
Trends in Power and Energy
Power and energy
 Problem: Get power in, get power out

 Thermal Design Power (TDP)


 Characterizes sustained power consumption

 Used as target for power supply and cooling system

 Lower than peak power, higher than average power


consumption

 Clock rate can be reduced dynamically to limit power


consumption

 Energy per task is often a better measurement

Copyright © 2012, Elsevier Inc. All rights reserved. 25


Trends in Power and Energy
Dynamic energy and power
 Dynamic energy – energy to switch the transistor state
 Transistor switch from 0  1 or 1  0
 ½ x Capacitive load x Voltage2

 Dynamic power – power to switch the transistor state


2
 ½ x Capacitive load x Voltage x Frequency switched

 Reducing clock rate reduces power, not energy

Copyright © 2012, Elsevier Inc. All rights reserved. 26


Trends in Power and Energy
Processor power consumption

 Intel 80386
consumed ~ 2 W
 3.3 GHz Intel Core
i7 consumes 130 W

 Heat must be
dissipated from 1.5 x
1.5 cm chip
 This is the limit of
what can be cooled
by air

 Dramatic increase in
power consumption!!

Copyright © 2012, Elsevier Inc. All rights reserved. 27


Trends in Power and Energy
Techniques for reducing power

 Do nothing well
 Dynamic Voltage-Frequency Scaling (DVFS)
 Low power state for DRAM, disks
 Over-clocking, turning off cores

Copyright © 2012, Elsevier Inc. All rights reserved. 28


Trends in Power and Energy
Power consumption
 Static power consumption:
 I x V (Static current x Voltage)
 Scales with number of transistors
 Power gating  technique used in integrated circuit
design to reduce power consumption, by shutting off the
electric current to blocks of the circuit that are not in use.
 Clock gating  technique used in many synchronous
circuits for reducing dynamic power dissipation. It saves
power by adding more logic to a circuit to disable
portions of the circuitry so that the flip-flops in them do
not have to switch states. Switching states consumes
power. The switching power consumption goes to zero,
and only leakage currents are incurred

Copyright © 2012, Elsevier Inc. All rights reserved. 29


Trends in Cost
Trends in cost

 Cost driven down by yield learning curve


 Yield  the ratio of the number of products that can be sold to the
number of products that can be manufactured.
 Estimated typical cost of modern 300 mm or 12 inch
wafer 0.13 nm process fabrication plant is $2-4 billion.
Typical number of processing steps for a modern integrated circuit is
more than 150. Typical production cycle-time is over 6 weeks.
Individual wafers cost multiple thousands of dollars. Given such
huge investments, consistent high yield is necessary for faster time
to profit.
 DRAM: price closely tracks cost
 Microprocessors: price depends on volume
 10% less for each doubling of volume

Copyright © 2012, Elsevier Inc. All rights reserved. 30


Trends in Cost
Integrated circuit cost
 Integrated circuit

 Bose-Einstein formula:

 Defects per unit area = 0.016-0.057 defects per square cm (2010)


 N = process-complexity factor = 11.5-15.5 (40 nm, 2010)

Copyright © 2012, Elsevier Inc. All rights reserved. 31


Intel i7 microprocessor

dcm 32
Left - floor plan of Core i7; Right - floor plan of second core

 QPI Quick Path Interconnect

Copyright © 2011, Elsevier Inc. All rights Reserved. 33


This 300 mm wafer contains 280 full Sandy Bridge dies, each 20.7 by 10.5 mm in a 32 nm
process. (Sandy Bridge is Intel’s successor to Nehalem used in the Core i7.) At 216 mm2, the
formula for dies per wafer estimates 282. (Courtesy Intel.)

Copyright © 2011, Elsevier Inc. All rights Reserved. 34


Case study – chip fabrication costs

Die size Estimated defect Manufacturing Transistors


(mm2) rate per(cm2) size (nm) (millions)

IBM Power 5 389 0.3 130 276

Sun Niagara 380 0.75 90 279

AMD Opteron 199 0.75 90 233

dcm 35
Problem

 a. What is the yield for IBM Power 5?


 b. Why does IBM Power 5 have a lower defect rate?

 Notes: We assumed that the wafer yield is 100/%, no wafers are bad
 N is the process complexity factor. For the 40 nm process it is in the
range 11.5 – 15.5. For the 130 nm process we took N=4

dcm 36
More questions
 A new facility uses a fabrication identical with the one for the Power 5
and produces two chips from 300 mm wafers:
 Woods : 150 mm2 ; the profit is $20/defect-free chip.
 Markon: 250 mm2 ; the profit is $25/defect-free chip
 How much profit can be made for (a) Woods; (b) Markon?
 (c) Which chip should be produced at the new facility?
 (d) If the demand is 50,000 Woods and 25,000 Mackron
chips/month and you can fabricate 150 wafers/month , how many
wafers should be made for each chip?

Copyright © 2012, Elsevier Inc. All rights reserved. 37


Copyright © 2012, Elsevier Inc. All rights reserved. 38
Dependability
Dependability

 Module reliability
 Mean time to failure (MTTF)
 Mean time to repair (MTTR)
 Mean time between failures (MTBF) = MTTF + MTTR
 Availability = MTTF / MTBF

Copyright © 2012, Elsevier Inc. All rights reserved. 39


Measuring Performance
Measuring performance
 Typical performance metrics:
 Response time
 Throughput

 Speedup of X relative to Y
 Execution timeY / Execution timeX

 Execution time
 Wall clock time: includes all system overheads
 CPU time: only computation time

 Benchmarks
 Kernels (e.g. matrix multiply)
 Toy programs (e.g. sorting)
 Synthetic benchmarks (e.g. Dhrystone)
 Benchmark suites (e.g. SPEC06fp, TPC-C)

Copyright © 2012, Elsevier Inc. All rights reserved. 40


Evolution of benchmarks over time

 Of the 12 SPEC2006 integer programs, 9 are written in


C, and the rest in C++.
 For the floating-point programs, the split is 6 in Fortran, 4
in C++, 3 in C, and 4 in mixed C and Fortran.

Copyright © 2012, Elsevier Inc. All rights reserved. 41


 SPEC2006 programs and the evolution of the SPEC benchmarks
over time, with integer programs above the line and floating-point
programs below the line. The figure shows all 70 of the programs in
the 1989, 1992, 1995, 2000, and 2006 releases.
 The benchmark descriptions on the left are for SPEC2006 only and
do not apply to earlier versions. Programs in the same row from
different generations of SPEC are generally not related; for example,
fpppp is not a CFD code like bwaves. Gcc is the senior citizen of the
group. Only 3 integer programs and 3 floating-point programs
survived three or more generations. Note that all the floating-point
programs are new for SPEC2006.
 Although a few are carried over from generation to generation, the
version of the program changes and either the input or the size of the
benchmark is often changed to increase its running time and to avoid
perturbation in measurement or domination of the execution time by
some factor other than CPU time.

Copyright © 2012, Elsevier Inc. All rights reserved. 42


Copyright © 2011, Elsevier Inc. All rights Reserved. 43
Figure 1.19 Power-performance of the three servers in Figure 1.18. Ssj_ops/watt values are on
the left axis, with the three columns associated with it, and watts are on the right axis, with
the three lines associated with it. The horizontal axis shows the target workload, as it varies
from 100% to Active Idle. The Intel-based R715 has the best ssj_ops/watt at each workload
level, and it also consumes the lowest power at each level.

Copyright © 2011, Elsevier Inc. All rights Reserved. 44


Figure 1.20 Percentage of peak performance for four programs on four multiprocessors scaled
to 64 processors. The Earth Simulator and X1 are vector processors (see Chapter 4 and
Appendix G). Not only did they deliver a higher fraction of peak performance, but they also had
the highest peak performance and the lowest clock rates. Except for the Paratec program, the
Power 4 and Itanium 2 systems delivered between 5% and 10% of their peak. From Oliker et al.
[2004].

dcm 45
Principles
Principles of computer design
 Take Advantage of Parallelism
 e.g. multiple processors, disks, memory banks,
pipelining, multiple functional units

 Principle of Locality
 Reuse of data and instructions

 Focus on the Common Case


 Amdahl’s Law

Copyright © 2012, Elsevier Inc. All rights reserved. 46


Principles
The processor performance equation

Copyright © 2012, Elsevier Inc. All rights reserved. 47


Principles
Different instruction types have different CPIs

Copyright © 2012, Elsevier Inc. All rights reserved. 48


Fallacies
 Multiprocessors are a silver bullet  to improve performance replace a
high-clock rate single core with multiple lower-clock-rate, efficient cores.
The burden is now on application developers to exploit parallelism.

 Increasing performance improves energy efficiency.

 Benchmarks remain valid indefinitely  almost 70% of the original


kernels in the SPEC2000 or earlier were dropped.

 Accuracy of reported MTTF  the MTTF of disks as currently reported


is almost 140 years!!

 Peak performance tracks observed performance  peak performance


of different programs on the same processor varies widely.

dcm 49
Pitfalls

 Ignoring Amdahl’s law


 Optimize a feature before measuring its usage.
 Dependability depends on the weakest link
 Fault detection can lower availability
 Some errors, e.g., an error in the branch predictor, could lower
the performance but not the availability.

dcm 50
Supercomputers of the late 1960s - IBM 360/91

 Launched in January 1968. Installed at NASA Aimes.


 Primary memory - up to 6 MB interleaved 16 ways.
 Secondary memory – 300 MB (two IBM 2301 drum and 2 IBM 2314
disks).
 The CPU had five highly autonomous execution units:
 processor storage,
 storage bus control,
 instruction processor,
 fixed-point processor and
 floating-point processor.
 Only four floating point registers.
 Tomasulo’s algorithm for register renaming in 360/91 used in many
modern processors for exploiting Instruction Level Parallelism (ILP).

51
dcm 52
Supercomputers of late 1960s – CDC 7600
 Designed by Seymour Cray.
 RISC architecture with a 15-bit instruction word containing a six-
bit operation code. Only 64 machine codes; no fixed-point
arithmetic in the central processor.
 Pipelined execution - 10-word instruction stack. All addresses in
the stack are fetched, without waiting for the instruction field to
be processed.
 Ten 60-bit read registers and ten 60-bit write registers, each
with an address register.
 Clock rate 36.4 MHz (27.5 ns clock cycle). Could deliver
about 10 MFLOPS on hand-compiled code, with a peak
of 36 MFLOPS.
 65 Kword primary memory; up to 512 Kword secondary
memory.
 Cooled by liquid freon.
53
Massively parallel systems of the 90s

 Touchstone Delta – prototype developed by Intel in 1990


 Installed at Caltech for the Concurrent Supercomputer Consortium
 MIMD architecture with hypercube interconnect; wormhole
routing.
 A node: i860 RISC chip, 60 MFLOPS peak, with 8--16 Mbytes of
memory.
 Peak performance: 32 GFLOPS for a configuration of 484 nodes.
 LINPACK rating=13.9 GFLOPS; SLALOM benchmark = 5750
patches.
 Significantly above the Moore curve
 The Paragon
 Production version of the Touchstone Delta
 Up to 4,000 nodes
 A light-weight kernel called SUNMOS
developed at Sandia National Laboratories
run on the Paragon's compute processors

54

You might also like