Server: dependability, scalability, efficient throughput 3. Heap: dynamic objects; accessed with pointers Holding or deleting any instr.
ng or deleting any instr. after branch until the branch
Embedded: Strict resource constraints At least 16 GPRs + separate floating-point registers destination is known. Penalty is fixed.
Average Selling Price (ASP): Caller-saving: Caller saves any registers that it wants to use 2. Predict-not-taken (Predict-untaken)
Component cost + direct cost + indirect cost. after the call, then invoke. 3. Predict-taken
List price: Callee-saving: First invoke, then callee saves the registers. Only useful when the target is known before the branch
Not ASP. Stores add to the ASP to get their cut. Want RISC (Reduced Instruction Set Computer) outcome.
50% to 75% of list price. CISC (Complex Instruction Set Computer) No advantage at all for MIPS 5-stage pipeline
Classes of computers The MIPS Architecture 4. Delayed branch
SISD:每个时钟周期一条数据流和一条指令流 1. A simple load-store instruction set Instruction always executes no matter whether the branch
MISD:每个时钟周期多条指令流,单个数据集 2. Design for pipelining efficiency, including a fixed taken or not taken.
SIMD:每个时钟周期在多个数据流上执行单个指令流 instruction set encoding RAW (Read after write) true depend. B read b4 A write
ISA: Instruction Set Architecture 指令集架构 the interface 3. Efficiency as a complier target WAW (Write after write) output depend. B write b4 A write
between hardware and software Pipeline WAR (Write after read) anti-depend. B write b4 A read
Interface Design: 1. Provides convenient functionality to Goal: make fast CPU Branch prediction
higher levels 2. Permits an efficient implementation at How: takes advantage of parallelism Static branch prediction
lower levels Decompose by steps Dynamic branch prediction
Computer Architecture is the science and art of selecting Pipeline - Implementation technique whereby different
and inter-connecting hardware components to create instructions are overlapped in execution at the same time.
computers that meet functional, performance, cost and Make fast CPUdecrease CPU timeimprove throughput;
power goals. improve efficiency for resources
1. Instruction Set design 2.Organization 3. Hardware Pipeline: 1. Many stages 2. Each stage carries out a Exceptions and Interrupts
MTTF 平均故障时间,MTTR 平均修复时间,FIT=1/MTTF different part 3. Cooperate at a synchronized clock. 4. Exceptions are exceptional events that disrupt the normal
故障率,MTBF = MTTF + MTTR 平均无故障时间,模块 Exploit parallelism flow of a program
Latency: It's the amount of time between when the Pipeline must be safely shutdown when exception occurs
可用性=MTTF/MTBF
instruction is issued and when it completes. and then restarted at the offending instruction
Measuring and Reporting Performance
Throughput: The throughput of a CPU pipeline is the Precise Exceptions vs. Imprecise Exceptions
Execution time (latency) Throughput MIPS
number of instructions completed per second Latency----the number of intervening cycles between an
Criteria of performance evaluation differs among users
Ideal speedup equal to Number of pipeline stages instruction that produces a result and an instruction that
and designers
Single clock - pipelining decreases cycle time. uses the result.
response time = elapsed time
Multiple clock - pipelining decreases CPI Initiation interval----the number of cycles that must elapse
CPU time - Measures designer perception of the CPU
Performance issues: 1. Latency. 2. Imbalance among between instructions issue to the same unit.
speed. Further divided into: User CPU time and System
stages. 3. Overhead. 4. Pipeline hazards. 5. Time to “fill” Supports multiple outstanding FP operations
CPU time
and “drain”. Structural hazards can occur.
Throughput - Measure administrator perception of the
MIPS 5 stage pipeline Solve the write port conflict
system performance.
Hazard Detect and insert stalls by serializing the writes
If you improve response time, you usually improve
A hazard is a condition that prevents an instruction in the 1. Track the use of the write port in the ID stage and to stall
throughput. You can also improve throughput without
pipeline from executing its next scheduled pipeline stage. an instruction before it issues
improving response time
Hazards can always be resolved by Stall 2. To stall a conflicting instruction when it tries to enter the
MIPS comparing the same Instruction Set
1. The stall delays all instructions issued after the MEM or WB stage.
SPEC - The System Performance Evaluation Cooperative
instruction that was stalled, while other instructions in the Example: Interrupts-User hitting the keyboard, Disk drive
Total Execution Time: A Consistent Summary Measure
pipeline go on proceeding. asking for attention, Arrival of a network packet
2. No new instructions are fetched during a stall Exceptions- Divide by zero, Overflow, Page fault
Case of multi-cycle implementation Handling exceptions
1. Ignore the problem (imprecise exceptions)
Fraction enhanced: Fraction of computation time in 2. Buffer the results and delay commitment
original machine that can be converted to take advantage History file - saves the original values of the registers.
of the enhancement. Speedup = [1- ΣFEi + ΣFEi/Si]-1 Future file - stores the newer values for registers.
3. Keep enough information for the trap handler to create
a precise sequence for the exception
4. Allow instruction issue only if it is known that all
Structural hazard
previous instructions will complete without causing an
Instruction Set Design Occurs when two or more instructions want to use the
exception.
The type of internal storage in CPU: stack, accumulator, same hardware resource in same cycle
Designing instruction sets for pipelining
GPR architecture Overcome by replicating hardware resources
1. Avoid variable instruction lengths and running times
Three general types of GPR (General Purpose Registers) 1. Multiple accesses to the register file -- double bump 双
whenever possible.
1. Register-Register ------ Load/Store 端口. 2. Multiple accesses to memory -- split IM and DM /
2. Avoid sophisticated addressing modes.
ALU operations use register operands only multiple memory port / instr. buffer. 3. Some functional
3. Don't allow self-modifying code.
Alpha, ARM, MIPS, PowerPC, SPARC, superH, TM unit is not fully pipelined. 4. Not pipelined functional units
4. Avoid implicitly setting CCs in instructions
2. Register-Memory Allow: to reduce cost and latency of the unit
Memory - Hierarchy Design
IBM360/370, Inter80x86, Ti TM Motorola 6800 Data hazard
Taking advantage of the principle of locality 局部性原则
3. Memory-Memory Vas Occur when the pipeline changes the order of read/write
Cache -- Small, fast storage used to improve average access
Metrics Reg-Reg Reg-mem Mem-Mem accesses to operands comparing with that in sequential
time to slow memory
Code density Lowest Higher Highest executing
Block placement
Instr. count Largest Large Small Dependencies between “nearby” instructions
Fully Associative, Set Associative, Direct Mapped
Instr. Comp. Simplest Complex Most Comp 1. Insert NOP by compiler
Sets with n blocks -- n-way set associative
Instr. length Fixed Variable Large vari- 2. Add hardware Interlock
Block identification Address Tag / Block
Encoding com Fixed Hybrid Variable Add extra hardware to detect stall situations
Add extra hardware to push bubbles thru pipe TAG Index offset
CPI Small Middle Large vari-
3. Forwarding (bypass, short-circuiting) Block replacement -- Random, LRU, FIFO
Memory address bit byte word
“point backwards” Write strategy
1. Support at least 3 addressing mode Register indirect,
Load stall – avoid: reordering by compiler When data is written into the cache,
displacement, immediate 2. The size of the address for
EX/MEM.ALUOUT MEM/WB.ALUOUT MEM/WB.LMD 1. Write-through cache: also written to memory.
displacement mode to be at least 12-16 bits 3.The size of
Can always discard cached data - most up-to-date data is in
the immediate field to be at least 8-16 bits
memory
Little Endian: Intel Big Endian: IBM, Motorola
Cache control bit: only a valid bit
MAKE THE COMMON CASE FAST
Memory always has latest data
Instructions for Control flow
2. Write-back cache: NOT written to memory
1. Common used instructions shall be considered firstly:
Can’t just discard cached data - may have to write it back
Load, store, add, sub, move R-R, and, shift, =, ≠, branch
to memory
and etc. 2. Conditional branch: displacement 100 <=27
Cache control bits: both valid and dirty bits
PC-relative branch: displacement > 8 bits. 3. PC-relative
Much lower bandwidth, since data often overwritten
conditional branches dominate the control instructions. 4.
multiple times
Jump and link instruction for procedure call. 5. Register
Write-through advantages: Read misses don't result in
indirect jump for procedure return Control hazard
Cause: branch condition and the branch PC are not writes, memory hierarchy is consistent and it is simple to
Three areas in which current high-level languages
implement.
allocate the data: available in time to fetch an instruction on the next clock.
Write back advantages: Writes occur at speed of cache and
1. Stack: local variables; scalars (single variables) The next PC takes time to compute
main memory bandwidth is smaller when multiple writes
2. Global data area: global variables, constants; arrays 1. Freeze or flush the pipeline
occur to the same block.
Write stall ---When the CPU must wait for writes to Get data from memory before actually needed.
complete during write through Rely on having extra memory bandwidth.
Write buffers -- A small cache that can hold a few values c. compiler prefetching (reduce MR)
waiting to go to main memory. Binding prefetch: Requests load directly into register.
To avoid stalling on writes, many CPUs use a write buffer. Non-Binding prefetch: Load into cache.
This buffer helps when writes are clustered. It does not 4. Reduce the time to hit in the cache.--4
entirely eliminate stalls since it is possible for the buffer to a. small and simple caches
fill if the burst is larger than the buffer. Using small and Direct-mapped cache
Write misses -- If a miss occurs on a write (the block is not b. avoiding address translation
present), there are two options: Translation Lookaside Buffer (TLB)
1. Write allocate - The block is loaded into the cache on a Index with Physical Portion of Address -- Start tag access in
miss before anything else occurs. parallel with translation
2. Write around - The block is only written to main
memory. It is not stored in the cache.
In general, write-back caches use write-allocate, and
write-through caches use write-around. Dealing with aliases -- page coloring
Cache performance c. pipelined cache access
CPIExec includes ALU and Memory instructions d. trace caches
CPU Execution time = (CPU clock cycles + Memory stall Find a dynamic sequence of instructions to load a cache
cycles) × Clock cycle time block. Cache blocks contains dynamic traces of the
executed instructions
Summary
Average Memory Access Time
AMAT = Hit time + (Miss Rate × Miss Penalty)
HitTime Inst MissRateInst MissPenalty Inst Inst %
HitTimeData MissRateData MissPenaltyData Data%
AluOps MemAcc
CPUtime IC * ( * CPI AluOps * AMAT ) * CC
Inst Inst
Compulsory—cold start/ first reference misses
Capacity—can’t contain all the blocks needed
Conflict—in set associative or direct mapped, a too many
blocks map to a set. Collision / interference misses.
How to improve CPU time
1. Reduce the miss penalty--5 Main memory
a. Multilevel caches Higher Bandwidth
Add a second-level cache between main memory and a 4 send address, 56 access/word, 4 send a word
small, fast first-level cache. 1. Wider Main Memory
AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss Doubling or quadrupling the width of the cache and the
RateL2(local) x Miss PenaltyL2) memory. 2*(4+56+4)
2. Simple Interleaved Memory
Memory chips are organized in banks to read or write
multiple words at a time. 4+56+4*4
# of banks ≥ # of CC to access word in bank
3. Independent Memory Banks
Independent memory controller was present for every
bank. Avoiding Memory Bank Conflicts -- use a prime
number of memory banks 2n-1
Memory Technology
Reduce miss rate of the second-level cache Access time ----- time between read is requested and
b. Critical word first desired word arrives. Cycle time ----- minimum time
Request the missed word first from memory and send it to between requests to memory.
the CPU as soon as it arrives; let the CPU continue Random Access Memory
execution while filling the rest of the words in the block. 1. SRAM – Cache, no refresh
c. Read miss before write miss 2. DRAM– Mainstream Main Memory, refreshed
With a write buffer, writes can be delayed to come after periodically, divided into 2 halves RAS (Access Strobe) &
reads. Write-through -- Check write buffer contents before CAS
read; Write-back -- copy the dirty block to a write buffer, Embedded: Read-Only Memory and Flash Memory.
then do the read Virtual Memory
d. Merging write buffers Provides illusion of very large memory
Merge multiword writes to one. Also reduces stalls due to Virtual memory space - what the program “sees”
the write buffer being full. Physical memory space - what the program runs in
e. victim caches Pages-fixed-size blocks, segments-variable-size
Small fully-associative cache. Hold a few of most recently 1. Place block: fully associative strategy, anywhere
replaced blocks. Check on miss. 2. Find block: segmentation-the offset is added to the
2. Reduce the miss rate--5 segment’s PA; paging-the offset is simply concatenated to
Assume total cache size not changed this page’s PA
a. larger block size -- Decrease compulsory miss rate (patial 3. Replace block: LRU block-minimize page faults
locality). May increase the miss penalty. Certainly increase 4. Write strategy: always write back, dirty bit
conflict misses. Pseudo-associative 伪相联
b. large cache size – Longer hit time & Higher cost AMATx = HTx + MRx * MPx
Old rule of thumb: 2 x size => 25% cut in miss rate HTx = HTy + HRx * C1
c. higher associativity -decreasing Conflict misses to HRx = HRz - HRy =(1 – Hry) - (1 - HRz) = MRy - MRz
improve miss rate. 2:1 rule direct-mapped cache of size N HTx = HTy + (MRy - MRz) * C1
has the same miss rate as a 2-way set-associative cache of MRx = MRz
size N/2. Eight-way set associative ≈ Fully associative MPx = MPy = MPz
d. way prediction and pseudo-associativity AMATx = HTy + (MRy - MRz) * C1 + MRz * MPy
Reduce conflict misses and maintains hit speed of x 表示伪相联,y 表示一路组相联,z 表示二路组相联
direct-mapped cache. (Column associative) C1 是伪关联命中时用的时钟周期
e. compiler optimizations
Reorder instr. Data: 1) Merging Arrays, 2) Loop Interchange,
3) Loop Fusion, 4) Blocking
3. Reduce the MP and MR via parallelism-- 3
a. non-blocking caches (reduce MP)
Continue to supply hits while processing read misses.
Conjunction with out-of-order execution.
b. hardware prefetching (reduce MR)