The ARM10 Family of Advanced
Microprocessor Cores
Stephen Hill
ARM Austin Design Center
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 1
Agenda
Design overview
Microarchitecture
• ARM10
o Memory System
o Interrupt response
3. Power
o Dynamic power
o Power down modes
4. VFP10
ETM10
Summary
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 2
ARM1020E Overview
Max frequency: 400MHz
0.9V, worst case
TSMC 0.13um LV
MIPS/MHz: 1.25
500 MIPS @ 400MHz
Dhrystone 2.1
Active power consumption: 0.51mA/MIPS
Room Temp / Typical / 1.1V
Average when running Dhrystone 2.1
Area
ARM1022E (2x16KB): 6.9mm2
ARM1020E (2x32KB): 10.3mm2
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 3
ARM10200 System Overview
ARM10200
Bridge SDRAM CTL
ETM10
I-Side D-Side
AHB AHB
VFP10
BIU BIU
WriteBuffer
I-Side ARM10E D-Side
MMU MMU
16
16or
or32k
32kData
Data 16 or 32k Data
16 or 32kCache
Cache
Instruction Cache
Cache
ARM1020E
64-bit data
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 4
ARM10E Microarchitecture
64-bit instruction and data interfaces
Static branch prediction with branch folding
Parallel load/store pipeline
Dedicated machine for LDM/STM execution; all but
the first cycle of these instructions are hidden if no
dependencies are encountered
Parallel execution of multi-cycle
coprocessor operations
Multiply 16 bits per cycle
1-3 cycle throughput and 2-4 cycle latency
No data-dependent MUL cycle counts
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 5
ARM7 Pipeline versus ARM10
ARM7TDMI
Instruction Thumb→→ARM ARM decode Register Shift ALU Reg
Fetch decompress Read Write
Reg Select
FETCH DECODE EXECUTE
ARM10
Branch ARM or Data + Branch Data Cache
Register
Predictor Thumb Address Interface
Instruction Read
Generator
Decode +
Instruction Coprocessor Reg
Address Result Write
Forward Shift Data
Generator Interface
Coprocessor + + ALU
Instruction Scoreboard
Instruction Issue Multiply
Fetch Multiply
Add
FETCH ISSUE DECODE EXECUTE MEMORY WRITE
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 6
ARM9 Pipeline versus ARM10
ARM9TDMI
ARM or Thumb
Instruction Inst Decode Memory Reg
Shift + ALU Write
Fetch Reg Reg Access
Decode Read
FETCH DECODE EXECUTE MEMORY WRITE
ARM10
Branch ARM or Data + Branch
Register Address Data Cache
Predictor Thumb Interface
Instruction Read Generator
Decode +
Instruction
Address Result Shift Coprocessor Reg
Generator Forward + Data Write
Coprocessor + ALU Interface
Instruction Scoreboard
Instruction Issue Multiply
Fetch Multiply Add
FETCH ISSUE DECODE EXECUTE MEMORY WRITE
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 7
ARM1020E Memory System
Instruction & Data Caches
32Kbyte Instruction and Data caches
Virtually addressed, 64-way set-associative,
32-byte lines, 64-bit R/W
Configurable for Write Through or Write Back operation
Lockable by line (1/64 of the cache)
MMUs
Two fully associative (I and D) 64-entry TLBs
Lockable by entry
Support for software loadable TLBs
Write Buffer
Eight 64-bit entries, plus 32-byte cache line castout buffer
AHB Bus Interface
64-bit wide data transfers, split transactions
Multi-layer AHB support (separate I and D-side system interfaces)
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 8
ARM1020E Memory System
Performance features:
Critical word first
Non-blocking data cache
Hit-under-miss (H-U-M)
Data cache streaming (forwarding) from
linefills
Data cache store merging into linefills
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 9
ARM1020E Interrupts
Interrupts taken in Execute stage
Fast interrupt mode:
First load miss stops further memory ops but not other instructions
Limit write buffer depth
Recommended measures for fast interrupt response:
Lock handler code into Caches and TLB
Set data cache to write-though (no cast outs)
Limit LDM length to 9 registers (spans only 2 cache lines)
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 10
ARM1020E Interrupts
Worst case Interrupt response time to enter handler (G:H Clock 1:1)
Worst case of outstanding memory operations (LDM just started)
3 table walks needed (unless TLB locked down)
Write buffer full, bus not granted by default
CYCLES Fast Max Regs Write TLB
(approx) Interrupt in LDM through D locked
mode cache down
171 16
148 9 16
129 9 9
63 9 16 9 9
48 9 9 9 9
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 11
ARM1020E Dynamic Power
ARM1020E Data Cache
Dhrystone 2.1 24%
Instruction MMU
5%
Data MMU
Instruction Cache 5%
22%
BIU
5%
Clocks
6%
Buffers
1%
ARM10E Core
32%
(PowerMill simulation)
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 12
ARM10E Dynamic Power
ARM10E Core
Dhrystone 2.1 LSU
10% Other
12%
Prefetch Clock
15% 8%
Execute
7%
RegBank
5%
ETM Interface
3%
Decode and Forwarding
Sequencer 3%
33% Multiplier CP Interface
2% 2%
(PowerMill simulation)
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 13
Power Down
Resume
Active Power >100 cycles
CPU executing RUN
(Fast/Normal/Slow)
CPU clock stopped. >101 cycles
Wake on interrupt or STANDBY
debug event.
CPU state lost. >102 cycles
Cache state DORMANT
preserved.
CPU & Cache state Cycle Delay >104 cycles
lost. All core power SHUTDOWN to Resume
removed.
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 14
Power Down
RAM ROM
PM Layer PM Layer
Power
Management
Controller
PM Layer
ARM High Performance Bus DSP Core
Interrupt
Controller
Debug PM Layer PM Layer
Hardware
Clocks
Clocks
CORE and SCC CORE and SCC
PM Layer
PLL PM Layer PM Layer
Cache Cache
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 15
VFP10
Full IEEE 754 compliant (with SW support)
Performance:
236 MFLOPS Linpack (SAxPY) @ 400MHz
400M FIR Taps (800 Peak MFLOPS) @ 400MHz
Functions supported in hardware
Multiply, add, multiply-add, subtract, multiply-subtract,
negate, negate multiply, negate multiply-add, negate
multiply-subtract, absolute value, compare, convert,
divide and square root, conversions
Most IEEE 754 exceptions handled in
hardware
RunFast mode
No trapping enabled (Denormals flush to +0)
NaN fractions not propagated (not typical)
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 16
VFP10
7 Stage pipeline
Fetch - Issue - Decode - Execute (E 1) - E 2 - E 3 - E 4/WB
32 Single precision / 16 Double precision registers
High performance short vector operations
Register banks operate as hardware circular queues and can
be addressed as short vectors (up to 8 values)
Separate divide/square root unit
Supports load/store, and arithmetic operation in parallel with
divide/square root operation
Separate load/store unit
Load/store operations may be done in parallel with data
processing operations
64-bit unidirectional data interfaces
Area: ~1.6mm2 in 0.13um
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 17
ETM10
Embedded Trace Macrocell
Data
ARM1020E
Control
EmbeddedICE
Logic Address
ETM10
TAP
JTAG
Multi-ICE Port
SOC Trace
Port
Trace Port
Analyzer
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 18
ETM10
Full real-time instruction and data tracing
Monitors the core’s internal buses
Zero performance overhead
Supports high frequency trace with demux-port
Configurable synthesis for optimum:
area
features
pin count
Programmed non-intrusively through JTAG
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 19
ARM1020E Family Summary
ARM1020E:
500 DMIPS @400 MHz
0.51 mA/MIPS VFP10
10.3mm2 / 6.9mm2
VFP10: Integer Unit
236 MFLOPS @400MHz IMMU DMMU
IEEE 754 Compatible
ETM10: I-Cache D-Cache
Full speed, real time
embedded trace
(ARM10200r1)
THE ARCHITECTURE FOR THE DIGITAL WORLD
TM
Hot Chips 13 20