0% found this document useful (0 votes)

299 views25 pages

Richard Grisenthwaite

ARMv7 rolls in a number of substantive changes: Thumb-2 Combined 32 and 16 bit instruction set instructions can be freely mixed 16 bit instructions include the original Thumb instruction set complete compatibility with ARM binaries Virtually all instructions available in ARM ISA available in Thumb-2 Some minor cleaning up of system management instructions. Expect to see growing emphasis on Thumb-2 in the future.

Uploaded by

caarthiyayini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

299 views25 pages

Richard Grisenthwaite

Uploaded by

caarthiyayini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Cortex A8 Processor

Richard Grisenthwaite
ARM Ltd

1
Evolution of the ARM Architecture
Original ARM architecture:
32 bit RISC architecture
16 Registers
1 being the Program counter
Conditional execution on all instructions
Load/Store Multiple operations
Good for Code Density
Shifts available on data processing and address generation
Original architecture had 26 bit address space
Augmented by a 32 bit address space early in the evolution

Thumb instruction set was the next big step

ARMv4T architecture (ARM7TDMI)
Introduced a 16 bit instruction set alongside the 32 bit instruction set
Switching ISA as part of branch or exception
Not a full instruction set – ARM still essential

2
Evolution of the Architecture (2)
ARMv5TEJ (ARM926EJ-S) introduced:
Better interworking between ARM and Thumb
DSP focussed additional instructions
Jazelle-DBX for Java byte code interpretation in hardware

ARMv6 (ARM1136JF-S) introduced :

Media processing – SIMD within the integer datapath
Enhanced exception handling
Overhaul of the memory system architecture

ARMv7 rolls in a number of substantive changes:

Thumb-2*
TrustZone*
Jazelle-RCT
Neon
ARMv7 is split into 3 profiles
* - Introduced initially as extensions to ARMv6

3
Thumb-2
Combined 32 and 16 bit instruction set
Instructions can be freely mixed
16 bit instructions include the original Thumb instruction set
Complete compatibility with Thumb binaries
Some new 16 bit instructions for key code size wins
Virtually all instructions available in ARM ISA available in Thumb-2
Some minor cleaning up of system management instructions
In principle can stand-alone as a complete ISA
Unified assembly language for ARM and Thumb-2
Assembly can be targeted to either ISA
Conditional execution made available via IT instruction:
ARM = 20 bytes Thumb-2 = 14 bytes
CMP r3,#1 CMP r3, #1 ;2 bytes
EOREQ r1,r1,#0x4000 ITTET ;2 bytes
EOREQ r1,r1,#2 MOVWEQ r3, #0x4002 ; 4 bytes
MOVNE r3,#0 EOREQ r1, r3 ; 2 bytes
MOVEQ r3,#1 MOVNE r3, #0 ; 2 bytes
MOVEQ r3, #1 ; 2 bytes

4
Thumb-2 (2)
“Thumb Code density at ARM Performance”
In principle this could be achieved with ARM and Thumb previously
Much of the code running is not performance critical
With code knowledge, can compile non-critical code to Thumb
Much simpler with Thumb-2

Performance

Thumb-2
100% ARM code

Random mix
‘Profiled’ mix

100% Thumb code

Code density

Expect to see growing emphasis on Thumb-2 in the future

ARM still totally committed to ARM ISA compatibility
The ARM instruction set is still completely supported
No plans to “downgrade” the ARM ISA in the applications space
5
TrustZone
Architectural extensions to
introduce a “Security” state
Orthogonal to User/Privileged split
Effectively two virtual CPUs
separated by a new mode
Monitor mode the gatekeeper for
switching CPUs
Some hardware registers duplicated
to aid switching
Memory tagged as secure and
non-secure by the system
Only the secure CPU can access the
secure memory & peripherals
System can include secure and non-
secure peripherals

6
TM
ARM NEON Technology Overview
64/128-bit Hybrid SIMD architecture
Independent Register file with 2 aliased views: D0
Q0
32 x 64-bit registers (D0-D31) D1
D2
16 x 128-bit registers (Q0-Q15) D3
Q1

: :

Integer and SP Floating-point processing D30

D31
Q15
8, 16, 32, 64-bit Integers
Single-precision Floating-point
64-bit
D0.U8
Q0.F32

Encoded in ARM and Thumb-2 128-bit

2 to 4x performance improvement over ARMv6 SIMD

Accelerates audio, video, and 3D-graphics

7
NEON SIMD Structure Load/Store
Native support for structures
e.g. Complex Numbers, Pixels, Co-ordinates
Memory treated as an Array of Structures (AoS)

x0 Load
y0 R0 VLD3.16 {D0,D1,D2}, [R0]! x3 x2 x1 x0 D0
z0
Transfer four 3 x 16-bit structures y3 y2 y1 y0 D1
x1
y1 z3 z2 z1 z0 D2
VST3.16 {D0,D1,D2}, [R0]!
Memory Store Registers
Eliminates ‘shuffling’ overhead
Optimised memory access as single transfer
Data arranged for efficient SIMD processing

8
NEON Vectorising Compiler Target

NEON provides a consistent algorithm mapping

Apply narrowing analysis
Vectorize over loop iterations LD LD LD
LD
LD
LD LD
LD
LD
LD

MUL MUL
MUL
MUL
MUL
Enabled by architectural model
Orthogonal instruction framework SHR SHR
ST
ST
ST

Few inter-lane operations

ST ST
ST
ST
ST
Fused Data Type conversion

NEON designed in conjunction with compiler technology

Ensure architecture optimised for this compiled mode
Benefits of CSE, unrolling, scheduling, register allocation
Portable solutions by avoiding hand coding or intrinsics

9
-RCT: Runtime Compilation Target
Beneficial to Java and a wide range of emerging languages
Microsoft .NET MSIL, Perl, Python etc
Enables high performance in smallest memory footprint
Optimal balance between speed and code density with run-time compilers
Low cost and low power
Less than 8K gates and small memory footprint result in lower power
Complementary to DBX on mid-tier devices
for optimum Java performance and efficiency
Broad industry adoption
Sun Microsystems, Aplix and Esmertec are early adopters
Builds on success of DBX technology

10
Cortex-A8 Processor Highlights
First implementation of the ARMv7 instruction-set architecture, including the
Advanced SIMD media instructions (NEON™)

In-order, dual-issue, superscalar microprocessor core

13-stage main integer pipeline
10-stage NEON media pipeline
dedicated L2 cache with 9-cycle latency
branch prediction based on global history

Key metrics
delivers 2000 DMIPS for next-generation consumer applications
average IPC of 0.9 across multiple benchmark suites
includes EEMBC, SpecInt95, Mediabench, and partner-provided applications
achieves 1GHz when fabricated in high-performance technologies
consumes less than 300mW in low-power devices
less than 4mm2 at 65nm, excluding NEON, L2 cache, and Embedded Trace

11
ARM Cortex-A8: why Superscalar?
In-order instruction issue
less complex than out-of-order
fewer structures means lower power
less need for custom design
can maintain high IPC with
fully symmetric ALU pipelines
all critical forwarding paths supported
dual-issue of dependent instruction pairs
Static scheduling with instruction replay on memory stall
low-power consumption due to early availability of gate enables
fire-and-forget instruction issue removes critical paths from the design

Net result
high-frequency design with out-of-order performance, but in-order
clock frequency and power consumption

Average IPC of 0.9 across 150+ ARM and industry benchmarks

12
Full Cortex-A8 Pipeline Diagram
13-Stage Integer Pipeline 10-Stage NEON Pipeline

13
Control Flow

Dynamic branch predictor components Branch resolution

512-entry 2-way BTB all branches are resolved in single stage
4K-entry GHB indexed by branch history and PC Maintains speculative and non-speculative
versions of branch history and return stack
8-entry return stack

Branch prediction maintains 95% accuracy over a wide codebase

14
Instruction Decode
D0 D1 D2 D3 D4 E0 E1 E2 E3 E4 E5
Replay penalty = 9 cycles Instruction Execute
Integer register writeback

Instruction Decode
ALU pipe 0

MUL pipe 0
Early Dec/seq Score-
Dec Dec board RegFile
queue + ID
read/write issue remap
Early logic ALU pipe 0
Dec Dec

Pending and replay LS pipe 0 or 1

queue

Instruction decode highlights

pending queue reduces Fetch stalls and increases pairing opportunities
replay queue keeps instructions for reissue on memory system stall
scoreboard predicts register availability using static scheduling techniques
cross-checks in D3 allow issue of dependent instruction pairs

15
Instruction Execution

Execution pipeline highlights

2 symmetric ALU pipelines: Shift/ALU/SAT
Load/store pipe used by instructions in either pipeline
Multiply instructions are tied to pipe 0
All key forwarding paths supported
Static scheduling allows for extensive clock gating

16
Memory System on Cortex-A8
Harvard Level 1 Caches – both 32KByte 4 way set associative
VIPT Instruction cache; VIPT Data cache with alias detection
Level 1 Data cache is blocking
Non-Neon read misses cache cause replay of subsequent instructions
Reduces complexity in later pipeline stages
Good for power and clock frequency
Neon data not allocated to L1 (but will read/update in L1 if necessary)
Unified Level 2 Cache
PIPT, 8 way set associative
Fully pipelined and non-blocking
Up to 9 memory transactions in flight
Streams to the Neon processing unit; up to 16GByte/s bandwidth
64 or 128 bit AMBA AXI interconnect to memory
Split transaction burst based protocol
Supports multiple outstanding memory transactions to minimise memory latencies

17
Memory System

LS pipeline
32k 4-way set associative data cache
Address hash array used to predict cache way
Saves power and improves timing BIU pipeline
load data forwarding in E3 to all critical sources 9-cycle minimum access latency to L2 cache
one-cycle load-use penalty for ALU L2 built using standard compiled RAMS (64k-2MB configurable size)
store data not required until E3 64/128bit AXI L3 bus interface supports up to 9 outstanding transactions

18
NEON Interfaces

Skewed late in pipeline, past the retire point Streaming to and from L2 memory system
reduces interface complexity up to 8 outstanding transactions
exception handling not required can receive 128 bits/cycle
decoupling queues from integer machine
can receive data from L1 or L2 memory system
removes load-use penalty independent NEON store buffer
negative impact on NEON -> ARM transfers
nonblocking ARM register file helps hide latency

19
NEON Media Engine Unit

Instruction issue Execution pipelines

static scheduling with fire-and-forget issue all pipelines are 64-bit SIMD
1 LS + 1 NINT/NFP can issue each cycle floating-point MAC executed using both FADD and FMUL pipelines

20
Cortex-A8 NEON Technology
Accelerating standardization of media processing for next generation
mobile and consumer products
The ideal software target to run rapidly evolving downloadable media
players such as Windows Media Player 10 and Real Player
1x 2x 3x 4x
MPEG-4
GSM-AMR
MP3 Decoder
ARMv5 ARMv6 NEON

Video, 30fps VGA decode

MPEG-4 including de-ring and de-block filters, yuv2rgb1 275MHz
H.264 (estimated)4 350MHz

GSM-AMR, worst case2 13MHz

MP3 decode, 320kbps 48kHz, worst case3 9.4MHz
1) MPEG-4 Simple Profile @ 30fps 512kbps , 133MHz SDRAM 10-1-1-1-1-1-1-1 memory, includes deblocking and deringing filters
2) MP3 Decoder @ 320kbps 48kHz (worst case), 133MHz SDRAM 10-1-1-1-1-1-1-1 memory
3) GSM-AMR (worst case), 3 cycle per word memory
4) H.264 Decoder Baseline profile

21
Coresight Debug and Trace
Hardware Debug and Trace are key components
Valued by the people who use the systems!
ARMs Coresight moves to a system-centric debug philosophy
SoC are not just the core any more
Multiple sources of trace data – cores, buses, software instrumentation
Multiple debug components – cores, buses watchers etc
Cross-triggering of debug events to multiple cores
System identification of components in the SoC essential to debug
Topology identification methodology as well
Coresight is a debug and trace focussed system architecture
Debug components part of a debug memory space
Standardised interface to JTAG or Serial-Wire Debug
Open standards to encourage 3rd party adoption
Cortex-A8 incorporates Coresight compliant interfaces

22
Implementation Strategy: Motivation
Why use a semicustom design flow?
required to achieve project frequency, area, and power targets

Why not deliver a hard macrocell?

too many restrictions on circuit and layout optimizations possible
design porting does not scale well with increases in design size and
complexity

The goal:
provide our partners with an alternative method of IP delivery that
achieves Cortex-A8 power, area, and frequency targets
minimizes the additional effort required from the silicon partner

23
ARM Cortex-A8 Processor Summary
Industry-leading performance and power efficiency
Greater than 2000 DMIPS for demanding tethered applications
Less than 300mW for low power mobile applications
More than 7 major new technology innovations:
NEON, Jazelle-RCT, Thumb-2, TrustZone, AMBA AXI, CoreSight, IEM
Supported end-to-end by ARM Technology
RealView ARCHITECT ESL Models – Artisan AdvantageCE Libraries
Industry momentum fueling wide adoption
5 licensees, 1/3 of the Top 15 WW Semiconductor Vendors *

* Source: Gartner Dataquest (March 2005)

24
Questions?

ARM Cortexa8 Longi
No ratings yet
ARM Cortexa8 Longi
8 pages
Lecture 05 ARM Processors
No ratings yet
Lecture 05 ARM Processors
65 pages
A Brief History of ARM
No ratings yet
A Brief History of ARM
12 pages
ARM CPU Architecture
No ratings yet
ARM CPU Architecture
30 pages
ARM Processor Roadmap
100% (1)
ARM Processor Roadmap
23 pages
2) Arm
No ratings yet
2) Arm
26 pages
Lessons From The ARM Architecture: Richard Grisenthwaite Lead Architect and Fellow ARM
No ratings yet
Lessons From The ARM Architecture: Richard Grisenthwaite Lead Architect and Fellow ARM
30 pages
Embedded Processor: Unit II
No ratings yet
Embedded Processor: Unit II
50 pages
Cortex A8
No ratings yet
Cortex A8
5 pages
ARM Basics
No ratings yet
ARM Basics
132 pages
ARM Arch 1704437782
No ratings yet
ARM Arch 1704437782
26 pages
ARM Architecture: Computer Organization and Assembly Languages P GZ y GG Yung-Yu Chuang
No ratings yet
ARM Architecture: Computer Organization and Assembly Languages P GZ y GG Yung-Yu Chuang
26 pages
Embedded Systems with ARM Focus
No ratings yet
Embedded Systems with ARM Focus
111 pages
Day1 Arm
No ratings yet
Day1 Arm
44 pages
Module 5 ARM
No ratings yet
Module 5 ARM
95 pages
Unit 3 ARM
No ratings yet
Unit 3 ARM
239 pages
Arm
100% (2)
Arm
44 pages
ARM Processors and Architectures - Uni Program
No ratings yet
ARM Processors and Architectures - Uni Program
81 pages
Development of The ARM Architecture
No ratings yet
Development of The ARM Architecture
44 pages
ARM Processor: Chapter 1: ARM Embedded Systems
No ratings yet
ARM Processor: Chapter 1: ARM Embedded Systems
25 pages
ARM Processors for Embedded Systems
100% (3)
ARM Processors for Embedded Systems
24 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
34 pages
ESD Unit 3 ARM 2024 Latest
No ratings yet
ESD Unit 3 ARM 2024 Latest
249 pages
ARM Architecture Overview
100% (1)
ARM Architecture Overview
19 pages
Module 1-Complete
No ratings yet
Module 1-Complete
136 pages
ARM Cortex-A9 MPCore
No ratings yet
ARM Cortex-A9 MPCore
34 pages
The First Encounter: Authors: Nemanja Perovic, Prof. Dr. Veljko Milutinovic
No ratings yet
The First Encounter: Authors: Nemanja Perovic, Prof. Dr. Veljko Milutinovic
44 pages
Unit III Part 1
No ratings yet
Unit III Part 1
47 pages
02 - CH - 2 - ARM - Processor Architecture
No ratings yet
02 - CH - 2 - ARM - Processor Architecture
91 pages
CSD Lec1 Arm Intro
No ratings yet
CSD Lec1 Arm Intro
43 pages
Embedded System Notes
No ratings yet
Embedded System Notes
10 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
ARM Notes For Students
100% (3)
ARM Notes For Students
24 pages
Module3 ARM
No ratings yet
Module3 ARM
96 pages
ARM Cortex-A Series Processors: Haoyang Lu, Zheng Lu, Yong Li, James Cortese
No ratings yet
ARM Cortex-A Series Processors: Haoyang Lu, Zheng Lu, Yong Li, James Cortese
15 pages
Unit 1
No ratings yet
Unit 1
18 pages
Lecture2.2 ARM Instruction Set Architecture
No ratings yet
Lecture2.2 ARM Instruction Set Architecture
95 pages
Unit 1 Notes JSW
No ratings yet
Unit 1 Notes JSW
8 pages
ARM Notes1
No ratings yet
ARM Notes1
15 pages
Arm Cortex Architecture: Abstract-These Discussion Gives An Idea About The Various
No ratings yet
Arm Cortex Architecture: Abstract-These Discussion Gives An Idea About The Various
5 pages
Arm 2011
No ratings yet
Arm 2011
55 pages
ARM Embedded Systems Programming
100% (2)
ARM Embedded Systems Programming
67 pages
ARM Introduction-1
100% (2)
ARM Introduction-1
26 pages
MS Unit2
No ratings yet
MS Unit2
94 pages
Unit Ii Arm
No ratings yet
Unit Ii Arm
243 pages
04 Naming Conventions For ARM Processors
No ratings yet
04 Naming Conventions For ARM Processors
11 pages
Adv Comp Arch Q3'11
No ratings yet
Adv Comp Arch Q3'11
54 pages
Arm
No ratings yet
Arm
44 pages
Embedded System Design and History
No ratings yet
Embedded System Design and History
99 pages
Microprocessors 2week
No ratings yet
Microprocessors 2week
46 pages
8085 Microprocessor Guide
100% (1)
8085 Microprocessor Guide
5 pages
Satellite Microwave
No ratings yet
Satellite Microwave
62 pages
Transmission Line Modeling
No ratings yet
Transmission Line Modeling
19 pages
225 MHZ, High Performance Hdmi Transmitter With Arc Adv7511: Features Functional Block Diagram
0% (1)
225 MHZ, High Performance Hdmi Transmitter With Arc Adv7511: Features Functional Block Diagram
2 pages
S7 Konverter
No ratings yet
S7 Konverter
328 pages
TTI TF830 Data Sheet
No ratings yet
TTI TF830 Data Sheet
2 pages
Lab Report
No ratings yet
Lab Report
7 pages
Army Direction Finder Manual
100% (2)
Army Direction Finder Manual
35 pages
Sonivox Audioinside Eas
No ratings yet
Sonivox Audioinside Eas
2 pages
AI-518 Industrial Controller Guide
No ratings yet
AI-518 Industrial Controller Guide
2 pages
NVR & CCDual Setup Guide
No ratings yet
NVR & CCDual Setup Guide
12 pages
Reference Material 2
No ratings yet
Reference Material 2
8 pages
3d NSL Full Body Health Analyzer
100% (2)
3d NSL Full Body Health Analyzer
17 pages
Unison Controls Private Limited
No ratings yet
Unison Controls Private Limited
25 pages
FSM Answers
No ratings yet
FSM Answers
6 pages
Logic Gates Overview & Functions
No ratings yet
Logic Gates Overview & Functions
15 pages
Antc207: Quartz Crystals and Microchip Ics
No ratings yet
Antc207: Quartz Crystals and Microchip Ics
12 pages
An 5076 PDF
No ratings yet
An 5076 PDF
15 pages
Two Port Networks
No ratings yet
Two Port Networks
33 pages
typeABZ Datasheet Rev S
No ratings yet
typeABZ Datasheet Rev S
23 pages
COA Chapter 4
No ratings yet
COA Chapter 4
56 pages
ENG424 Tute 3 Solution
No ratings yet
ENG424 Tute 3 Solution
6 pages
Lab Report Alternating Current
No ratings yet
Lab Report Alternating Current
4 pages
55UH6550UB LCD TV ServiceManual
No ratings yet
55UH6550UB LCD TV ServiceManual
120 pages
Lista 02 B
No ratings yet
Lista 02 B
7 pages
Power Electronics 1: ENEL371S2
No ratings yet
Power Electronics 1: ENEL371S2
17 pages
Gate Test Series
No ratings yet
Gate Test Series
7 pages
Min Ele 05
No ratings yet
Min Ele 05
8 pages
Dte Micro Project
No ratings yet
Dte Micro Project
11 pages
Manual Operating NSD570 PDF
100% (3)
Manual Operating NSD570 PDF
350 pages

Richard Grisenthwaite

Uploaded by

Richard Grisenthwaite

Uploaded by

Cortex A8 Processor

Thumb instruction set was the next big step

ARMv6 (ARM1136JF-S) introduced :

ARMv7 rolls in a number of substantive changes:

100% Thumb code

Expect to see growing emphasis on Thumb-2 in the future

Integer and SP Floating-point processing D30

Encoded in ARM and Thumb-2 128-bit

2 to 4x performance improvement over ARMv6 SIMD

NEON provides a consistent algorithm mapping

Few inter-lane operations

NEON designed in conjunction with compiler technology

In-order, dual-issue, superscalar microprocessor core

Average IPC of 0.9 across 150+ ARM and industry benchmarks

Dynamic branch predictor components Branch resolution

Branch prediction maintains 95% accuracy over a wide codebase

Pending and replay LS pipe 0 or 1

Instruction decode highlights

Execution pipeline highlights

Instruction issue Execution pipelines

Video, 30fps VGA decode

GSM-AMR, worst case2 13MHz

Why not deliver a hard macrocell?

* Source: Gartner Dataquest (March 2005)

You might also like