1
EC6009 ADVANCED COMPUTER ARCHITECTURE
UNIT III
DATA-LEVEL PARALLELISM
Vector architecture – SIMD extensions – Graphics Processing units – Loop level parallelism.
DATA LEVEL PARALLELISM IN
Vector Architectures
SIMD Extensions
GPU Architectures
VECTOR ARCHITECTURE
Vector architectures uses sets of data elements scattered about memory. It places the data into large,
sequential register files, operate on data in those register files, and then disperse the results back into
memory. A single instruction operates on vectors of data, which results in dozens of register–register
operations on independent data elements.
(i) VMIPS( Vector MIPS)
VMIPS is based on the Cray-1 processor and is shown in the figure below:
ARCHITECTURE OF VMIPS
EC6009 – ADVANCED COMPUTER ARCHITECTURE
SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
2
The primary components of the instruction set architecture of VMIPS are the following:
Vector registers :- Each vector register is a fixed-length bank holding a single vector.
VMIPS has eight vector registers. Each vector register holds 64 elements, each 64 bits wide. They are
provided with enough ports to connect with the functional units. The read and write ports, which total at least
16 read ports and 8 write ports, are connected to the functional unit inputs or outputs by a pair of crossbar
switches.
Vector functional units:- VMIPS has five functional units. Each unit is fully pipelined,
and it can start a new operation on every clock cycle.
Vector load/store unit :- The vector memory unit loads or stores a vector to or from
memory. The VMIPS vector loads and stores are fully pipelined, so that words can be moved between the
vector registers and memory with a bandwidth of one word per clock cycle.
A set of scalar registers:- Scalar registers can also provide data as input to the vector
functional units. These are the normal 32 general-purpose registers and 32 floating-point registers of MIPS.
(ii) VMIPS VECTOR INSTRUCTIONS
Some of the instructions of VMIPS are shown below:
In VMIPS, vector operations use the same names as scalar MIPS instructions, but with
the letters “VV” or “VS” appended.
ADDVV.D : is an addition of two double-precision vectors. The vector instructions take as their input a pair
of vector registers (ADDVV.D)
(ADDVS.D) : is an addition of two double-precision vectors. The vector instructions take as their input a
vector register and a scalar register (ADDVS.D)
The names LV and SV denote vector load and vector store, and they load or store an
entire vector of double-precision data. One operand is the vector register to be loaded or stored; the other
operand, which is a MIPS general-purpose register, is the starting address of the vector in memory.
There are two additional special-purpose registers: the vector-length(VL) and vector-
mask registers(VMR). The VL is used when the natural vector length is not 64. VMR is used when loops
involve IF statements.
EC6009 – ADVANCED COMPUTER ARCHITECTURE
SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
3
We can best understand a vector processor by looking at a vector loop for VMIPS:
Consider SAXPY or DAXPY loop that forms the inner loop of the Linpack benchmark:
Y=a×X+Y
SAXPY stands for single-precision a x X plus Y
DAXPY stands for double precision a × X plus Y
Let us assume that the number of elements, or length, of a vector register (64) matches the length of the
vector operation.
Example Show the code for MIPS and VMIPS for the DAXPY loop. Assume that the starting addresses of X
and Y are in Rx and Ry respectively.
MIPS Code for DAXPY loop:
Requires 578 instructions to execute.
VMIPS Code for DAXPY loop:
Requires 6 instructions to execute.
The most dramatic difference is that the vector processor greatly reduces the dynamic instruction bandwidth,
executing only 6 instructions versus 578 for MIPS. This reduction occurs because the vector operations work
on 64 elements in the vector register.
(iii) VECTOR EXECUTION TIME:
The execution time of a sequence of vector operations primarily depends on three factors:
(1) the length of the operand vectors, (2) structural hazards among the operations, and (3) the
EC6009 – ADVANCED COMPUTER ARCHITECTURE
SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
4
data dependences.
The execution time(clock cycles) for a single vector instruction is approximately the
vector length.
Convoy, is the set of vector instructions that could potentially execute together.
Performance of a section of code by counting the number of convoys. The instructions in a convoy must not
contain any structural hazards. A convoy of instructions must complete execution before any other
instructions (scalar or vector) can begin execution.
Chaining allows sequences with read-after-write dependency hazards to execute in same
convoy. Chaining allows a vector operation to start as soon as the individual elements of its vector source
operand become available: The results from the first functional unit in the chain are “forwarded” to the
second functional unit.
Chime : A timing metric to estimate the time for a convoy is called a chime. It is simply
the unit of time taken to execute one convoy. Thus, a vector sequence that consists of m convoys executes in
m chimes; for a vector length of n, for VMIPS this is approximately m × n clock cycles.
The convoys for the VMIPS Code for DAXPY loop is shown below:
Each Convoy executes 64 elements and takes 1 chime or 64 clock cycles.
Hence for 3 convoys, the sequence takes 3 chimes or 64 x 3 = 192 clock cycles.
(iv) OPTIMIZATIONS OF THE VECTOR ARCHITECTURE
A. Multiple Lanes: Beyond One Element per Clock Cycle
B. Vector-Length Registers: Handling Loops Not Equal to 64
C. Vector Mask Registers: Handling IF Statements in Vector Loops
D. Memory Banks: Supplying Bandwidth for Vector Load/Store Units
E. Stride: Handling Multidimensional Arrays in Vector Architectures
F. Gather-Scatter: Handling Sparse Matrices in Vector Architectures
A. Multiple Lanes: Beyond One Element per Clock Cycle
Multiple functional units can be used to improve the performance of a single vector instruction.
Example : C = A + B (A, B and C are vectors)
The vector processor shown below (Fig (a)) has a single add pipeline and completes only
one addition per cycle. Hence it takes 64 clock cycles to complete a chime.
The vector processor shown below (Fig (b)) has four add pipelines and can complete four
additions per cycle. The elements within a single vector add instruction are interleaved across the four
pipelines. The set of elements that move through the pipelines together is termed an element group. For a 4
lane architecture, the number of clock cycles is reduced to 16.
EC6009 – ADVANCED COMPUTER ARCHITECTURE
SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
5
Single Lane Multiple Lane(4 lanes)
The below figure shows the structure of a four lane Functional unit.
Structure of a vector functional unit containing four lanes.
The vector register storage is divided across the lanes. Each lane holds every fourth
element of each vector register. The figure shows three vector functional units: an FP add, an FP multiply,
and a load-store unit. Each of the vector arithmetic units contains four execution pipelines, one per lane, to
complete a single vector instruction.
EC6009 – ADVANCED COMPUTER ARCHITECTURE
SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
6
Adding multiple lanes is a popular technique to improve vector performance as it requires
little increase in control complexity and does not require changes to existing machine code. It also allows
designers to trade off die area, clock rate, voltage, and energy without sacrificing peak performance. If the
clock rate of vector processor is halved, doubling the number of lanes will retain the same potential
performance.
B. Vector-Length Registers: Handling Loops Not Equal to 64
The vector length is determined by the number of elements in each vector register. This
length, which is 64 for VMIPS, is unlikely to match the real vector length in a program.
A single piece of code may require different vector lengths. For example, consider his code:
The size of all the vector operations depends on n, which may not even be known until run time!
The solution to these problems is to create a vector-length register (VLR). The VLR
controls the length of any vector operation, including a vector load or store.
The value in the VLR, however, cannot be greater than the length of the vector registers.
The maximum length of a vector register is called as maximum vector length (MVL). The MVL determines
the number of data elements in a vector of an architecture.
If the value of n is not known at compile time and greater than the MVL, the VLR
concept cannot be applied. If the vector is longer than the maximum length, a technique called strip mining
is used.
Strip mining is the generation of code such that each vector operation is done for a size
less than or equal to the MVL. If the size of code is n, find VL = n % MVL(modulo) and
j=n/MVL(truncating integer division). The first loop is executed with VL elements. The other loops are
executed for n/MVL times with MVL elements in each loop.
The strip mined version of DAXPY loop in C is shown below:
A vector of arbitrary length processed with strip mining
EC6009 – ADVANCED COMPUTER ARCHITECTURE
SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
7
C. Vector Mask Registers: Handling IF Statements in Vector Loops
Programs that contain IF statements in loops cannot be run in vector mode using the
previous techniques because the IF statements introduce control dependences into a loop.
Consider the following loop written in C:
This loop cannot be vectorized normally because of the conditional execution of the body. Only if the inner
loop could be run for the iterations for which X[i] ≠ 0, then the subtraction can be vectorized. This can be
achieved through vector-mask control.
Mask registers provide conditional execution of each element operation in a vector
instruction. The vector-mask control uses a Boolean vector to control the execution of a vector instruction.
The entries in the VMR is 0 or 1. When the vector-mask register is enabled, any vector instructions executed
operate only on the vector elements whose corresponding entries in the vector-mask register are one. The
entries in the destination vector register that correspond to a zero in the ask register are unaffected by the
vector operation. VMR operation is done using SNEVS instruction.
D. Memory Banks: Supplying Bandwidth for Vector Load/Store Units
Simple memory interleaving is load/store of a single word from a memory bank in one
clock cycle. Most vector processors use memory banks, which allow multiple independent accesses rather
than simple memory interleaving for three reasons:
(i) Many vector computers support multiple loads or stores per clock, and the memory
bank cycle time is usually several times larger than the processor cycle time. To support simultaneous
accesses from multiple loads or stores, the memory system needs multiple banks and to be able to control the
addresses to the banks independently.
(ii) Most vector processors support the ability to load or store data words that are not
sequential.
(iii) Most vector computers support multiple processors sharing the same memory system.
Hence, each processor will be generating its own independent stream of addresses.
E. Stride: Handling Multidimensional Arrays in Vector Architectures
The position in memory of adjacent elements in a vector may not be sequential. This distance separating
elements to be gathered into a single register is called the stride.
EC6009 – ADVANCED COMPUTER ARCHITECTURE
SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
8
F. Gather-Scatter: Handling Sparse Matrices in Vector Architectures
In a sparse matrix, the elements of a vector are usually stored in some compacted form and then accessed
indirectly. Assuming a simplified sparse structure, we might see code that looks like this:
K and M are used to designate the nonzero elements of A and C. This code implements a sparse vector sum
on the arrays A and C.
The primary mechanism for supporting sparse matrices is gather-scatter operations using index vectors. The
goal of such operations is to support moving between a compressed representation (i.e., zeros are not
included) and normal representation (i.e., the zeros are included) of a sparse matrix.
SIMD INSTRUCTION SET EXTENSIONS FOR MULTIMEDIA
SIMD Multimedia Extensions started with the simple observation that many media
applications operate on narrower data types than the 32-bit processors were optimized for.
Many graphics systems used 8 bits to represent each of the three primary colors plus 8
bits for transparency. Depending on the application, audio samples are usually represented with 8 or 16 bits.
By partitioning the carry chains within, say, a 256-bit adder, a processor could perform simultaneous
operations on short vectors of thirty-two 8-bit operands, sixteen 16-bit operands, eight 32-bit operands, or
four 64-bit operands.
Like vector instructions, a SIMD instruction specifies the same operation on vectors of
data. Vector machines with large register files such as the VMIPS vector register, can hold as many as sixty-
four 64-bit elements in each of 8 vector registers. Whereas, SIMD instructions tend to specify fewer
operands and hence use much smaller register files.
Limitations, compared to vector instructions:
Number of data operands encoded into op code is small
No sophisticated addressing modes (strided, scatter-gather)
No mask registers
Advantages over vector architecture:
Cost little to add to the standard ALU and easy to implement
Require little extra state à easy for context-switch
Require little extra memory bandwidth
No virtual memory problem of cross-page access and page-fault
Implementations:
o Intel MMX (1996)
Eight 8-bit integer ops or four 16-bit integer ops
o Streaming SIMD Extensions (SSE) (1999)
Eight 16-bit integer ops
Four 32-bit integer/fp ops or two 64-bit integer/fp ops
o Advanced Vector Extensions (2010)
Four 64-bit integer/fp ops
EC6009 – ADVANCED COMPUTER ARCHITECTURE
SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
9
o Operands must be consecutive and aligned memory locations
o Generally designed to accelerate carefully written libraries rather than for compilers
Example:
To give an idea of what multimedia instructions look like, assume we added 256-bit SIMD multimedia
instructions to MIPS. We concentrate on floating point in this example. We add the suffix “4D” on
instructions that operate on four double-precision operands at once. Like vector architectures, you can think
of a SIMD processor as having lanes, four in this case. This example
shows MIPS SIMD code for the DAXPY loop. Assume that the starting addresses of X and Y are in Rx and
Ry, respectively. Underline the changes to the MIPS code for SIMD.
The changes were :
Replacing every MIPS double-precision instruction with its 4D equivalent
Increasing the increment from 8 to 32
Changing the registers from F2 and F4 to F4 and F8 to get enough space in the register file for four
sequential double-precision operands.
Each SIMD lane should have its own copy of the scalar a, Hence the value of F0 was copied into
registers F1, F2, and F3. Real SIMD instruction extensions have an instruction to broadcast a value to
all other registers in a group. Thus, the multiply does F4*F0, F5*F1, F6*F2, and F7*F3.
SIMD MIPS has 149 instruction versus 578 instructions executed for MIPS.
EC6009 – ADVANCED COMPUTER ARCHITECTURE
SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
10
GRAPHICS PROCESSING UNITS
Original GPUs were dedicated fixed-function devices for generating 3D graphics (mid-late 1990s)
including high performance floating-point units.
Provide workstation-like graphics for PCs
User could configure graphics pipeline, but not really program it
Over time, more programmability added (2001-2005)
E.g., New language Cg for writing small programs run on each vertex or each pixel,
also Windows DirectX variants
Massively parallel (millions of vertices or pixels per frame) but very constrained
programming model
GPUs have virtually every type of parallelism that can be captured by the programming environment:
multithreading, MIMD, SIMD, and even instruction-level.
A. Features of GPU’s:
Boards have their own memory
Usually between 1 and 2 GB. It can go up to 6 GB on high-end models
Much faster than the processor’s memory.
Higher bandwidth than the processor’s memory.
GPU’s exhibit Massive Parallelism – GPUS support thousands of threads running at the same time.
B. Architecture of GPU:
In general, architectures of state-of-the-art GPUs are kept secret but some general details are
published.
For example NVIDIA publishes documentation about their processors and provides a full
development tool chain to use their GPUs.
GPUs devote most of their transistors to computational units, and very few of them for control and
cache.
GPUs are best suited for applications with simple control flow and high arithmetic intensity.
EC6009 – ADVANCED COMPUTER ARCHITECTURE
SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
11
The terminology used in NVIDIA’s GPU are :
Grid - a vectorizable loop
Thread Block – A group of threads processing a portion of the loop
(CUDA) Thread –Thread that processes one iteration of the loop
Warp- A thread of SIMD instruction
Global Memory - DRAM available to all threads
Local Memory- Private to the thread
Shared Memory- Accessible to all threads of a Streaming Processor
Streaming Multiprocessor –Multithreaded SIMD processor
Warp Scheduler –SIMD Thread Scheduler
Thread Processor – SIMD lane
The block diagram of NVIDIA GPU is shown below:
EC6009 – ADVANCED COMPUTER ARCHITECTURE
SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
12
Each thread has its own PC.
Thread schedulers use scoreboard to dispatch .
There are no data dependencies between threads .
Warp scheduler schedules blocks to SIMD processors
Within each SIMD processor there are 32 SIMD lanes. They are wide and shallow compared to
vector processors.
NVIDIA GPU has 32,768 registers. It is divided into lanes Each SIMD thread is limited to 64
registers . SIMD thread has up to 64 vector registers of 32 32-bit elements (or) 32 vector registers of
32 64-bit elements .
Vector Processors vs. GPUs :
Similarities to vector machines:
Works well with data-level parallel problems
Scatter-gather transfers
Mask registers
Large register files
Differences:
EC6009 – ADVANCED COMPUTER ARCHITECTURE
SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
13
No scalar processor
Uses multithreading to hide memory latency
Has many functional units, as opposed to a few deeply pipelined units like a vector processor
EC6009 – ADVANCED COMPUTER ARCHITECTURE
SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING