0% found this document useful (0 votes)

225 views13 pages

Unit Iii - Aca

Vector architectures use large vector registers to store and operate on multiple data elements simultaneously via single instructions. VMIPS is an example vector architecture based on MIPS. It contains vector registers that can hold 64 elements each, as well as vector functional units that can perform operations on entire vectors in parallel. A single vector instruction can complete dozens of operations at once. This greatly improves performance over scalar MIPS for tasks like vector loops. Optimizations like multiple lanes and vector length registers help vector processors handle different data sizes and lengths more efficiently.

Uploaded by

Anitha Denis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

225 views13 pages

Unit Iii - Aca

Uploaded by

Anitha Denis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

1

EC6009 ADVANCED COMPUTER ARCHITECTURE

UNIT III

DATA-LEVEL PARALLELISM

Vector architecture – SIMD extensions – Graphics Processing units – Loop level parallelism.

DATA LEVEL PARALLELISM IN

 Vector Architectures
 SIMD Extensions
 GPU Architectures
VECTOR ARCHITECTURE
Vector architectures uses sets of data elements scattered about memory. It places the data into large,
sequential register files, operate on data in those register files, and then disperse the results back into
memory. A single instruction operates on vectors of data, which results in dozens of register–register
operations on independent data elements.

(i) VMIPS( Vector MIPS)

VMIPS is based on the Cray-1 processor and is shown in the figure below:

ARCHITECTURE OF VMIPS

EC6009 – ADVANCED COMPUTER ARCHITECTURE

SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
2

The primary components of the instruction set architecture of VMIPS are the following:
 Vector registers :- Each vector register is a fixed-length bank holding a single vector.
VMIPS has eight vector registers. Each vector register holds 64 elements, each 64 bits wide. They are
provided with enough ports to connect with the functional units. The read and write ports, which total at least
16 read ports and 8 write ports, are connected to the functional unit inputs or outputs by a pair of crossbar
switches.
 Vector functional units:- VMIPS has five functional units. Each unit is fully pipelined,
and it can start a new operation on every clock cycle.
 Vector load/store unit :- The vector memory unit loads or stores a vector to or from
memory. The VMIPS vector loads and stores are fully pipelined, so that words can be moved between the
vector registers and memory with a bandwidth of one word per clock cycle.
 A set of scalar registers:- Scalar registers can also provide data as input to the vector
functional units. These are the normal 32 general-purpose registers and 32 floating-point registers of MIPS.

(ii) VMIPS VECTOR INSTRUCTIONS

Some of the instructions of VMIPS are shown below:

 In VMIPS, vector operations use the same names as scalar MIPS instructions, but with
the letters “VV” or “VS” appended.
ADDVV.D : is an addition of two double-precision vectors. The vector instructions take as their input a pair
of vector registers (ADDVV.D)
(ADDVS.D) : is an addition of two double-precision vectors. The vector instructions take as their input a
vector register and a scalar register (ADDVS.D)
 The names LV and SV denote vector load and vector store, and they load or store an
entire vector of double-precision data. One operand is the vector register to be loaded or stored; the other
operand, which is a MIPS general-purpose register, is the starting address of the vector in memory.
 There are two additional special-purpose registers: the vector-length(VL) and vector-
mask registers(VMR). The VL is used when the natural vector length is not 64. VMR is used when loops
involve IF statements.

EC6009 – ADVANCED COMPUTER ARCHITECTURE

SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
3

 We can best understand a vector processor by looking at a vector loop for VMIPS:
Consider SAXPY or DAXPY loop that forms the inner loop of the Linpack benchmark:

Y=a×X+Y

SAXPY stands for single-precision a x X plus Y

DAXPY stands for double precision a × X plus Y
Let us assume that the number of elements, or length, of a vector register (64) matches the length of the
vector operation.

Example Show the code for MIPS and VMIPS for the DAXPY loop. Assume that the starting addresses of X
and Y are in Rx and Ry respectively.

MIPS Code for DAXPY loop:

Requires 578 instructions to execute.

VMIPS Code for DAXPY loop:

Requires 6 instructions to execute.

The most dramatic difference is that the vector processor greatly reduces the dynamic instruction bandwidth,
executing only 6 instructions versus 578 for MIPS. This reduction occurs because the vector operations work
on 64 elements in the vector register.

(iii) VECTOR EXECUTION TIME:

 The execution time of a sequence of vector operations primarily depends on three factors:
(1) the length of the operand vectors, (2) structural hazards among the operations, and (3) the
EC6009 – ADVANCED COMPUTER ARCHITECTURE
SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
4

data dependences.
 The execution time(clock cycles) for a single vector instruction is approximately the
vector length.
 Convoy, is the set of vector instructions that could potentially execute together.
Performance of a section of code by counting the number of convoys. The instructions in a convoy must not
contain any structural hazards. A convoy of instructions must complete execution before any other
instructions (scalar or vector) can begin execution.
 Chaining allows sequences with read-after-write dependency hazards to execute in same
convoy. Chaining allows a vector operation to start as soon as the individual elements of its vector source
operand become available: The results from the first functional unit in the chain are “forwarded” to the
second functional unit.
 Chime : A timing metric to estimate the time for a convoy is called a chime. It is simply
the unit of time taken to execute one convoy. Thus, a vector sequence that consists of m convoys executes in
m chimes; for a vector length of n, for VMIPS this is approximately m × n clock cycles.

The convoys for the VMIPS Code for DAXPY loop is shown below:

Each Convoy executes 64 elements and takes 1 chime or 64 clock cycles.

Hence for 3 convoys, the sequence takes 3 chimes or 64 x 3 = 192 clock cycles.

(iv) OPTIMIZATIONS OF THE VECTOR ARCHITECTURE

A. Multiple Lanes: Beyond One Element per Clock Cycle

B. Vector-Length Registers: Handling Loops Not Equal to 64
C. Vector Mask Registers: Handling IF Statements in Vector Loops
D. Memory Banks: Supplying Bandwidth for Vector Load/Store Units
E. Stride: Handling Multidimensional Arrays in Vector Architectures
F. Gather-Scatter: Handling Sparse Matrices in Vector Architectures

A. Multiple Lanes: Beyond One Element per Clock Cycle

Multiple functional units can be used to improve the performance of a single vector instruction.
Example : C = A + B (A, B and C are vectors)
 The vector processor shown below (Fig (a)) has a single add pipeline and completes only
one addition per cycle. Hence it takes 64 clock cycles to complete a chime.
 The vector processor shown below (Fig (b)) has four add pipelines and can complete four
additions per cycle. The elements within a single vector add instruction are interleaved across the four
pipelines. The set of elements that move through the pipelines together is termed an element group. For a 4
lane architecture, the number of clock cycles is reduced to 16.

EC6009 – ADVANCED COMPUTER ARCHITECTURE

SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
5

Single Lane Multiple Lane(4 lanes)

The below figure shows the structure of a four lane Functional unit.

Structure of a vector functional unit containing four lanes.

 The vector register storage is divided across the lanes. Each lane holds every fourth
element of each vector register. The figure shows three vector functional units: an FP add, an FP multiply,
and a load-store unit. Each of the vector arithmetic units contains four execution pipelines, one per lane, to
complete a single vector instruction.
EC6009 – ADVANCED COMPUTER ARCHITECTURE
SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
6

 Adding multiple lanes is a popular technique to improve vector performance as it requires

little increase in control complexity and does not require changes to existing machine code. It also allows
designers to trade off die area, clock rate, voltage, and energy without sacrificing peak performance. If the
clock rate of vector processor is halved, doubling the number of lanes will retain the same potential
performance.

B. Vector-Length Registers: Handling Loops Not Equal to 64

 The vector length is determined by the number of elements in each vector register. This
length, which is 64 for VMIPS, is unlikely to match the real vector length in a program.
A single piece of code may require different vector lengths. For example, consider his code:

The size of all the vector operations depends on n, which may not even be known until run time!
 The solution to these problems is to create a vector-length register (VLR). The VLR
controls the length of any vector operation, including a vector load or store.
 The value in the VLR, however, cannot be greater than the length of the vector registers.
The maximum length of a vector register is called as maximum vector length (MVL). The MVL determines
the number of data elements in a vector of an architecture.
 If the value of n is not known at compile time and greater than the MVL, the VLR
concept cannot be applied. If the vector is longer than the maximum length, a technique called strip mining
is used.
 Strip mining is the generation of code such that each vector operation is done for a size
less than or equal to the MVL. If the size of code is n, find VL = n % MVL(modulo) and
j=n/MVL(truncating integer division). The first loop is executed with VL elements. The other loops are
executed for n/MVL times with MVL elements in each loop.
The strip mined version of DAXPY loop in C is shown below:

A vector of arbitrary length processed with strip mining

EC6009 – ADVANCED COMPUTER ARCHITECTURE

SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
7

C. Vector Mask Registers: Handling IF Statements in Vector Loops

 Programs that contain IF statements in loops cannot be run in vector mode using the
previous techniques because the IF statements introduce control dependences into a loop.
Consider the following loop written in C:

This loop cannot be vectorized normally because of the conditional execution of the body. Only if the inner
loop could be run for the iterations for which X[i] ≠ 0, then the subtraction can be vectorized. This can be
achieved through vector-mask control.
 Mask registers provide conditional execution of each element operation in a vector
instruction. The vector-mask control uses a Boolean vector to control the execution of a vector instruction.
The entries in the VMR is 0 or 1. When the vector-mask register is enabled, any vector instructions executed
operate only on the vector elements whose corresponding entries in the vector-mask register are one. The
entries in the destination vector register that correspond to a zero in the ask register are unaffected by the
vector operation. VMR operation is done using SNEVS instruction.

D. Memory Banks: Supplying Bandwidth for Vector Load/Store Units

Simple memory interleaving is load/store of a single word from a memory bank in one
clock cycle. Most vector processors use memory banks, which allow multiple independent accesses rather
than simple memory interleaving for three reasons:
(i) Many vector computers support multiple loads or stores per clock, and the memory
bank cycle time is usually several times larger than the processor cycle time. To support simultaneous
accesses from multiple loads or stores, the memory system needs multiple banks and to be able to control the
addresses to the banks independently.
(ii) Most vector processors support the ability to load or store data words that are not
sequential.
(iii) Most vector computers support multiple processors sharing the same memory system.
Hence, each processor will be generating its own independent stream of addresses.

E. Stride: Handling Multidimensional Arrays in Vector Architectures

The position in memory of adjacent elements in a vector may not be sequential. This distance separating
elements to be gathered into a single register is called the stride.
EC6009 – ADVANCED COMPUTER ARCHITECTURE
SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
8

F. Gather-Scatter: Handling Sparse Matrices in Vector Architectures

In a sparse matrix, the elements of a vector are usually stored in some compacted form and then accessed
indirectly. Assuming a simplified sparse structure, we might see code that looks like this:

K and M are used to designate the nonzero elements of A and C. This code implements a sparse vector sum
on the arrays A and C.
The primary mechanism for supporting sparse matrices is gather-scatter operations using index vectors. The
goal of such operations is to support moving between a compressed representation (i.e., zeros are not
included) and normal representation (i.e., the zeros are included) of a sparse matrix.

SIMD INSTRUCTION SET EXTENSIONS FOR MULTIMEDIA

 SIMD Multimedia Extensions started with the simple observation that many media
applications operate on narrower data types than the 32-bit processors were optimized for.
 Many graphics systems used 8 bits to represent each of the three primary colors plus 8
bits for transparency. Depending on the application, audio samples are usually represented with 8 or 16 bits.
By partitioning the carry chains within, say, a 256-bit adder, a processor could perform simultaneous
operations on short vectors of thirty-two 8-bit operands, sixteen 16-bit operands, eight 32-bit operands, or
four 64-bit operands.
 Like vector instructions, a SIMD instruction specifies the same operation on vectors of
data. Vector machines with large register files such as the VMIPS vector register, can hold as many as sixty-
four 64-bit elements in each of 8 vector registers. Whereas, SIMD instructions tend to specify fewer
operands and hence use much smaller register files.
 Limitations, compared to vector instructions:
 Number of data operands encoded into op code is small
 No sophisticated addressing modes (strided, scatter-gather)
 No mask registers
 Advantages over vector architecture:
 Cost little to add to the standard ALU and easy to implement
 Require little extra state à easy for context-switch
 Require little extra memory bandwidth
 No virtual memory problem of cross-page access and page-fault
 Implementations:
o Intel MMX (1996)
 Eight 8-bit integer ops or four 16-bit integer ops
o Streaming SIMD Extensions (SSE) (1999)
 Eight 16-bit integer ops
 Four 32-bit integer/fp ops or two 64-bit integer/fp ops
o Advanced Vector Extensions (2010)
 Four 64-bit integer/fp ops
EC6009 – ADVANCED COMPUTER ARCHITECTURE
SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
9

o Operands must be consecutive and aligned memory locations

o Generally designed to accelerate carefully written libraries rather than for compilers
 Example:

To give an idea of what multimedia instructions look like, assume we added 256-bit SIMD multimedia
instructions to MIPS. We concentrate on floating point in this example. We add the suffix “4D” on
instructions that operate on four double-precision operands at once. Like vector architectures, you can think
of a SIMD processor as having lanes, four in this case. This example
shows MIPS SIMD code for the DAXPY loop. Assume that the starting addresses of X and Y are in Rx and
Ry, respectively. Underline the changes to the MIPS code for SIMD.

The changes were :

 Replacing every MIPS double-precision instruction with its 4D equivalent

 Increasing the increment from 8 to 32
 Changing the registers from F2 and F4 to F4 and F8 to get enough space in the register file for four
sequential double-precision operands.
 Each SIMD lane should have its own copy of the scalar a, Hence the value of F0 was copied into
registers F1, F2, and F3. Real SIMD instruction extensions have an instruction to broadcast a value to
all other registers in a group. Thus, the multiply does F4*F0, F5*F1, F6*F2, and F7*F3.
 SIMD MIPS has 149 instruction versus 578 instructions executed for MIPS.

EC6009 – ADVANCED COMPUTER ARCHITECTURE

SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
10

GRAPHICS PROCESSING UNITS

 Original GPUs were dedicated fixed-function devices for generating 3D graphics (mid-late 1990s)
including high performance floating-point units.
 Provide workstation-like graphics for PCs
 User could configure graphics pipeline, but not really program it
 Over time, more programmability added (2001-2005)
 E.g., New language Cg for writing small programs run on each vertex or each pixel,
also Windows DirectX variants
 Massively parallel (millions of vertices or pixels per frame) but very constrained
programming model
 GPUs have virtually every type of parallelism that can be captured by the programming environment:
multithreading, MIMD, SIMD, and even instruction-level.
A. Features of GPU’s:
 Boards have their own memory
 Usually between 1 and 2 GB. It can go up to 6 GB on high-end models
 Much faster than the processor’s memory.
 Higher bandwidth than the processor’s memory.
 GPU’s exhibit Massive Parallelism – GPUS support thousands of threads running at the same time.
B. Architecture of GPU:
 In general, architectures of state-of-the-art GPUs are kept secret but some general details are
published.
 For example NVIDIA publishes documentation about their processors and provides a full
development tool chain to use their GPUs.
 GPUs devote most of their transistors to computational units, and very few of them for control and
cache.
 GPUs are best suited for applications with simple control flow and high arithmetic intensity.

EC6009 – ADVANCED COMPUTER ARCHITECTURE

SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
11

The terminology used in NVIDIA’s GPU are :

 Grid - a vectorizable loop
 Thread Block – A group of threads processing a portion of the loop
 (CUDA) Thread –Thread that processes one iteration of the loop
 Warp- A thread of SIMD instruction
 Global Memory - DRAM available to all threads
 Local Memory- Private to the thread
 Shared Memory- Accessible to all threads of a Streaming Processor
 Streaming Multiprocessor –Multithreaded SIMD processor
 Warp Scheduler –SIMD Thread Scheduler
 Thread Processor – SIMD lane
The block diagram of NVIDIA GPU is shown below:

EC6009 – ADVANCED COMPUTER ARCHITECTURE

SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
12

 Each thread has its own PC.

 Thread schedulers use scoreboard to dispatch .
 There are no data dependencies between threads .
 Warp scheduler schedules blocks to SIMD processors
 Within each SIMD processor there are 32 SIMD lanes. They are wide and shallow compared to
vector processors.
 NVIDIA GPU has 32,768 registers. It is divided into lanes Each SIMD thread is limited to 64
registers . SIMD thread has up to 64 vector registers of 32 32-bit elements (or) 32 vector registers of
32 64-bit elements .

 Vector Processors vs. GPUs :

Similarities to vector machines:
 Works well with data-level parallel problems
 Scatter-gather transfers
 Mask registers
 Large register files
Differences:
EC6009 – ADVANCED COMPUTER ARCHITECTURE
SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
13

 No scalar processor
 Uses multithreading to hide memory latency
 Has many functional units, as opposed to a few deeply pipelined units like a vector processor

EC6009 – ADVANCED COMPUTER ARCHITECTURE

SUBJECT NOTES – AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING

1-IAS Architecture-12-12-2022
No ratings yet
1-IAS Architecture-12-12-2022
34 pages
PDF
100% (2)
PDF
39 pages
Scribid ACA Important Topics With Answers
No ratings yet
Scribid ACA Important Topics With Answers
57 pages
Unit 5
No ratings yet
Unit 5
86 pages
Unit-1 ACA
No ratings yet
Unit-1 ACA
26 pages
Systolic Array
No ratings yet
Systolic Array
42 pages
Principles of Scalable Performance
No ratings yet
Principles of Scalable Performance
34 pages
Instruction Pipeline Design, Arithmetic Pipeline Deign - Super Scalar Pipeline Design
No ratings yet
Instruction Pipeline Design, Arithmetic Pipeline Deign - Super Scalar Pipeline Design
34 pages
Unix File System Case Study
No ratings yet
Unix File System Case Study
23 pages
Pipelining and Superscalar Techniques: CSE539: Advanced Computer Architecture
No ratings yet
Pipelining and Superscalar Techniques: CSE539: Advanced Computer Architecture
49 pages
Parallel Computer Models: CSE7002: Advanced Computer Architecture
No ratings yet
Parallel Computer Models: CSE7002: Advanced Computer Architecture
37 pages
Unit 4 - Run - Time Environment
No ratings yet
Unit 4 - Run - Time Environment
34 pages
CH17-COA10e - Parallel Processing
No ratings yet
CH17-COA10e - Parallel Processing
45 pages
Spos Unit 2 PPT 2022 Compressed
No ratings yet
Spos Unit 2 PPT 2022 Compressed
89 pages
Spos Practicals 1 To 4-Machine
No ratings yet
Spos Practicals 1 To 4-Machine
54 pages
Subject Name Parallel and Distributed Computing
100% (1)
Subject Name Parallel and Distributed Computing
3 pages
DBMS UNIT IV NOTES File Organization and Indexing
No ratings yet
DBMS UNIT IV NOTES File Organization and Indexing
64 pages
Co Unit 1 Notes
100% (1)
Co Unit 1 Notes
51 pages
Unit I
No ratings yet
Unit I
53 pages
Multiprocessor Architecture Overview
100% (1)
Multiprocessor Architecture Overview
10 pages
Unit 2 - Advanced Computer Architecture - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Advanced Computer Architecture - WWW - Rgpvnotes.in
15 pages
SPM Lecture Notes 2023 (R20 III-I)
No ratings yet
SPM Lecture Notes 2023 (R20 III-I)
76 pages
CP4253 Map Unit Ii
No ratings yet
CP4253 Map Unit Ii
23 pages
PowerPoint Slides To Chapter 07
No ratings yet
PowerPoint Slides To Chapter 07
49 pages
Final Report
No ratings yet
Final Report
8 pages
Lec-10 Software Pipelining
No ratings yet
Lec-10 Software Pipelining
24 pages
Distributed File System - File Service Architecture
No ratings yet
Distributed File System - File Service Architecture
51 pages
CS3353 Unit2
No ratings yet
CS3353 Unit2
51 pages
3G & 4G Network Evolution Guide
No ratings yet
3G & 4G Network Evolution Guide
178 pages
Segmented Paging: Unit Iv
100% (1)
Segmented Paging: Unit Iv
11 pages
2-Water Fall Model, Incremental Model, RAD Model-05!01!2024
No ratings yet
2-Water Fall Model, Incremental Model, RAD Model-05!01!2024
30 pages
PPS - Unit 1
No ratings yet
PPS - Unit 1
69 pages
Chapter-7 Multiprocessors and Multicomputers: Module-Iv
No ratings yet
Chapter-7 Multiprocessors and Multicomputers: Module-Iv
53 pages
21cse203p App l19-l27 Unit III
No ratings yet
21cse203p App l19-l27 Unit III
72 pages
cs3351 Dpco Unit 4
No ratings yet
cs3351 Dpco Unit 4
26 pages
Macro Processor Overview & Concepts
100% (2)
Macro Processor Overview & Concepts
44 pages
Object Oriented Programming and Data Structures
No ratings yet
Object Oriented Programming and Data Structures
69 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
Beyond Binary Classification
No ratings yet
Beyond Binary Classification
34 pages
CS8491 Ca Unit 4
No ratings yet
CS8491 Ca Unit 4
32 pages
Lab Rubrics
No ratings yet
Lab Rubrics
3 pages
Software Testing & Maintenance Guide
No ratings yet
Software Testing & Maintenance Guide
37 pages
ACA Notes UNIT-1
No ratings yet
ACA Notes UNIT-1
20 pages
Industrial Applications Using Neural Networks
No ratings yet
Industrial Applications Using Neural Networks
11 pages
Lecture 3 Multiprocessor Vs Multicomputer Vs DS
No ratings yet
Lecture 3 Multiprocessor Vs Multicomputer Vs DS
55 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
9 pages
Ch01 Basic Concepts and Computer Evolution
No ratings yet
Ch01 Basic Concepts and Computer Evolution
36 pages
Unit - Iv - Storage Virtualization
No ratings yet
Unit - Iv - Storage Virtualization
26 pages
CS6801 Mca Rejinpaul Iq April May 2018
No ratings yet
CS6801 Mca Rejinpaul Iq April May 2018
1 page
Digital Steganography
No ratings yet
Digital Steganography
38 pages
Multiprocessor vs Multicomputer Systems
No ratings yet
Multiprocessor vs Multicomputer Systems
27 pages
Multiprocessor System Architecture
No ratings yet
Multiprocessor System Architecture
11 pages
Parallel Computer Architecture Classification
No ratings yet
Parallel Computer Architecture Classification
23 pages
CC Ques Bank Cloud Computing QB UNIT 5
No ratings yet
CC Ques Bank Cloud Computing QB UNIT 5
7 pages
Os Lab
No ratings yet
Os Lab
97 pages
Imp Questions For RGPV
No ratings yet
Imp Questions For RGPV
4 pages
Computer Architecture and Parallel Processing
No ratings yet
Computer Architecture and Parallel Processing
29 pages
Slides Chapter 5 Basic Processing Unit
No ratings yet
Slides Chapter 5 Basic Processing Unit
44 pages
Guc 315 61 38694 2023-11-23T11 50 52
No ratings yet
Guc 315 61 38694 2023-11-23T11 50 52
33 pages
7TH - Unit 4-21ec74h6 - Ca
No ratings yet
7TH - Unit 4-21ec74h6 - Ca
67 pages
Unit 2 - Signal System
No ratings yet
Unit 2 - Signal System
69 pages
Signals and Systems Basics
No ratings yet
Signals and Systems Basics
67 pages
VLab - NITK-PALS-FDP Project - Instructions
No ratings yet
VLab - NITK-PALS-FDP Project - Instructions
1 page
Efficient DSP Multiplier Design
No ratings yet
Efficient DSP Multiplier Design
4 pages
Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance
No ratings yet
Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance
16 pages
From Vlsi To System Design (Soc) - The Choice of Spi: Kunal Ghosh
No ratings yet
From Vlsi To System Design (Soc) - The Choice of Spi: Kunal Ghosh
6 pages
AT&T
No ratings yet
AT&T
4 pages
Optimization An Important Stage of Engineering Design
No ratings yet
Optimization An Important Stage of Engineering Design
8 pages
Unit 2 Carrier Transport Mechanism in Semiconductors
No ratings yet
Unit 2 Carrier Transport Mechanism in Semiconductors
158 pages
Races-2025 - Brochure-New
No ratings yet
Races-2025 - Brochure-New
1 page
List Perangkat Sertifikasi Komersil
No ratings yet
List Perangkat Sertifikasi Komersil
4 pages
River Restoration Experience
No ratings yet
River Restoration Experience
38 pages
Lilian P. Lomboy (1st Week)
No ratings yet
Lilian P. Lomboy (1st Week)
2 pages
Tuberculosis - Madeleine R Jasin - Clean
No ratings yet
Tuberculosis - Madeleine R Jasin - Clean
49 pages
ABE 114 Midterm Coverage
No ratings yet
ABE 114 Midterm Coverage
381 pages
(Ebook PDF) Applied Mergers and Acquisitions by Robert F. Bruner PDF Download
100% (1)
(Ebook PDF) Applied Mergers and Acquisitions by Robert F. Bruner PDF Download
56 pages
Probability: Experience
No ratings yet
Probability: Experience
12 pages
Research Ii Espine, de Los Santos, Verzo Effects of Technology To Senior Hi
No ratings yet
Research Ii Espine, de Los Santos, Verzo Effects of Technology To Senior Hi
46 pages
Groundfridge Safety Guide
No ratings yet
Groundfridge Safety Guide
1 page
Citi Investment Banking Analyst
No ratings yet
Citi Investment Banking Analyst
4 pages
Lesson Plan Operations Management Oct 2021 Feb 2022
No ratings yet
Lesson Plan Operations Management Oct 2021 Feb 2022
4 pages
Automated Stock Management System
No ratings yet
Automated Stock Management System
7 pages
Use of Local Culture in EFL Classroom
No ratings yet
Use of Local Culture in EFL Classroom
20 pages
The Magic Garden of George B and Other Logic Puzzles 1st Edition Raymond Smullyan PDF Download
100% (10)
The Magic Garden of George B and Other Logic Puzzles 1st Edition Raymond Smullyan PDF Download
61 pages
PPSC Sub Inspector Test Results
No ratings yet
PPSC Sub Inspector Test Results
5 pages
Davis Paul A Tourism Factfiles
No ratings yet
Davis Paul A Tourism Factfiles
26 pages
Problem Solving and Reasoning
No ratings yet
Problem Solving and Reasoning
44 pages
Time Value of Money
No ratings yet
Time Value of Money
3 pages
A Simple Classifier For Multiple Criteria ABC Analysis. Wan Lung NG
No ratings yet
A Simple Classifier For Multiple Criteria ABC Analysis. Wan Lung NG
10 pages
POLYNOMIAL-Inequalities: Quick Revision
No ratings yet
POLYNOMIAL-Inequalities: Quick Revision
7 pages
Interview Management Strategies
No ratings yet
Interview Management Strategies
9 pages
Chemistry Exam Guide 2024
No ratings yet
Chemistry Exam Guide 2024
4 pages
Breville BES860XL Manual
100% (1)
Breville BES860XL Manual
116 pages
Pharmacist Return To Work Course
100% (2)
Pharmacist Return To Work Course
8 pages
Final Exam Opeartion Research
No ratings yet
Final Exam Opeartion Research
1 page

Unit Iii - Aca

Uploaded by

Unit Iii - Aca

Uploaded by

1

EC6009 ADVANCED COMPUTER ARCHITECTURE

DATA LEVEL PARALLELISM IN

(i) VMIPS( Vector MIPS)

EC6009 – ADVANCED COMPUTER ARCHITECTURE

(ii) VMIPS VECTOR INSTRUCTIONS

Some of the instructions of VMIPS are shown below:

EC6009 – ADVANCED COMPUTER ARCHITECTURE

SAXPY stands for single-precision a x X plus Y

MIPS Code for DAXPY loop:

Requires 578 instructions to execute.

VMIPS Code for DAXPY loop:

Requires 6 instructions to execute.

(iii) VECTOR EXECUTION TIME:

Each Convoy executes 64 elements and takes 1 chime or 64 clock cycles.

(iv) OPTIMIZATIONS OF THE VECTOR ARCHITECTURE

A. Multiple Lanes: Beyond One Element per Clock Cycle

A. Multiple Lanes: Beyond One Element per Clock Cycle

EC6009 – ADVANCED COMPUTER ARCHITECTURE

Single Lane Multiple Lane(4 lanes)

Structure of a vector functional unit containing four lanes.

 Adding multiple lanes is a popular technique to improve vector performance as it requires

B. Vector-Length Registers: Handling Loops Not Equal to 64

A vector of arbitrary length processed with strip mining

EC6009 – ADVANCED COMPUTER ARCHITECTURE

C. Vector Mask Registers: Handling IF Statements in Vector Loops

D. Memory Banks: Supplying Bandwidth for Vector Load/Store Units

E. Stride: Handling Multidimensional Arrays in Vector Architectures

F. Gather-Scatter: Handling Sparse Matrices in Vector Architectures

SIMD INSTRUCTION SET EXTENSIONS FOR MULTIMEDIA

o Operands must be consecutive and aligned memory locations

The changes were :

 Replacing every MIPS double-precision instruction with its 4D equivalent

EC6009 – ADVANCED COMPUTER ARCHITECTURE

GRAPHICS PROCESSING UNITS

EC6009 – ADVANCED COMPUTER ARCHITECTURE

The terminology used in NVIDIA’s GPU are :

EC6009 – ADVANCED COMPUTER ARCHITECTURE

 Each thread has its own PC.

 Vector Processors vs. GPUs :

EC6009 – ADVANCED COMPUTER ARCHITECTURE

You might also like