0% found this document useful (0 votes)
21 views13 pages

Aca

The document provides detailed notes on Advanced Computer Architecture, covering five units that include the theory of parallelism, principles of scalable performance, shared-memory systems, parallel architectures, and vector processing. Key topics include parallel computer models, performance metrics, cache coherence, and SIMD computing. The notes emphasize the importance of parallelism in modern computing to achieve efficiency and speed in processing large-scale problems.

Uploaded by

brigcse05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views13 pages

Aca

The document provides detailed notes on Advanced Computer Architecture, covering five units that include the theory of parallelism, principles of scalable performance, shared-memory systems, parallel architectures, and vector processing. Key topics include parallel computer models, performance metrics, cache coherence, and SIMD computing. The notes emphasize the importance of parallelism in modern computing to achieve efficiency and speed in processing large-scale problems.

Uploaded by

brigcse05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Of course.

Here are your detailed notes for Advanced Computer


Architecture, structured unit by unit according to your syllabus.

Unit I: Theory of Parallelism

This unit lays the groundwork for understanding parallel computing by


introducing fundamental models, properties, and challenges.

Parallel Computer Models

Parallel computing involves using multiple processing elements


simultaneously to solve a problem. The primary models classify how these
elements handle instructions and data.

 The State of Computing: Historically, performance gains came


from increasing clock speeds (making one processor faster). We've
hit physical limits (the "power wall"), so modern performance gains
now come from parallelism—using multiple, simpler processing
cores.

 Flynn's Taxonomy: This is the most common classification of


computer architectures, based on the number of instruction streams
and data streams.

o SISD (Single Instruction, Single Data): A traditional


uniprocessor computer (e.g., a simple microcontroller). One
instruction operates on one piece of data at a time.

o SIMD (Single Instruction, Multiple Data): A single


instruction is broadcast to multiple processing elements, each
operating on its own data. This is excellent for data
parallelism, like applying a brightness filter to all pixels in an
image simultaneously. Vector processors are a key example.

o MISD (Multiple Instruction, Single Data): Multiple


instructions operate on a single stream of data. This is very
rare in practice but could be used for fault tolerance (e.g.,
multiple systems performing different checks on the same
data).

o MIMD (Multiple Instruction, Multiple Data): Multiple


processors execute different instructions on different data.
This is the most common type of parallel computer today,
including multi-core CPUs.
 Multiprocessors (Shared Memory): All processors
share a single global memory space. Communication is
fast and implicit through memory reads/writes.

 Multicomputers (Distributed Memory): Each


processor has its own private memory. Communication
is explicit, requiring messages to be sent between
processors over a network.

 PRAM and VLSI Models: These are theoretical models used to


analyze algorithms without being tied to specific hardware.

o PRAM (Parallel Random Access Machine): A shared-


memory MIMD model where N processors are connected to a
global memory. It's used to study the logical structure of
parallel algorithms and their potential speedup, ignoring
communication costs.

o VLSI (Very Large Scale Integration) Model: Models the


physical limitations of a chip, considering factors like the area
of the components and the length of wires. It helps analyze
the trade-offs between computation and communication costs
on an actual silicon chip.

Program and Network Properties

 Conditions of Parallelism (Bernstein's Conditions): For two


program segments, P1 and P2, to be executed in parallel, they must
be independent. This is determined by analyzing their input and
output variables. Let I(Pi) be the set of input variables and O(Pi) be
the set of output variables for process Pi. P1 and P2 can execute in
parallel if:

1. I(P1) ∩ O(P2) = ∅ (P1's input is not P2's output)

2. I(P2) ∩ O(P1) = ∅ (P2's input is not P1's output)

3. O(P1) ∩ O(P2) = ∅ (They don't write to the same output


variable)

 Program Partitioning and Scheduling: This is the challenge of


breaking down a large program into smaller tasks (partitioning) and
assigning those tasks to different processors (scheduling) to
minimize overall execution time and maximize resource utilization.

 System Interconnect Architectures: This is the "fabric" or


network that connects processors and memory modules. The
topology of this network critically affects performance, cost, and
scalability.

o Shared Bus: Simple and cheap, but doesn't scale well as


many processors will contend for the single bus.

o Crossbar Switch: Provides a direct connection between any


processor and any memory module. Offers maximum
bandwidth but is very expensive and complex to build for
many processors.

o Multistage Networks: A compromise between bus and


crossbar (e.g., Omega Network).

o Direct Topologies: Used in multicomputers where nodes are


connected directly. Examples include Ring, Mesh, Torus,
and Hypercube.

Unit II: Principles of Scalable Performance

This unit focuses on how to measure, analyze, and achieve performance


improvements through parallelism.

Performance Metrics and Measures

 Speedup: The primary goal of parallel processing. It measures how


much faster a parallel algorithm is compared to its best sequential
counterpart.

o Speedup(n)=Tparallel(n)Tsequential, where n is the number of


processors.

 Efficiency: Measures how well the processors are being utilized.


Ideal efficiency is 1 (or 100%), meaning perfect speedup.

o Efficiency(n)=nSpeedup(n)

 Scalability: The ability of a system to maintain performance gains


as more processors are added.

Speedup Performance Laws

 Amdahl's Law: Describes the theoretical maximum speedup for a


fixed-size problem. It states that the speedup is limited by the
portion of the program that is inherently sequential (cannot be
parallelized). If s is the fraction of the program that is sequential,
the maximum speedup is:

o Speedup≤s+n1−s1
o This gives a pessimistic view, suggesting that if even 10% of a
program is sequential (s=0.1), you can never achieve more
than a 10x speedup, regardless of how many processors you
use.

 Gustafson's Law: Provides a more optimistic view for problems


where the size of the parallel part grows with the number of
processors (scaled-workload). It computes the speedup for a
problem that is scaled to take the same amount of time on n
processors as the original problem did on one.

o Speedup(n)=s+(1−s)×n

o This suggests that for large problems, the impact of the


sequential portion becomes less significant, and near-linear
speedups are achievable.

Hardware Technologies

 Processes and Memory Hierarchy: Performance is not just about


processor speed; it's heavily dependent on memory access time.
The memory hierarchy is designed to bridge the speed gap
between fast processors and slow main memory.

o Registers -> L1 Cache -> L2 Cache -> L3 Cache -> Main


Memory (DRAM) -> Storage (SSD/HDD)

o Caches store frequently used data closer to the processor,


reducing latency. Effective use of the cache is critical for high
performance.

 Advanced Processor Technology:

o Superscalar Processors: These processors can execute


more than one instruction per clock cycle by having multiple
parallel execution units (e.g., multiple ALUs). This is a form of
Instruction-Level Parallelism (ILP).

o Vector Processors: These processors are designed to


operate on entire arrays or "vectors" of data with a single
instruction. This is a form of Data-Level Parallelism (DLP)
and is highly efficient for scientific and multimedia
applications.

Unit III: Shared-Memory and Pipelining


This unit delves into the architecture of shared-memory systems and the
fundamental performance technique of pipelining.

Shared-Memory Organizations

In these systems, all processors access a single, global address space. The
key challenge is ensuring all processors have a consistent view of this
memory.

 Consistency Models: These are rules that define what it means for
memory to be "correct" in a parallel system. They specify the order
in which memory operations appear to execute.

o Sequential Consistency: The most intuitive model. The


result of any execution is the same as if the operations of all
processors were executed in some single sequential order,
and the operations of each individual processor appear in the
order specified by its program. It's easy to reason about but
can be slow to implement.

o Weak Consistency Models: Relax the rules of sequential


consistency to improve performance. They only guarantee
consistency at certain "synchronization points" in the
program. Between these points, a processor might not see
updates made by another processor immediately.

Pipelining Techniques

Pipelining is an implementation technique where multiple instructions


are overlapped in execution. It's like an assembly line 🏭 for instructions. A
typical instruction cycle (Fetch, Decode, Execute, Write-back) is broken
down into stages.

 Linear Pipeline Processors: A pipeline with a straight-line flow of


stages, with no feedback loops. Instruction pipelines are typically
linear.

 Non-Linear Pipeline Processors: A pipeline that contains


feedback or feed-forward connections, allowing the data flow to
branch. Useful for certain function evaluations.

 Instruction Pipeline Design: Focuses on fetching and executing a


stream of instructions. The goal is to have one instruction
completing every clock cycle. This can be disrupted by hazards:

o Structural Hazards: Two instructions need the same


hardware resource at the same time.
o Data Hazards: An instruction depends on the result of a
previous instruction that is still in the pipeline (e.g., ADD R1,
R2, R3 followed by SUB R4, R1, R5).

o Control Hazards: The pipeline makes a wrong decision on a


branch instruction and has to flush and restart.

 Arithmetic Pipeline Design: Used to speed up complex arithmetic


operations like floating-point multiplication by breaking them down
into stages.

 Superscalar Pipeline Design: A superscalar processor has


multiple instruction pipelines. It fetches and decodes multiple
instructions at once and dispatches them to different execution
units (pipelines), achieving a throughput of more than one
instruction per cycle.

Unit IV: Parallel and Scalable Architectures

This unit revisits parallel architectures with a focus on scalability and the
mechanisms required to make them work.

Multiprocessors and Cache Coherence

 Multiprocessor System Interconnects: As discussed in Unit I,


the choice of interconnect (bus, crossbar, multistage network) is
critical for shared-memory multiprocessors, as it dictates the
bandwidth and latency of memory access.

 Cache Coherence Problem: In a multiprocessor system with


individual caches, a copy of a data block may exist in multiple
caches. If one processor modifies its copy, the other caches become
stale, leading to an inconsistent view of memory. This must be
prevented.

 Synchronization Mechanisms: Protocols that ensure an orderly


view of shared memory.

o Snooping Protocols: Used in bus-based systems. Each


cache "snoops" on the bus to monitor memory transactions. If
it sees an update to a data block it holds, it can either
invalidate its copy or update it. (e.g., MSI, MESI protocols).

o Directory-Based Protocols: Used in large-scale systems


where a bus is not feasible. A central "directory" keeps track
of which processors are sharing which data blocks. When a
block is modified, the directory sends targeted messages only
to the caches that hold a copy, which is more scalable than
broadcasting on a bus.

Multicomputers and Message Passing

 Three Generations of Multicomputers: Early multicomputers


used simple mesh/hypercube interconnects. Later generations
developed more sophisticated routing techniques and integrated
message-passing hardware to reduce communication latency.

 Message-Passing Mechanisms: Since there is no shared memory,


processors communicate by explicitly sending and receiving
messages. A message contains the data being sent along with
header information (source, destination, etc.). The performance of a
multicomputer is heavily dependent on the efficiency of this
mechanism, which includes:

o Communication Latency: The total time to send a message


from source to destination.

o Routing: The algorithm used to determine the path a


message takes through the network.

o Flow Control: Mechanisms to prevent network congestion.

Unit V: Vector Processing and SIMD Computers

This unit focuses on architectures designed for massive data-level


parallelism.

Vector Processing Principles

Instead of operating on single data items (scalars), vector processors


operate on entire arrays (vectors) at once.

 Vector Instruction: A single instruction like VADD C, A, B performs


an element-wise addition of two vectors, A and B, and stores the
result in vector C. This single instruction might perform 64 additions,
for example. This drastically reduces the instruction fetch and
decode overhead compared to a scalar loop. [Image comparing
scalar vs. vector addition]

 Multivector Multiprocessors: These are systems that combine


the features of multiprocessors and vector processors. They consist
of multiple vector processors, often connected to a shared memory
system, to tackle extremely large-scale scientific problems.
 Compound Vector Processing: A technique where a vector
processor can chain multiple vector operations together (e.g., the
output of a vector multiply is fed directly into a vector add) without
waiting for the first operation to complete. This is essentially
pipelining for vector operations.

SIMD Computer Organizations

Vector processing is a highly efficient form of SIMD (Single Instruction,


Multiple Data). Another form is the processor array.

 Processor Array: A large number of simple processing elements


(PEs) are arranged in a grid or other topology. A central control unit
broadcasts a single instruction to all PEs, and each PE executes it on
its own local data.

 The Connection Machine CM-5: A landmark massively parallel


computer from the early 1990s. It was a hybrid machine that
supported both SIMD and MIMD programming models. It consisted
of thousands of processing nodes connected by a sophisticated "fat-
tree" network, demonstrating the principles of massive parallelism
and scalable interconnects.

---------------------------

UNIT I: Theory of Parallelism & System Architecture

Theory of Parallelism

 Parallelism means performing multiple computations


simultaneously to solve a problem faster.

 Benefits: Increased speed, efficiency, ability to solve large problems.

 Levels of Parallelism: Bit-level, instruction-level, data, and task-


level.

Parallel Computer Models

 Data-parallel models: All processors perform the same operation


on different data (e.g., SIMD).

 Task graph models: Different processors perform different tasks;


represented through a dependency graph.

 Pipeline models: Tasks are divided into stages processed in


sequence but overlapped (“assembly line”).

The State of Computing


 Transition from serial to parallel architectures driven by need for
high-performance in scientific, industrial, and data-intensive
applications.

Multiprocessors and Multicomputers

 Multiprocessors: All CPUs share a common memory (tightly


coupled). Often SMP (Symmetric Multiprocessing).

 Multicomputers: Each processor has its own memory and nodes


are networked (loosely coupled).

 Architectures: UMA, NUMA, and COMA (described below).

Diagram: Typical multiprocessor bus interconnect—multiple CPUs


connected to a single, shared bus; multicomputer—nodes interconnected
through a network.

Multivector and SIMD Computers

 Multivector: Multiple vector processors working in parallel.

 SIMD: Single instruction operates on multiple data streams at once


(suitable for scientific computing).

PRAM and VLSI Models

 PRAM (Parallel Random Access Machine): Multiple processors


synchronously access shared memory; fundamental for parallel
algorithm analysis.

 Types: EREW (exclusive read, exclusive write), CREW, CRCW


(concurrent read/write).

 VLSI Model: Considers algorithms in terms of chip area and


processing time, important for hardware implementation.

Architectural Development Tracks, Program/Network Properties

 Focus on increasing concurrency, interconnect topologies (mesh,


hypercube, ring, etc.).

Conditions of Parallelism, Program Partitioning and Scheduling

 Programs must be partitioned so tasks can run in parallel with


minimal dependency/synchronization.

 Scheduling: Assigning tasks to processors/items for optimal use.

Program Flow Mechanisms

 Control Flow: Sequence of execution dictated by control


structures.
 Data Flow: Sequence dictated by availability of data.

 Demand-Driven Flow: Computation is triggered by need for


results.

System Interconnect Architectures

 Bus: Shared path.

 Crossbar Switch: Every input can connect to every output.

 Multistage networks (e.g., Omega, Butterfly): Layered


switching.

UNIT II: Scalable Performance & Advanced Processors

Principles of Scalable Performance

 Scalability: System’s ability to maintain efficiency as workload


increases.

 Metrics: Speedup (ratio of serial runtime to parallel


runtime), Efficiency, Throughput, Latency.

 Amdahl’s Law: Speedup limited by serial portion.

 Gustafson’s Law: Realistic speedup with increasing problems size.

Performance Metrics and Measures

 Speed, capacity, cost, energy consumption; trade-offs assessed for


specific applications.

 Run time: Serial vs. Parallel.

Parallel Processing Applications

 Widely applied in scientific simulations, graphics, data analytics,


machine learning.

Scalability Analysis and Approaches

 Techniques: Load balancing, minimizing communication cost,


reducing synchronization.

Hardware Technologies, Memory Hierarchy

 Fast, small registers → caches → RAM → disk/SSD → network storage.

 Modern hardware leverages hierarchical memory for faster access


patterns.

Advanced Processor Technology: Superscalar and Vector Processors


 Superscalar: Executes multiple instructions per cycle by having
several execution units.

 Vector Processors: Operate on entire vectors (arrays) of data in


parallel, boosting performance for math-intensive tasks.

Diagram: Block diagram of a superscalar processor: multiple fetch,


decode, and execution units that process different instruction types
(integer/floating point) concurrently.

UNIT III: Shared Memory & Pipelining

Shared-Memory Organizations

 UMA: Uniform memory access for all processors.

 NUMA: Access time depends on memory location.

 COMA: All main memory acts as cache.

Diagram: UMA vs. NUMA vs. COMA showing how memory is connected to
the CPUs and possible bottlenecks.

Consistency Models

 Sequential consistency: Results are as if executed sequentially.

 Weak consistency: Allows more parallelism by relaxing ordering


constraints.

Pipelining & Superscalar Techniques

 Linear Pipeline: Instructions processed in fixed stages, each in


sequence.

 Non-Linear Pipeline: Multiple pipelines or paths based on


instruction type or data.

Diagram: Classic pipeline (fetch → decode → execute → memory → write-


back) and superscalar pipeline with parallel paths.

Instruction Pipeline Design

 Divides instruction processing into multiple sequential stages,


allowing multiple instructions in various stages.

Arithmetic Pipeline Design

 Specialized for arithmetic operations (e.g., floating-point addition).

Superscalar Pipeline Design

 Incorporates hazard detection, out-of-order execution, register


renaming, etc. for handling multiple simultaneous instructions.
UNIT IV: Parallel and Scalable Architectures

Multiprocessors and Multicomputers

 System Interconnects: Bus-based, switch-based (crossbar,


multistage), point-to-point.

 Cache Coherence: Needed in systems with shared memory to


keep CPU caches synchronized. Mechanisms include directory-based
or snoopy protocols.

Synchronization Mechanisms

 Techniques to ensure correct sequencing (barriers, locks,


semaphores).

Three Generations of Multicomputers

1. Early—custom interconnect, simple communication.

2. Mid—scalable networks, software support.

3. Advanced—commodity components, high-speed interconnects.

Message Passing Mechanisms

 Standard in multicomputer (distributed memory) systems, with APIs


(like MPI).

Multivector and SIMD Computers

 SIMD: Suitable for repetitive data operations.

 Multivector: High throughput for structured computations.

UNIT V: Vector & SIMD Processing

Vector Processing Principles

 Operates on vectors (arrays) as single units.

 Useful in scientific, graphics, and AI workloads—one instruction acts


on many data items.

Multivector Multiprocessors

 Multiple vector units, allow more concurrency.

Compound Vector Processing

 Multiple independent vector pipelining units.

SIMD Computer Organizations


 Examples include Connection Machine CM-5, which used thousands
of simple processors for data-level parallelism.

Diagram: SIMD architecture block diagram showing central control and


multiple simple processors performing the same operation.

The Connection Machine CM-5

 Advanced SIMD computer with thousands of nodes connected via a


fat tree network for scalable performance.

How to Access Diagrams

 Multiprocessor/multicomputer/topologies: See diagrams in .

 PRAM/Memory Models: See .

 Pipeline/Superscalar architecture: See .

 SIMD, vector machines: See , .

If you need specific diagrams, download full notes and PPTs from
links like those at , , , , , and .

Note: You can find many diagrams (interconnects, pipelines, memory


models) in:

 "Parallelism in Computer Architecture" (), slides at MET Institute (),


ACA notes at Sandip University/Studocu (), and on
Tutorialspoint/GeeksforGeeks.

If you have a particular topic from any unit where you want a specific
diagram, let me know and I can describe or fetch it for you!

You might also like