UNIT-2
1. Discuss Parallel Processing application with Scalability Analysis and
Approaches.
Ans: Parallel processing refers to the simultaneous execution of multiple
processes or threads to solve complex computational problems efficiently. It is
widely used in applications such as scientific simulations, data mining, image
processing, and machine learning, where the workload is divided among
multiple processors to achieve faster results. To evaluate the effectiveness of
parallel processing, scalability analysis plays a crucial role in determining how
well a system performs as the workload or resources increase.
Applications of Parallel Processing
1. Scientific Research:
o Simulations of natural phenomena (e.g., weather forecasting,
earthquake modeling).
o High-performance computations in physics, chemistry, and biology.
2. Big Data Analytics:
o Parallel frameworks like Hadoop and Spark process large datasets by
distributing computations across clusters of machines.
3. Image and Video Processing:
o Rendering 3D graphics, video compression, and object detection
tasks in real-time.
4. Artificial Intelligence and Machine Learning:
o Training deep neural networks requires splitting computations across
GPUs or distributed systems.
5. Real-Time Systems:
o Parallel processing in robotics, autonomous vehicles, and financial
trading systems enables immediate response to dynamic
environments.
Scalability Analysis in Parallel Processing
1
Scalability is the ability of a system to maintain efficiency as the number of
processors or workload increases. A scalable parallel system effectively balances
computational and communication demands, ensuring optimal performance
with increasing resources.
Scalability Metrics: The basic Scalability metrics affecting the scalability of a
computer system for a given application:
Machine size (n)—the number of processors employed in a parallel
computer system. A large machine size implies more resources and more
computing power.
Clock rate (f)—the clock rate determines the basic machine cycle. We
build a machine with components (processors, memory, bus or
network, etc.) driven by a clock which can scale up with better technology.
Problem size (s)—the amount of computational workload or the number
of data points used to solve a given problem. The problem size is directly
proportional to the sequential execution time T(s, 1) for a uniprocessor
system.
CPU time (T)—the actual CPU time (in seconds) elapsed in executing a
given program on a parallel machine with n processors collectively. This is
the parallel execution time, denoted as T(s,n) and is a function of both s
and n.
I/O demand (d)—the input/output demand in moving the program, data,
and results associated with a given application run. The I/O operations may
overlap with the CPU operations in a multi-programmed environment.
Memory capacity (m)—the amount of main memory (in bytes or words)
used in a program execution. The memory demand is affected by the
2
problem size, the program size, the algorithms, and the data structures
used.
Computer cost (c)—the total cost of hardware and software resources
required to carry out the execution of a program.
Programming overhead (p)—the development overhead associated with
an application program. Programming overhead may slow down software
productivity and thus implies a high cost Unless otherwise stated, both
computer cost and programming cost are ignored in our scalability
analysis.
Approaches to Scalability
1. Speedup:
o Measures how much faster a parallel system executes a task
compared to a single processor.
o Defined as S(s,n)=T(s,1)T(s,n)+h(s,n), where T(s,1)is the sequential
execution time and T(s,n) is the parallel execution time.
2. Efficiency:
o Describes how well the processors are utilized.
o Defined as E(s,n)=S(s,n)/n. In an ideal scenario, efficiency is 1,
indicating perfect scalability.
3. Programming Overhead:
o Includes the development efforts required to adapt algorithms for
parallel systems. Scalability approaches seek to minimize this to
enhance productivity.
4. System Design:
o Efficient architectures with optimized interconnection networks and
minimal communication overhead are essential for scalable systems.
3
2Q. Explain Amdahl’s law
Ans: The performance gain that can be obtained by improving some portion of a
computer can be calculated using Amdahl’s law.
Statement:
Amdahl’s law states that “the performance improvement to be gained from
using some faster mode of execution is limited by the fraction of the time the
faster mode can be used.” Amdahl’s law defines the speedup that can be gained
by using a particular feature.
Speedup: Speedup is the ratio, it tells us how much faster a task will run
using the computer with the enhancement as opposed to the original
computer.
Amdahl’s law for fixed workload deals with computation of fixed workload
with a fixed problem size.
The speedup factor in Amdahl’s law is dependent on fixed problem size &
hence its formula is computed on fixed load.
Consider both cases of DOP<n and DOP>=n
4
Let Q(n) be the lumped sum of all system overheads on an n-processor
system. Then,
Further an additional case arises where the computer operates either in
sequential mode {with DOP = 1) or in perfectly parallel mode (with
DOP=n). in such cases wi is assumed to be zero provided if i≠1 or i≠n.
Amdahl’s law implies that the sequential portion of the program w i does
not change with respect to the machine size n. Consider a normalized
situation in which
W1 + Wn = α +(1 – α )= 1, with α =W1 , and α = Wn ,the above eqn will be
where α represents the percentage of a program that must be executed
sequentially and (1 – α ) corresponds to the portion of the code that can be
executed in parallel.
Amdahl’s law is illustrated in Figures below.
5
When the number of processors increases, the load on each processor
decreases. However, the total amount of work (workload) W1 + Wn, is kept
constant as shown in Fig (a).
In fig (b) The total execution time decreases because Tn = Wn / n.
Eventually, the sequential part will dominate the performance because Tn
—> 0 as n becomes very large and T1 is kept unchanged.
6
3Q. Explain in detail about CISC and RISC Processors.
(or)
Explain the differences between RISC and CISC architecture.
Ans: Instruction Set Architecture (ISA) defines primitive commands for
programming a machine. The complexity arises from instruction formats,
addressing modes, data formats, registers, opcodes, and flow control
mechanisms.
Initially, computers used simple instruction sets due to high hardware costs.
Over time, two schools of thought emerged: CISC (Complex Instruction Set
Computer) and RISC (Reduced Instruction Set Computer).
CISC (Complex Instruction Set Computer)
CISC processors are designed with a complex set of instructions that can
execute multi-step operations or addressing modes within a single
instruction.
The goal is to reduce the number of instructions per program by making
individual instructions more powerful.
Earlier CISC relied on microprogrammed control, requiring ROM for control
memory, which slowed execution.
Modern CISC uses hardwired control for improved speed.
Key Characteristics:
1. Complex Instructions:
7
o Each instruction can execute several low-level operations (like
memory access, arithmetic operations, etc.) in a single command.
o Example: An instruction might combine fetching data from memory
and performing an arithmetic operation.
2. Variable Instruction Length:
o Instructions are of different sizes, which makes decoding complex.
3. Hardware Complexity:
o Because instructions are complex, the control unit of the processor
requires more transistors, leading to complex hardware.
4. Fewer Instructions Per Program:
o With its rich set of instructions, a program written for a CISC
processor may require fewer instructions compared to RISC.
5. Performance:
o The execution of individual instructions is slower due to their
complexity.
o Example CISC processors: Intel x86, AMD processors.
6. Memory Usage:
o Programs tend to be smaller since fewer instructions are used.
RISC (Reduced Instruction Set Computer)
RISC processors use a small and optimized set of instructions that execute
in a single clock cycle.
The focus is on simplicity and speed, relying on software to handle
complex tasks by breaking them into smaller instructions.
Originally focused on integer operations but now support both integer and
floating-point operations.
Modern designs may include multiple functional units (integer and
floating-point) and pipeline capabilities.
Tend to be under-pipelined compared to RISC scalar processors.
8
Key Characteristics:
1. Simple and Uniform Instructions:
o Instructions are simple and fixed in size, making decoding faster and
easier.
2. Single-Cycle Execution:
o Most instructions complete in one clock cycle, improving execution
speed.
3. Hardware Simplicity:
o The simpler instruction set reduces hardware complexity, making the
design more efficient and cheaper.
4. More Instructions Per Program:
o Since the instruction set is simple, complex tasks require multiple
instructions, leading to longer programs.
5. Performance:
o Faster execution of individual instructions and overall programs due
to optimized pipelining.
o Example RISC processors: ARM processors (used in most
smartphones), MIPS, PowerPC.
6. Memory Usage:
9
o Programs require more memory since more instructions are needed
to achieve the same functionality as CISC.
Q4. Explain in detail about superscalar processors
Ans: Superscalar Processors
Superscalar processors are designed to execute multiple instructions in
parallel during a single clock cycle, which improves performance by
exploiting instruction-level parallelism.
Only independent instructions can run in parallel. The degree of
parallelism depends on the type of program being executed. Without loop
optimizations (like loop unrolling), programs typically don't benefit much
from issuing more than three instructions per cycle.
Pipelining in Superscalar processor:
The fundamental structure of a superscalar pipeline is shown below
A superscalar processor pipeline executes multiple instructions per cycle (e.g., 3
instructions for a 3-issue pipeline). Stages include: Fetch → Decode → Execute
→ Write-back.
10
Efficiency depends on minimizing data dependencies and resource conflicts;
stalls occur if instructions can't execute in parallel.
Instruction cache fetches multiple instructions, with actual execution limited by
dependencies.
Optimizing compilers help maximize parallelism and reduce stalls.
Key Features
1. Instruction Issue:
o A superscalar processor can issue multiple instructions (2–5 typically)
per cycle.
o Not all cycles achieve full utilization, as dependencies or conflicts
between instructions can cause some pipelines to wait.
2. Pipeline Structure:
o Similar to scalar processors, superscalar pipelines consist of stages:
Fetch → Decode → Execute → Write-back.
o Each stage can handle multiple instructions simultaneously,
depending on instruction availability and independence.
3. Functional Units:
o Contains multiple functional units (e.g., integer and floating-point
units) to handle various instruction types simultaneously.
o All units can theoretically work together if there are no conflicts or
dependencies.
4. Optimizing Compiler: Plays a crucial role in arranging instructions to
maximize parallelism and reduce stalling.
Q5. Briefly describe virtual memory models
Ans: Virtual memory extends physical memory using disk storage, allowing
execution of large programs by dynamically loading active parts into physical
memory.
11
Expands physical memory using auxiliary storage (e.g., disks).
Only active portions of programs reside in physical memory; inactive parts are
swapped dynamically.
Enables multi-programming, time-sharing systems, and larger program
execution than physical memory allows.
Address Space:
Each word in the physical memory is identified by a unique Physical address. All
memory words in the main memory form a Physical address space. virtual-
addresses are those used by machine instructions making up an executable
program.
The virtual addresses must be translated into physical addresses at run time. A
system of translation tables and mapping functions are used in this process.
Address Mapping:
Address mapping translates a virtual address into a physical address during
program execution. If the data for a virtual address is in physical memory, it
returns the corresponding physical address signals memory hit . If not, it signals
a memory miss, prompting the system to load the required data into memory.
The two virtual memory models are:
(a)Private Virtual Memory:
Each processor has a private virtual address space divided into pages.
Benefits: Smaller address space, page protection, no locking.
Drawback: Synonym problem (different virtual addresses pointing to the
same physical page).
12
(b)Shared Virtual Memory:
A single, globally shared virtual address space.
Benefits: Unique addresses, larger address spaces, shared resources.
Drawback: Requires locking for protected access; segmentation adds
complexity.
6Q. Hierarchical Memory Technology
13
Ans: Storage devices such as registers, caches, main memory, disk devices, and
backup storage are often organized as a hierarchy as depicted in Fig.
The memory technology and storage organization at each level are characterized
by five parameters: the access time (ti), memory size (si), cost per byte (ci),
transfer bandwidth (bi), and unit of transfer (xi).
Access time (ti) refers to the round-trip time from the CPU to the i th-level
memory.
Memory size (si) is the number of bytes or words in level i.
Cost per byte(ci) The cost of the i th-level memory is estimated by the product ci
si.
Bandwidth (bi), refers to the rate at which information is transferred between
adjacent levels.
Unit of transfer (xi), refers to the grain size for data transfer between levels i and
i + 1.
Memory devices at a lower level are faster to access, smaller in size, and more
expensive per byte, having higher bandwidth and using a smaller unit of transfer
as compared with those at a higher level.
14
Registers and Caches The registers are parts of the processor, multi-level caches
are built either on the processor chip or on the processor board. Register
transfer operations are directly controlled by the processor after instructions are
decoded. Register transfer is conducted at processor speed, in one clock cycle.
The cache is controlled by the MMU and is programmer-transparent.
Main Memory The main memory is sometimes called the primary memory of a
computer system. It is usually much larger than the cache and often
implemented by the most cost-effective RAM chips. The main memory is
managed by a MMU in cooperation with the operating system.
Disk Drives and Backup Storage The disk storage is considered the highest level
of on-line memory. It holds the system programs such as the OS and compilers,
and user programs and their data sets. Optical disks and magnetic tape units are
off-line memory for use as archival and backup storage. They hold copies of
present and past user programs and processed results and files.
Peripheral Technology Besides disk drives and backup storage, peripheral
devices include printers, plotters, terminals, monitors, graphics displays, optical
scanners, image digitizers, output microfilm devices, etc. Some I/O devices are
tied to special-purpose or multimedia applications.
Q6. Explain Advances processor technology
Ans: Architectural families of modern processors are introduced here. Major
processor families to be studied include the CISC, RISC, superscalar, VLIW,
superpipelined, vector, and symbolic processors. Scalar and vector processors
are for numerical computations. Symbolic processors have been developed for
AI applications.
Design Space of Processors
Processor families can be mapped onto a coordinated space of clock rate versus
cycles per instruction (CPI), as illustrated in Fig.
15
As implementation technology evolves rapidly, the clock rates of various
processors have moved from low to higher speeds toward the right of the design
space (ie increase in clock rate). and processor manufacturers have been trying
to lower the CPI rate (cycles taken to execute an instruction) using innovative
hardware approaches.
Two main categories of processors are:-
o CISC
o RISC
Under both CISC and RISC categories, products designed for multi-core chips,
embedded applications, or for low cost and/or low power consumption, tend to
have lower clock speeds. High performance processors must necessarily be
designed to operate at high clock speeds. The category of vector 2 processors
has been marked VP; vector processing features may be associated with CISC or
RISC main processors.
Design space of CISC, RISC, Superscalar and VLIW processors
CISC: High CPI (1–20), complex instructions, high clock rates, positioned in
the upper design space.
RISC: Low CPI (1–2), simple instructions with pipelining, balanced
performance and efficiency.
Superscalar: Extends RISC with parallel execution, achieving lower CPI by
issuing multiple instructions per cycle.
16
VLIW: Uses long instruction words to schedule multiple operations,
achieving the lowest CPI but with higher cost and complexity.
The effective CPI of a processor used in a supercomputer should be very low,
positioned at the lower right corner of the design space. However, the cost and
power consumption increase appreciably if processor design is restricted to the
lower right corner.
Instruction Pipelines
The execution cycle of a typical instruction includes four phases: fetch, decode,
execute, and write-hack. These instruction phases are often executed by an
instruction pipelines demonstrated in Fig.a
These four phases are frequently performed in a pipeline, or ―assembly line‖
manner, as illustrated on the figure below.
17
Pipeline Definitions
Instruction pipeline cycle – the time required for each phase to complete its
operation (assuming equal delay in all phases)
Instruction issue latency – the time (in cycles) required between the issuing of
two adjacent instructions
Instruction issue rate – the number of instructions issued per cycle (the degree
of a superscalar)
Simple operation latency – the delay (after the previous instruction)
associated with the completion of a simple operation (e.g. integer add) as
compared with that of a complex operation (e.g. divide).
18
Resource conflicts – when two or more instructions demand use of the same
functional unit(s) at the same time.
Pipelined Processors
Case 1 : Execution in base scalar processor –
A base scalar processor, as shown in Fig.a and below:
o issues one instruction per cycle o has a one-cycle latency for a simple
operation
o has a one-cycle latency between instruction issues o can be fully utilized if
instructions can enter the pipeline at a rate on one per cycle
CASE 2: If the instruction issue latency is two cycles per instruction, the pipeline
can be underutilized, as demonstrated in Fig.b and below:
Pipeline Underutilization – ex: issue latency of 2 between two instructions.
– effective CPI is 2.
CASE 3: Poor Pipeline utilization– Fig.c and below:
in which the pipeline cycle time is doubled by combining pipeline stages. In this
case, the fetch and decode phases are combined into one pipeline stage, and
19
execute and write-back are combined into another stage. This will also result in
poor pipeline utilization. combines two pipeline stages into one stage – here the
effective CPI is ½ only.
The effective CPI rating is 1 for the ideal pipeline in Fig.a, and 2 for the case in
Fig.b. In Fig.c, the clock rate of the pipeline has been lowered by one-half
Underpipelined systems will have higher CPI ratings, lower clock rates, or both.
Data path architecture and control unit of a scalar processor
External Connections: Main memory, I/O controllers, etc., connect to the
external bus.
Control Unit Role: Generates control signals for the fetch, decode, ALU
operation, memory access, and write result phases of instruction
execution.
Control Logic:
o Hardwired Logic: Used in modern RISC processors for faster and
simpler control.
o Microcoded Logic: More common in older CISC processors for
handling complex instruction sets.
20
Modern Trends: Even modern CISC processors integrate techniques
originally developed for RISC processors to enhance performance.
Instruction-Set Architectures
The instruction set of a computer defines the basic commands that a
programmer can use to program the machine.
Complexity Comes from factors like
o Instruction Formats: How instructions are structured.
o Data Formats: How data is represented and processed.
o Addressing Modes: Ways to locate data in memory.
o General-Purpose Registers: Small storage areas for quick data access.
o Opcode Specifications: The unique codes for each instruction.
o Flow Control: Instructions for decision-making and controlling the
program’s execution flow.
The ISA Broadly is classified into two: CISC & RISC
21
Scalar Processors
CISC Scalar Processor:
Executes scalar data.
Executes integer and fixed point operations.
Modern scalar processors executes both integer and floating-point unit and
even multiple such units.
Based on a complex instruction set, a CISC scalar processor can also use
pipelined design.
Processor may be underpipelined due to data dependence among instructions,
resource conflicts, branch penalties and logic hazards.
22
RISC Scalar Processor:
Generic RISC processors are called scalar RISC because they are designed to
issue one instruction per cycle, similar to the base scalar processor
Simpler: - RISC design Gains power by pushing some less frequently used
operations into software.
Needs a good compiler when compared to CISC processor.
Instruction-level parallelism is exploited by pipelining
RISC pipeline 5 stages
Basic five-stage pipeline in a RISC machine (IF = Instruction Fetch, ID =
Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write
back). The vertical axis is successive instructions; the horizontal axis is time. So in
the green column, the earliest instruction is in WB stage, and the latest
instruction is undergoing instruction fetch.
23
24