0% found this document useful (0 votes)
12 views13 pages

Com - 612 Exam

The document discusses parallel algorithms for summing integers in a list, detailing a method that divides the workload among multiple processors to compute partial sums and then combines them for the final result. It also explains the theoretical speedup of k-stage pipelined processors compared to non-pipelined processors, outlines the architecture and functionality of array processors, and highlights the requirements for effective multiprocessor operating systems. Additionally, it covers the P-RAM model and its assumptions, as well as the differences between multiprocessor and single-processor systems.

Uploaded by

Thar Nge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views13 pages

Com - 612 Exam

The document discusses parallel algorithms for summing integers in a list, detailing a method that divides the workload among multiple processors to compute partial sums and then combines them for the final result. It also explains the theoretical speedup of k-stage pipelined processors compared to non-pipelined processors, outlines the architecture and functionality of array processors, and highlights the requirements for effective multiprocessor operating systems. Additionally, it covers the P-RAM model and its assumptions, as well as the differences between multiprocessor and single-processor systems.

Uploaded by

Thar Nge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Long Questions (16 marks)

1. Describe a parallel algorithm to find the sum of all the integers in a list of integer. Explain
how you would divide the work among multiple processors and how the results would be
combined.
The goal is to divide the work among multiple processors so that they concurrently
compute partial sums of the array, and then combine these partial sums to produce the final
result. The process involves divide-and-conquer and reduction. The general algorithm for this
procedure is as follow:
Input: Array A of size N, Number of processors P
Step 1: Divide the array into P parts
chunk_size = N / P
Step 2: Assign each processor a chunk and compute partial sum
for i = 0 to P-1 in parallel:
start_index = i * chunk_size
end_index = (i + 1) * chunk_size - 1
sum_i = sum(A[start_index : end_index]) // Each processor computes a
sum
Step 3: Gather all partial sums into an array S
S = [sum_0, sum_1, ..., sum_(P-1)]
Step 4: Perform a sequential processing to compute the final sum
total_sum = sum(S)
Output: total_sum
Let’s assume that we have an array A of 20 integers: And suppose we have 4 processors
available.
A = [3, 7, 1, 5, 9, 2, 8, 4, 6, 10, 15, 12, 18, 11, 20, 17, 13, 14, 19, 16]
To find the sum of all 20 integers in an array (A[20]) using a parallel algorithm, we can
divide the work among multiple processors as follows:
 Divide the Work: Split the array of 20 integers into subsets based on the number of
processors available. Each processor will handle a portion of the array.
 Calculate Partial Sums: Each processor computes the sum of its assigned subset of
integers. For example, if you have 4 processors, the array A can be divided into 4 parts,
each processor computing the sum of 5 integers.
 Combine Partial Results: After each processor has computed its partial sum, gather
these partial sums into a list.
 Sequential Summation: Finally, sum up the list of partial sums sequentially to get the
total sum of all 20 integers.
(1) Divide the Work:
 Processor 1: Handles A[0] to A[4]
 Processor 2: Handles A[5] to A[9]
 Processor 3: Handles A[10] to A[14]
 Processor 4: Handles A[15] to A[19]
(2) Calculate Partial Sums:
 Processor 1 computes: sum1 = 3 + 7 + 1 + 5 + 9 = 25
 Processor 2 computes: sum2 = 2 + 8 + 4 + 6 + 10 = 30
 Processor 3 computes: sum3 = 15 + 12 + 18 + 11 + 20 = 76
 Processor 4 computes: sum4 = 17 + 13 + 14 + 19 + 16 = 79
(3) Combine Partial Results:
 Gather the partial sums into an array: S = [25, 30, 76, 79]
(4) Sequential Summation:
 Sum up the elements of array S sequentially:
25 + 30 = 55
55 + 76 = 131
131 + 79 = 210
By dividing the work among processors, each processor independently computes a
smaller sum, which reduces the overall computational time compared to a single processor
handling all computations sequentially.
After obtaining the partial sums from each processor, combining them into a single list is
straightforward and efficient.
The final step, summing up the list of partial sums sequentially, ensures that all
processors' results are integrated to give the total sum of the original array. This approach
optimizes the computation by leveraging parallel processing capabilities to handle larger data
sets efficiently.

Parallel algorithms are used to solve real-world problems.


Real-World Applications:
(1) Database Sorting: In large databases, where the dataset is enormous, parallel quicksort can
be used to quickly sort data.
(2) Distributed Systems: In cloud-based systems, where data is distributed across multiple
machines, parallel quicksort allows sorting tasks to be distributed among available processors
to achieve faster data retrieval and analysis.
(3) Scientific Computing: In simulations and numerical, large-scale matrix multiplications are
often required. Parallel matrix multiplication helps in accelerating these simulations.
(4) Machine Learning: Many machine learning algorithms, such as training neural networks,
require operations involving large matrices (e.g.. weight matrices, input-output matrices).
Parallel matrix multiplication speeds up these operations, allowing faster model training.

2. Prove that a processor with k-stage pipeline can be at most k times faster than non-pipelined
processor.
Pipelined processor: Instructions are broken down into stages, and multiple instructions
are processed concurrently, each in a different stage.
Non-pipelined processor: Instructions are executed one at a time, completing all stages
before the next instruction begins.
Step 1: Execution Time in a Non-Pipelined Processor
In a non-pipelined processor, each instruction is executed sequentially, meaning the next
instruction starts only after the current one completes.
- Let T be the time taken to execute one instruction.
- For n instructions, the total execution time is:
Tnon-pipelined = nT
Step 2: Execution Time in a k-Stage Pipelined Processor
A k-stage pipeline splits the execution of each instruction into k stages, with each stage
completing a part of the instruction execution. The pipeline operates as follows:
1. The first instruction takes kT time to complete (since it must go through all k stages).
2. After that, each subsequent instruction completes in T time, as the pipeline is filled.
3. Thus, for n instructions, the total execution time is:
Tpipelined = kT + (n−1)T
Simplifying:
Tpipelined = (n+k−1)T
For large n, the term k−1becomes negligible, so we approximate:
nT
Tpipelined ≈
k
Step 3: Speedup Calculation
Speedup S is defined as the ratio of execution time in a non-pipelined processor to that in a
pipelined processor:
S = Tnon-pipelined / Tpipelined
nT
Substituting the values: S = nT =k
k
Thus, under ideal conditions, a k-stage pipeline can be at most k times faster than a
non-pipelined processor.
However, in real-world scenarios, factors such as pipeline stalls, data hazards, and branch
mispredictions often reduce the actual speedup below this theoretical maximum.
3. ARRAY PROCESSORS
A synchronous array of parallel processors is called an array processor, which consists of
multiple processing elements (PEs) under the supervision of one control unit (CU). An array
processor can handle single instruction and multiple data (SIMD) streams. SIMD machines are
specifically designed to perform vector computations over matrices and arrays of data.

SIMD computers appear in two basic architectural organizations: array processors, using
random-access memory, and associative processors, using content addressable (or associative)

SIMD Computer Organization

In general, an array processor assumes the configuration as shown in Figure. The


processor is structured with N synchronized PEs, all of which are under the control of one CU.
Each PE; is essentially an ALU with attached working registers and local memory, PEM i, which
collectively form the storage space for the distributed data. The CU has its own main memory for
the storage of programs. The operating system and user programs are executed under the control
of the CU. The user programs are loaded into the CU memory from an external source. The
function of CU is to decode all the instructions and determine where the decoded instructions
should be executed. Scalar and control-type instructions are executed directly by the CU. Vector
instructions are broadcast to the PEs for distributed execution to achieve spatial parallelism
through duplicated arithmetic units.

All the PEs perform the same function synchronously in a lock-step fashion under the
command of the CU. Vector operands are distributed to the PEMs before parallel execution in
the array of PEs. The distributed data can be loaded into the PEMs from an external source via
the system data bus, or via the CU in a broadcast mode before using the control bus.

An array processor is normally interfaced to a host computer through the control unit.
The host computer is a general-purpose machine which serves as a coordinator of the entire
system, consisting of the host system and the processor array. The functions of a host computer
include resource management and peripheral and I/O supervision. The control unit of the
processor array directly supervises the execution of programs.
4. The additional requirements of a multiprocessor operating system
The additional requirements of a multiprocessor operating system are enumerated below
(An operating system that fails to perform well in these respects tends to negate other advantages
which are associated with multiprocessing):
(1) Load balancing
The operating system must utilize the resources efficiently. This is commonly expressed
in terms of achieving a uniform balance of loads across the processors. The operating system
should schedule the subtasks such that there are no idle resources including the processors.
(2) Scheduling cooperating processes
Parallel programs which consist of concurrently executing tasks must be scheduled such
that collectively they are able to use the resources in the machine required to solve the given
problem. Scheduling half the subtasks of one parallel program and half the subtasks of another
parallel program which cannot cooperate can be wasteful. If a parallel program needs all its
subtasks to be running and cooperating, then scheduling only half of the tasks can lead to
starvation.
(3) Graceful degradation in case of failure of one of the resources:
Given that a parallel machine has multiple resources of the same kind, one must expect a
higher degree of fault tolerance from it. Failure of one of its resources should not result in a
catastrophic system crash. The operating system should be able to reschedule the task that had
been running on the failed resource and continue the parallel program. Ideally, there should be
only a fractional degradation in the performance of the parallel machine.
(4) Communication schemes:
Most parallel programs need to share data and intermediate results across subtasks during
the processing towards the solution of the problem. To achieve effective cooperation among the
subtasks of a parallel program, the operating system must provide adequate facilities for
communication between tasks. These facilities vary depending on whether the machine is of a
shared memory type or of a distributed memory type.

(5) Synchronization mechanisms:


To ensure integrity of the shared data across the subtasks of a parallel program,
synchronization between tasks is required. The shared data must be accessed under mutual
exclusion (implemented using a mechanism like semaphores). The tasks may need to wait till
some state is reached across all the tasks of the parallel program. The operating systems need to
provide signaling mechanisms for such synchronization requirements.

5. P-RAM
The random access machine (RAM)- not to be confused with random access memory-is a
model of one-address computer. A RAM consists of a memory, a read-only input tape, a write-
only output tape, and a program. The program is not stored in the memory and cannot be
modified. The input tape consists of a sequence of integers. Every time an input value is read, the
input head advances one square.

The P-RAM (parallel RAM) consists of a control unit, global memory, and an unbounded
set of processors, each with its own private memory. Every processor has a unique index; this
value can be used to address the processor for passing signals and interrupts. All the processors
share access to global memory which is unbounded in capacity. Execution of instructions can be
started on any of the processors at any time, starting at any location of memory (its local or
global memory).

A processor can cause the start of execution of some task on some other processor
through the signalling mechanism. This signalling mechanism can be available through hardware
or software, or both. This model represents the capabilities of an MIMD machine for parallel
computations. Several simplifying assumptions are made while considering the development of
algorithms for such a machine. They are:

(1) There is no limit on the number of processors in the machine.


(2) Any memory location is uniformly accessible from any processor. (Uniformly in terms of
time required for the access.)
(3) There is no limit on the amount of shared memory in the system.
(4) Resource contention is absent. A processor does not have to wait for reading a memory
location till the other processor has finished accessing the location.
(5) The programs written on these machines are, in general, of type MIMD. Certain special cases
such as SIMD may also be handled in such a framework.

While mapping an algorithm developed on such a machine to real machines, some


constraints are applied to modify the implementation aspects of the algorithm. Formally, such
constraints are expressed in the following terms:
(1) EREW (Exclusive Read Exclusive Write): Read or write conflicts are not allowed.
(2) CREW (Concurrent Read Exclusive Write): Concurrent reading is allowed; however, write
conflicts are not allowed.
(3) CRCW (Concurrent Read Concurrent Write): Concurrent reading and concurrent writing are
allowed. Such a model, in general, is of little practical value.

6. The Configuration of OS (Tr. YYA ပေးထားတဲ့ စာရွက်)


Short Questions (8 marks)

1. What is a multiprocessor system, and how does it differ from a single-processor system?

A multiprocessor system is a computer system with two or more processors that share a
common physical memory and work together to execute tasks. These processors can execute
multiple instructions simultaneously, making them suitable for parallel processing tasks.

A single-processor system, on the other hand, contains only one processor that executes
one instruction at a time. While it can multitask using scheduling techniques, it does not achieve
true parallelism like a multiprocessor system.

Differences:

Aspect Multiprocessor System Single-Processor System


Number of Multiple One
processors
Parallelism True parallelism Simulated through
multitasking
Performance Higher, due to parallel tasks Limited by sequential
execution
Cost Higher Lower
Applications High-performance computing, Personal computers, basic
databases tasks

Example:

 A multiprocessor system: A supercomputer used for weather prediction (e.g., IBM


Summit).
 A single-processor system: A basic laptop performing web browsing.
2. Explain how multiple functional units contribute to parallelism in sequential machines.

Multiple functional units in a sequential machine enable the system to perform different
operations (e.g., addition, multiplication) simultaneously, thereby increasing throughput. These
units include Arithmetic Logic Units (ALUs), Floating-Point Units (FPUs), and other specialized
processors.

Functions:

 A CPU with multiple functional units divides tasks between these units.
 While one unit processes an addition operation, another performs multiplication or a
memory fetch.

Example:

 Consider a CPU executing two instructions:


1. Addition (handled by the ALU).
2. Data fetching from memory (handled by a Memory Management Unit).
 The tasks are performed simultaneously instead of sequentially, reducing execution time.

Multiple functional units within a processor core enable parallelism in sequential


machines by allowing different operations to be executed simultaneously on separate data
streams, effectively "overlapping" the execution of instructions and improving overall processing
speed by taking advantage of instruction-level parallelism (ILP) within a single core; essentially,
multiple units can work on different parts of a task at the same time, even if the instructions are
processed sequentially.

Example
Imagine a processor with separate functional units for integer addition, floating-point
multiplication, and logical operations.
If an instruction sequence includes an addition, a multiplication, and a logical AND, these
operations can be executed simultaneously on their respective units, significantly reducing the
overall execution time compared to a single functional unit processing them sequentially.

3. Real-world applications (in algorithm).


4. Data Parallel Model

The programming model for data-parallel SIMD processors is an extension of the


sequential programming. For example, Fortran 90 is specifically tailored for data parallelism.
The language compiler for a specific machine must translate the source code into executable
object code, which will exploit the parallelism in the hardware.

Data-parallel programs require the use of pre-distributed data sets. Thus the choice of
parallel data structure plays a significant role in data-parallel programming.

Synchronization of data-parallel programs is done at compile-time rather than at run-


time. Hardware optimization is enforced by the control unit to carry out lockstep execution of
SIMD programs. The compiler must be aware of the underlying interconnection topology of PEs
to generate optimal code for the array processor.

5. Parallel processing

Parallel processing is the simultaneous execution of multiple tasks or instructions to


improve computational speed and efficiency. It is widely used in high-performance computing,
AI, and large-scale data processing. The types of parallelism encountered in computing: look-
ahead, pipelining, vectorization, concurrency, data parallelism, partitioning, interleaving,
overlapping, replication, time sharing, space sharing, multitasking, multiprogramming,
multithreading, and distributed computing.
The most well-known classification of computer architectures are as follow:
(1) SISD: This is the conventional von Neumann computer. Only a single stream of instructions
and a single stream of data is involved.

(2) SIMD: Here we have only one instruction stream, but multiple data streams are active at a
time. This is typically done by replicating arithmetic units in the CPU and allowing the
different units to refer to different operands, but follow a common instruction. This model is
known as data parallelism. Vector computers and Array processors are examples of this
class.

(3) MISD: This is the opposite of SIMD. By definition this means multiple instructions on a
single stream of data. This appears counter intuitive and hence many people believe that
MISD is an impossible class. However, there are some who would place the systolic array
architecture in this class.
(4) MIMD: The other end of the spectrum, of course, would be to let loose both the streams.
There are multiple instructions and multiple data. An arbitrary combination of the instruction
and data streams is possible. This is the most general model in Flynn's classification.

You might also like