PDC
PDC
Answer: C
A. SIMD
B. MIMD
C. MISD
D. BIOS
Answer: D
Answer: C
A. Compiler
B. Scheduler
C. Loader
D. Decoder
Answer: B
Answer: C
A. Alan Turing
B. Seymour Cray
C. Michael Flynn
Answer: C
B. Memory types
Answer: C
Q8. Which of the following represents the earliest form of parallel computing?
A. Supercomputers
B. Multicore processors
C. Vector processors
Answer: C
A. Quantum computing
B. Multicore CPUs
C. Vacuum tubes
D. Mechanical computers
Answer: B
Q10. What role did supercomputers like the Cray-1 play in the history of parallel computing?
Answer: C
In parallel computing, granularity refers to the size of the computational tasks being performed
concurrently. It essentially measures the ratio of computation to communication, with coarse-grained
parallelism involving larger tasks and less frequent communication, and fine-grained parallelism
involving smaller tasks and more frequent communication.
More Details:
Coarse-grained parallelism:
This involves breaking a problem into a relatively small number of large, independent tasks that can be
executed concurrently. Communication and synchronization between these large tasks are less
frequent.
Fine-grained parallelism:
This involves breaking a problem into a large number of very small, independent tasks that can be
executed concurrently. Communication and synchronization between these smaller tasks are more
frequent.
Impact on performance:
The choice of granularity can significantly impact performance. Fine-grained parallelism can lead to
higher communication overhead and potential performance bottlenecks, while coarse-grained
parallelism may result in underutilization of resources or load imbalance.
The ideal granularity for a given problem depends on factors such as the algorithm's characteristics, the
communication overhead, and the available resources. Finding the optimal balance between
computation and communication is crucial for maximizing parallel performance.
A vector processor, also known as an array processor, is a central processing unit (CPU) designed to
efficiently perform operations on large one-dimensional arrays of data called vectors. It's a type of
parallel processor that executes instructions on multiple data elements simultaneously, unlike scalar
processors which operate on individual data points.
Parallelism:
Vector processors leverage parallel processing capabilities, allowing multiple processors to operate
concurrently or splitting tasks into subtasks handled by different processors.
Vector Instructions:
They have instructions designed to operate on entire vectors, treating them as a single unit.
A core principle of vector processing is SIMD, where a single instruction operates on multiple data
elements simultaneously.
Pipeline:
Vector processors often employ pipelining to achieve fine-grained parallelism, latency hiding, and
amortized control overheads.
Applications:
They are well-suited for data-intensive applications like image processing, scientific simulations, and
artificial intelligence.
Scalar processors
operate on individual data elements, while vector processors operate on entire vectors.
Scalar instructions
are designed for single data element operations, while vector instructions are optimized for parallel
operations on vectors.
Scalar processors
typically have a simpler architecture, whereas vector processors often have specialized hardware for
vector processing.
Image processing:
Scientific simulations:
Artificial intelligence:
Supercomputers:
Early vector processors were common in supercomputers designed for high-performance computing.
Increased Performance:
Vector processing can significantly improve performance for data-intensive tasks by leveraging
parallelism.
Efficiency:
They can execute operations on large datasets more efficiently than scalar processors.
Simplicity:
Vector instructions can often simplify code and reduce the number of instructions needed.
Complexity: Programming for vector processors can be complex, requiring specialized knowledge.
Limited Applicability: Not all applications are suitable for vector processing, as some tasks are inherently
sequential.
Elaboration:
Multiple Cores:
A multicore processor contains multiple physical processing units (cores) on a single
chip.
Independent Execution:
Each core can execute instructions independently, enabling the processor to perform
multiple tasks or threads simultaneously.
Improved Performance:
This simultaneous execution allows for faster processing of multiple tasks, leading to
improved overall performance and responsiveness.
Multithreading and Parallel Processing:
Multicore processors are well-suited for tasks that can be broken down into multiple
smaller tasks that can be processed in parallel by different cores, such as video
editing, rendering, or scientific simulations.
Efficiency:
While each core may not run as fast as a single-core processor, the ability to process
multiple tasks concurrently makes multicore processors more efficient for various
workloads.
In parallel computing, a scheduler manages and distributes tasks across
multiple processors or cores to improve processing speed and efficiency. It
determines which tasks are executed, when, and on which processor, aiming
to minimize overall execution time and optimize resource utilization. Different
scheduling algorithms exist, each with its strengths and weaknesses,
impacting performance and fairness.
Introduction to Parall
Q1. In a shared memory parallel system, which of the following problems is most likely to occur
when multiple threads access the same memory location?
A. Deadlock
B. Starvation
C. Race condition
D. Paging fault
Answer: C
Q2. Which model of parallelism is best suited for problems with large data sets but relatively
simple computations on each data element?
A. Task Parallelism
B. Instruction-Level Parallelism
C. Data Parallelism
D. Bit-Level Parallelism
Answer: C
Q3. Which of the following scenarios would benefit the most from fine-grained parallelism?
A. Performing I/O operations in a database
B. Rendering frames in a video game engine
C. Weather prediction models using matrix computations
D. Batch processing of large logs
Answer: C
Q6. Which of the following best describes the evolution from vector processors to massively
parallel processors (MPPs)?
A. From control-driven to data-driven execution
B. From MIMD to SIMD models
C. From centralized memory to cache-based memory
D. From data-parallel to instruction-level parallelism
Answer: A
Q8. Which architectural advancement in the early 2000s led to widespread use of parallel
computing in consumer devices?
A. Introduction of FPGAs
B. Launch of GPU co-processors for AI
C. Advent of multicore CPUs
D. Rise of cloud computing
Answer: C
Q9. The introduction of GPU computing as a parallel computing paradigm is associated with
which company and architecture?
A. Intel and Itanium
B. IBM and PowerPC
C. AMD and Bulldozer
D. NVIDIA and CUDA
Answer: D
Q11. In parallel computing, which of the following primarily affects the efficiency of
synchronization between threads?
A. Cache coherency
B. Network topology
C. Thread stack size
D. Instruction pipelining
Answer: A
Cache coherence ensures consistent view of shared memory, crucial during synchronization.
Q14. Which type of parallelism is commonly used in GPU programming models like CUDA?
A. Task parallelism
B. Thread-level parallelism
C. Data-level parallelism
D. Memory-level parallelism
Answer: C
CUDA is built on SIMD-like data-parallel execution.
Q15. In multithreaded parallel programming, which method is commonly used to avoid race
conditions?
A. Fork-Join Model
B. Locks and Mutexes
C. Memory paging
D. Static Scheduling
Answer: B
Q16. Which of the following early computers first implemented pipelining, a concept critical to
later parallel architectures?
A. ENIAC
B. Cray-1
C. IBM System/360
D. CDC 6600
Answer: D
CDC 6600 introduced pipelining for instruction execution.
Q20. What major shift allowed parallel computing to enter the mainstream consumer market?
A. Transition from mechanical to electronic computers
B. Inclusion of parallelism in OS scheduling
C. Development of VLIW architectures
D. Integration of multicore CPUs in personal computers
Answer: D
3. Types of Parallelism
Q4. In ILP, what limits the extent to which instructions can be executed in parallel?
A. Bit width
B. Data dependencies
C. Clock frequency
D. RAM size
Answer: ✅ B
Q6. Which of the following applications is most suited for data parallelism?
A. Web crawling
B. Sorting distributed files
C. Image processing (e.g., applying a filter to all pixels)
D. Compiling source code
Answer: ✅ C
🔹 4.1 SISD
🔹 4.2 SIMD
🔹 4.3 MISD
Q1. Which type of parallelism would be least effective in a scenario involving heterogeneous,
independent tasks with minimal data overlap?
A. Data Parallelism
B. Task Parallelism
C. Bit-Level Parallelism
D. Instruction-Level Parallelism
Answer: ✅ A
Data parallelism assumes the same operation on chunks of data — not suitable for
heterogeneous tasks.
Q2. Which factor is most critical in achieving effective instruction-level parallelism in modern
processors?
A. Clock speed
B. Instruction pipelining with hazard detection and branch prediction
C. Size of shared memory
D. Number of threads
Answer: ✅ B
Q3. In pipeline parallelism, which of the following scenarios would most likely reduce
throughput significantly?
A. All stages of the pipeline are equally balanced
B. One stage takes significantly longer than others (bottleneck)
C. Static scheduling of tasks
D. Increasing instruction cache
Answer: ✅ B
Q4. Bit-level parallelism can result in significant performance improvement only when:
A. Tasks are independent
B. The CPU supports parallel ALUs
C. Operations are inherently word-based (e.g., arithmetic on large integers)
D. Multiple processors are available
Answer: ✅ C
Q5. Which combination of parallelism types is typically used in modern GPUs for maximum
performance?
A. Bit-level and instruction-level
B. Task-level and pipeline
C. Data-level and instruction-level
D. Data-level and task-level
Answer: ✅ C
GPUs use SIMD-like data parallelism and deep instruction pipelines.
Q6. Which Flynn category would best describe a distributed simulation system where each
node executes a different part of the simulation using its own code and data?
A. SISD
B. SIMD
C. MISD
D. MIMD
Answer: ✅ D
Q7. A radar signal processing system that applies multiple filters (algorithms) on the same
stream of incoming data is closest to:
A. SIMD
B. MISD
C. MIMD
D. SISD
Answer: ✅ B
This is one of the very rare real-world examples that approximate MISD.
Q9. Which of the following is a fundamental assumption in SISD architecture that does not
apply to MIMD?
A. Uniform instruction stream
B. Multiple ALUs
C. Multi-core processor
D. Synchronization primitives
Answer: ✅ A
Q1. Which architecture in Flynn's taxonomy is most suitable for applications with highly
regular data structures and operations (e.g., vector operations)?
A. SISD
B. SIMD
C. MISD
D. MIMD
Answer: ✅ B
Q2. Which of the following is a valid difference between SIMD and MIMD?
A. SIMD supports instruction-level concurrency, while MIMD does not
B. SIMD executes different instructions across all cores, MIMD executes the same
C. SIMD requires data to be partitioned identically, MIMD does not
D. MIMD cannot scale beyond 4 processors
Answer: ✅ C
Q3. Which Flynn architecture is considered mostly theoretical with very few practical
implementations?
A. SISD
B. SIMD
C. MISD
D. MIMD
Answer: ✅ C
Q4. A multicore CPU where each core executes a different thread independently follows which
Flynn category?
A. SISD
B. SIMD
C. MISD
D. MIMD
Answer: ✅ D
Q10. Which of the following best describes the memory in a distributed memory system?
A. All processors have access to a common physical memory
B. Each processor accesses all memory locations uniformly
C. Each processor has its own private local memory
D. All memory operations are atomic
Answer: ✅ C
Q11. A disadvantage of distributed memory systems is:
A. Cache coherence
B. Synchronization primitives
C. Programming complexity due to explicit communication
D. Limited memory bandwidth
Answer: ✅ C
Q12. Which parallel computing system would be most scalable for extremely large data sets and
thousands of processors?
A. SISD
B. Shared memory system
C. Distributed memory system
D. SIMD-based GPU
Answer: ✅ C
Q1. Which architecture type would suffer most from branch divergence in control flow?
A. SISD
B. SIMD
C. MIMD
D. MISD
Answer: ✅ B
SIMD requires all processing elements to follow the same instruction path; divergence reduces
efficiency.
Q2. In MIMD systems, synchronization mechanisms like barriers are essential because:
A. All processors execute the same instruction
B. Processors share a single clock
C. Tasks may progress at different speeds
D. Memory is distributed equally
Answer: ✅ C
Independent instruction streams lead to timing mismatches that must be synchronized.
Q3. Which parallel architecture model provides the highest flexibility in heterogeneous task
execution with asynchronous processing?
A. SIMD
B. MIMD
C. MISD
D. SISD
Answer: ✅ B
Q4. MISD systems, though rare, are best theoretically suited for:
A. Image processing
B. Signal redundancy and fault tolerance
C. Data-parallel matrix multiplication
D. Task-based load balancing
Answer: ✅ B
Q5. Which of the following systems can dynamically switch between SIMD and MIMD
modes, depending on workload?
A. Superscalar processors
B. Clustered GPUs
C. Heterogeneous hybrid architectures (e.g., modern CPUs + GPU cores)
D. VLIW systems
Answer: ✅ C
Q7. Which statement best highlights the primary scalability bottleneck in shared memory
systems?
A. High latency of message passing
B. Complex memory address translation
C. Contention for shared resources and synchronization overhead
D. Limited support for branch prediction
Answer: ✅ C
Q9. A high-performance computing system uses multiple nodes, each with multiple cores and
private memory, communicating via MPI. This architecture is best described as:
A. Shared memory system
B. SIMD system
C. Distributed memory system
D. MISD system
Answer: ✅ C
Q10. Which of the following best differentiates shared and distributed memory systems in terms
of programming model complexity?
A. Shared memory systems require explicit synchronization, distributed do not
B. Distributed memory systems are generally easier to program
C. Shared memory uses automatic communication between threads
D. Distributed memory requires explicit communication and data partitioning
Answer: ✅ D
Q11. Hybrid systems combining shared and distributed memory (e.g., clusters of multicore
machines) are designed to:
A. Limit the use of synchronization
B. Reduce cost by avoiding interconnects
C. Combine fast intra-node communication with scalable inter-node communication
D. Prevent data races entirely
Answer: ✅ C
Q12. Which of the following is NOT an issue in distributed memory systems?
A. Deadlock due to message dependencies
B. Synchronization of shared variables
C. Load balancing across nodes
D. Communication overhead
Answer: ✅ B
(Shared variables don’t exist in distributed memory; all communication is explicit.)
Concept: Each process has its own local memory; processes communicate via explicit
messages.
Communication: Through send/receive operations (e.g., MPI_Send, MPI_Recv).
Used in: Distributed memory systems (e.g., clusters, supercomputers).
Pros: Scales well across many nodes.
Cons: Requires careful handling of message synchronization and data distribution.
Concept: Combines two or more models (e.g., OpenMP + MPI or CUDA + MPI).
Example: MPI is used across nodes, OpenMP or CUDA used within each node.
Used in: HPC systems with multi-core CPUs + GPUs across a cluster.
Pros: Maximizes hardware utilization, scalable and efficient.
Cons: Complex to program and debug due to multiple layers of parallelism.
Q1. Which of the following best describes the primary challenge of combining MPI and CUDA
in a hybrid system?
A. Incompatibility of programming languages
B. Lack of GPU support in MPI
C. Managing data transfers between GPU memory and other nodes
D. Threads cannot be created inside CUDA
Answer: ✅ C
Q2. In OpenMP, which clause helps prevent race conditions by allowing each thread to maintain
its own copy of a variable?
A. shared
B. private
C. reduction
D. firstprivate
Answer: ✅ B
Q3. Which MPI function is typically used to gather data from all processes to a single
process?
A. MPI_Scatter
B. MPI_Broadcast
C. MPI_Reduce
D. MPI_Gather
Answer: ✅ D
Q4. What is the role of CUDA’s warp in data-parallel execution?
A. A synchronization barrier
B. A group of 32 threads executed in lock-step
C. A GPU memory space
D. A function for atomic operations
Answer: ✅ B
Q5. What distinguishes a hybrid parallel application from a traditional parallel program?
A. It only uses OpenMP for GPU parallelism
B. It runs on a single CPU
C. It combines two or more models, like OpenMP for shared memory and MPI for distributed
memory
D. It avoids synchronization completely
Answer: ✅ C
Q10. What is the main limitation of using OpenMP for large-scale distributed memory systems?
A. Too much code complexity
B. Limited compiler support
C. It only works on shared memory, not across nodes
D. It requires GPU support
Answer: ✅ C
Q3. Which of the following is a correct advantage of the pipeline model in parallel computing?
A. It removes the need for task synchronization
B. It increases sequential bottlenecks
C. It improves throughput by overlapping computations
D. It works only in distributed memory systems
Answer: ✅ C
Q4. What is the main role of a barrier in a parallel program?
A. Allocate shared memory
B. Lock a shared variable
C. Prevent race conditions by restricting communication
D. Block threads until all have reached a certain point
Answer: ✅ D
Q5. Which synchronization mechanism allows multiple readers or one writer at a time?
A. Spinlock
B. Mutex
C. Semaphore
D. Read-Write Lock
Answer: ✅ D
Q7. In shared memory communication, which of the following is the most common issue?
A. Lack of message delivery
B. Deadlock due to message order
C. Race conditions when accessing shared data
D. Data redundancy
Answer: ✅ C
Q2. Which of the following would most likely lead to a pipeline stall in pipeline parallelism?
A. All stages having equal workload
B. No inter-stage dependencies
C. One stage becoming a bottleneck
D. Perfect task balancing
Answer: ✅ C
Explanation: A bottleneck stage delays all following stages, reducing throughput.
Q5. In shared-memory systems, which of the following does NOT improve synchronization
efficiency?
A. Reducing critical section size
B. Replacing fine-grained locks with one global lock
C. Using lock-free data structures
D. Applying barriers only where required
Answer: ✅ B
Explanation: A global lock increases contention and reduces concurrency.
Q6. Message passing systems are less prone to race conditions than shared memory systems
because:
A. They don’t use locks
B. Each process has private memory, and explicit communication ensures isolation
C. They support atomic operations
D. They don’t require synchronization
Answer: ✅ B
Q7. Which of the following best describes the difference between a barrier and a semaphore?
A. A barrier blocks one thread; a semaphore blocks all
B. A barrier enforces collective synchronization, a semaphore regulates access to resources
C. A barrier is OS-level, semaphore is user-level
D. Semaphores are only used in shared memory, barriers in message-passing
Answer: ✅ B
Q9. When using message passing for communication, what factor most significantly affects
scalability?
A. Number of barriers
B. Instruction-level parallelism
C. Network bandwidth and latency
D. GPU core count
Answer: ✅ C
Q1. If a program has a serial portion of 20%, what is the theoretical maximum speedup
according to Amdahl’s Law, regardless of processor count?
A. 4
B. 5
C. 10
D. ∞
Answer: ✅ A
Explanation: Max Speedup = 1 / Serial Fraction = 1 / 0.2 = 5
Q4. Which of the following best explains why Amdahl’s Law limits scalability?
A. Because of hardware restrictions
B. Because some parts of the code cannot be parallelized
C. Because of cache misses
D. Because parallelism always reduces performance
Answer: ✅ B
Q7. What happens to efficiency in strong scaling when the number of processors increases but
the problem size remains fixed?
A. It remains constant
B. It always increases
C. It typically decreases due to overhead
D. It increases linearly
Answer: ✅ C
Q8. A program has 95% parallel code. What is the maximum theoretical speedup using
Amdahl’s Law?
A. 20
B. 10
C. 19
D. Cannot be determined without processor count
Answer: ✅ A
Explanation: Max speedup = 1 / 0.05 = 20
Q9. In a perfectly parallelizable task, adding more processors increases speedup linearly. Which
type of scalability does this represent?
A. Weak scalability
B. Strong scalability
C. Amdahl’s scalability
D. Temporal scalability
Answer: ✅ B
Q1. A program has a serial portion of 10%. If we double the number of processors from 4 to 8,
the speedup will:
A. Double
B. Increase slightly
C. Remain the same
D. Be exactly 8
Answer: ✅ B
Explanation: Due to the serial portion, doubling processors yields diminishing returns
(Amdahl's Law).
Q2. Which scenario demonstrates superlinear speedup, which seems to violate Amdahl’s Law?
A. When total execution time increases after parallelization
B. When processors utilize shared memory
C. When parallel execution fits entirely into processor cache
D. When tasks are perfectly load balanced
Answer: ✅ C
Explanation: Cache effects can lead to speedup > P, which Amdahl's Law does not account for.
Q4. Suppose a program runs in 100 seconds on 1 processor, and 60 seconds on 4 processors.
What is the efficiency?
A. 0.25
B. 0.4
C. 0.6
D. 0.75
Answer: ✅ B
Explanation:
Speedup = 100 / 60 = 1.67
Efficiency = 1.67 / 4 = ~0.417
Q5. Which of the following statements is true regarding Amdahl’s Law in practice?
A. It assumes ideal scaling as processors increase
B. It favors increasing the problem size with processors
C. It highlights the impact of even a small serial portion
D. It allows infinite speedup with enough threads
Answer: ✅ C
Q6. Which of the following is not a reason why efficiency might decline as processors increase?
A. Increased communication overhead
B. Smaller workload per processor
C. Larger memory requirements
D. Decreased synchronization cost
Answer: ✅ D
Explanation: Decreased synchronization cost should improve efficiency.
Q7. Given a system where the parallel portion is 80%, how many processors are required to
achieve at least 4x speedup?
A. 5
B. 6
C. 8
D. 10
Answer: ✅ D
Explanation (using Amdahl's Law):
Speedup = 1 / (S + (1 - S)/P)
S = 0.2
Try P = 10
→ Speedup = 1 / (0.2 + 0.8/10) = 1 / (0.2 + 0.08) = 1 / 0.28 ≈ 3.57
To get speedup ≥ 4, you need more than 10 processors
(Answer is D for 10 being near the required threshold. For full 4x, you’d need around 13.)
Q8. A system achieves a speedup of 6 using 8 processors. What is the efficiency and what does
it suggest?
A. 0.75, excellent scaling
B. 0.8, possible overuse
C. 0.6, some parallel overhead
D. 1, perfect speedup
Answer: ✅ C
Explanation: Efficiency = Speedup / P = 6 / 8 = 0.75 → Typo: Option C should say 0.75, and
answer is A.
Assuming C is 0.75, it's good but not perfect — shows some overhead.
Q9. If a system exhibits high efficiency but low speedup, what could be inferred?
A. System has high overhead
B. Problem is mostly parallel
C. Problem is mostly serial
D. Number of processors is high
Answer: ✅ C
Explanation: Low speedup suggests serial bottleneck; high efficiency shows minimal loss from
parallelization overhead.
Q10. Why does Gustafson’s Law suggest more optimistic scaling than Amdahl’s Law?
A. It assumes fixed hardware
B. It scales problem size with processor count
C. It ignores the serial portion
D. It assumes constant communication cost
Answer: ✅ B