Long Questions (16 marks)
1. Describe a parallel algorithm to find the sum of all the integers in a list of integer. Explain
   how you would divide the work among multiple processors and how the results would be
   combined.
           The goal is to divide the work among multiple processors so that they concurrently
   compute partial sums of the array, and then combine these partial sums to produce the final
   result. The process involves divide-and-conquer and reduction. The general algorithm for this
   procedure is as follow:
                   Input: Array A of size N, Number of processors P
                   Step 1: Divide the array into P parts
                     chunk_size = N / P
                   Step 2: Assign each processor a chunk and compute partial sum
                     for i = 0 to P-1 in parallel:
                         start_index = i * chunk_size
                         end_index = (i + 1) * chunk_size - 1
                         sum_i = sum(A[start_index : end_index]) // Each processor computes a
                   sum
                   Step 3: Gather all partial sums into an array S
                     S = [sum_0, sum_1, ..., sum_(P-1)]
                   Step 4: Perform a sequential processing to compute the final sum
                     total_sum = sum(S)
                   Output: total_sum
           Let’s assume that we have an array A of 20 integers: And suppose we have 4 processors
   available.
           A = [3, 7, 1, 5, 9, 2, 8, 4, 6, 10, 15, 12, 18, 11, 20, 17, 13, 14, 19, 16]
           To find the sum of all 20 integers in an array (A[20]) using a parallel algorithm, we can
   divide the work among multiple processors as follows:
          Divide the Work: Split the array of 20 integers into subsets based on the number of
           processors available. Each processor will handle a portion of the array.
      Calculate Partial Sums: Each processor computes the sum of its assigned subset of
       integers. For example, if you have 4 processors, the array A can be divided into 4 parts,
       each processor computing the sum of 5 integers.
      Combine Partial Results: After each processor has computed its partial sum, gather
       these partial sums into a list.
      Sequential Summation: Finally, sum up the list of partial sums sequentially to get the
       total sum of all 20 integers.
   (1) Divide the Work:
          Processor 1: Handles A[0] to A[4]
          Processor 2: Handles A[5] to A[9]
          Processor 3: Handles A[10] to A[14]
          Processor 4: Handles A[15] to A[19]
   (2) Calculate Partial Sums:
          Processor 1 computes: sum1 = 3 + 7 + 1 + 5 + 9 = 25
          Processor 2 computes: sum2 = 2 + 8 + 4 + 6 + 10 = 30
          Processor 3 computes: sum3 = 15 + 12 + 18 + 11 + 20 = 76
          Processor 4 computes: sum4 = 17 + 13 + 14 + 19 + 16 = 79
   (3) Combine Partial Results:
          Gather the partial sums into an array: S = [25, 30, 76, 79]
   (4) Sequential Summation:
          Sum up the elements of array S sequentially:
           25 + 30 = 55
           55 + 76 = 131
           131 + 79 = 210
       By dividing the work among processors, each processor independently computes a
smaller sum, which reduces the overall computational time compared to a single processor
handling all computations sequentially.
       After obtaining the partial sums from each processor, combining them into a single list is
straightforward and efficient.
       The final step, summing up the list of partial sums sequentially, ensures that all
processors' results are integrated to give the total sum of the original array. This approach
   optimizes the computation by leveraging parallel processing capabilities to handle larger data
   sets efficiently.
           Parallel algorithms are used to solve real-world problems.
   Real-World Applications:
   (1) Database Sorting: In large databases, where the dataset is enormous, parallel quicksort can
       be used to quickly sort data.
   (2) Distributed Systems: In cloud-based systems, where data is distributed across multiple
       machines, parallel quicksort allows sorting tasks to be distributed among available processors
       to achieve faster data retrieval and analysis.
   (3) Scientific Computing: In simulations and numerical, large-scale matrix multiplications are
       often required. Parallel matrix multiplication helps in accelerating these simulations.
   (4) Machine Learning: Many machine learning algorithms, such as training neural networks,
       require operations involving large matrices (e.g.. weight matrices, input-output matrices).
       Parallel matrix multiplication speeds up these operations, allowing faster model training.
2. Prove that a processor with k-stage pipeline can be at most k times faster than non-pipelined
   processor.
           Pipelined processor: Instructions are broken down into stages, and multiple instructions
   are processed concurrently, each in a different stage.
           Non-pipelined processor: Instructions are executed one at a time, completing all stages
   before the next instruction begins.
   Step 1: Execution Time in a Non-Pipelined Processor
   In a non-pipelined processor, each instruction is executed sequentially, meaning the next
   instruction starts only after the current one completes.
   - Let T be the time taken to execute one instruction.
   - For n instructions, the total execution time is:
   Tnon-pipelined = nT
   Step 2: Execution Time in a k-Stage Pipelined Processor
A k-stage pipeline splits the execution of each instruction into k stages, with each stage
completing a part of the instruction execution. The pipeline operates as follows:
    1. The first instruction takes kT time to complete (since it must go through all k stages).
    2. After that, each subsequent instruction completes in T time, as the pipeline is filled.
    3. Thus, for n instructions, the total execution time is:
Tpipelined = kT + (n−1)T
Simplifying:
Tpipelined = (n+k−1)T
For large n, the term k−1becomes negligible, so we approximate:
               nT
Tpipelined ≈
                k
Step 3: Speedup Calculation
Speedup S is defined as the ratio of execution time in a non-pipelined processor to that in a
pipelined processor:
S = Tnon-pipelined / Tpipelined
                                                   nT
Substituting the values:                       S = nT =k
                                                    k
         Thus, under ideal conditions, a k-stage pipeline can be at most k times faster than a
non-pipelined processor.
         However, in real-world scenarios, factors such as pipeline stalls, data hazards, and branch
mispredictions often reduce the actual speedup below this theoretical maximum.
3. ARRAY PROCESSORS
         A synchronous array of parallel processors is called an array processor, which consists of
  multiple processing elements (PEs) under the supervision of one control unit (CU). An array
  processor can handle single instruction and multiple data (SIMD) streams. SIMD machines are
  specifically designed to perform vector computations over matrices and arrays of data.
         SIMD computers appear in two basic architectural organizations: array processors, using
  random-access memory, and associative processors, using content addressable (or associative)
  SIMD Computer Organization
         In general, an array processor assumes the configuration as shown in Figure. The
  processor is structured with N synchronized PEs, all of which are under the control of one CU.
  Each PE; is essentially an ALU with attached working registers and local memory, PEM i, which
  collectively form the storage space for the distributed data. The CU has its own main memory for
  the storage of programs. The operating system and user programs are executed under the control
  of the CU. The user programs are loaded into the CU memory from an external source. The
  function of CU is to decode all the instructions and determine where the decoded instructions
  should be executed. Scalar and control-type instructions are executed directly by the CU. Vector
  instructions are broadcast to the PEs for distributed execution to achieve spatial parallelism
  through duplicated arithmetic units.
         All the PEs perform the same function synchronously in a lock-step fashion under the
  command of the CU. Vector operands are distributed to the PEMs before parallel execution in
  the array of PEs. The distributed data can be loaded into the PEMs from an external source via
  the system data bus, or via the CU in a broadcast mode before using the control bus.
         An array processor is normally interfaced to a host computer through the control unit.
  The host computer is a general-purpose machine which serves as a coordinator of the entire
  system, consisting of the host system and the processor array. The functions of a host computer
  include resource management and peripheral and I/O supervision. The control unit of the
  processor array directly supervises the execution of programs.
4. The additional requirements of a multiprocessor operating system
          The additional requirements of a multiprocessor operating system are enumerated below
   (An operating system that fails to perform well in these respects tends to negate other advantages
   which are associated with multiprocessing):
   (1) Load balancing
          The operating system must utilize the resources efficiently. This is commonly expressed
   in terms of achieving a uniform balance of loads across the processors. The operating system
   should schedule the subtasks such that there are no idle resources including the processors.
   (2) Scheduling cooperating processes
          Parallel programs which consist of concurrently executing tasks must be scheduled such
   that collectively they are able to use the resources in the machine required to solve the given
   problem. Scheduling half the subtasks of one parallel program and half the subtasks of another
   parallel program which cannot cooperate can be wasteful. If a parallel program needs all its
   subtasks to be running and cooperating, then scheduling only half of the tasks can lead to
   starvation.
   (3) Graceful degradation in case of failure of one of the resources:
          Given that a parallel machine has multiple resources of the same kind, one must expect a
   higher degree of fault tolerance from it. Failure of one of its resources should not result in a
   catastrophic system crash. The operating system should be able to reschedule the task that had
   been running on the failed resource and continue the parallel program. Ideally, there should be
   only a fractional degradation in the performance of the parallel machine.
   (4) Communication schemes:
          Most parallel programs need to share data and intermediate results across subtasks during
   the processing towards the solution of the problem. To achieve effective cooperation among the
   subtasks of a parallel program, the operating system must provide adequate facilities for
   communication between tasks. These facilities vary depending on whether the machine is of a
   shared memory type or of a distributed memory type.
   (5) Synchronization mechanisms:
          To ensure integrity of the shared data across the subtasks of a parallel program,
   synchronization between tasks is required. The shared data must be accessed under mutual
  exclusion (implemented using a mechanism like semaphores). The tasks may need to wait till
  some state is reached across all the tasks of the parallel program. The operating systems need to
  provide signaling mechanisms for such synchronization requirements.
5. P-RAM
         The random access machine (RAM)- not to be confused with random access memory-is a
  model of one-address computer. A RAM consists of a memory, a read-only input tape, a write-
  only output tape, and a program. The program is not stored in the memory and cannot be
  modified. The input tape consists of a sequence of integers. Every time an input value is read, the
  input head advances one square.
          The P-RAM (parallel RAM) consists of a control unit, global memory, and an unbounded
  set of processors, each with its own private memory. Every processor has a unique index; this
  value can be used to address the processor for passing signals and interrupts. All the processors
  share access to global memory which is unbounded in capacity. Execution of instructions can be
  started on any of the processors at any time, starting at any location of memory (its local or
  global memory).
          A processor can cause the start of execution of some task on some other processor
  through the signalling mechanism. This signalling mechanism can be available through hardware
  or software, or both. This model represents the capabilities of an MIMD machine for parallel
  computations. Several simplifying assumptions are made while considering the development of
  algorithms for such a machine. They are:
  (1) There is no limit on the number of processors in the machine.
  (2) Any memory location is uniformly accessible from any processor. (Uniformly in terms of
      time required for the access.)
  (3) There is no limit on the amount of shared memory in the system.
  (4) Resource contention is absent. A processor does not have to wait for reading a memory
      location till the other processor has finished accessing the location.
  (5) The programs written on these machines are, in general, of type MIMD. Certain special cases
      such as SIMD may also be handled in such a framework.
         While mapping an algorithm developed on such a machine to real machines, some
  constraints are applied to modify the implementation aspects of the algorithm. Formally, such
  constraints are expressed in the following terms:
   (1) EREW (Exclusive Read Exclusive Write): Read or write conflicts are not allowed.
   (2) CREW (Concurrent Read Exclusive Write): Concurrent reading is allowed; however, write
       conflicts are not allowed.
   (3) CRCW (Concurrent Read Concurrent Write): Concurrent reading and concurrent writing are
       allowed. Such a model, in general, is of little practical value.
6. The Configuration of OS (Tr. YYA ပေးထားတဲ့ စာရွက်)
   Short Questions (8 marks)
1. What is a multiprocessor system, and how does it differ from a single-processor system?
          A multiprocessor system is a computer system with two or more processors that share a
   common physical memory and work together to execute tasks. These processors can execute
   multiple instructions simultaneously, making them suitable for parallel processing tasks.
           A single-processor system, on the other hand, contains only one processor that executes
   one instruction at a time. While it can multitask using scheduling techniques, it does not achieve
   true parallelism like a multiprocessor system.
   Differences:
    Aspect                 Multiprocessor System                     Single-Processor System
    Number              of Multiple                                  One
    processors
    Parallelism              True parallelism                       Simulated          through
                                                                    multitasking
    Performance              Higher, due to parallel tasks          Limited      by sequential
                                                                    execution
    Cost                     Higher                                 Lower
    Applications             High-performance            computing, Personal computers, basic
                             databases                              tasks
   Example:
          A multiprocessor system: A supercomputer used for weather prediction (e.g., IBM
           Summit).
          A single-processor system: A basic laptop performing web browsing.
2. Explain how multiple functional units contribute to parallelism in sequential machines.
           Multiple functional units in a sequential machine enable the system to perform different
   operations (e.g., addition, multiplication) simultaneously, thereby increasing throughput. These
   units include Arithmetic Logic Units (ALUs), Floating-Point Units (FPUs), and other specialized
   processors.
   Functions:
          A CPU with multiple functional units divides tasks between these units.
          While one unit processes an addition operation, another performs multiplication or a
           memory fetch.
   Example:
          Consider a CPU executing two instructions:
              1. Addition (handled by the ALU).
              2. Data fetching from memory (handled by a Memory Management Unit).
          The tasks are performed simultaneously instead of sequentially, reducing execution time.
          Multiple functional units within a processor core enable parallelism in sequential
   machines by allowing different operations to be executed simultaneously on separate data
   streams, effectively "overlapping" the execution of instructions and improving overall processing
   speed by taking advantage of instruction-level parallelism (ILP) within a single core; essentially,
   multiple units can work on different parts of a task at the same time, even if the instructions are
   processed sequentially.
   Example
           Imagine a processor with separate functional units for integer addition, floating-point
    multiplication, and logical operations.
           If an instruction sequence includes an addition, a multiplication, and a logical AND, these
   operations can be executed simultaneously on their respective units, significantly reducing the
   overall execution time compared to a single functional unit processing them sequentially.
3. Real-world applications (in algorithm).
4. Data Parallel Model
           The programming model for data-parallel SIMD processors is an extension of the
   sequential programming. For example, Fortran 90 is specifically tailored for data parallelism.
   The language compiler for a specific machine must translate the source code into executable
   object code, which will exploit the parallelism in the hardware.
           Data-parallel programs require the use of pre-distributed data sets. Thus the choice of
   parallel data structure plays a significant role in data-parallel programming.
          Synchronization of data-parallel programs is done at compile-time rather than at run-
   time. Hardware optimization is enforced by the control unit to carry out lockstep execution of
   SIMD programs. The compiler must be aware of the underlying interconnection topology of PEs
   to generate optimal code for the array processor.
5. Parallel processing
          Parallel processing is the simultaneous execution of multiple tasks or instructions to
   improve computational speed and efficiency. It is widely used in high-performance computing,
   AI, and large-scale data processing. The types of parallelism encountered in computing: look-
   ahead, pipelining, vectorization, concurrency, data parallelism, partitioning, interleaving,
   overlapping, replication, time sharing, space sharing, multitasking, multiprogramming,
   multithreading, and distributed computing.
          The most well-known classification of computer architectures are as follow:
   (1) SISD: This is the conventional von Neumann computer. Only a single stream of instructions
       and a single stream of data is involved.
   (2) SIMD: Here we have only one instruction stream, but multiple data streams are active at a
       time. This is typically done by replicating arithmetic units in the CPU and allowing the
       different units to refer to different operands, but follow a common instruction. This model is
       known as data parallelism. Vector computers and Array processors are examples of this
       class.
   (3) MISD: This is the opposite of SIMD. By definition this means multiple instructions on a
       single stream of data. This appears counter intuitive and hence many people believe that
       MISD is an impossible class. However, there are some who would place the systolic array
       architecture in this class.
(4) MIMD: The other end of the spectrum, of course, would be to let loose both the streams.
    There are multiple instructions and multiple data. An arbitrary combination of the instruction
    and data streams is possible. This is the most general model in Flynn's classification.