UNIT-4
1. Explain the Multiprocessors and Multi computers
Ans: Multiprocessors: A Multiprocessor is a computer system with two or more central
processing units (CPUs) share full access to a common RAM.
The main objective of using a multiprocessor is to boost the system’s execution
speed, with other objectives being fault tolerance and application matching.
There are two types of multiprocessors, one is called shared memory multiprocessor
and another is distributed memory multiprocessor.
In shared memory multiprocessors, all the CPUs shares the common memory but in a
distributed memory multiprocessor, every CPU has its own private memory.
The interconnection among two or more processor and shared memory is done with
three methods.
1)Time shared common bus
2)Multiport memories
3)Crossbar switch network
1)Time shared common bus
In this method it contains a single shared bus through which all processor & memory unit
can be communicated.
Consider CPU-1 is interacting with memory unit using common shared bus in that case all
other processor must be idle as we have only one bus to communicate.
Advantage:
Simple to implement.
Due to single common bus cost to implement is very less.
Disadvantage:
Data transfer rate is slow.
2)Multiport memories
Unlike in the shared common bus method, it contains separate bus for each processor to
communicate with the memory module.
Suppose CPU-1 wants to interact with memory module 1 then port mm1 is enabled.
Similarly CPU-4 wants to interact with memory module 4 then port mm4 is enabled. Hence
all the process can be communicated parallelly. If more than one CPU request for same time
memory module, priority will be given in the order of CPU-1,CPU-2,CPU-3,CPU-4.
Multiport memories architecture
Remaining ans in spectrum pg 4.6 Q5
3)Crossbar switch network
Here instead multiport unlike in multiport memories, a switch will be installed between
memory unit and CPU. Switch is responsible for whether to pass the request to a particular
memory module or not based on the request made for.
cross-bar switch network
Advantage:
High data through rate.
Disadvantage:
Complex to implement as more switches involved.
Costlier to implement.
Remaining ans in spectrum pg 4.5 Q3
Applications of Multiprocessor –
1. As a uniprocessor, such as single instruction, single data stream (SISD).
2. As a multiprocessor, such as single instruction, multiple data stream (SIMD), which is
usually used for vector processing.
3. Multiple series of instructions in a single perspective, such as multiple instruction,
single data stream (MISD), which is used for describing hyper-threading or pipelined
processors.
4. Inside a single system for executing multiple, individual series of instructions in
multiple perspectives, such as multiple instruction, multiple data stream (MIMD).
Benefits of using a Multiprocessor –
1. Enhanced performance.
2. Multiple applications.
3. Multi-tasking inside an application.
4. High throughput and responsiveness.
5. Hardware sharing among CPUs.
Advantages:
1. Improved performance
2. Better scalability
3. Increased reliability
4. Reduced cost
5. Enhanced parallelism
Disadvantages:
1. Increased complexity
2. Higher power consumption
3. Difficult programming
4. Synchronization issues
5. Limited performance gains
Multicomputer: A multicomputer system is a computer system with multiple processors
that are connected together to solve a problem. Each processor has its own memory and it
is accessible by that particular processor and those processors can communicate with
eachother via an interconnection network.
As the multicomputer is capable of messages passing between the processors, it is possible
to divide the task between the processors to complete the task. Hence, a multicomputer
can be used for distributed computing. It is cost effective and easier to build a
multicomputer than a multiprocessor.
Multiprocessor Multicomputer
Multiprocessor consists of multiple Multicomputer is an interlinked
processors within a single computer. multiple autonomous computer.
Multiprocessor is a singly shared The memory attached to the
memory that is attached to the processing elements are distributed
elements being processed. in multiples.
Multiprocessor is necessary for the Multicomputer is not required for
processing elements to communicate elements being processed to
with each other. communicate.
Multiprocessor is a dynamic network. Multicomputer is a type of static
network.
Multiprocessor supports parallel Multicomputer supports distributed
computing computing.
Example of multiprocessor is a Example of a multicomputer is a
sequent symmetry S-81. message passing multicomputer.
2. Explain the message passing mechanisms
Ans: Message Passing Mechanism
Message passing is a paradigm for inter-process communication (IPC) and parallel
computing. It involves exchanging messages between processes or nodes to coordinate
their actions and share data, allowing for synchronization and coordination in computations.
Message passing in a multicomputer network demands special hardware and software
support.
Message-Routing Schemes:
Message Format:
Information units used in message routing are specified in Fig. A message is the
logical unit for internode communication. It is often assembled from an arbitrary
number of fixed-length packets; thus, it may have a variable length.
A packet is the basic unit containing the destination address for routing purposes.
Because different packets may arrive at the destination asynchronously, a sequence
number is needed in each packet to allow reassembly of the message transmitted.
The packet is divided into several flits which stores following:
a) Routing information
b) Sequence number
c) Data
A packet can be further divided into a number of fixed-lengthflits (flow control digits).
Routing information (destination) and sequence number occupy the header flits. The
remaining flits arc the data elements of a packet.
The packet length is determined by the routing scheme and network implementation.
Typical packet lengths range from 64 to 512 bits. The sequence number may occupy
one to two flits depending on the message length.
Store-and-Forward Routing:
Packets are the basic unit of information flow in a store-and-forward Routing. The
concept is illustrated in Fig. a.
Each node is required to use a packet buffer. A packet is transmitted from a source
node to a destination node through a sequence of intermediate nodes.
When a packet reaches an intermediate node, it is first stored in the buffer. Then it is
forwarded to the next node if the desired output channel and a packet buffer in the
receiving node are both available.
The latency in store-and-forward networks is directly proportional to the distance
(the number of hops) between the source and the destination.
This routing scheme was implemented in the first generation of multicomputer.
Wormhole Routing:
This was implemented by the generations of multicomputer. They divide the packets
into several smaller flites and use flit buffer in hardware router attached the to nodes.
The transmission from the source node to the destination node is done through a
sequence of routers.
All the flits in the same packet are transmitted in order as inseparable companions in
a pipelined Fashion.
The flits of packets are transmitted in sequence in cascading from having header flit
just followed by the data flits. The path is known only to header flit who carries data
flits to desired destination.
Different packet can be interleaved during transmission. However, the flits from
different packets cannot be mixed up. Otherwise they may be towed to the wrong
destinations.
Asynchronous Pipelining:
The pipelining of successive flits in a packet is done asynchronously using a
handshaking protocol as shown in Fig.
Along the path, a 1-bit ready/request (R/A) line is used between adjacent routers.
When the receiving router (D) is ready to receive a flit (i.e. the flit buffer is available),
it pulls the R/A line low. (Fig. a)
When the sending router(S) is ready, it raises the line high and transmits flit i through
the channel. (Fig.b)
While the flit is being received by D, the R/A line is kept high. (Fig. c)
After flit i is removed from D's buffer (i.e. is transmitted to the next node), the cycle
repeats itself for the transmission of the next flit i + 1 until the entire packet is
transmitted. (Fig.d)
Asynchronous pipelining can be very efficient, and the clock used can be faster than
that used in a synchronous pipeline.
Deadlock and Virtual Channels:
Virtual Channel: A virtual channel is a logical link between two nodes.
It is formed by a flit buffer in the source node, a physical channel between them, and
a flit buffer in the receiver node.
Figure shows the concept of four virtual channels sharing a single physical channel.
Four flit buffers are used at the source node and receiver node, respectively.
One source buffer is paired with one receiver buffer to form a virtual channel when
the physical channel is allocated for the pair.
Deadlock Avoidance:
By two virtual channels, V3 and V4 in Fig. c, one can break the deadlock cycle.
A modified channel-dependence graph is obtained by using the virtual channels V3
and V4, after the use of channel C2, instead of reusing C3 and C4.
The cycle in Fig. b is being converted to a spiral, thus avoiding a deadlock. Channel
multiplexing can be done at the flit level or at the packet level if the packet length is
sufficiently short.
Virtual channels can be implemented with either unidirectional channels or
bidirectional channels. The use of virtual channels may reduce the effective channel
bandwidth available to each request.
3. Describe the Multiprocessors System interconnect Architectures
Ans:
Spectrum pg.no 4.3 Q1 generalized multiprocessor system
Q2 Hierarchical bus system
Q3 crossbar networks
Q4 crosspoint switch design
Q5 multiport memory
Q6 routing in butterfly networks
4. Explain directory-based cache coherence protocols
Ans:
CACHE COHERENCE : Cache coherence refers to the consistency of shared data stored in multiple
caches within a multiprocessor.
In systems with multiple processors, each processor typically has a private cache to reduce memory access
latency. However, when multiple caches store copies of the same memory block, cache coherence issues
arise:
Inconsistent Data: A processor modifies a memory block in its cache, but other caches hold stale
copies.
Race Conditions: Multiple processors attempt to modify the same data simultaneously..
Cache coherence ensures that all the processors have a consistent view of shared memory even when data
is cached in multiple locations.
Directory-based cache coherence protocols:
Directory-based cache coherence protocols are crucial for maintaining data
consistency in large-scale multiprocessor systems.
Directory-based cache coherence protocols maintain a centralized directory that
tracks the state and location of cached data across all processors in a multiprocessor
system
The directory acts as a global point of control, managing coherence by storing
information about which caches hold copies of each memory block and their
respective states (shared, exclusive, modified)
When a processor requests access to a memory block, it sends a message to the
directory, which consults its records to determine the appropriate actions required to
maintain coherence
o If the block is not present in any cache, the directory fetches it from main
memory and sends it to the requesting processor
o If the block is present in other caches, the directory coordinates the necessary
invalidation or update messages to ensure coherence before granting access to
the requesting processor
Directory-based protocols typically employ a set of coherence states (MESI, MOESI) to track the
status of each cached block and enforce the coherence invariants.
Common coherence states include:
o Modified (M): The block is exclusively owned by a single cache and has been modified
o Exclusive (E): The block is exclusively owned by a single cache but has not been modified
o Shared (S): The block is shared among multiple caches and is read-only
o Invalid (I): The block is not present in the cache or is outdated
Coherence invariants ensure that:
o At most one cache can have a block in the Modified state
o If a block is in the Shared state, no cache can have it in the Modified or Exclusive state
o A block in the Exclusive state cannot coexist with copies in other caches
The directory maintains a presence vector or a sharing list to keep track of which processors have
copies of each memory block, enabling efficient invalidation or update operations
Directory-based protocols offer better scalability compared to snooping-based
protocols,
o In snooping-based protocols, all processors monitor a shared bus for
coherence transactions, which can lead to increased traffic and limited
scalability as the number of processors grows
o Directory-based protocols avoid the need for a shared bus and
centralized snooping, reducing the communication overhead and
enabling more efficient use of interconnect bandwidth
Protocol Categorization:
o Full Map directories
o Limited directories
o Chained directories
1. Full Map directories:
The Full Map Directory Protocol is a directory-based cache coherence
protocol used to maintain consistency across multiple caches in a
shared-memory multiprocessor system. This protocol is called "Full Map"
because it maintains a complete directory for every cache line in the
shared memory, allowing precise tracking of which caches hold copies of
a given memory block.
The full-map protocol implements directory entries with one bit per
processor and a dirty bit(The cache block has been modified (written)
more than once, and the cache copy is the only one in the system). Each
bit represents the status of the block in the corresponding processor's
cache (present or absent).
If the dirty bit is set, then one and only one processor's bit is set and that
processor can write into the block.
Each directory entry contains N pointers, where N is the number of
processors.
There could be N cached copies of a particular block shared by all
processors
For every memory block, an N bit vector is maintained, where N equals
the number of processors in the shared memory system. Each bit in the
vector corresponds to one processor.
The full-map protocol provides a useful upper bound for the performance
of centralized directory-based cache coherence. However, it is not
scalable due to excessive memory overhead.
2. Limited directories:
The Limited Directory Protocol (LDP) is a memory coherence protocol
used to maintain consistency in shared memory systems, particularly in
distributed shared memory (DSM) or multiprocessor systems. It is a
variation of the directory-based coherence protocol, optimized to reduce
the storage and communication overhead associated with directory
maintenance.
Limited directory protocols are designed to solve the directory size
problem. Restricting the number of simultaneously cached copies of any
particular block of data limits the growth of the directory to a constant
factor.
It has Fixed number of pointers per directory entry regardless of the
number of processors.
These protocols are considered scalable with respect to memory
overhead because the resource required to implement them grows
approximately linearly with the number of processors in the system.
3. Chained directories:
The Chained Directory Protocol is a variant of directory-based
coherence protocols that uses a distributed and linked data structure to
track cache coherence. Unlike centralized directory protocols, where a
single directory maintains coherence information, the chained directory
protocol links directories across distributed memory modules in a chain-
like structure.
Chained directories realize the scalability of limited directories without
restricting the number of shared copies of data blocks. This type of cache
coherence scheme is called a chained scheme because it keeps track of
shared copies of data by maintaining a chain of directory pointers.
Instead of broadcasting coherence messages to all processors or a
centralized directory, coherence requests (e.g., read, write, or invalidate)
are propagated through the chain.
Each directory node forwards the request to the next node in the chain
until the coherence operation is completed.
Chained directories emulate full-map by distributing the directory among
the caches Solving the directory size problem without restricting the
number of shared block copies.
Chained directories keep track of shared copies of a particular block by
maintaining a chain of directory pointers.
This scheme should be called a gossip protocol [as opposed to a snoopy
protocol] because information is passed from individual to individual
rather than being spread by covert observation.
Although the chained protocols are more complex than the limited
directory protocols, they are still scalable in terms of the amount of
memory used For the directories. The number of pointers per cache or
memory block is independent of the number of processors.
IF ASKED IN DETAIL
1. Full Map directories:
The Full Map Directory Protocol is a directory-based cache coherence
protocol used to maintain consistency across multiple caches in a
shared-memory multiprocessor system. This protocol is called "Full Map"
because it maintains a complete directory for every cache line in the
shared memory, allowing precise tracking of which caches hold copies of
a given memory block.
Components of the Full Map Directory Protocol
1. Directory:
o A centralized or distributed structure that keeps metadata about every
memory block.
o Each entry in the directory corresponds to a memory block and contains:
State: Indicates the coherence state of the block (e.g., Modified,
Shared, Invalid).
Presence Vector: A bit vector where each bit corresponds to a
specific cache. A 1 in a bit indicates the cache holds a copy of the
block.
2. Cache States (for each block in individual caches):
o Modified (M): The block is updated in the cache and not consistent with
memory; the cache owns the block.
o Shared (S): The block is consistent across all caches and the memory.
o Invalid (I): The block is not valid in the cache.
3. Interconnect:
o A mechanism (e.g., bus or network) that facilitates communication
between caches, processors, and the directory.
Operations in the Protocol
The Full Map Directory Protocol handles four primary operations: Read, Write,
Invalidate, and Evict.
a. Read Miss (Cache Read Request)
If a processor requests a memory block not in its cache:
1. The cache sends a read request to the directory.
2. The directory checks the block's state:
If the block is Shared, the directory forwards the block from
memory or one of the caches with a copy.
If the block is Modified, the directory requests the cache with the
block to write it back to memory, then forwards the block to the
requesting cache.
3. The directory updates the presence vector to include the requesting
cache.
b. Write Miss (Cache Write Request)
If a processor wants to write to a block not in its cache or in the Shared state:
1. The cache sends a write request to the directory.
2. The directory checks the block's state:
If the block is Shared, the directory invalidates the block in all
other caches and updates the state to Modified.
If the block is Modified, the directory requests the owning cache
to write back the block, then forwards it to the requesting cache.
3. The directory updates the presence vector to reflect the requesting
cache as the sole owner.
c. Invalidate
When a cache needs exclusive access to a block (e.g., for a write operation):
1. The directory sends invalidate messages to all caches listed in the
presence vector.
2. Once all invalidations are acknowledged, the block is marked as
Modified in the requesting cache.
d. Eviction
When a cache evicts a block:
1. The cache informs the directory to remove its entry from the presence
vector.
2. If the block is Modified, the evicting cache writes it back to memory to
ensure consistency.
1. Limited directories:
The Limited Directory Protocol (LDP) is a memory coherence protocol
used to maintain consistency in shared memory systems, particularly in
distributed shared memory (DSM) or multiprocessor systems. It is a
variation of the directory-based coherence protocol, optimized to reduce
the storage and communication overhead associated with directory
maintenance.
Key Features of Limited Directory Protocol:
1. Compact Directory Representation:
o Traditional directory protocols store information for all processors in the
system, which can lead to significant memory overhead in large-scale
systems.
o LDP limits the number of processors that can be explicitly tracked for a
cache line. Instead of maintaining a full list of sharers, it uses a limited
sharer list to track only a subset of processors that have a copy of the
data.
2. Fixed-Size Directory Entries:
o Each directory entry has a fixed size, with space for a limited number of
sharers.
o If the number of sharers exceeds the directory's capacity, it enters a
fallback state, such as broadcasting invalidation or downgrading the
block.
3. Coherence States:
o Like other directory protocols, LDP supports standard states such as
Modified, Shared, and Invalid (often abbreviated as MSI or MESI with
Exclusive).
o The directory keeps track of the coherence state and the subset of
processors holding a valid copy.
4. Efficient for Sparse Sharing:
o LDP performs well in scenarios where data sharing is limited to a small
number of processors at any given time. This is common in many parallel
applications where data locality is high.
Operation of Limited Directory Protocol:
1. Read Request:
If a processor requests a cache line, the directory checks its state:
o Shared or Uncached State: The requesting processor is added to the
limited sharer list, and the data is forwarded.
o Modified State: The directory sends an invalidation request to the
current owner, fetches the latest data, and updates the state to Shared.
2. Write Request:
For a write request, the protocol ensures exclusive access:
o If the data is in Shared State, invalidation messages are sent to all
sharers in the directory.
o The writer is granted exclusive ownership, and the state is updated to
Modified.
3. Overflow Handling:
When the number of sharers exceeds the directory's capacity:
o Broadcasting: The directory resorts to broadcasting coherence
messages to all processors, ensuring all sharers are updated or
invalidated.
o Coarser Tracking: The directory might switch to a coarse-grained
mode, marking the cache line as globally shared without tracking
individual sharers.
3.Chained Directory Protocol
The Chained Directory Protocol is a variant of directory-based coherence protocols that
uses a distributed and linked data structure to track cache coherence. Unlike centralized
directory protocols, where a single directory maintains coherence information, the chained
directory protocol links directories across distributed memory modules in a chain-like
structure.
Key Features
1. Distributed Directory: Each memory module maintains a local directory that tracks
which caches hold copies of its memory blocks. These directories are linked in a
chain.
2. Chaining Mechanism:
o Instead of broadcasting coherence messages to all processors or a centralized
directory, coherence requests (e.g., read, write, or invalidate) are propagated
through the chain.
o Each directory node forwards the request to the next node in the chain until
the coherence operation is completed.
3. Scalability:
o The chained structure reduces bottlenecks in centralized directories and scales
better with an increasing number of processors.
4. Reduced Network Traffic:
o By chaining requests instead of broadcasting them, the protocol minimizes
unnecessary message traffic, making it suitable for large-scale systems.
How It Works
1. Data Access and Directory Chain
o When a processor accesses a memory block, it checks its local cache.
o If the block is not found, a request is sent to the memory module that owns the
block. The local directory at the memory module identifies which processors
have cached copies of the block.
o If coherence actions are needed (e.g., invalidating other copies), the request
propagates through the chain of directories.
2. Read Operations
o If a processor requests a read, the directory checks whether the block is
available in shared state.
o If available, the block is sent to the requesting processor.
o If the block is modified in another cache, the protocol fetches the updated
block from that cache, writes it back to memory, and updates the requestor.
3. Write Operations
o For a write request, the protocol ensures that no stale copies exist in other
caches.
o It sends invalidate messages to all caches holding the block, propagating the
request through the chain.
4. Fault Tolerance
o If a link in the chain fails, the protocol can often reroute requests to maintain
functionality.
5. Explain snoopy bus protocol
Ans: Snoopy bus Protocol:
The Snoopy Bus Protocol is a cache coherence mechanism used in multiprocessor systems
where multiple processors share a common bus and memory. This protocol ensures that the
data consistency is maintained across all processor caches. It operates by "snooping" or
monitoring the shared bus to observe memory transactions. Two key maintained protocols
to achieve this coherency through bus-snooping are the write invalidate protocol and the
write update protocol. Here's a detailed breakdown:
Write Invalidate Protocol
When multiple caches share the same memory, a protocol is needed to keep them in sync.
The Write Invalidate Protocol ensures this by:
When a processor writes to a memory location, all other caches with that memory
line invalidate (nullify) their copies.
This ensures no other processor reads outdated data, and only the writing processor
has the latest data.
It prevents mismatched or outdated information in the system.
Advantages
Reduces Bandwidth Usage: To this, the only message that is transmitted in the bus is
invalidation messages and therefore the subject is smaller as compared to
transmitting the actual data.
Simpler Implementation: The protocol is also easy to implement because it simplifies
the problem of cache coherence.
Prevents Stale Data Access: By voiding other caches’ copies it ensures that no
processor acquire old information from that particular cache.
Disadvantages
Increased Cache Misses: Invalidating caches may result in more cache misses, in the
sense that other processors may have to read the data in from either the main
memory or from the cache of the writing processor.
Performance Overhead: It also allows for frequent invalidations which can add
latency where there is high write contention.
Write Update Protocol
The Write Update Protocol (or Write Broadcast Protocol) is a cache coherence method
used in multiprocessor systems.
When a processor writes to a memory location, it updates all other caches with the
new value instead of invalidating them.
This ensures all caches have the latest data without needing to nullify any copies.
Advantages
Reduced Cache Misses: Due to the availability of the updated data in all the caches,
they do not have to use the fetch operation thus reducing on the number of cache
misses.
Improved Performance for Read-Heavy Workloads: Applications with more frequent
reads are benefitted most when there is the latest data in all the caches.
Consistent Data Across Caches: Makes sure all caches have up to date data, thus
improving data synchronization.
Disadvantages
Higher Bandwidth Consumption: The approach of broadcasting updated data to all
caches in order to reduce multiple invalidation messages requires more
bus bandwidth than when sending invalidation messages only.
Complex Implementation: Co-ordinating and making certain that all of the caches
have the correct data can be even more challenging.
Potential for Increased Traffic: When multiple processors are involved, the amount of
update messages can grow big, causing congestion and thus the bus.
The memory copy is also updated if write-through caches are used. In using write-back
caches, the memory copy is updated later at block replacement time. Write-Through Cachet
The states of a cache block copy change with respect to read, write, and replacement
operations in the cache. Figure shows the state transitions for two basic write-invalidate
snoopy protocols developed for write-through and write-back caches, respectively. A block
copy of a write- through cache i attached to processor i can assume one of two possible
cache states: valid or invalid (Fig. a). Write-Back Caches The valid state of a write-back cache
can be further split into two cache states, labeled RW (read-write) and RO (read-only) as
shown in Fig. b. The INV (invalidated or not-in-cache) cache state is equivalent to the invalid
state mentioned before. This threestate coherence scheme corresponds to an ownership
protocol.
Write-once Protocol James Goodman proposed a cache coherence protocol for bus-based
multiprocessors. This scheme combines the advantages of both write-through and
writeback invalidations. In order to reduce bus traffic, the very first write of a cache block
uses a write-through policy.