Multiprocessors and multi
computers
        -prajwala T R
        Dept. of CSE
            PESIT
Multiprocessor system interconnects
        Network characteristics
• Timing
  – Synchronous
  – asynchronous
• Switching
  – Circuit switching
  – Packet switching
• Control
  – Centralized
  – distributed
Hierarchical bus systems
• Local bus-
  buses implemented within processor chip or
  PCB
  provides communication path among
  components mounted on board
Memory bus
Data bus
• Backplane bus
• Is printed circuit on which many connectors
  are used to plug in functional boards
• VME bus
• multibusII
• Futurebus+
Backplane bus
• I/O bus
  – SCSCI-small computer system interface bus
  – Made of coaxial cables with taps connecting to
    disks,printer.
  – Interface logic
  – Ex: encore bus consists of 32 bit address,64 bit
    data path and 14 bit vector bus
  – Clock speed 12.5MHz
SCSI bus cable
Encore ultramax multiprocessor
          architecture
             Cross bar switch
• Single stage
• Multistage network
  – Blocking ex:omega and baseline network
  – non blocking(all possible connections between
    i/o)
• Cross bar networks
Single stage
          Cross bar networks
• Single stage, permutation and non blocking
  network
• Unary switch set to open or close and
  establishes point to point connections
• N X M or N=M
• All processors send request asynchronously
  and independently
design
•   Multiplexer
•   Arbitration logic
•   Acknowledgement signal
•   Memory read or writ
•   16 processors then 4 bit control lines
• Advantages
  – High bandwidth
  – Interface is cheaper
  – Single processor send many requests to multiple
    modules
• Disadvantages
  – Cost effective only for small number of processors
  – Not expandable once built
           Multiport memory
• Solution intermediate to bus and switch
• Only one of n processor requests is honored at
  a time.
• Drawback
  – Not scalable
  – Large number of interconnection cables
Multiport memory
 Multistage and combining networks
• Omega network
• Base line networks
• Hotspot problem
  – ex: memory module
  – Semaphore
  – Degrade performance
       Fetch and add primitive
• Increments content of memory loation.
• Atomic operation
• X,e-value, increments
• When using multiprocessor, when one process
  is allowed to make change no other process
  can access intermediate result
• Switch performs addition of increments.
• Disadvantage-
• Requires additional switch cycles to make
  entire operation atomic.
• Rela time systems ex:IBMRP3
  – 512 processors
  – Omega network of 128 ports
  – Bandwidth 13Gbps
  – 50Mhz clock
• 2 methods to solve cache coherence problem
  – Snoopy protocol- to monitor the values
  – Directory based protocols-no broadcasting of
    values. A central directory is maintained for
    modifications made in the cache
            Snoopy protocols
• Snoopy protocols are used to ensure
  coherence of cache.
• The mechanism are
  – write invalidate
  – Write update
  – Write through caches
  – Write back caches
  – Write once protocol
        Snoopy protocols contd…
• Write invalidate protocol
  – Will invalidate all remote copies when local cache
    block is updates
• Write update policy
  – Broadcast new data to all caches containing the
    copy of block
         Snoopy protocols contd…
• Write through caches
  – I and j processors
  – VALID o INVALID
• Possible operations:
  – Read by same processR(i)
  – Read by different processorR( j )
  – Write by same processor W(i) Write by different
    processor W( j )
  – Replace by same processor Z(i) Replace by different
    processor Z( j )
               Write back caches
• Data item states: o
  – RO : Read Only (Valid state)
  – RW : Read Write (Valid state)
  – INV : Invalid state
• Possible operations:
  –   Read by same processor R(i)
  –   Read by different processor R( j )
  –   Write by same processor W(i)
  –   Write by different processor W( j )
  –   Replace by same processor Z(i) Replace by different
      processor Z( j )
Write back cache
             Snoopy protocols contd..
•    Write-once Protocol
•   First write using write-through policy
•   Subsequent writes using write-back policy
•   In both cases, data item copy in remote caches is invalidated
•   Data item states:
    – Valid :cache block consistent with main memory copy
    – Reserved : data has been written exactly once and is consistent
      with main memory copy
    – Dirty : data is written more than once but is not consistent with
      main memory copy
    – Invalid :block not found in cache or is inconsistent with main
      memory copy
Read hit, read miss, write hit, write miss
• Read hit: The information is supplied by the c
• Read miss: The data is read from main
  memory. Check for dirty or reserved states
• Write hit-if in dirty or reserved state update to
  dirty state
• Write miss-invalid state
     Multilevel cache coherence
• An write invalidate is sent vertically up inorder
  to invalidate the shared caches at higher level.
• Higher level caches keep track of dirty blocks.
Protocol Performance issues
       Directory based protocols
• Snoopy protocols broadcast the information.
• In large network this is expensive
• Write invalidate protocol leads heavy bus
  traffic
• Write update protocol –the updated data may
  not be used by remote processors a lot
• Hence use directory based protocol.
• Cache directories-
  – List of cached locations
  – Number of pointers to specify the copies of block
  – Dirty bit
• Cache directories store information on where
  copies of cache block resides, list of cached
  locations
• Central directory scheme
  – Duplicates all cache directories.
  – Consistency must be maintained.
  – Drawbacks-
     • Contention
     • Long term searches
• Distributed directory scheme
  – Each memory module holds its own directories.
  – State information is local to the memory module.
  – If read miss in cache 2-request sent to memory
    module and memory module controller
    retransmits data in cache 1.
  – If write hit of c1-controller sends invalidation to all
    caches.
           Types of directories
• Full map directories
  – Each directory has n entries where n is number of
    processors.
  – 2 bit-entry for processor(valid),dirty bit(whether
    block overwritten)
• Steps
  – Cache c3 finds block containing x is valid.
  – C3 issues write request to memory module
    containing x.
  – Memory module invalidates requests of c1 and c2
  – C1 and c2 set the b it indicating x is no longer
    valid.
  – Memory module sends write request to c3
  – Cache c3 updates value of x
            Limited directories
• Directory size problem is solved- entries only if
  cache block has the value X else no entry.
• Dirix
  – i –number of pointers
  – X-no broadcast scheme
• Full map scheme without broadcast
• i<n pointers
•   Dir2NB-pointer replacement-eviction
•   Directory-set associative mapping
•   Scalable protocols
•   Dir I B-
    – Allow more than I copies of each block of data to
      exist
           Chained directories
• Singly linked chain
  – Initially no shared copies of x
  – P1 reads x from shared memory along with chain
    termination pointer.
  – If p2 requires cache data it is read from p1 along
    with CT.
  – Memory then keeps a pointer to c2.
  – Gossip protocol –info passed from individual to
    individual .
• Doubly linked chain
  – 2 pointer-backward and forward chain pointers.
  – More memory because of more storage of 2
    pointers
• Cache design alternatives
• Shared caches-no private cache,will reduce
  main memory access time,second level cache
• Shared data can be non cacheable.
• Cache flushing-at synchronization ,I/O and
  process migration.
             Atomic operation
• Synchronization primitives
  – Test and set lock
  – Lock-1 set
  – Lock 0-reset
  – Spin lock
• Wired barrier synchronization
  – Wired NOR logic
  – Control vector –X
  – Common monitor-Y
  – Xi connected to input
  – Yi output
• Xi -1-process is initiated
• Barrier set to 1-synchronization
• Only one barrier line is needed to initiate and
  complete single synchronization operation
3 generations of multicomputers
      Message passing schemes
• Message formats
  – Fixed length of packets.
  – Destination addr, sequence number
  – Further dived to flits-flow control digits
  – Store and forward routing-packets
  – Wormhole-flits
  – Size of packets-64-512 bits
      Store and forward routing
• Basic units of transfer are packets
• Transmitted through series of intermediate
  nodes.
• Buffers are used to store packets, then
  transferred to output channels
• Latency is α number of hops
           Wormhole routing
• Flits are used.
• Transmission from source to destination is
  done through routers
• All flits are transmitted in order ,as
  inseparable companions.
• Header flit,dataflits
• Latency is independent of distance or number
  of hops
       Asynchronous pipelining
• Handshaking protocol
• 1 –bit ready/request line is used between
  adjacent routers.
• No global clock
             virtual channels
• Virtual channel is a logical link between 2
  nodes.
• Flit buffer in source node and a physical
  channel between them and flit buffer at
  receiver node.
• One source buffer is paired with one receiver..
 buffer to form virtual channel
• Physical channel is time shared by all virtual
  channels
                   deadlocks
• Why?
  – 4 flits from 4 messages occupy 4 channels
  – Circular waits
• How to detect deadlock?
  – Channel dependence graph
• Deadlock avoidance
  – Use virtual channels
  – Can be bidirectional or unidirectional
        Flow control strategies
• Packet collision
• Elements –scr buffer and dest buffer holding
  slit,channel
• Packet collision resolution-
  – Which packet will be allocated to channel?
  – What will be done to packet which is denied
                   Solution 1
• Virtual cut through routing scheme
  – Packet 2 temporarily stored in buffer.
  – Adv- not wasting resource
  – Disadv-requires use of large buffer, storage delay
  – Packet buffer should not have cycles
                 Solution 2
• Blocking flow control
  – Second packet is blocked but not abandoned
                  Solution 3
• Discard and retransmit
  – Drops packet
  – Disadv-wastage of resources,unstable delivery
    rate
  – Requires packet retransmission and
    acknowledgement.
                  Solution 4
• Detour after blocked
  – Results in idling the resources
  – Offers flexibility
  – Rerouted packet enters live stock which wastes
    resources
       Dimension order routing
• Deterministic
  – Communication path is completely predetermined
    by source and destination.
  – Dimension order routing- X Y routing, E cube
    routing
• Adaptive routing depends on network
  conditions
              E cube routing
• N=2n
• Source (s),dest(d),intermediate node (v)
  (0,1…n-1),
1.Direction bit ri=si-1 XOR di-1
2.V XOR 2i-1 if ri=1.else ri=0 skip
3.Move to i+1 dimension until dest reached.
example
Adaptive routing
    Multicast routing algorithms
• Communication patters
  – Unicast
  – Broadcast
  – multicast
           Routing efficiency
• Channel bandwidth
• Communication delay
• Implemented by replicating packet at
  intermediate node and multiple copies of
  packet reach destination.
Virtual networks
Network portioning