UNIT 2
Distributed System
Basic Concept of Distributed Shared Memory (DSM)
Distributed Shared Memory (DSM) is a system that allows multiple computers
(or nodes) in a distributed system to share a single logical memory space. This
makes it easier for programs to communicate and work together as if they are
running on a single system.
Key Points:
   1. Shared Memory Illusion: DSM provides an illusion of a single shared
      memory, even though the memory is physically distributed across
      multiple machines.
   2. Communication Simplification: Instead of sending explicit messages
      between computers, programs can simply read and write to the shared
      memory.
   3. Transparency: The system hides the complexity of data transfer between
      nodes, making it appear as though all memory is local.
   4. Consistency: DSM ensures data consistency so that when one node
      updates the shared memory, other nodes see the latest changes.
   5. Applications: DSM is used in parallel computing, distributed systems,
      and for tasks requiring high-speed data sharing between nodes.
Example Analogy:
Imagine a group of people working together on a whiteboard. Each person can
write and read from the same whiteboard without having to pass notes to each
other. DSM is like that whiteboard in a computer network.
Design & Implementation Issues in DSM Systems
When creating a Distributed Shared Memory (DSM) system, there are several
challenges and decisions to address. Here's a simple explanation:
1. Memory Management
   •   Problem: How to organize and manage memory across multiple
       computers.
   •   Solution: Use techniques to divide memory into pages (like chapters in a
       book) and ensure efficient access.
2. Data Placement
   •   Problem: Where should data be stored across the network?
   •   Solution: Place data near the nodes that use it the most to reduce
       delays.
3. Data Consistency
   •   Problem: If one computer updates the shared memory, how do others
       see the changes?
   •   Solution: Use consistency models (like sequential or eventual
       consistency) to ensure data is synchronized.
4. Communication Overhead
   •   Problem: Sharing data between computers involves network
       communication, which can be slow.
   •   Solution: Minimize unnecessary communication by caching data or
       batching updates.
5. Synchronization
   •   Problem: How to handle situations where multiple computers try to
       access or modify the same data at the same time.
   •   Solution: Use locks, semaphores, or atomic operations to manage
       access.
6. Fault Tolerance
   •   Problem: What happens if one computer crashes or loses connection?
   •   Solution: Implement mechanisms to detect failures and replicate data to
       avoid loss.
7. Scalability
   •   Problem: Can the system handle more computers as needed?
   •   Solution: Design the system to support more nodes without slowing
       down performance.
8. Security
   •   Problem: How to protect shared memory from unauthorized access or
       attacks.
   •   Solution: Use encryption, authentication, and access controls.
Example Analogy:
Imagine a group project where everyone works on a shared document:
   •   Memory Management: Decide how to divide and assign parts of the
       document.
   •   Data Placement: Keep sections with the person who edits them the
       most.
   •   Synchronization: Ensure two people don’t edit the same sentence at the
       same time.
These issues must be carefully addressed to create an efficient and reliable
DSM system!
Consistency Models in Distributed Shared Memory (DSM)
A consistency model defines the rules about how and when changes (updates)
made to shared memory on one node become visible to other nodes in a
distributed system. These rules affect the system’s behavior and performance.
Here’s a simple explanation of common consistency models:
1. Strict Consistency
   •   What it means: Any read operation on shared memory will always return
       the most recent write.
   •   Example: Imagine a chalkboard where if one person writes something,
       everyone immediately sees it.
   •   Pros: Perfect synchronization.
   •   Cons: Hard to achieve because it requires instant updates to all nodes,
       which is slow in real-world networks.
2. Sequential Consistency
   •   What it means: All nodes see memory operations (reads and writes) in
       the same order, but the order doesn’t have to be the real-time order.
   •   Example: Think of a queue where actions happen one by one. If Node A
       writes and then Node B writes, all nodes will see those writes in the
       same sequence (A first, then B).
   •   Pros: Easier to implement than strict consistency.
   •   Cons: Can still be slow due to synchronization requirements.
3. Causal Consistency
   •   What it means: If one operation (write) causes another, all nodes must
       see them in the same order. But if operations are independent, they can
       be seen in different orders.
   •   Example: If you post a message ("Hello") and someone replies to it
       ("Hi"), all nodes will see "Hello" before "Hi." But two unrelated messages
       can appear in any order.
   •   Pros: More flexible and faster than sequential consistency.
   •   Cons: Slightly more complex to implement.
4. Eventual Consistency
   •   What it means: Changes to shared memory will eventually become
       visible to all nodes, but not immediately.
   •   Example: A news update takes time to reach everyone, but eventually,
       everyone gets the same information.
   •   Pros: Very fast and used in systems like NoSQL databases (e.g.,
       DynamoDB, Cassandra).
   •   Cons: Temporary inconsistencies can occur, which might confuse some
       applications.
5. Weak Consistency
   •   What it means: Updates become visible only after certain
       synchronization points.
   •   Example: You’re editing a document. Changes become visible to others
       only when you save the file.
   •   Pros: Improves performance by reducing communication overhead.
   •   Cons: Requires careful programming to avoid errors.
6. Release Consistency
   •   What it means: Shared memory updates are visible only when a special
       "release" operation is performed.
   •   Example: Imagine workers at a construction site. Everyone gets updated
       instructions only after the foreman announces them.
   •   Pros: High performance for certain applications.
   •   Cons: Programmers need to manage release operations.
7. Consistency in Real-Time Systems
   •   What it means: Adds time constraints to ensure updates are visible
       within a specific time.
   •   Example: Imagine a live sports score update system. The score must be
       consistent across all devices within seconds.
   •   Pros: Suitable for time-sensitive applications.
   •   Cons: Requires precise synchronization, which can be costly.
Summary Table
Model               Updates Visible          Speed       Use Case
Strict Consistency Immediately               Slow        Real-time systems
Model              Updates Visible        Speed      Use Case
Sequential         Same order for all
                                          Moderate Collaborative tools
Consistency        nodes
Causal             Based on                          Social media (post/reply
                                          Moderate
Consistency        dependencies                      threads)
Eventual
                   Eventually consistent Fast        Cloud databases (NoSQL)
Consistency
                   After synchronization           High-performance
Weak Consistency                         Very fast
                   points                          systems
Release            After a "release"                 Parallel computing,
                                          Fast
Consistency        operation                         gaming servers
What is Trashing?
Trashing happens in a computer system when it spends more time swapping
data between the RAM and the hard drive (or other storage) instead of doing
actual work. This can make the system very slow.
Why Does Trashing Happen?
   1. Not Enough RAM: When the computer doesn’t have enough memory
      (RAM) to run programs, it uses the hard drive as "virtual memory."
   2. Too Many Programs Open: If too many programs are running at the
      same time, the computer keeps moving data back and forth between the
      RAM and storage.
   3. Frequent Page Faults: A "page fault" occurs when the program needs
      data that isn’t currently in RAM, so the computer has to fetch it from
      storage. Too many page faults cause trashing.
How Does It Affect the System?
   •   The system becomes very slow.
   •   Programs take a long time to respond.
   •   The hard drive (or SSD) is used a lot, which can cause wear and tear.
Simple Analogy:
Imagine you’re studying with limited desk space. If your desk is too small to
hold all your books, you keep putting books on the shelf and taking them back.
You waste time moving books instead of studying. This is like trashing in a
computer!
How to Avoid Trashing?
   1. Add More RAM: More memory reduces the need for virtual memory.
   2. Close Unnecessary Programs: Free up resources by running fewer
      programs.
   3. Optimize Programs: Use lightweight software that uses less memory.
   4. Increase Page File Size: Adjust virtual memory settings to handle more
      data.
Trashing is a sign that the system is struggling with memory management and
needs attention to work efficiently.
Structure of Shared Memory Space
The shared memory space in a computer system is the part of memory that
multiple processes (or computers in a distributed system) can access to share
data and communicate with each other. The structure of this memory is
carefully organized to ensure efficiency and consistency.
1. Divided into Segments
The shared memory space is usually divided into segments to manage data
easily. Each segment stores a specific type of information:
   •   Data Segment: Stores shared data like variables, arrays, or objects.
   •   Code Segment: Stores shared executable code or instructions.
   •   Synchronization Segment: Stores locks, semaphores, or signals to
       coordinate access.
2. Addressing
Each segment in the shared memory space has a unique address so processes
can find and access it:
   •   Logical Address: A program’s view of memory, making it simple to use.
   •   Physical Address: The actual location in the system memory (RAM).
3. Access Control
To ensure safety, access to the shared memory is controlled:
   •   Read-Only Access: Some processes can only read data but not modify it.
   •   Read-Write Access: Some processes can both read and write data.
   •   Access is managed using permissions to prevent conflicts or errors.
4. Synchronization Mechanisms
To avoid multiple processes modifying the same data at the same time:
   •   Locks: Prevent simultaneous access.
   •   Semaphores: Allow controlled access by multiple processes.
   •   Barriers: Ensure all processes reach a certain point before proceeding.
5. Consistency Management
The system ensures that:
   •   When one process updates data, other processes see the updated value.
   •   Changes are synchronized across all users of the shared memory.
Simple Analogy:
Think of shared memory space like a whiteboard in a classroom:
   •   Segments: Different sections of the whiteboard for different subjects
       (math, science, etc.).
   •   Addressing: Each section has a label, so you know where to write or
       read.
   •   Access Control: Rules decide who can write or just read (e.g., students
       vs. teachers).
   •   Synchronization: Only one person writes at a time to avoid overwriting.
Benefits of Shared Memory Space:
   •   Fast Communication: Processes can exchange data quickly.
   •   Reduced Overhead: No need for frequent messaging or data transfers.
   •   Efficient Resource Use: Memory is shared, reducing duplication.
This structure makes shared memory a powerful tool for collaborative tasks in
computing!
File Model
In a distributed system, where files are stored and managed across multiple
computers or servers, file models are still classified based on structure and
modifiability. Let’s explain this in simple terms:
1. Based on Structure
This focuses on how data is organized in the file, even when it’s spread across
multiple systems.
   •   Structured Files:
          o   Data is highly organized and follows a predictable format (e.g.,
              rows, columns, or key-value pairs).
          o   These files make it easier for distributed systems to process, share,
              and query the data quickly.
          o   Example: A distributed database like HBase or a file stored in a
              table-like structure in Hadoop.
          o   Analogy: Think of a shared online spreadsheet where every entry
              fits neatly into a predefined box.
   •   Unstructured Files:
          o   Data does not follow a strict format. It could be plain text, images,
              videos, or logs.
          o   In distributed systems, unstructured data requires extra tools (like
              indexing or AI) to organize or analyze it.
          o   Example: A distributed storage system like Amazon S3 storing
              photos, videos, or PDFs.
          o   Analogy: Imagine sharing random notes or pictures with no
              consistent arrangement.
2. Based on Modifiability
This focuses on whether files can be changed after being stored in the
distributed system.
   •   Mutable Files:
          o   Files can be updated or edited directly, even when distributed.
          o   These systems allow changes but often require coordination to
              ensure all copies (on different servers) stay consistent.
          o   Example: A shared Google Doc that multiple users can edit
              simultaneously.
          o   Challenge: In distributed systems, ensuring consistency (so all
              servers have the same version) is hard and requires protocols like
              locks or version control.
   •   Immutable Files:
          o   Files cannot be edited after being created. If changes are needed,
              a new version of the file is created instead.
          o   These systems are easier to manage in distributed environments
              because no coordination is required for updates.
          o   Example: Log files in systems like Apache Kafka, where each record
              is appended and never changed.
          o   Analogy: Like sending a letter—it’s permanent once written and
              sent. If you want changes, you write a new one.
Why It Matters in Distributed Systems:
   1. Structured files are great for tasks like querying or analytics, as they’re
      easier to process in systems like Hadoop or Spark.
   2. Unstructured files work well for storing raw data like media or logs but
      need tools (like Elasticsearch) for searching or analyzing them.
   3. Mutable files are useful for collaboration but add complexity because all
      servers must stay synchronized.
   4. Immutable files are simpler to handle in distributed systems, reducing
      the risk of conflicts or corruption.
Example in Practice:
   •   Structured + Mutable: A shared database that multiple servers can
       update, like a distributed SQL database.
   •   Unstructured + Immutable: A distributed file system like AWS S3 storing
       videos or backups that don’t change after upload.
Understanding this helps design efficient distributed systems based on the type
of files and operations needed!
File Sharing Semantics
File Sharing Semantics in distributed systems define the rules for how changes
to files are shared and visible to users. Here’s a simple explanation of four
specific types: Unix, Session, Immutable, and Transaction-like Semantics:
1. Unix Semantics
   •   What It Means: Every user sees the most recent changes to a file
       immediately after any modification.
   •   How It Works:
          o   If User A edits a file, User B will instantly see the updates, as if
              they are both working on the same computer.
          o   This is the behavior of traditional Unix file systems (like Linux).
   •   Challenge in Distributed Systems: Achieving this in a distributed system is
       difficult because updates need to sync across all servers instantly.
   •   Analogy: Imagine writing on a whiteboard. Everyone can see the changes
       as you write.
2. Session Semantics
   •   What It Means: Changes made by a user are visible only to that user
       during their editing session. Other users see the changes only after the
       session ends (when the file is closed and saved).
   •   How It Works:
          o   If User A opens a file, edits it, and saves it, User B will only see the
              changes after User A finishes and closes the file.
          o   This reduces conflicts and ensures smoother editing.
   •   Common Use: Systems like Dropbox often work this way.
   •   Analogy: Think of borrowing a book. You can read and make notes, but
       others won’t see your changes until you return it.
3. Immutable Semantics
   •   What It Means: Files cannot be modified once created. If changes are
       needed, a new version of the file is created instead.
   •   How It Works:
          o   If User A wants to update a file, they create a new version (e.g.,
              "file_v2") rather than editing the original file.
          o   This ensures consistency, as the original file remains unchanged.
   •   Common Use: Systems like Git or blockchain use this approach to avoid
       conflicts and maintain history.
   •   Analogy: Think of taking a photo. You can’t change the photo itself, but
       you can take another one if needed.
4. Transaction-like Semantics
   •   What It Means: File changes happen in a series of steps, and all the steps
       must succeed or fail together (like a database transaction).
   •   How It Works:
          o   If User A wants to update a file, the system ensures all changes are
              completed successfully. If something goes wrong, the file reverts
              to its original state (no partial updates are allowed).
          o   This ensures data integrity, especially in critical systems.
   •   Common Use: Used in systems requiring high reliability, like banking or
       airline reservation systems.
   •   Analogy: Think of transferring money between two accounts. The
       transfer only succeeds if both accounts are updated correctly—
       otherwise, it’s canceled.
Summary
   1. Unix Semantics: Changes are immediately visible to everyone.
   2. Session Semantics: Changes are visible only after the file is closed.
   3. Immutable Semantics: Files cannot be changed, only replaced with new
      versions.
   4. Transaction-like Semantics: Changes are made in an all-or-nothing
      manner, ensuring data consistency.
Each semantic is chosen based on the needs of the distributed system,
balancing speed, consistency, and reliability.