Disk Storage & Virtualization Basics
Disk Storage & Virtualization Basics
Unit 3
Aronya Baksy
March 2021
1. Seek Time: The time needed for the controller to position the disk head to the correct
cylinder of the disk
2. Rotational Latency: The time needed for the first sector of the block to position itself
under the disk head
3. Transfer Time: Time needed for the disk controller to read/write all the sectors on the disk.
• RAID (Redundant Array of Independent Disks) is a storage virtualization technology that com-
bines multiple physical disks into one or more logical volumes for increased redundancy and faster
performance.
• The driving technologies behind RAID are striping,mirroring and parity checking.
• DAS is only accessible from the node to which the storage device is attached physically.
• Network Attached Storage (NAS) is a file-level storage device connected to a heterogeneous
group of clients.
• A single NAS device containing physical storage devices (these may be arranged in RAID) serves
all file requests from any client in the connected network.
• NAS removes the responsibility of file serving from other servers on the network. Data is transferred
over Ethernet using TCP/IP protocol.
• Storage Area Network (SAN) is a network that provides access to block-level data storage.
• A SAN is built from a combination of servers and storage over a high speed, low latency interconnect
that allows direct Fibre Channel connections from the client to the storage volume to provide the
fastest possible performance.
• The SAN may also require a separate, private Ethernet network between the server and clients to
keep the file request traffic out of the Fibre Channel network for even more performance.
• It allows for simultaneous shared access, but it is more expensive than NAS and SAN.
• Distinct protocols were developed for SANs, such as Fibre Channel, iSCSI, Infiniband.
1
Figure 1: Storage Architectures
• LVM provides a method of allocating space on mass-storage devices that is more flexible than
conventional partitioning schemes to store volumes.
• The components of LVM are:
1. Extend volumes while a volume is active and has a full file system (shrinking volumes requires
unmounting and suitable storage requirements)
2. Collect multiple pysical drives into a volume group
• LVM consists of the following basic components layered on top of each other:
– A physical volume corresponds to a physical disk that is detected by the OS (labelled often
as sda or sdb) (NOTE: partitions of a single actual disk are detected as separate disks by the
OS).
– A volume group groups together one or more physical volumes
– A logical volume is a logical partition of the volume group. Each logical volume runs a file
system.
• The /boot partition cannot be included in LVM as GRUB (the GNU Bootloader that loads the
bootstrap program from the master boot record) cannot read LVM metadata.
2
2 Storage Virtualization
• Abstraction of physical storage devices into logical entities presented to the user, hiding the un-
derlying hardware complexity and access functionality (either direct access or network access)
• Advantages of storage virtualization are:
– Enables higher resource usage by aggregating multiple heterogeneous devices into pools
– Easy centralized management, provisioning of storage as per application needs (performance
and cost).
3
– The client passes this layout to a Logical Object Volume (LOV). The LOV maps the layout
to objects and their actual locations on different OSTs
– The client then locks the file range being operated on and executes one or more parallel
reads/writes directly to the OSTs
– Server delivers the combined disk space of all the physical storage servers as a single file
system
– Client implements highly available, massively parallel access to each storage node along with
node failure handling
• A storage brick is a server (containing directly attached storage or connected to a SAN) on which
a file system (like ext3 or ext4) is created
• A translator is a layer between a brick and the actual user. It acts as a file system interface and
implements one single Gluster functionality
• I/O Scheduling Translators are responsible for load balancing,
• Automatic File Replication (AFR) translator keeps identical copies of a file/directory on all its
subvolumes (used for replication)
4
2.2.3 Network-Level BLV
• Most commonly implemented, scalable form, implemented as part of the interconnect network
between storage and hosts (e.g.: Fibre Channel SAN)
• Switch-based: the actual virtualization occurs in an intelligent switch in the network, and it
works in conjunction with a metadata manager
• Appliance-based: I/O is routed through an appliance that manages the virtualization layer
• In-band appliances perform all I/O with zero direct interaction between client and storage.
• Out-of-band appliances manage only metadata (control paths) while the actual data flows di-
rectly between client and storage server (each client having an agent to manage this)
– Access Control Lists: Set permissions to allow other users to access an object
– Audit Logs: Once enabled, stores the access log for an bucket. This enables one to identify
the AWS account, IP Address, time of access and operations performed by the one who
accessed.
• Data Security is maintained in S3 using:
– Replication: across multiple devices, allows for upto 2 replica failures (cheaper option is
Reduced Redundancy Storage which survives only 1 replica failure), but consistency across
replicas is not guaranteed.
– Versioning: If enabled, S3 stores the full history of each object. It allows for changes to be
undone, including file deletions.
– Regions: select location of S3 bucket for performance/legal reasons.
• S3 allows for large objects to be uploaded in parts. These parts can be uploaded in parallel for
maximum network utilization
5
• Table is collection of items, item is collection of attribute-value pairs. Primary key identifies items
uniquely in a table.
• A partition is an allocation of storage for a table, backed by SSDs and automatically replicated
across multiple Availability Zones within an AWS Region.
• RDS provides encryption at rest and in transit, as well as APIs for applications.
4 Partitioning
• Breaking down large DBs into smaller units that are stored on different machines. Each row belongs
to exactly one partition
• Supports operations that touch mulitple partitions at the same time.
• Motivation is scalability in terms of load balancing and query throughput, as well as fault tolerance
(when combined with replication)
• Small queries can be independently processed by one partition. Large queries can be parallelized
between multiple partitions.
• When some partitions have more data than others, they are said to be skewed. A partition with
disproportionately high load is called a hot spot
• Disadvantage: When trying to read a particular item, no way of knowing which node it is on, so
all nodes need to be queried in parallel.
6
4.1.3 Partitioning by Hash of Key
• Using a suitable hash function for keys, each partition has a range of hash values assigned to it
(rather than a range of keys), and every key whose hash falls within a partition’s range will be
stored in that partition.
• A good hash function takes skewed data and makes it uniformly distributed
• Simple hash partitioning do not allow efficient range queries. This is solved using composite keys.
• Consistent hashing is a way of evenly distributing load across an internet-wide system of servers
such as a content delivery network
• It uses randomly chosen partition boundaries to avoid the need for central control or distributed
consensus
• Each partition maintains its own secondary index, covering only the documents in that partition.
• Reading involves reading from each and every partition and separately combining the results. This
approach is called scatter-gather, and it makes read queries expensive
• Even if the partitions are queried in parallel, scatter/gather is prone to tail latency amplification
• Writes are less efficient as a write affects multiple partitions of the index. This requires a distributed
transaction across all partitions affected by a write
• In practice, updates to global secondary indexes are often asynchronous
• Simple, but drawback is that any change in N leads to rehashing of large number of keys which
makes the rebalancing very expensive
7
4.3.2 Fixed number of partitions
• Move only entire partitions. Assignment of keys to partitions does not change, but only assignment
of partitions to nodes changes.
• Create many more partitions than there are nodes and assign several partitions to each node
• If a node is added to the cluster, the new node can steal a few partitions from every existing node
until partitions are fairly distributed once again
• So many fixed-partition databases choose not to implement partition split and merge
• Choosing the right number of partitions is difficult if the size of the dataset is variable
• In dynamic partitioning, the partitions split if they grow beyond an upper bound. If the partition
shrinks below a lower bound, it can be merged with an adjacent partition
• Can be used with both key-range partitioned and hash partitioned data
4.4.1 ZooKeeper
• A distributed metadata management system for clusters.
• ZooKeeper maintains an authoritative mapping between partititons and nodes, and each node
registers itself with the ZooKeeper service.
• Other actors, such as the routing tier or the partitioning-aware client, can subscribe to this infor-
mation in ZooKeeper
• When partitioning changes or node removal/addition occurs, ZooKeeper notifies the routing tier
8
5 Replication
• Keeping multiple copies of a single partition on different nodes connected by a network
• Motivation for replication:
– Reduce latency by reducing distance to user
– Increase availability by allowing fault tolerance
– Increase read throughput by allowing more parallel reads (scalable)
5.1.2 Implementation
• Statement Replication: The leader logs every write request that it executes and sends that
statement log to its followers (fails for non-deterministic functions like rand() and now())
• Write-Ahead Log Shipping: The leader writes the log (an append-only byte stream) to disk
and sends it across the network to its followers. When the follower processes this log, it builds a
copy of the exact same data structures as found on the leader.
• Logical Log Replication: Uses different log formats for replication and for the storage engine.
A logical log (aka the replication log) is a sequence of records describing writes to database tables
at the row level
• Trigger-Based Replication: A trigger on the leader table logs the change to another table where
an external process can read it. The external process applies the replication to another system
9
5.2 Replication lag
• The delay between a write happening on the leader and the same being reflected on a follower is
known as the replication lag.
• Read-After-Write consistency is a guarantee for a single user, in that if the same user reads the
data at any time interval after reading it, the user will get the updated data.
• Solutions:
– Read critical data from leader, rest from follower (negates scaling advantage)
– Prevent queries on any follower that is lagging significantly behind the leader
– Client remembers the timestamp of their most recent write, and ensure that the node serving
that user is updated atleast till that timestamp
– Monotonic reads: each user read from the same replica always
– Consistent prefix reads - if a sequence of writes happen in a certain order, then anyone reading
those writes should see them appear in the same order
• In a multi-leader config, each datacenter can continue operating independently of the others, and
replication catches up when the failed datacenter is back online.
• In a single-leader config, the public internet is used for synchronous updates between leader and
follower, hence is sensitive to problems in this network
• A multi-leader config with asynchronous replication tolerates network problems better as a tem-
porary network problems do not prevent writes being processed
• In some implementations, the client sends writes to multiple nodes at the same time
• In others, a single co-ordinator node does this on behalf of the client, but it does not enforce a
particular order of writes (like a leader in a single-leader set up does)
• If writes are sent to multiple nodes, but some nodes out of these fail and hence cannot complete
the write. If the nodes that failed come back online, then any data on them is now out of date
(stale)
• To solve this issue, each data item has a version number associated with it. The client reading
from multiple replicas checks the version number of the data and selects the most recent one.
• When the client reads values with different version numbers, the client writes the most recent
version of the data to all the nodes with less recent versions. This is called read repair
10
• A background process (rather than the client itself) monitors all data values and their versions
across all nodes, and periodically writes the latest value of the data to all the replicas. This is
called an anti-entropy process.
• Let there be n nodes. Let r nodes be queried for each read, and w nodes confirm for each write. If
w+r >n
then an up-to-date copy of the data is guaranteed while reading, as at least one of the r nodes
being read from must be up to date.
• Reads and writes that obey the above rule are called quorum reads and writes.
5.4.1 Monitoring
• Monitoring in leaderless systems is difficult as writes do not happen in any particular order
• In single-leader systems, the writes are in a fixed order maintained on the edit log of the leader.
The out-of-date follower can compare its position (timestamp) with that of the leader and make
the necessary changes.
11
• Version vectors are sent from the database replicas to clients when values are read, and need to be
sent back to the database when a value is subsequently written
• The version vector allows the database to distinguish between overwrites and concurrent writes,
and ensures that it is safe to read from one replica and write to another.
6 Consistency Models
• Most distributed systems only guarantee eventual consistency
• In eventual consistency, data read at any point may not be consistent across nodes, but if there
are no writes for some unspecified interval then all the nodes can catch up to the consistent state
• This is a weak guarantee, as it does not give any guarantees about actual time of consistency.
6.1 Linearizability
• The illusion that there is only one copy of a data item across a distributed system. (implies that
all data must be up to date at all times, no staleness in caching)
• Ensures that applications running on the distributed system do not need to worry about replication.
• Main point of linearizability: After any one read has returned the new value, all following reads
(on the same or other clients) must also return the new value.
• Compare-and-Set is an operation on the database:
– The CAS operation takes in 3 arguments: a memory location to read from (called X), an old
value (vold ) and a new value (vnew )
– If X == vold then set X := vnew
– If X 6= vold then return an error, don’t change the value in X
• Test for linearizable behaviour: record the timings of all requests and responses and check whether
a valid sequential ordering can be constructed from them.
• In synchronous mode, single leader replication is linearizable.
• Consensus algorithms implement measures to avoid stale replicas, and implement safe linearizable
storage. (e.g.: ZooKeeper)
• Multi-leader and leaderless replication are not linearizable (leaderless probably not)
12
• The modern CAP goal is to maximize combinations of consistency and availability that make
sense for the specific application, while incorporating plans for unavailability and recovery of failed
partitions.
6.3.1 Phase 1
• Coordinator places the record Prepare T on its log. The message is then sent to all the sites.
• Each site that receives the message decides whether to commit the componenet of transaction T
or to abort it.
• A site that wants to commit enters the pre-commit stage (in this state the site can no longer abort
the transaction)
• The site takes the necessary actions to ensure that its component of T will not be aborted, then
writes the log message Ready T.
• Once the log is stored on disk at the site, the site sends the Ready T message back to the
coordinator
• A site that doesn’t want to commit sends the message Don’t Commit T back to the coordinator
6.3.2 Phase 2
• If Coordinator gets Ready T from all the sites, it logs the message Commit T and sends it to
all the sites
• If the coordinator has received don’t commit T from one or more sites, it logs Abort T at its
site and then sends abort T messages to all sites involved in T
• If a site receives a commit T message, it commits the component of T at that site, logging
Commit T as it does
• If a site receives the message Abort T, it aborts T and writes the log record Abort T
13