Great Ideas
UC Berkeley
in UC Berkeley
Teaching Professor Computer Architecture Professor
Dan Garcia (a.k.a. Machine Structures) Bora Nikolić
Dependability
Garcia, Nikolić
cs61c.org
6 Great Ideas in Computer Architecture
1. Abstraction (Layers of
Representation/Interpretation)
2. Moore’s Law
3. Principle of Locality/Memory Hierarchy
4. Parallelism
5. Performance Measurement &
Improvement
6. Dependability via Redundancy
6 Great Ideas in Computer Architecture
1. Abstraction (Layers of
Representation/Interpretation)
2. Moore’s Law
3. Principle of Locality/Memory Hierarchy
4. Parallelism
5. Performance Measurement &
Improvement
6. Dependability via Redundancy
Computers Fail…
May fail transiently…
…or permanently
We will discuss hardware failures
and methods to mitigate them
Great Idea #6: Dependability via Redundancy
Redundancy so that a failing piece doesn’t
make the whole system fail
1+1=2 2 of 3 agree
1+1=2 1+1=2 1+1=1 FAIL!
Increasing transistor density reduces the cost of redundancy
Great Idea #6: Dependability via Redundancy
Applies to everything from datacenters to
storage to memory to instructors
Redundant datacenters so that can
lose 1 datacenter but Internet service
stays online
Redundant disks so that can lose 1 disk
but not lose data (Redundant
Arrays of Independent Disks/RAID)
Redundant memory bits of so that
can lose 1 bit but no data
(Error Correcting Code/ECC Memory)
Dependability
Fault: failure of a component
May or may not lead to system failure
Service accomplishment
Service delivered
as specified
Restoration Failure
Service interruption
Deviation from
specified service
Dependability via Redundancy: Time vs. Space
Spatial Redundancy – replicated data or check
information or hardware to handle hard and
soft (transient) failures
Temporal Redundancy – redundancy in time
(retry) to handle soft (transient) failures
Dependability Measures
Reliability: Mean Time To Failure (MTTF)
Service interruption: Mean Time To Repair (MTTR)
Mean time between failures (MTBF)
MTBF = MTTF + MTTR
Availability = MTTF / (MTTF + MTTR)
Improving Availability
Increase MTTF: More reliable hardware/software
+ Fault Tolerance
Reduce MTTR: improved tools and processes for diagnosis
and repair
Availability Measures
Availability = MTTF / (MTTF + MTTR) as %
MTTF, MTBF usually measured in hours
Since hope rarely down, shorthand is
“number of 9s of availability per year”
1 nine: 90% => 36 days of repair/year
2 nines: 99% => 3.6 days of repair/year
3 nines: 99.9% => 526 minutes of repair/year
4 nines: 99.99% => 53 minutes of repair/year
5 nines: 99.999% => 5 minutes of repair/year
Reliability Measures
Another is average number of failures per
year: Annualized Failure Rate (AFR)
E.g., 1000 disks with 100,000 hour MTTF
365 days * 24 hours = 8760 hours
(1000 disks * 8760 hrs/year) / 100,000
= 87.6 failed disks per year on average
87.6/1000 = 8.76% annual failure rate
Google’s 2007 study* found that actual AFRs
for individual drives ranged from 1.7% for first
year drives to over 8.6% for three-year old
drives
*research.google.com/archive/disk_failures.pdf
Hard Drive Failures
Annualized
hard-drive
failure rates
Failures In Time (FIT) Rate
The Failures In Time (FIT) rate of a device is the number
of failures that can be expected in one billion (109)
device-hours of operation
Or 1000 devices for 1 million hours,
1 million devices for 1000 hours each
MTBF = 1,000,000,000 x 1/FIT
Relevant: Automotive safety integrity level (ASIL)
defines FT rates for different classes of components in
vehicles
Dependability Design Principle
Design Principle: No single points of failure
“Chain is only as strong as its weakest link”
Dependability corollary of Amdahl’s Law
Doesn’t matter how dependable you make one
portion of system
Dependability limited by part you do not improve
Fall 2017 – Lecture #25
Error Detection/Correction Codes
Memory systems generate errors
(accidentally flipped bits)
DRAMs store very little charge per bit
“Soft” errors occur occasionally when cells are struck by alpha
particles or other environmental upsets
“Hard” errors” can occur when chips permanently fail
Problem gets worse as memories get denser and larger
Memories protected against soft errors with EDC/ECC
Extra bits are added to each data-word
Used to detect and/or correct faults in the memory system
Each data word value mapped to unique code word
A fault changes valid code word to invalid one, which can be
detected
Block Code Principles
Hamming distance = difference in # of bits
p = 011011, q = 001111, Ham. distance (p,q) = 2
p = 011011,
q = 110001,
distance (p,q) = ?
Can think of extra bits as creating
a code with the data
What if minimum distance
between codewords is 2
and get a 1-bit error? Richard Hamming, 1915-98
Turing Award Winner
Parity: Simple Error-Detection Coding
Each data value, before it is
Each word, as it is read
written to memory is
from memory is “checked”
“tagged” with an extra bit by finding its parity
to force the stored word to (including the parity bit).
have even parity:
b7b6b5b4b3b2b1b0 p
b7b6b5b4b3b2b1b0
+ p
+
c
Minimum Hamming distance of parity code is 2
A non-zero parity check indicates an error occurred:
2 errors (on different bits) are not detected
Nor any even number of errors, just odd numbers of errors are detected
Parity Example
Data 0101 0101 Read from memory
4 ones, even parity now 0101 0101 0
Write to memory: 4 ones => even parity, so no
0101 0101 0 error
to keep parity even Read from memory
Data 0101 0111 1101 0101 0
5 ones, odd parity now 5 ones => odd parity,
so error
Write to memory:
0101 0111 1 What if error is in parity bit?
to make parity even
Suppose Want to Correct One Error?
Hamming came up with simple to understand
mapping to allow Error Correction at minimum
distance of three
Single error correction, double error detection
Called “Hamming ECC”
Worked weekends on relay computer with unreliable
card reader, frustrated with manual restarting
Got interested in error correction; published 1950
R. W. Hamming, “Error Detecting and Correcting
Codes,” The Bell System Technical Journal, Vol. XXVI,
No 2 (April 1950) pp 147-160.
Detecting/Correcting Code Concept
Space of possible bit patterns (2N)
Error changes bit pattern to non-code
Sparse population of code words (2M << 2N)
- with identifiable signature
Detection: bit pattern fails codeword check
Correction: map to nearest valid code word
Hamming Distance: Eight Code Words
Hamming Distance 2: Detection
Detect Single Bit Errors
Invalid
Codewords
• No 1-bit error goes to another valid codeword
• ½ codewords are valid
Hamming Distance 2: Detection
Correct Single Bit Errors
One bit
away from 111
One bit away
from 000
• 1-bit errors near valid codewords
• ¼ codewords are valid
Hamming ECC
Interleave data and parity bits
Place parity bits at binary positions 1, 10, 100, etc
p1 covers all positions with LSB = 1
p2 covers all positions with next to LSB = 1, etc
Can continue indefinitely
Bit position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Encoded data bits p1 p2 d1 p4 d2 d3 d4 p8 d5 d6 d7 d8 d9 d10 d11 p16 d12 d13 d14 d15
p1
p2 …
Parity
p4
bit
coverage p8
p16
Hamming ECC
Set parity bits to create even parity for each
group
A byte of data: 10011010
Create the coded word, leaving spaces for the
parity bits:
__1_001_1010
1 2 3 4 5 6 7 8 9 a b c – bit position
Calculate the parity bits
Hamming ECC
Position 1 checks bits 1,3,5,7,9,11:
? _ 1 _ 0 0 1 _ 1 0 1 0. set position 1 to a _:
Position 2 checks bits 2,3,6,7,10,11:
0 ? 1 _ 0 0 1 _ 1 0 1 0. set position 2 to a _:
Position 4 checks bits 4,5,6,7,12:
0 1 1 ? 0 0 1 _ 1 0 1 0. set position 4 to a _:
Position 8 checks bits 8,9,10,11,12:
0 1 1 1 0 0 1 ? 1 0 1 0. set position 8 to a _:
Hamming ECC
Final code word: 011100101010
Data word: 1 001 1010
Hamming ECC
Suppose receive
011100101110
0 1 1 1 0 0 1 0 1 1 1 0
Bit position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Encoded data bits p1 p2 d1 p4 d2 d3 d4 p8 d5 d6 d7 d8 d9 d10 d11 p16 d12 d13 d14 d15
p1
p2 …
Parity
p4
bit
coverage p8
p16
Hamming ECC Error Check
Suppose receive
011100101110
Hamming ECC Error Check
Suppose receive
011100101110
0 1 0 1 1 1 √
11 01 11 X-Parity 2 in error
1001 0 √
01110 X-Parity 8 in error
Implies position 8+2=10 is in error
011100101110
11/21/2020
Hamming ECC Error Correct
Flip the incorrect bit …
011100101010
Hamming ECC Error Correct
Suppose receive
011100101010
0 1 0 1 1 1 √
11 01 01 √
1001 0 √
01010 √
What if More Than 2-Bit Errors?
Use double-error correction, triple-error
detection (DECTED)
Network transmissions, disks, distributed
storage common failure mode is bursts of bit
errors, not just one or two bit errors
Contiguous sequence of B bits in which first, last and any
number of intermediate bits are in error
Caused by impulse noise or by fading in wireless
Effect is greater at higher data rates
Solve with Cyclic Redundancy Check (CRC),
interleaving or other more advanced codes
RAID: Redundant Arrays of (Inexpensive) Disks
Data is stored across multiple disks
Files are "striped" across multiple disks
Redundancy yields high data availability
Availability: service still provided to user, even if
some components failed
Disks will still fail
Contents reconstructed from data
redundantly stored in the array
− Capacity penalty to store redundant info
− Bandwidth penalty to update redundant info
Redundant Arrays of Inexpensive Disks
RAID 1: Disk Mirroring/Shadowing
recovery
group
• Each disk is fully duplicated onto its “mirror”
Very high availability can be achieved
• Writes limited by single-disk speed
• Reads may be optimized
Most expensive solution: 100% capacity overhead
RAID 3: Parity Disk
10010011
11001101
10010011 P
...
1 1 1 1
logical record 0 1 0 1
Striped physical
records 1 0 1 0
0 0 0 0
P contains sum of 0 1 0 1
other disks per stripe 0 1 0 1
mod 2 (“parity”) 1 0 1 0
If disk fails, subtract 1 1 1 1
P from sum of other
disks to find missing information
RAID 4: High I/O Rate Parity
Increasing
D0 D1 D2 D3 P Logical
Insides of 5 Disk
disks Address
D4 D5 D6 D7 P
D8 D9 D10 D11 P
Example: small read
D0 & D5, large write D12 D13 D14 D15 P
Stripe
D12-D15
D16 D17 D18 D19 P
D20 D21 D22 D23 P
. . . . .
. . Disk Columns
. . .
. . . . .
Inspiration for RAID 5
RAID 4 works well for small reads
Small writes (write to one disk):
Option 1: read other data disks, create new sum and
write to Parity Disk
Option 2: since P has old sum, compare old data to
new data, add the difference to P
Small writes are limited by Parity Disk: Write to
D0, D5 both also write to P disk
D0 D1 D2 D3 P
D4 D5 D6 D7 P
RAID 5: High I/O Rate Interleaved Parity
Increasing
D0 D1 D2 D3 P Logical
Independent writes Disk
possible because of Addresses
interleaved parity D4 D5 D6 P D7
D8 D9 P D10 D11
D12 P D13 D14 D15
Example: write to P D16 D17 D18 D19
D0, D5 uses disks
0, 1, 3, 4
D20 D21 D22 D23 P
. . . . .
. . .
Disk Columns . .
. . . . .
“And in Conclusion…”
Great Idea: Redundancy to Get Dependability
Spatial (extra hardware) and Temporal (retry if error)
Reliability: MTTF, Annualized Failure Rate (AFR), and FIT
Availability: % uptime (MTTF/MTTF+MTTR)
Memory
Hamming distance 2: Parity for Single Error Detect
Hamming distance 3: Single Error Correction Code + encode
bit position of error
Treat disks like memory, except you know when a disk
has failed—erasure makes parity an Error Correcting
Code
RAID-2, -3, -4, -5 (and -6, -10): Interleaved data and
parity