0% found this document useful (0 votes)

26 views45 pages

Lec38 BW

The document discusses six great ideas in computer architecture, with a focus on the idea of dependability via redundancy. It describes how redundancy allows systems to continue functioning even if individual components fail. Examples mentioned include redundant data centers, redundant disk arrays, and error correcting memory. The concepts of reliability, availability, mean time to failure, mean time to repair, and failures in time are defined. Error detection and correction codes like parity and Hamming codes are introduced as ways to detect and correct errors in memory through the addition of redundant bits.

Uploaded by

k12730709

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views45 pages

Lec38 BW

Uploaded by

k12730709

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Great Ideas

UC Berkeley
in UC Berkeley
Teaching Professor Computer Architecture Professor
Dan Garcia (a.k.a. Machine Structures) Bora Nikolić

Dependability

Garcia, Nikolić

cs61c.org
6 Great Ideas in Computer Architecture
1. Abstraction (Layers of
Representation/Interpretation)
2. Moore’s Law
3. Principle of Locality/Memory Hierarchy
4. Parallelism
5. Performance Measurement &
Improvement
6. Dependability via Redundancy
6 Great Ideas in Computer Architecture
1. Abstraction (Layers of
Representation/Interpretation)
2. Moore’s Law
3. Principle of Locality/Memory Hierarchy
4. Parallelism
5. Performance Measurement &
Improvement
6. Dependability via Redundancy
Computers Fail…
 May fail transiently…

 …or permanently

We will discuss hardware failures

and methods to mitigate them
Great Idea #6: Dependability via Redundancy
 Redundancy so that a failing piece doesn’t
make the whole system fail

1+1=2 2 of 3 agree

1+1=2 1+1=2 1+1=1 FAIL!

Increasing transistor density reduces the cost of redundancy

Great Idea #6: Dependability via Redundancy
 Applies to everything from datacenters to
storage to memory to instructors
 Redundant datacenters so that can
lose 1 datacenter but Internet service
stays online
 Redundant disks so that can lose 1 disk
but not lose data (Redundant
Arrays of Independent Disks/RAID)
 Redundant memory bits of so that
can lose 1 bit but no data
(Error Correcting Code/ECC Memory)
Dependability
 Fault: failure of a component
 May or may not lead to system failure
Service accomplishment
Service delivered
as specified

Restoration Failure

Service interruption
Deviation from
specified service
Dependability via Redundancy: Time vs. Space
 Spatial Redundancy – replicated data or check
information or hardware to handle hard and
soft (transient) failures
 Temporal Redundancy – redundancy in time
(retry) to handle soft (transient) failures
Dependability Measures
 Reliability: Mean Time To Failure (MTTF)
 Service interruption: Mean Time To Repair (MTTR)
 Mean time between failures (MTBF)
 MTBF = MTTF + MTTR
 Availability = MTTF / (MTTF + MTTR)
 Improving Availability
 Increase MTTF: More reliable hardware/software
+ Fault Tolerance
 Reduce MTTR: improved tools and processes for diagnosis
and repair
Availability Measures
 Availability = MTTF / (MTTF + MTTR) as %
 MTTF, MTBF usually measured in hours
 Since hope rarely down, shorthand is
“number of 9s of availability per year”
 1 nine: 90% => 36 days of repair/year
 2 nines: 99% => 3.6 days of repair/year
 3 nines: 99.9% => 526 minutes of repair/year
 4 nines: 99.99% => 53 minutes of repair/year
 5 nines: 99.999% => 5 minutes of repair/year
Reliability Measures
 Another is average number of failures per
year: Annualized Failure Rate (AFR)
 E.g., 1000 disks with 100,000 hour MTTF
 365 days * 24 hours = 8760 hours
 (1000 disks * 8760 hrs/year) / 100,000
= 87.6 failed disks per year on average
 87.6/1000 = 8.76% annual failure rate
 Google’s 2007 study* found that actual AFRs
for individual drives ranged from 1.7% for first
year drives to over 8.6% for three-year old
drives
*research.google.com/archive/disk_failures.pdf
Hard Drive Failures
Annualized
hard-drive
failure rates
Failures In Time (FIT) Rate
 The Failures In Time (FIT) rate of a device is the number
of failures that can be expected in one billion (109)
device-hours of operation
 Or 1000 devices for 1 million hours,
1 million devices for 1000 hours each
 MTBF = 1,000,000,000 x 1/FIT

 Relevant: Automotive safety integrity level (ASIL)

defines FT rates for different classes of components in
vehicles
Dependability Design Principle
 Design Principle: No single points of failure
 “Chain is only as strong as its weakest link”
 Dependability corollary of Amdahl’s Law
 Doesn’t matter how dependable you make one
portion of system
 Dependability limited by part you do not improve

Fall 2017 – Lecture #25

Error Detection/Correction Codes
 Memory systems generate errors
(accidentally flipped bits)
 DRAMs store very little charge per bit
 “Soft” errors occur occasionally when cells are struck by alpha
particles or other environmental upsets
 “Hard” errors” can occur when chips permanently fail
 Problem gets worse as memories get denser and larger
 Memories protected against soft errors with EDC/ECC
 Extra bits are added to each data-word
 Used to detect and/or correct faults in the memory system
 Each data word value mapped to unique code word
 A fault changes valid code word to invalid one, which can be
detected
Block Code Principles
 Hamming distance = difference in # of bits
 p = 011011, q = 001111, Ham. distance (p,q) = 2
 p = 011011,
q = 110001,
distance (p,q) = ?
 Can think of extra bits as creating
a code with the data
 What if minimum distance
between codewords is 2
and get a 1-bit error? Richard Hamming, 1915-98
Turing Award Winner
Parity: Simple Error-Detection Coding
 Each data value, before it is
 Each word, as it is read
written to memory is
from memory is “checked”
“tagged” with an extra bit by finding its parity
to force the stored word to (including the parity bit).
have even parity:
b7b6b5b4b3b2b1b0 p
b7b6b5b4b3b2b1b0

+ p
+
c
 Minimum Hamming distance of parity code is 2
 A non-zero parity check indicates an error occurred:
 2 errors (on different bits) are not detected
 Nor any even number of errors, just odd numbers of errors are detected
Parity Example
 Data 0101 0101  Read from memory
 4 ones, even parity now 0101 0101 0
 Write to memory:  4 ones => even parity, so no
0101 0101 0 error
to keep parity even  Read from memory
 Data 0101 0111 1101 0101 0
 5 ones, odd parity now  5 ones => odd parity,
so error
 Write to memory:
0101 0111 1  What if error is in parity bit?
to make parity even
Suppose Want to Correct One Error?
 Hamming came up with simple to understand
mapping to allow Error Correction at minimum
distance of three
 Single error correction, double error detection
 Called “Hamming ECC”
 Worked weekends on relay computer with unreliable
card reader, frustrated with manual restarting
 Got interested in error correction; published 1950
 R. W. Hamming, “Error Detecting and Correcting
Codes,” The Bell System Technical Journal, Vol. XXVI,
No 2 (April 1950) pp 147-160.
Detecting/Correcting Code Concept
Space of possible bit patterns (2N)

Error changes bit pattern to non-code

Sparse population of code words (2M << 2N)

- with identifiable signature

 Detection: bit pattern fails codeword check

 Correction: map to nearest valid code word
Hamming Distance: Eight Code Words
Hamming Distance 2: Detection

Detect Single Bit Errors

Invalid
Codewords

• No 1-bit error goes to another valid codeword

• ½ codewords are valid
Hamming Distance 2: Detection

Correct Single Bit Errors

One bit
away from 111

One bit away

from 000

• 1-bit errors near valid codewords

• ¼ codewords are valid
Hamming ECC
 Interleave data and parity bits
 Place parity bits at binary positions 1, 10, 100, etc
 p1 covers all positions with LSB = 1
 p2 covers all positions with next to LSB = 1, etc
 Can continue indefinitely
Bit position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Encoded data bits p1 p2 d1 p4 d2 d3 d4 p8 d5 d6 d7 d8 d9 d10 d11 p16 d12 d13 d14 d15

p1
p2 …
Parity
p4
bit
coverage p8

p16
Hamming ECC
Set parity bits to create even parity for each
group
 A byte of data: 10011010
 Create the coded word, leaving spaces for the
parity bits:
 __1_001_1010
1 2 3 4 5 6 7 8 9 a b c – bit position

 Calculate the parity bits

Hamming ECC
 Position 1 checks bits 1,3,5,7,9,11:
? _ 1 _ 0 0 1 _ 1 0 1 0. set position 1 to a _:

 Position 2 checks bits 2,3,6,7,10,11:

0 ? 1 _ 0 0 1 _ 1 0 1 0. set position 2 to a _:

 Position 4 checks bits 4,5,6,7,12:

0 1 1 ? 0 0 1 _ 1 0 1 0. set position 4 to a _:

 Position 8 checks bits 8,9,10,11,12:

0 1 1 1 0 0 1 ? 1 0 1 0. set position 8 to a _:
Hamming ECC
 Final code word: 011100101010
 Data word: 1 001 1010
Hamming ECC
 Suppose receive
011100101110

0 1 1 1 0 0 1 0 1 1 1 0
Bit position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Encoded data bits p1 p2 d1 p4 d2 d3 d4 p8 d5 d6 d7 d8 d9 d10 d11 p16 d12 d13 d14 d15

p1
p2 …
Parity
p4
bit
coverage p8

p16
Hamming ECC Error Check
 Suppose receive
011100101110
Hamming ECC Error Check
 Suppose receive
011100101110
0 1 0 1 1 1 √
11 01 11 X-Parity 2 in error
1001 0 √
01110 X-Parity 8 in error
 Implies position 8+2=10 is in error
011100101110

11/21/2020
Hamming ECC Error Correct
 Flip the incorrect bit …
011100101010
Hamming ECC Error Correct
 Suppose receive
011100101010
0 1 0 1 1 1 √
11 01 01 √
1001 0 √
01010 √
What if More Than 2-Bit Errors?
 Use double-error correction, triple-error
detection (DECTED)
 Network transmissions, disks, distributed
storage common failure mode is bursts of bit
errors, not just one or two bit errors
 Contiguous sequence of B bits in which first, last and any
number of intermediate bits are in error
 Caused by impulse noise or by fading in wireless
 Effect is greater at higher data rates
 Solve with Cyclic Redundancy Check (CRC),
interleaving or other more advanced codes
RAID: Redundant Arrays of (Inexpensive) Disks
 Data is stored across multiple disks
 Files are "striped" across multiple disks
 Redundancy yields high data availability
 Availability: service still provided to user, even if
some components failed
 Disks will still fail
 Contents reconstructed from data
redundantly stored in the array
− Capacity penalty to store redundant info
− Bandwidth penalty to update redundant info
Redundant Arrays of Inexpensive Disks
RAID 1: Disk Mirroring/Shadowing

recovery
group

• Each disk is fully duplicated onto its “mirror”

Very high availability can be achieved
• Writes limited by single-disk speed
• Reads may be optimized

Most expensive solution: 100% capacity overhead

RAID 3: Parity Disk
10010011
11001101
10010011 P
...
1 1 1 1
logical record 0 1 0 1
Striped physical
records 1 0 1 0
0 0 0 0
P contains sum of 0 1 0 1
other disks per stripe 0 1 0 1
mod 2 (“parity”) 1 0 1 0
If disk fails, subtract 1 1 1 1
P from sum of other
disks to find missing information
RAID 4: High I/O Rate Parity
Increasing
D0 D1 D2 D3 P Logical
Insides of 5 Disk
disks Address
D4 D5 D6 D7 P

D8 D9 D10 D11 P

Example: small read

D0 & D5, large write D12 D13 D14 D15 P
Stripe
D12-D15
D16 D17 D18 D19 P

D20 D21 D22 D23 P

. . . . .
. . Disk Columns
. . .
. . . . .
Inspiration for RAID 5
 RAID 4 works well for small reads
 Small writes (write to one disk):
 Option 1: read other data disks, create new sum and
write to Parity Disk
 Option 2: since P has old sum, compare old data to
new data, add the difference to P
 Small writes are limited by Parity Disk: Write to
D0, D5 both also write to P disk

D0 D1 D2 D3 P

D4 D5 D6 D7 P
RAID 5: High I/O Rate Interleaved Parity

Increasing
D0 D1 D2 D3 P Logical
Independent writes Disk
possible because of Addresses
interleaved parity D4 D5 D6 P D7

D8 D9 P D10 D11

D12 P D13 D14 D15

Example: write to P D16 D17 D18 D19

D0, D5 uses disks
0, 1, 3, 4
D20 D21 D22 D23 P
. . . . .
. . .
Disk Columns . .
. . . . .
“And in Conclusion…”
 Great Idea: Redundancy to Get Dependability
 Spatial (extra hardware) and Temporal (retry if error)
 Reliability: MTTF, Annualized Failure Rate (AFR), and FIT
 Availability: % uptime (MTTF/MTTF+MTTR)
 Memory
 Hamming distance 2: Parity for Single Error Detect
 Hamming distance 3: Single Error Correction Code + encode
bit position of error
 Treat disks like memory, except you know when a disk
has failed—erasure makes parity an Error Correcting
Code
 RAID-2, -3, -4, -5 (and -6, -10): Interleaved data and
parity

CS61C Su18 27 MRR Dependability
No ratings yet
CS61C Su18 27 MRR Dependability
60 pages
Clase 6
No ratings yet
Clase 6
33 pages
NI Tutorial 6480
No ratings yet
NI Tutorial 6480
4 pages
11 Errors
No ratings yet
11 Errors
33 pages
Error Detection and Correction: Parity Code
No ratings yet
Error Detection and Correction: Parity Code
4 pages
Error Detection and Correction: Parity Code
No ratings yet
Error Detection and Correction: Parity Code
4 pages
AN0271V1-NAND Error Correction Codes Introduction-0217
No ratings yet
AN0271V1-NAND Error Correction Codes Introduction-0217
17 pages
Error Detection and Correction
No ratings yet
Error Detection and Correction
36 pages
UNIT V Notes 2
No ratings yet
UNIT V Notes 2
3 pages
Error Detection and Correction Codes
No ratings yet
Error Detection and Correction Codes
34 pages
Error-Correcting Codes For Semiconductor Memory Applications: A State-of-the-Art Review
100% (1)
Error-Correcting Codes For Semiconductor Memory Applications: A State-of-the-Art Review
11 pages
The Datalink & The Mac (Medium Access) Sublayer
No ratings yet
The Datalink & The Mac (Medium Access) Sublayer
69 pages
CNPPT
No ratings yet
CNPPT
10 pages
Wireless Error Detection & Correction
No ratings yet
Wireless Error Detection & Correction
22 pages
Unit 2 Design of Embedded System Hardware-I
No ratings yet
Unit 2 Design of Embedded System Hardware-I
87 pages
Ecc in Nand Flash
100% (1)
Ecc in Nand Flash
14 pages
CCN L07Data Link Error Control
No ratings yet
CCN L07Data Link Error Control
27 pages
1.error Detection and Correction
No ratings yet
1.error Detection and Correction
74 pages
Advanced Error Control Techniques
100% (1)
Advanced Error Control Techniques
70 pages
Computer Codes
100% (17)
Computer Codes
38 pages
Error Detection & Correction Guide
No ratings yet
Error Detection & Correction Guide
7 pages
Advanced Error Correction for Secure Space Communication
No ratings yet
Advanced Error Correction for Secure Space Communication
63 pages
Error Detection and Correction From Simon
No ratings yet
Error Detection and Correction From Simon
30 pages
CN Unit 3
No ratings yet
CN Unit 3
60 pages
DE-Unit - 1 Error Detection & Correction Code, Parity Bit, Hamming Code
No ratings yet
DE-Unit - 1 Error Detection & Correction Code, Parity Bit, Hamming Code
5 pages
Design of Hamming Encoder (23,16) For Emerging Applications
No ratings yet
Design of Hamming Encoder (23,16) For Emerging Applications
7 pages
Hamming Code Trainer
No ratings yet
Hamming Code Trainer
38 pages
Error Detection and Correction: CIT 595 Spring 2008
No ratings yet
Error Detection and Correction: CIT 595 Spring 2008
7 pages
Data Link Layer Error Handling
No ratings yet
Data Link Layer Error Handling
121 pages
Hamming Code
No ratings yet
Hamming Code
11 pages
5 Error Detection
No ratings yet
5 Error Detection
10 pages
Lecture 7
No ratings yet
Lecture 7
28 pages
Error Correcting Codes
No ratings yet
Error Correcting Codes
27 pages
Hamming Code Eng
No ratings yet
Hamming Code Eng
3 pages
Error Detection/Correction: Section 1.7 Section 3.9 Bonus Material: Hamming Code
No ratings yet
Error Detection/Correction: Section 1.7 Section 3.9 Bonus Material: Hamming Code
28 pages
Error Control Coding Guide
No ratings yet
Error Control Coding Guide
70 pages
Error Detection
No ratings yet
Error Detection
7 pages
Fault Tolerant Parallel FFTs Using Error Correction Codes and Parseval Checks
No ratings yet
Fault Tolerant Parallel FFTs Using Error Correction Codes and Parseval Checks
9 pages
Group Presentation
No ratings yet
Group Presentation
21 pages
Mini Project 138,45
No ratings yet
Mini Project 138,45
8 pages
Area and Power Efficient Ecc For Multiple Adjacent Bit Errors in Srams
No ratings yet
Area and Power Efficient Ecc For Multiple Adjacent Bit Errors in Srams
4 pages
NW Lec 10
No ratings yet
NW Lec 10
39 pages
Detecting and Correccting Multiple Bit Upsets in Static
No ratings yet
Detecting and Correccting Multiple Bit Upsets in Static
23 pages
Error Detection and Correction
No ratings yet
Error Detection and Correction
12 pages
CN CDT12
No ratings yet
CN CDT12
19 pages
Error Detection and Correction
No ratings yet
Error Detection and Correction
27 pages
The Theory
No ratings yet
The Theory
26 pages
Week 4
100% (1)
Week 4
44 pages
Error Control Techniques: Check Sum Hamming Code
No ratings yet
Error Control Techniques: Check Sum Hamming Code
21 pages
Error Detection: Transmission Errors Occur
No ratings yet
Error Detection: Transmission Errors Occur
29 pages
Rohini 66006427465
No ratings yet
Rohini 66006427465
4 pages
Computer Networks: BCSC0008
No ratings yet
Computer Networks: BCSC0008
42 pages
LDPC Codes: Decoding & Encoding Guide
No ratings yet
LDPC Codes: Decoding & Encoding Guide
15 pages
Lec 06
No ratings yet
Lec 06
49 pages
Lec 12
No ratings yet
Lec 12
30 pages
Lec 02
No ratings yet
Lec 02
28 pages
CS61C Course Overview & Key Concepts
No ratings yet
CS61C Course Overview & Key Concepts
48 pages
090.560-CS Quantum LX Condenser-Vessel 3.0x 2011-08
No ratings yet
090.560-CS Quantum LX Condenser-Vessel 3.0x 2011-08
92 pages
Data Transmission Questions
100% (1)
Data Transmission Questions
28 pages
Memory Address Decoding
No ratings yet
Memory Address Decoding
28 pages
DEXAL Technical Manual
No ratings yet
DEXAL Technical Manual
52 pages
14 Computer Architecture and Organization
No ratings yet
14 Computer Architecture and Organization
69 pages
Lesson 3.1 Error Detection Correction
No ratings yet
Lesson 3.1 Error Detection Correction
67 pages
Fault Tolerant SEC-DAEC Encoders
No ratings yet
Fault Tolerant SEC-DAEC Encoders
7 pages
Data Communication MCQs Guide
No ratings yet
Data Communication MCQs Guide
79 pages
Poc-Unit Iv & V Imp Questions
No ratings yet
Poc-Unit Iv & V Imp Questions
7 pages
L12 Mag Stripe.
No ratings yet
L12 Mag Stripe.
122 pages
Modbus Using CompactLogix 1769 SM2
No ratings yet
Modbus Using CompactLogix 1769 SM2
14 pages
Unit-4 DC Notes
No ratings yet
Unit-4 DC Notes
37 pages
Chapter 6 - External Memory
No ratings yet
Chapter 6 - External Memory
52 pages
Matm Finals
No ratings yet
Matm Finals
7 pages
Interfacing The Serial Port
100% (1)
Interfacing The Serial Port
40 pages
Meldas M3 Parameter Reinstall Guide
No ratings yet
Meldas M3 Parameter Reinstall Guide
3 pages
ComputerNetwork C3 en
No ratings yet
ComputerNetwork C3 en
78 pages
Odd - and Even-Parity - Generators and Checkers
No ratings yet
Odd - and Even-Parity - Generators and Checkers
8 pages
Iii Aiml NT Unit-3 Notes
No ratings yet
Iii Aiml NT Unit-3 Notes
17 pages
Rxocf I2A 57 RX Path Imbalance: Moty Fault Class Fault Code Description
No ratings yet
Rxocf I2A 57 RX Path Imbalance: Moty Fault Class Fault Code Description
42 pages
ETS User Manual: EDPF-NT Plus
No ratings yet
ETS User Manual: EDPF-NT Plus
27 pages
FAC1002 - Computer Codes
No ratings yet
FAC1002 - Computer Codes
44 pages
Unit 1 Topic-Binary Codes
No ratings yet
Unit 1 Topic-Binary Codes
15 pages
Unit 2
No ratings yet
Unit 2
135 pages
Computer Network Lab Manual
No ratings yet
Computer Network Lab Manual
43 pages
(Ebook) Ewing's Analytical Instrumentation Handbook by Grinberg, Nelu Rodriguez, Sonia ISBN 9781482218671, 1482218674 Instant Download
100% (1)
(Ebook) Ewing's Analytical Instrumentation Handbook by Grinberg, Nelu Rodriguez, Sonia ISBN 9781482218671, 1482218674 Instant Download
59 pages
Digital Circuit Design Experiments
No ratings yet
Digital Circuit Design Experiments
39 pages
Data Communications
No ratings yet
Data Communications
30 pages
Chapter 3 - Serial Interfacing With Microprocessor Based System
No ratings yet
Chapter 3 - Serial Interfacing With Microprocessor Based System
28 pages
Chapter 12: Mass-Storage Systems
No ratings yet
Chapter 12: Mass-Storage Systems
53 pages

Lec38 BW

Uploaded by

Lec38 BW

Uploaded by

Great Ideas

We will discuss hardware failures

1+1=2 1+1=2 1+1=1 FAIL!

Increasing transistor density reduces the cost of redundancy

 Relevant: Automotive safety integrity level (ASIL)

Fall 2017 – Lecture #25

Error changes bit pattern to non-code

Sparse population of code words (2M << 2N)

 Detection: bit pattern fails codeword check

Detect Single Bit Errors

• No 1-bit error goes to another valid codeword

Correct Single Bit Errors

One bit away

• 1-bit errors near valid codewords

 Calculate the parity bits

 Position 2 checks bits 2,3,6,7,10,11:

 Position 4 checks bits 4,5,6,7,12:

 Position 8 checks bits 8,9,10,11,12:

• Each disk is fully duplicated onto its “mirror”

Most expensive solution: 100% capacity overhead

Example: small read

D20 D21 D22 D23 P

D12 P D13 D14 D15

Example: write to P D16 D17 D18 D19

You might also like