Fault Tolerant Design: An Introduction: Elena Dubrova
Fault Tolerant Design: An Introduction: Elena Dubrova
AN INTRODUCTION
ELENA DUBROVA
Department of Microelectronics and Information Technology
Royal Institute of Technology
Stockholm, Sweden
Acknowledgments xi
1. INTRODUCTION 1
1 Definition of fault tolerance 1
2 Fault tolerance and redundancy 2
3 Applications of fault-tolerance 2
2. FUNDAMENTALS OF DEPENDABILITY 5
1 Introduction 5
2 Dependability attributes 5
2.1 Reliability 6
2.2 Availability 6
2.3 Safety 8
3 Dependability impairments 8
3.1 Faults, errors and failures 9
3.2 Origins of faults 10
3.3 Common-mode faults 11
3.4 Hardware faults 11
3.4.1 Permanent and transient faults 11
3.4.2 Fault models 12
3.5 Software faults 13
4 Dependability means 14
4.1 Fault tolerance 14
4.2 Fault prevention 15
4.3 Fault removal 15
4.4 Fault forecasting 16
5 Problems 16
4.3 Pair-and-a-spare 62
5 Hybrid redundancy 64
5.1 Self-purging redundancy 64
5.1.1 Reliability evaluation 64
5.2 N-modular redundancy with spares 65
5.3 Triplex-duplex redundancy 66
6 Problems 67
5. INFORMATION REDUNDANCY 71
1 Introduction 71
2 Fundamental notions 73
2.1 Code 73
2.2 Encoding 73
2.3 Information rate 74
2.4 Decoding 74
2.5 Hamming distance 74
2.6 Code distance 75
2.7 Code efficiency 76
3 Parity codes 76
4 Linear codes 79
4.1 Basic notions 79
4.2 Definition of linear code 80
4.3 Generator matrix 81
4.4 Parity check matrix 82
4.5 Syndrome 83
4.6 Constructing linear codes 84
4.7 Hamming codes 85
4.8 Extended Hamming codes 88
5 Cyclic codes 89
5.1 Definition 89
5.2 Polynomial manipulation 89
5.3 Generator polynomial 90
5.4 Parity check polynomial 92
5.5 Syndrome polynomial 93
5.6 Implementation of polynomial division 93
5.7 Separable cyclic codes 95
5.8 CRC codes 97
5.9 Reed-Solomon codes 97
6 Unordered codes 98
6.1 M -of-n codes 99
6.2 Berger codes 100
7 Arithmetic codes 101
7.1 AN-codes 101
7.2 Residue codes 102
8 Problems 102
6. TIME REDUNDANCY 107
1 Introduction 107
2 Alternating logic 107
3 Recomputing with shifted operands 109
4 Recomputing with swapped operands 110
5 Recomputing with duplication with comparison 110
6 Problems 111
7. SOFTWARE REDUNDANCY 113
1 Introduction 113
2 Single-version techniques 114
2.1 Fault detection techniques 115
2.2 Fault containment techniques 115
2.3 Fault recovery techniques 116
2.3.1 Exception handling 117
2.3.2 Checkpoint and restart 117
2.3.3 Process pairs 119
2.3.4 Data diversity 119
3 Multi-version techniques 120
3.1 Recovery blocks 120
3.2 N -version programming 121
3.3 N self-checking programming 123
3.4 Design diversity 123
4 Software Testing 125
4.1 Statement and Branch Coverage 126
4.1.1 Statement Coverage 126
4.1.2 Branch Coverage 126
4.2 Preliminaries 127
4.3 Statement Coverage Using Kernels 129
4.4 Computing Minimum Kernels 132
I would like to thank KTH students Xavier Lowagie, Sergej Koziner, Chen Fu,
Henrik Kirkeby, Kareem Refaat, Julia Kuznetsova, and Dr. Roman Morawek
from Technikum Wien for carefully reading and correcting the draft of the
manuscript.
I am grateful to the Swedish Foundation for International Cooperation in Re-
search and Higher Education (STINT) for the scholarship KU2002-4044 which
supported my trip to the University of New South Wales, Sydney, Australia,
where the first draft of this book was written during October - December 2002.
INTRODUCTION
3. Applications of fault-tolerance
Originally, fault-tolerance techniques were used to cope with physical defects
of individual hardware components. Designers of early computing systems
employed redundant structures with voting to eliminate the effect of failed
components, error-detection or correcting codes to detect or correct information
errors, diagnostic techniques to locate failed components and automatic switch-
overs to replace them.
Following the development of semiconductor technology, hardware compo-
nents became intrinsically more reliable and the need for tolerance of component
defect diminished in general purpose applications. Nevertheless, fault tolerance
remained necessary in many safety-, mission- and business-critical applications.
FUNDAMENTALS OF DEPENDABILITY
Ah, this is obviously some strange usage of the word ’safe’ that I wasn’t previously aware
of.
—Douglas Adams, "The Hitchhikers Guide to the Galaxy".
1. Introduction
The ultimate goal of fault tolerance is the development of a dependable
system. In a broad term, dependability is the ability of a system to deliver its
intended level of service to its users. As computer systems become relied upon
by society more and more, dependability of these systems becomes a critical
issue. In airplanes, chemical plants, heart pace-makers or other safety critical
applications, a system failure can cost people’s lives or environmental disaster.
In this section, we study three fundamental characteristics of dependability:
attributes, impairment and means. Dependability attributes describe the prop-
erties which are required from a system. Dependability impairments express
the reasons for a system to cease to perform its function or, in other words, the
threats to dependability. Dependability means are the methods and techniques
enabling the development of a dependable computing system.
2. Dependability attributes
The attributes of dependability express the properties which are expected
from a system. Three primary attributes are reliability, availability and safety.
Other possible attributes include maintainability, testability, performability,
confidentiality, security. Depending on the application, one or more of these at-
tributes are needed to appropriately evaluate the system behavior. For example,
in an automatic teller machine (ATM), the proportion of time which system is
able to deliver its intended level of service (system availability) is an important
2.1 Reliability
%&
'()
Reliability R t of a system at time t is the probability that the system oper-
ates without failure in the interval 0 t , given that the system was performing
correctly at time 0.
Reliability is a measure of the continuous delivery of correct service. High
reliability is required in situations when a system is expected to operate without
interruptions, as in the case of a pacemaker, or when maintenance cannot be
performed because the system cannot be accessed. For example, spacecraft
mission control system is expected to provide uninterrupted service. A flaw
in the system is likely to cause a destruction of the spacecraft as in the case
of NASA’s earth-orbiting Lewis spacecraft launched on August 23rd, 1997.
The spacecraft entered a flat spin in orbit that resulted in a loss of solar power
and a fatal battery discharge. Contact with the spacecraft was lost, and it then
re-entered the atmosphere and was destroyed on September 28th. According
to the report of the Lewis Spacecraft Mission Failure Investigation, the failure
was due to a combination of a technically flawed attitude-control system design
and inadequate monitoring of the spacecraft during its crucial early operations
phase.
Reliability is a function of time. The way in which time is specified varies
considerably depending on the nature of the system under consideration. For
example, if a system is expected to complete its mission in a certain period of
time, like in case of a spacecraft, time is likely to be defined as a calendar time
or as a number of hours. For software, the time interval is often specified in
so called natural or time units. A natural unit is a unit related to the amount
of processing performed by a software-based product, such as pages of output,
transactions, telephone calls, jobs or queries.
2.2 Availability
Relatively few systems are designed to operate continuously without inter-
ruption and without maintenance of any kind. In many cases, we are interested
not only in the probability of failure, but also in the number of failures and, in
particular, in the time required to make repairs. For such applications, attribute
which we would like to maximize is the fraction of time that the system is in
the operational state, expressed by availability.
%&
Availability A t of a system at time t is the probability that the system is
%&
functioning correctly at the instant of time t.
A t is also referred as point availability, or instantaneous availability. Often
,
it is necessary to determine the interval or mission availability. It is defined by
AT% &+* 1
T 0
T
A t dt %& - (2.1)
. &
A T is the value of the point availability averaged over some interval of time
T . This interval might be the life-time of a system or the time to accomplish
some particular task. Finally, it is often found that after some initial transient
effect, the point availability assumes a time-independent value. In this case, the
,
steady-state availability is defined by
%&
If a system cannot be repaired, the point availability A t equals to the sys-
tem’s reliability, i.e. the probability that the system has not failed between 0 and
t. Thus, as T goes to infinity, the steady-state availability of a non-repairable
% &+*
system goes to zero
A∞ 0
Steady-state availability is often specified in terms of downtime per year.
Table 2.1 shows the values for the availability and the corresponding downtime.
Availability Downtime
90% 36.5 days/year
99% 3.65 days/year
99.9% 8.76 hours/year
99.99% 52 minutes/year
99.999% 5 minutes/year
99.9999% 31 seconds/year
such web sites should be available all the time and should respond quickly even
when a large number of clients concurrently access them. Another example
is electronic power control system. Customers expect power to be available
24 hours a day, every day, in any weather condition. In some cases, prolonged
power failure may lead to health hazard, due to the loss of services such as water
pumps, heating, light, or medical attention. Industries may suffer substantial
financial loss.
2.3 Safety
Safety can be considered as an extension of reliability, namely a reliability
with respect to failures that may create safety hazards. From reliability point
of view, all failures are equal. In case of safety, failures are partitioned into
fail-safe and fail-unsafe ones.
As an example consider an alarm system. The alarm may either fail to
function even though a dangerous situation exists, or it may give a false alarm
when no danger is present. The former is classified as a fail-unsafe failure. The
latter is considered a fail-safe one. More formally, safety is defined as follows.
%
Safety S t) of a system is the probability that the system will either perform
its function correctly or will discontinue its operation in a fail-safe manner.
Safety is required in safety-critical applications were a failure may result in
an human injury, loss of life or environmental disaster. Examples are chemical
or nuclear power plant control systems, aerospace and military applications.
Many unsafe failures are caused by human mistakes. For example, the Cher-
nobyl accident on April 26th, 1986, happened because all safety systems were
shut off to allow an experiment which aimed investigating a possibility of pro-
ducing electricity from the residual energy in the turbo-generators. The exper-
iment was badly planned, and was led by an electrical engineer who was not
familiar with the reactor facility. The experiment could not be canceled when
things went wrong, because all automatic shutdown systems and the emergency
core cooling system of the reactor had been manually turned off.
3. Dependability impairments
Dependability impairment are usually defined in terms of faults, errors, fail-
ures. A common feature of the three terms is that they give us a message that
something went wrong. A difference is that, in case of a fault, the problem
occurred on the physical level; in case of an error, the problem occurred on
the computational level; in case of a failure, the problem occurred on a system
level.
The fourth cause of faults are external factors, which arise from outside the
system boundary, the environment, the user or the operator. External factors
include phenomena that directly affect the operation of the system, such as tem-
perature, vibration, electrostatic discharge, nuclear or electromagnetic radiation
or that affect the inputs provided to the system. For instance, radiation causing
a bit to flip in a memory location is a fault caused by an external factor. Faults
caused by user or operator mistakes can be accidental or malicious. For exam-
ple, a user can accidentally provide incorrect commands to a system that can
lead to system failure, e.g. improperly initialized variables in software. Mali-
cious faults are the ones caused, for example, by software viruses and hacker
intrusions.
Coupling faults are more difficult to test because they depend upon more than
one line. An example of a coupling fault would be a short-circuit between two
adjacent word lines in a memory. Writing a value to a memory cell connected
to one of the word lines would also result in that value being written to the
corresponding memory cell connected to the other short-circuited word line.
Two types of transition coupling faults include inversion coupling faults in
which a specific transition in one memory cell inverts the contents of another
memory cell, and idempotent coupling faults in which a specific transition of
one memory cell results in a particular value (0 or 1) being written to another
memory cell.
Clearly, fault models are not accurate in 100% cases, becuase faults can cause
a variety of different effects. However, studies have shown that a combination
of several fault models can give a very precise coverage of actual faults. For
example, for memories, practically all faults can be modeled as a combination
of stuck-at faults, transition faults and idempotent coupling faults.
4. Dependability means
Dependability means are the methods and techniques enabling the devel-
opment of a dependable system. Fault tolerance, which is the subject of this
book, is one of such methods. It is normally used in a combination with other
methods to attain dependability, such as fault prevention, fault removal and fault
forecasting. Fault prevention aims to prevent the occurrences or introduction
of faults. Fault removal aims to reduce the number of faults which are present
in the system. Fault forecasting aims to estimate how many faults are present,
possible future occurrences of faults, and the impact of the faults on the system.
5. Problems
2.1. What is the primary goal of fault tolerance?
2.2. Give three examples of applications in which a system failure can cost
people’s lives or environmental disaster.
2.6. Define the reliability of a system. What property of a system the reliability
characterizes? In which situations is high reliability required?
2.8. What is the difference between the reliability and the availability? How
does the point availability compare to the system’s reliability if the system
cannot be repaired? What is the steady-state availability of a non-repairable
system?
A common mistake that people make when trying to design something completely foolproof
is to underestimate the ingenuity of complete fools.
—Douglas Adams, Mostly Harmless
1. Introduction
Along with cost and performance, dependability is the third critical criterion
based on which system-related decisions are made. Dependability evaluation is
important because it helps identifying which aspect of the system behaviors, e.g.
component reliability, fault coverage or maintenance strategy plays a critical
role in determining overall system dependability. Thus, it provides a proper
focus for product improvement effort from early in the development stage to
fabrication and test.
There are two conventional approaches to dependability evaluation: (1) mod-
eling of a system in the design phase, or (2) assessment of the system in a later
phase, typically by test. The first approach relies on probabilistic models that
use component level failure rates published in handbooks or supplied by the
manufacturers. This approach provides an early indication of system depend-
ability, but the model as well as the underlying data later need to be validated
by actual measurements. The second approach typically uses test data and re-
liability growth models. It involves fewer assumptions than the first, but it can
be very costly. The higher the dependability required for a system, the longer
the test. A further difficulty arises in the translation of reliability data obtained
by test into those applicable to the operational environment.
Dependability evaluation has two aspects. The first is qualitative evaluation,
that aims to identify, classify and rank the failure modes, or the events combi-
nations that would lead to system failures. For example, component faults or
0 0 p % A&10 1 - (3.1)
Let A denotes the event “not A”. For example, if A stands for “it rains”, A
stands for “it does not rain”. The second axiom of probability theory says that
the probability of an event A equals to 1 minus the probability of the event A:
% 3&
occur equals to the probability that B occur times the conditional probability
P AB :
% &
If p B is greater than zero, the equation 3.3 can be written as
An important condition that we will often assume is that two events are
% &
mutually independent. For events A and B to be independent, the probability
% 3 +& * % &
p A does not depend on whether B has already occurred or not, and vice versa.
Thus, p A B p A . So, for independent events, the rule (3.3) reduces to
This is the definition of independence, that the probability of two events both
occurring is the product of the probabilities of each event occurring. Situations
% 4 &* % 4 &+*
also arise when the events are mutually exclusive. That is, if A occurs, B cannot,
and vice versa. So,p A B 0 and p B A 0 and the equation 3.3 becomes
p % A 4 B &+* 0 ( if A and B are mutually exclusive events - (3.6)
This is the definition of mutually exclusiveness, that the probability of two
events both occurring is zero.
% 5 &
Let us now consider the situation when either A, or B, or both event may
occur. The probability p A B is given by
p % A 5 B & * p % A &65 p % B &2 p % A 4 B & (3.7)
Combining (3.6) and (3.7), we get
pA% 5 B& * p % A&75 p % B&!( if A and B are mutually exclusive events - (3.8)
2
individual components is independent. By rule (3.2), the probability that a
% 2 &
single component fails is 1 R. Then, by rule (3.5), the probability that a single
component fails and the other two remain operational is 1 R R 2 . Since, the
% 2 &
probabilities of any of the three components to fail are the same, the the overall
probability of one component failing and other two not is 3 1 R R 2 . The three
probabilities are added by applying rule (3.8), because the events are mutually
% 3&
inclusive. Suppose that one event, A is dependent on another event, B. Then
P A B denotes the conditional probability of event A, given event B.
* 8
if a processor fails, on average, once every 1000 hours, then it has a failure rate
λ 1 1000 failures/hour.
Often failure rate data is available at component level, but not for the entire
system. This is because several professional organizations collect and publish
failure rate estimates for frequently used components (diodes, switches, gates,
flip-flops, etc.). At the same time the design of a new system may involve new
configurations of such standard components. When component failure rates are
available, a crude estimation of the failure rate of a non-redundant system can
be done by adding the failure rates λ i of the components:
λ * ∑9 λ
i 1
n
i
>? =<@
components of the system physically wear out.
: :;: :;:<:
=
Figure 3.1. Typical evolution of failure rate over a life-time of a hardware system.
During the useful life phase of the system, failure rate function is assumed to
have a constant value λ. Then, the reliability of the system varies exponentially
as a function of time:
Rt % &+* eA λt
(3.9)
J ? =<@
function of time is shown in Figure 3.2.
rate due to an upgrade, the failure rate levels off gradually, partly because of
the bugs found and fixed after the upgrades. The second difference is that, in
the last phase, software does not have an increasing failure rate as hardware
does. In this phase, the software is approaching obsolescence and there is no
motivation for more upgrades or changes.
>? =<@
: :;: :<:;:
=
Figure 3.3. Typical evolution of failure rate over a life-time of a software system.
*
occurrence of the first system failure.
*QP ( (R-R-R-!( S
If n identical systems are placed into operation at time t 0 and the time t i ,
i 12 n , that each system i operates before failing is measured then the
average time is MTTF:
MT T F * 1n 4 ∑9 t n
i 1
i (3.10)
%&
,
In terms of system reliability R t , MTTF is defined as
MT T F * 0
∞
%& -
R t dt (3.11)
So, MTTF is the area under the reliability curve in Figure 3.2. If the reliability
function obeys the exponential failure law (3.9), then the solution of (3.11) is
given by
MT T F * 1
λ
(3.12)
where λ is the failure rate of the system. The smaller the failure rate is, the
longer is the time to the first failure.
In general, MTTF is meaningful only for systems that operate without repair
until they experience a system failure. In a real situation, most of the mission
critical systems undergo a complete check-out before the next mission is under-
taken. All failed redundant components are replaced and the system is returned
to a fully operational status. When evaluating the reliability of such systems,
mission time rather than MTTF is used.
MT T R * 1
µ
(3.13)
4
If the system experiences n failures during its lifetime, the total time that the
4
system is operational is n MT T F. Likewise, the total time the system is being
repaired is n MT T R. The steady state availability given by the expression (2.2)
can be approximated as
MT BF * MT T F 5 MTT R (3.15)
% 3&
form the expected actions when a fault occurs. More precisely, fault coverage
is defined in terms of the conditional probability P A B , read as “probability
of A given B”.
Fault detection coverage is the conditional probability that, given the exis-
tence of a fault, the system detects it.
UWVNXNYZ\[ C
UWVNXNYZ\[ C UWVNXNY$Z][ G
UWVNXNYZ\[ G
Figure 3.4. Reliability block diagram of a two-component system: (a) serial, (b) parallel.
fb;[Igh[Ii;i;Vcb C
U^[`_aVcbed
fb;VFi<[Ii;i;Vcb G
Figure 3.5. Reliability block diagram of a three-component system.
Reliability block diagrams are a popular model, because they are easy to
understand and to use for modeling systems with redundancy. In the next
section we will see that they are also easy to evaluate using analytical methods.
However, reliability block diagrams, as well as other combinatorial reliability
models, have a number of serious limitations.
First, reliability block diagrams assume that the system components are lim-
ited to the operational and failed states and that the system configuration does
not change during the mission. Hence, they cannot model standby components,
repair as well as complex fault detection and recovery mechanisms. Second, the
failures of the individual components are assumed to be independent. There-
fore, the case when the sequence of component failures affects system reliability
cannot be adequately represented.
the state space is discrete. For example, a system might have two states: op-
erational and failed. The time scale is usually continuous, which means that
component failure and repair times are random variables. Thus, Continuous
Time Markov Chains are the most commonly used. In some textbooks, they are
called Continuous Markov Models. There are, however, applications in which
time scale is discrete. Examples include synchronous communication protocol,
shifts in equipment operation, etc. If both time and state space are discrete, then
the process is called Discrete Time Markov Chain.
*j% ( &
Markov processes are illustrated graphically by state transition diagrams.
A state transition diagram is a directed graph G V E , where V is the set
of vertices representing system states and E is the set of edges representing
system transitions. State transition diagram is a mathematical model which can
be used to represent a wide variety of processes, i.e. radioactive breakdown or
chemical reaction. For dependability models, a state is defined to be a particular
combination of operating and failed components. For example, if we have a
system consisting of two components, than there are four different combinations
enumerated in Table 3.2, where O indicates an operational component and F
indicates a failed component.
Component State
1 2 Number
O O 1
O F 2
F O 3
F F 4
The state transitions reflect the changes which occur within the system state.
For example, if a system with two identical component is in the state (11), and
the first module fails, then the system moves to the state (01). So, a Markov
process represents possible chains of events which occur within a system. In
the case of dependability analysis, these events are failures and repairs.
Each edge carries a label, reflecting the rate at which the state transitions
occur. Depending on the modeling goals, this can be failure rate, repair rate or
both.
We illustrate the concept first on a simple system, consisting of a single
component.
>
component (Figure 3.6).
C G
Figure 3.6. State transition diagram of a single-component system.
If repair is allowed, then a transition between the failed and the operational
states is possible, with a repair rate µ (Figure 3.7). State diagrams incorporating
>
repair are used in availability analysis.
C G
k
Figure 3.7. State transition diagram of a single-component system incorporating repair.
2
transition between the state 1 and the failed-unsafe state 3 depends on failure
rate λ and the probability that a fault is not detected, i.e. 1 C.
6> l G
C
>? Cm l @ D
Figure 3.8. State transition diagram of a single-component system for safety analysis.
Figure 3.9. The failure rates λ1 and λ2 for components 1 and 2 indicate the
rates at which the transitions are made between the states. The two components
are assumed to be independent and non-repairable.
>po G >7q
C n
>7q D >po
Figure 3.9. State transition diagram for a two independent component system.
* *
component system shown in Figure 3.9 are in parallel. If the components have
identical failure rates λ1 λ2 λ, then it is not necessary to distinguish between
the states 2 and 3. Both states represent a condition where one component is
operational and one is failed. So, we can merge these two states into one.
(Figure 3.10). The assignments of the state numbers in the simplified transition
diagram are shown in Table 3.3. Since the failures of components are assumed
Component State
1 2 Number
O O 1
O F 2
F O 2
F F 3
Table 3.3. Markov states of a simplified state transition diagram of a two-component parallel
system.
to be independent events, the transition rate from the state 1 to the state 2 in
Figure 3.10 is the sum of the transition rates from the state 1 to the states 2 and
3 in Figure 3.9, i. e. 2λ.
C G> G > D
Figure 3.10. Simplified state transition diagram of a two-component parallel system.
%&
solution is composed from the reliabilities of the parts.
Given a system consisting of n components with R i t being the reliability
of the ith component, the reliability of the overall system is given by
% &1* 9 % &
In a serial system, all components should be operational for a system to
function correctly. Hence, by rule (3.5), R serial t ∏ni 1 Ri t . In a parallel
system, only one of the components is required for a system to be operational.
- - * -
with 100 components is to be build, and each of the components has a reliability
0 999, the overall system reliability is 0 999 100 0 905.
On the other hand, a parallel system can be made reliable despite the un-
reliability of its component parts. For example, a parallel system of four
identical modules with the module reliability 0.95, has the system reliabil-
2u% 2v- & * 0 - 99999375. Clearly, however, the cost of the parallelism
ity 1 1 95 4
can be high.
%&
failed components simultaneously. Given a system consisting of n components
with Ai t being the availability of the ith component, the availability if the
overall system is given by
%&
on this model.
The aim of Markov processes analysis is to calculate Pi t , the probability
that the system is in the state i at time t. Once this is known, the system
reliability, availability or safety can be computed as a sum taken over all the
operating states.
*
Let us designate the state 1 as the state in which all the components are
operational. Assuming that at t 0 the system is in state 1, we get
P1 0% &T* 1 -
Since at any time the system can be only in one state, Pi 0 % &+* 0 x( w i * y 1, and we
z { % &T* (
have
∑ Pi t 1 (3.18)
i O F
where the sum is over all possible states.
%&
To determine the Pi t , we derive a set of differential equations, one for
%&
each state of the system. These equations are called state transition equations
because they allow the Pi t to be determined in terms of the rates (failure,
repair) at which transitions are made from one state to another. State transition
equations are usually presented in matrix form. The matrix M whose entry m i j
is the rate of transition between the states i and j is called the transition matrix
associated with the system. We use first index i for the columns of the matrix
\
and the second index j for the rows, i.e. M has the following structure
-R-R-
-R-R-
m11 m21 mk1
M *}|~~ m12 m22
-R-R-
mk1
-
m1k m2k -R-R- mkk
where k is the number of states in the state transition diagram representing the
system. In reliability or availability analysis the components of the system are
0
normally assumed to be in either operational or failed states. So, if a system
consists of n components, then k 2n . In safety analysis, where the system can
fail in either a safe or an unsafe way, k can be up to 3 n . The entries in each column
2 P ( (R-R-R- S *y
of the transition matrix must sum up to 0. So, the entries m ii corresponding to
self-transitions are computed as ∑ mi j , for all j 12 k such that j i.
For example, the transition matrix for the state transition diagram of a single-
component system shown in Figure 3.6 is
M * 2 λλ 00 - (3.19)
The rate of the transition between the states 1 and 2 is λ, therefore the m 12 λ. *
*2 λ. The rate of transition between the states 2 and 1 is 0, so
* *
Therefore, m11
m21 0 and thus m22 0.
Similarly, the transition matrix for the state transition diagram in Figure 3.7,
which incorporates repair, is
M * 2 λλ 2 µµ - (3.20)
The transition matrix for the state transition diagram in Figure 3.8, is of size
3 3, since, for safety analysis, the system is modeled to be in three different
states: operational, failed-safe failed-unsafe.
2
* |
λ
λC
0 0
-
% 2 &
M 0 0 (3.21)
λ1 C 0 0
The transition matrix for the simplified state transition diagram of the two-
component system, shown in Figure 3.10 is
2
M * |
2λ
2λ 2
0 0
λ 0 - (3.22)
0 λ 0
%& %&
Using state transition matrices, state transition equations are derived as fol-
lows. Let P t be a vector whose ith element is the probability Pi t that the
system is in state i at time t. Then the matrix representation of a system of state
transition equations is given by
d
Pt % &+* 4 % &!-
M Pt (3.23)
%&
dt
Once the system of equations is solved and the Pi t are known, the system
reliability, availability or safety can be computed as a sum taken over all the
operating states.
We illustrate the computation process on a number of simple examples.
% & 2 2λ 0 0 P % t &
(3.23) to the the matrix (3.22) we get
| % & * | 2λ 2 λ 0 4 | P % t & -
P1 t 1
d
%& P % t&
P2 t 2
dt P3 t 0 λ 0 3
P % t &* λP % t &
dt 2 1 2
d
dt 3 2
P % t &T* 2e A 2 2e A
λt 2λt
P % t &T* 1 2 2e A 5 e A
2
λt 2λt
3
%&
Since the Pi t are known, we can now calculate the reliability of the system.
%&
For the parallel configuration, both components should fail to have a system
%&
failure. Therefore, the reliability of the system is the sum of probabilities P1 t
% &* A 2 A
and P2 t :
R parallel t 2e λt e 2λt (3.24)
In general case, the reliability of the system is computed as a function using the
% &T* z % &!(
equation
Rt ∑ Pi t (3.25)
i O
where the sum is taken over all the operating states O. Alternatively, the relia-
% &T* 2 z % &!(
bility can be calculated
Rt 1 ∑ Pi t
i F
%&* A
where the sum is taken over all the states F in which the system has failed.
e λt .
% &t* 2
Note that, for constant failure rates, the component reliability is R t
Therefore, the equation (3.24) can be written as R parallel t 2
2R R, which
agrees with the expression (3.16) derived using reliability block diagrams. Two
results are the same, because in this example we assumed the failure rates to be
mutually independent.
component increases. The increased failure rates of the components 1 and 2
are denoted with λ1 and λ2 , respectively.
>po G >7q
C n
>7q D >7o
Figure 3.11. State transition diagram of a two-component parallel system with load sharing.
%&
From the state transition diagram in Figure 3.11, we can derive the state
\
\ P % t & \
transition equations for Pi t . In the matrix form they are
P % t& 2 λ 2 λ 0
P % t& 2 λ 0 0 4 |~ P % t & -
0 0
| ~ * ~ |
1 1 2 1
d
~ %2
& ~ 1λ
λ 2 λ 0 ~ P % t &
2 2
P % t& λ λ P % t&
dt P 3t 2 0 1 3
4 0 2 0 1 4
% &T*
P1 t .A A e λ1 λ2 t
P % t &T* A e 2 A e. A A
λ1 λ2 t λ1 λ1 λ2 t
A e 2 A e. A A
2 λ1 λ2 λ2 λ1 λ2 λ2
P % t &T* λ2 λ1 t λ2 λ1 λ2 t
A e 5 A e. A A
3 λ1 λ2 λ1 λ1 λ2 λ1
P % t &T*1 2 e. A A 2 λ1 λ2 t λ1 λ2 t λ1 λ1 λ2 t
2 A e 5 A e. A A -
4 λ1 λ2 λ2 λ1 λ2 λ2
λ2 λ1 t λ2 λ1 λ2 t
λ1 λ2 λ1 λ1 λ2 λ1
R % t &T* e . A A 5 A e 2 A e. A A (3.26)
λ1 λ2 t λ1 λ2 t λ1 λ1 λ2 t
5 A e 2 A e. A A -
parallel λ1 λ2 λ2 λ1 λ2 λ2
λ2 λ1 t λ2 λ1 λ2 t
λ1 λ2 λ1 λ1 λ2 λ1
* *
If λ1 λ1 and λ2 λ2 , the above equation is equal to (3.24). The effect of the
* * * *
increased loading can be illustrated as follows. Assume that the two components
are identical, i.e. λ1 λ2 λ and λ1 λ2 λ . Then, the equation (3.26)
reduces to
% &T* λ
A 2 A -
2 2
2λ
R parallel t e λt e 2λt
2λ λ 2λ λ
J¡`£¢¥¤§¦
C
i<\cZ\[ghVH_6Vc$[`H
> 7
> 7 >
> 7 G >
B C G D >=
Figure 3.12. Reliability of a two-component parallel system with load sharing.
¨ A *
Figure 3.12 shows the reliability of a two-component parallel system with
load-sharing for different values of λ . The reliability e λt of a single-component
system is also plotted for a comparison. In case of λ λ two components are
independent, so the reliability is given by (3.23). λ p*
∞ is the case of total
dependency. The failure of one component brings an immediate failure of an-
other component. So, the reliability equals to the reliability of a serial system
with two components (3.16). It can been seen that, the more the values of λ
exceeds the value of λ, the closer the reliability of the system approaches serial
system with two components reliability.
Component State
1 2 Number
O O 1
F O 2
F F 3
Table 3.4. Markov states of a simplified state transition diagram of a two-component parallel
system incorporating repair.
is a transition between the states 1 and 2. If a system is in the state 2 and the
backup component fails, there is a transition to the state 3. Since we assumed
% ( &
that the backup unit cannot fail while in the standby mode, the combination
O F cannot occur. The states 1 and 2 are operational states. The state 3 is
C G >q D C G D
k k k
?¥© @ ?«ª @
Figure 3.13. Sate transition diagrams for a standby two-component system (a) for reliability
analysis, (b) for availability analysis.
Suppose the primary unit can be repaired with a rate µ. For reliability anal-
ysis, this implies that a transition between the states 2 and 1 is possible. The
corresponding transition matrix if given by
2 λ1
-
* | 2λ2
µ 0
M λ1 2 µ 0
0 λ2 0
For availability analysis, we should be able to repair the backup unit as well.
This adds a transition between the states 3 and 2. We assume that the repair rates
for primary and backup units are the same. We also assume that the backup
unit will be repaired first. The corresponding transition matrix if given by
2 λ1
-
* | 2λ2
µ 0
λ1
2
M 2 µ µ (3.27)
0 λ2 µ
One can see that, in the matrix for availability calculations, none of the
diagonal elements is zero. This is because the system should be able to recover
%&
from the failed state. By solving the system of state transition equations, we
can get Pi t and compute the availability of the system as
At % &T* 1 2 ∑z P % t &!(i F
i (3.28)
M P∞ 4 % &+* 0 - (3.29)
2 λ P % ∞&65 µP % ∞&T* 0
In our example, for matrix (3.27) this represents a system of equations
µP3 ∞ % &T* 0
λ P % ∞ &2 µP % ∞ &+* 0
1 1 2 2
2 2 3
% &
Since these three equations are linearly dependent, they are not sufficient to
solve for P ∞ . The needed piece of additional information is the condition
(3.18) that the sum of all probabilities is one:
∑ Pi % ∞ &T* 1 - (3.30)
i
% &T*¬ 1 5 5 %
P1 ∞ λ
& ®A (
λ 2
1
& ®A (
µ µ
P % ∞ &T* ¬ 1 5 5 % 1
λ λ 2 λ
& ®A ¯ ° -
2 µ µ µ
P % ∞ &T*¬ 1 5 5 % 1 2
λ λ 2 λ
3 µ µ µ
λ A ± λ
A % ∞ &+* 1 2 1 5 5% µ & µ ² -
λ 2
1 2
µ
If we further assume that λ 8 µ ³´³ 1, we can write
±λ
A % ∞ &+µ 1 2 -
2
µ²
2
To summarize, steady-state availability problems are solved by the same pro-
cedure as time-dependent availability. Any n 1 of the n equations represented
% &
by (3.29) are combined with the condition 3.30 to solve for the components of
P ∞ . These are then substituted into (3.28) to obtain availability.
%&
Its state transition matrix is given by (3.21). So, the state transition equations
for Pi t are given by
% & 2λ % &
4 | % & -
% & *
P1 t 0 0 P1 t
dt | % & | λ % 1 2 C&
d
λC
%&
P2 t 0 0 P2 t
P3 t 0 0 P3 t
% &* eA
P1 t λt
P % t &* C 2 Ce A λt
The safety of the system is the sum of probabilities of being in the operational
or fail-safe states, i.e.
*
% &¶*
At time t 0 the safety of the system is 1, as expected. As time approaches
*
infinity, the safety approaches the fault detection coverage, S 0 C. So, if
C 1, the system has a perfect safety.
6. Problems
3.1. Why is dependability evaluation important?
3.2. What is the difference between qualitative and quantitative evaluation?
3.3. Define the failure rate. How the failure rate of a non-redundant system can
be computed from the failure rates of its components?
3.4. How does a typical evolution of failure rate over a system’s life-time differ
for hardware and software?
3.5. What is the mean time to failure of a system? How can the MTTF of a
non-redundant system be computed from the MTTF of its components?
3.6. What is the difference between the mean time to repair and the mean time
between failures?
3.7. A heart pace-maker has a constant failure rate of λ * 8 - 1678 10 9 hr.
(a) What is the probability that it will fail during the first year of operation?
(b) What is the probability that it will fail within 5 years of operation?
(b) What is the MTTF?
3.8. A logic circuit with no redundancy consists of 16 two-input NAND gates
and 3 J-K flip-flops. Assuming the constant failure rates of a two-input
NAND gate and a J-K flip-flop are 0.2107 and 0.4667 per hour, respectively,
compute
(a) the failure rate of the logic circuit,
(b) the reliability for a 72-hour mission,
(c) the MTTF.
* -
3.9. An automatic teller machine manufacturer determines that his product has
a constant failure rate of λ 77 16 per 10 6 hours in normal use. For how
long should the warranty be set if no more than 5% of the machines are to
be returned to the manufacturer for repair?
3.10. A car manufacturer estimates that the reliability of his product is 99% during
the first 7 years.
(a) How many cars will need a repair during the first year?
(b) What is the MTTF?
3.11. A two-year guarantee is given on a TV-set based on the assumption that no
more than 3% of the items will be returned for repair. Assuming exponential
failure law, what is the maximum failure rate that can be tolerated?
3.12. A DVD-player manufacturer determines that the average DVD set is used
930 hr/year. A two-year warranty is offered on the DVD set having MTTF
of 2500 hr. If the exponential failure law holds, which fraction of DVD-sets
will fail during the warranty period?
* A
3.13. Suppose the failure rate of a jet engine is λ 10 3 per hour. What is the
probability that more than two engines on a four-engine aircraft will fail
during a 4-hour flight? Assume that the failures are independent.
-
3.14. A non-redundant system with 50 components has a design life reliability of
0 95. The system is re-designed so that it has only 35 components. Estimate
the design life reliability of the re-designed system. Assume that all the
components have constant failure rate of the same value and the failures are
independent.
3.15. At the end of the year of service the reliability of a component with a constant
failure rate is 0.96.
(a) What is the failure rate?
(b) If two components are put in parallel, what is the one year reliability?
Assume that the failures are independent.
* A
3.16. A lamp has three 25V bulbs. The failure rate of a bulb is λ 10 3 per year.
What is the probability that more than one bulb fail during the first month?
* -
3.17. Suppose a component has a failure rate of λ 0 007 per hour. How many
components should be placed in parallel if the system is to run for 200 hours
with failure probability of no more than 0.02? Assume that the failures are
independent.
3.18. Consider a system with three identical components with failure rate λ. Find
the system failure rate λsys for the following cases:
(a) All three components in series.
(b) All three components in parallel.
8
3.22. Suppose that the steady state availability for standby system should be 0.9.
What is the maximum acceptable value of the failure to repair ratio λ µ?
3.23. A computer system is designed to have a failure rate of one fault in 5 years.
The rate remains constant over the life of the system. The system has no
fault-tolerance capabilities, so it fails upon occurrence of the first fault. For
such a system:
(a) What is the MTTF?
(b) What is the probability that the system will fail in its first year of oper-
ation?
(c) The vendor of this system wishes to offer insurance against failure for
the three years of operation of the system at some extra cost. The vendor
determined that it should charge $200 for each 10% drop in reliability
to offer such an insurance. How much should the vendor charge to the
client for such an insurance?
3.24. What are the basic assumptions regarding the failures of the individual com-
ponents (a) in reliability block diagram model; (b) in Markov process model?
(
3.25. A system consists of three modules: M 1 M2 and M3 . It was analyzed, and
the following reliability expression was derived:
Rsystem * R1 R3 5 R2 R3 2 R1 R2 R3
( (
3.26. A system with four modules: A B C and D is connected so that it operates
correctly only if one of the two conditions is satisfied:
modules A and D operate correctly,
module A operates correctly and either B or C operates correctly.
Answer the following questions:
(a) Draw the reliability block diagram of the above system such that every
module appears only once.
% &
(b) What is the reliability of the system? Assume that the reliability of the
module X is R X and that the modules fail independently from each
other.
* * * * -
(c) What is the reliability of the above system as a function of time t, if
the failure rates of the components are λ A λB λC λD 0 01 per
year? Assume that the exponential failure law holds.
3.27. How many states has a non-simplified state transition diagram of a system
consisting of n components? Assume that every component has only two
states: operational and failed.
3.28. Construct the Markov model of the three-component system shown in Figure
3.5. Assume that the components are independent and non-repairable. The
failure rate of the processors 1 and 2 is λ p . The failure rates of the memory
is λm . Derive and solve the system of state transition equations representing
this system. Compute the reliability of the system.
3.29. Construct the Markov model of the three-component system shown in Figure
3.5 for the case when a failed processor can be repaired. Assume that the
components are independent and that a processor can be repaired as long as
the system has not failed. The failure rate of the processors 1 and 2 is λ p .
The failure rate of the memory is λm . Derive and solve the system of state
transition equations representing this system. Compute the reliability of the
system.
3.30. What is the difference between treatment of repair for reliability and avail-
ability analysis?
3.31. Construct the Markov model of the three-component system shown in Figure
3.5 for availability analysis. Assume that the components are independent
and that the processors and the memory can be repaired after the system
failure. The failure rate of the processors 1 and 2 is λ p . The failure rate of
the memory is λm . Derive and solve the system of state transition equations
representing this system. Compute the reliability of the system.
3.32. Suppose that the reliability of a system consisting of 4 blocks, two of which
are identical, is given by the following equation:
Rsystem * R1 R2 R3 5 R2
2
1 R21 R2 R3
HARDWARE REDUNDANCY
Those parts of the system that you can hit with a hammer (not advised) are called hardware;
those program instructions that you can only curse at are called software.
—Anonymous
1. Introduction
Hardware redundancy is achieved by providing two or more physical in-
stances of a hardware component. For example, a system can include redundant
processors, memories, buses or power supplies. Hardware redundancy is often
the only available method for improving dependability of a system, when other
techniques, such as better components, design simplification, manufacturing
quality control, software debugging, have been exhausted or shown to be more
costly than redundancy.
Originally, hardware redundancy techniques were used to cope with low reli-
ability of individual hardware elements. Designers of early computing systems
replicated components at gate and flip-flop levels and used comparison or vot-
ing to detect or correct faults. As reliability of hardware components improved,
the redundancy was shifted at the level of larger components, such as whole
memories or arithmetic unit, thus reducing the relative complexity of the voter
or comparator with respect to that of redundant units.
There are three basic forms of hardware redundancy: passive, active and
hybrid. Passive redundancy achieves fault tolerance by masking the faults that
occur without requiring any action on the part of the system or an operator.
Active redundancy requires a fault to be detected before it can be tolerated.
After the detection of the fault, the actions of location, containment and recov-
ery are performed to remove the faulty component from the system. Hybrid
redundancy combine passive and active approaches. Fault masking is used to
2. Redundancy allocation
The use of redundancy does not immediately guarantee an improvement in
the dependability of a system. The increase in weight, size and power consump-
tion caused by redundancy can be quite severe. The increase in complexity may
diminish the dependability improvement, unless a careful analysis is performed
to show that a more dependable system can be obtained by allocating the re-
dundant resources in a proper way.
A number of possibilities have to be examined to determine at which level
redundancy needs to be provided and which components need to be made re-
dundant. To understand the importance of these decisions, consider a serial
*
system consisting of two components with reliabilities R 1 and R2 . If the system
reliability R R1 R2 does not satisfy the design requirements, the designer may
decide to make some of the components redundant. The possible choices of
redundant configurations are shown in Figure 4.1(a) and (b). Assuming the
component failures are mutually independent, the corresponding reliabilities of
*% 2 &
these systems are
Ra 2R1 R21 R2
R *% 2R 2 R & R
b 2
2
2 1
C G
G C
C G
?¥© @ ?«ª @
Figure 4.1. Redundancy allocation.
³
It follows from this expression that the higher reliability is achieved if we
duplicate the component that is least reliable. If R 1 R2 , then configuration
Figure 4.1(a) is preferable, and vice versa.
Another important parameter to examine is the level of redundancy. Consider
the system consisting of three serial components. In high-level redundancy, the
entire system in duplicated, as shown in Figure 4.2(a). In low-level redundancy,
the duplication takes place at component level, as shown in Figure 4.2(b). If
each of the block of the diagram is a subsystem, the redundancy can be placed
C G D C G D
at even lower levels.
C G D C G D
?¥© @ ?¥ª @
Figure 4.2. High-level and low-level redundancy.
Let us compare the reliabilities of the systems in Figures 4.2(a) and (b).
Assuming that the component failures are mutually independent, we have
Ra * 1 u2 % 1 2 R R R & 2
*% 1 2u% 1 2 R & &·% 1 2u% 1 2 R & &·% 1 2¸% 1 2 R & &
1 2 3
Rb 2 2 2
1 2 3
* *
As we can see, the reliabilities Ra and Rb differ, although the systems have the
same number of components. If RA RB RC , then the difference is
Rb 2R* a 6R3 1 % 2 R& 2
¹
Consequently, Rb Ra , i.e. low-level redundancy yields higher reliability than
high-level redundancy. However, this dependency only holds if the components
failures are truly independent in both configurations. In reality, common-mode
failures are more likely to occur with low-level rather than with high-level
redundancy, since in high-level redundancy the redundant units are normally
isolated physically and therefore are less prone to common sources of stress.
3. Passive redundancy
Passive redundancy approach masks faults rather than detect them. Masking
insures that only correct values are passed to the system output in spite of the
\YN C UWVNXNYZ\[ C
\YN G UWVNXNYZ\[ G º+V»e[`b VcY;YN
\YN D UWVNXNYZ\[ D
Figure 4.3. Triple modular redundancy.
A TMR system can mask only one module fault. A failure in either of the
remaining modules would cause the voter to produce an erroneous result. In
Section 5 we will show that the dependability of a TMR system can be improved
by removing failed modules from the system.
TMR is usually used in applications where a substantial increase in reliability
is required for a short period. For example, TMR is used in the logic section
of launch vehicle digital computer (LVDC) of Saturn 5. Saturn 5 is a rocket
carrying Apollo spacecrafts to the orbit. The functions of LVDC include the
monitoring, testing and diagnosis of rocket systems to detect possible failures or
unsafe conditions. As a result of using TMR, the reliability of the logic section
for a 250-hr mission is approximately twenty times larger than the reliability
of an equivalent simplex system. However, as we see in the next section, for
longer duration missions, a TMR system is less reliable than a simplex system.
we need to take the reliability of modules as well as the duration of the mission
into account.
A TMR system operates correctly as long as two modules operate correctly.
Assuming that the voter is perfect and that the component failures are mutually
independent, the reliability of A TMR systems is given by
RT MR * R1 R2 R3 5% 1 2 R & R R 5 R % 1 2 R & R 5
1 2 3 1 2 3 R1 R2 1 % 2 R&
3
The term R1 R2 R3 gives the probability that the first module functions correctly
% 2 &
and the second module functions correctly and the third module functions cor-
rectly. The term 1 R1 R2 R3 stands for the probability that the first module
has failed and the second module functions correctly and the third module func-
* * *
tions correctly, etc. The overall probability is an or of the probabilities of the
terms since the events are mutually exclusive. If R 1 R2 R3 R, the above
* 2
equation reduces to
RT MR 3R2 2R3 (4.1)
J §`£¢¥¤¡¦
of a simplex system consisting of a single module with reliability R. The
C
J¿pÀÁ
J
¼¾½
B ¼¾½ C J¦ÃÂeÄÅIÆ]¤
Figure 4.4. TMR reliability compared to simplex system reliability.
*
reliabilities of the modules composing the TMR system are assumed to be
2 *
equal R. As can be seen, there is a point at which R T MR R. This point can
be found by solving the equation 3R2 2R3 R. The three solutions are 0.5, 1
* -
and 0, implying that the reliability of a TMR system is equal to the reliability
* *
of a simplex system when the reliability of the module is R 0 5, when the
module is perfect (R 1), or when the module is failed (R 0).
* -
system can be fault-tolerant and still have a low overall reliability. For example,
* -
a TMR system build out of poor-quality modules with R 0 2 will have a low
reliability of RT MR 0 136. Vice versa, a system which cannot tolerate any
faults can have a high reliability, e.g. when its components are highly reliable.
However, such a system will fail as soon as the first fault occurs.
Next, let us consider how the reliability of a TMR system changes as a
function of time. For a constant failure rate λ, the reliability of the system
varies exponentially as a function of time R t % &1* A
e λt (3.9). Substituting this
expression in (4.1), we get
RT MR % t &+* 3eA 2 2eA
2λt 3λt
(4.2)
Figure 4.5 shows how the reliabilities of simplex and TMR systems change
J¡`£¢¥¤§¦
as functions of time. The value of λt, rather than t is shown on the x-axis, to
C
J
J¿pÀÁ
¼½
B ¼ÈÇ C G D >=
Figure 4.5. TMR reliability as a function of λt.
-
system is higher than the reliability of simplex system in the period between 0
and approximately 0 7λt. That is why TMR is suitable for applications whose
mission time is shorter than 0 7 of MTTF. -
3.1.2 Voting techniques
In the previous section we evaluated the reliability of TMR system assuming
that the voter is perfect. Clearly, such an assumption is not realistic. A more
precise estimate of the reliability of TMR system takes the reliability of the
*% 2 &
voter into account:
RT MR 3R2 2R3 Rv
Voter is in series with redundant modules, since if it fails, the whole system fails.
The reliability of the voter must be very high in order to keep the overall relia-
bility of TMR higher than the reliability of the simplex system. Fortunately, the
voter is typically a very simple device compared to the redundant components
and therefore its failure probability is much smaller. Still, in some systems the
presence of a single point of failure is not acceptable by qualitative require-
ment specifications. We call single point of failure any component within a
system whose failure leads to the failure of the system. In such cases, more
complicated voting shemes are used. One possibility is to decentralize voting
by having three voters instead of one, as shown in Figure 4.6. Decentralized
voting avoids the single point of failure, but requires establishing consensus
among three voters.
\YN C UWVNXNYZ\[ C ºC VcYNeYN C
\YN G UWVNXNYZ\[ G ºG VcYNeYN G
\YN D UWVNXNYZ\[ D ºD VcYNeYN D
Figure 4.6. TMR system with three voters.
Êo
The value of the output f is determined by the majority of the input values
É
Êq É É
Ê6Ë É Ì
Figure 4.7. Logic diagram of a majority voter with 3 inputs.
( (
x1 x2 x3 . The defining table for this voter is given in Table 4.1.
x1 x2 x3 f
0 0 0 0
0 0 1 0
0 1 0 0
0 1 1 1
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 1
* ((
N 2 module faults.
*
Figure 4.9 plots the reliabilities of NMR systems for n 1 3 5 and 7. Note
that the x-axis shows the interval of time between 0 and λt 1, i.e. MTTF.
This interval of time is of most interest for reliability analysis. As expected,
-
larger values of n result in a higher increase of reliability of the system. At time
approximately 0 7λt, the reliabilities of simplex, TMR, 5MR and 7MR system
\YN C UWVNXNYZ\[ C
\YN G UWVNXNYZ\[ G º+V»e[`b VcY;YN
ÑÑ
\YN D UWVNXNYZ\[ÐÏ
Figure 4.8. N-modular redundancy.
-
become equal. After 0 7λt, the reliability of a simplex system is higher than
the reliabilities of redundant systems. So, similarly to TMR, NMR is suitable
J ¡`£¢¥¤§¦
for applications with short mission times.
J
J¿pÀÁ
JÒÀÁ
JÓÀÁ
B C >=
Figure 4.9. Reliability of an NMR system for different values of n.
4. Active redundancy
Active redundancy achieves fault tolerance by first detecting the faults which
occur and then performing actions needed to recover the system back to the op-
erational state. Active redundancy techniques are common in the applications
where temporary erroneous results are preferable to the high degree of redun-
dancy required to achieve fault masking. Infrequent, occasional errors are
allowed, as long as the system recovers back to normal operation in a specified
interval of time.
A duplication with comparison scheme can detect only one module fault.
After the fault is detected, no actions are taken by the system to return back to
the operational state.
RDC * R 4R
1 2 (4.3)
*
or, RDC R2 if R1 R2 R. * *
Figure 4.11 compares the reliability of a duplication with comparison system
% &t*
RDC to the reliability of a simplex system consisting of a single module with
reliability R. It can been seen that, unless the modules are perfect (R t 1),
the reliability of a duplication with comparison system is always smaller than
the reliability of a simplex system.
C
J
JÕÔTÖ
B C J¦ÃÂeÄÅIÆ]¤
Figure 4.11. Duplication with comparison reliability compared to simplex system reliability.
2
sic configuration is shown in Figure 4.12. Only one of n modules is operational
and provides the system’s output. The remaining n 1 modules serve as spares.
A spare is a redundant component which is not needed for the normal system
operation. A switch is a device that monitors the active module and switches
operation to a spare if an error is reported by fault-detection unit FD.
]$YN C U^V¨XYZ][ C
àá ߨ Þ
]$YN G U^V¨XYZ][ G
ÑÑ àá Û]ÝÚ Ü VcY;YN
ØÙ
]$YNâÏ U^V¨XYZ][Ï ×
àá
Figure 4.12. Standby sparing redundancy.
There are two types of standby sparing: hot standby and cold standby. In
the hot standby sparing, both operational and spare modules are powered up.
The spares can be switched into use immediately after the operational module
has failed. In the cold standby sparing, the spare modules are powered down
until needed to replace the faulty module. A disadvantage of cold standby
sparing is that time is needed to apply power to a module, perform initialization
and re-computation. An advantage is that the stand-by spares do not consume
power. This is important in applications like satellite systems, where power
consumption is critical. Hot standby sparing is preferable where the momentary
interruption of normal operation due to reconfiguration need to be minimized,
2
like in a nuclear plant control system.
A standby sparing system with n modules can tolerate n 1 module faults.
Here by “tolerate” we mean that the system will detect and locate the faults,
successfully recover from them and continue delivering the correct service.
When the nth fault occurs, it will still be detected, but the system will not be
able to recover back to normal operation.
Standby sparing redundancy technique is used in many systems. One exam-
ple is Apollo spacecraft’s telescope mount pointing computer. In this system,
two identical computers, an active and a spare, are connected to a switching
device that monitors the active computer and switches operation to the backup
in case of a malfunction.
Another example of using standby sparing is Saturn 5 launch vehicle digital
computer (LVDC) memory section. The section consists of two memory blocks,
with each memory being controlled by an independent buffer register and parity-
checked. Initially, only one buffer register output is used. When a parity error
is detected in the memory being used, operation immediately transfers to the
other memory. Both memories are then re-generated by the buffer register of
the "correct" memory, thus correcting possible transient faults.
Standby sparing is also used in Compaq’s NonStop Himalaya server. The
system is composed of a cluster of processors working in parallel. Each proces-
sor has its own memory and copy of the operating system. A primary process
and a backup process are run on separate processors. The backup process mir-
rors all the information in the primary process and is able to instantly take over
in case of a primary processor failure.
]$YN C UWVNXNY$Z][ C ߨ Þ
]$YN G
àá Û]ÝÚ Ü VcYNeYN
UWVNXNY$Z][ G ãØ Ù
àá
Figure 4.13. Standby sparing system with one spare.
Consider a standby sparing scheme with one spare, shown in Figure 4.13.
Let module 1 be a primary module and module 2 be a spare. The state transi-
tion diagram of the system is shown in Figure 4.14. The states are numbered
according to the Table 4.2.
Component State
1 2 Number
O O 1
F O 2
F F 3
Table 4.2. Markov states of the state transition diagram of a standby sparing system with one
spare.
When the primary component fails, there is a transition between the state 1
and state 2. If a system is in the state 2 and the spare fails, there is a transition
% ( &
to the state 3. Since we assumed that the spare cannot fail while in the standby
mode, the combination O F cannot occur. The states 1 and 2 are operational
C >po >7q
states. The state 3 is the failed state.
G D
Figure 4.14. State transition diagram of a standby sparing system with one spare.
The transition matrix for the state transition diagram 4.14 is given by
2 λ1
-
* | 2
0 0
M λ1 λ2 0
0 λ2 0
% & 2 λ 0 0 P % t &
So, we get the following system of state transition equations
| % & * | λ 2 λ 0 4 | P % t &
P1 t 1 1
d
%& P % t&
P2 t 1 2 2
dt P3 t 0 λ 0 2 3
or P % t &T*2 λ P % t &
d
P % t &T* λ P % t &
dt 2 1 1 2 2
d
dt 3 2 2
% &* eA
P1 t λ1 t
P % t &* % eA 2 eA &
A
λ1 λ1 t λ2 t
λ2 λ1
P % t &* 1 2
A % λ eA 2 λ eA &
2
1 λ1 t λ2 t
3 λ2 λ1 2 1
Since P % t & is the only state corresponding to system failure, the reliability of
the system is the sum of P % t & and P % t &
3
1 2
R % t &T* e A 5 λ λ2 λ % eA 2 eA &
λ1 t λ1 t λ2 t
1
SS
2 1
% &T* eA 5 λ λ2 λ eA % 1 2 eA . A &
RSS t λ1 t 1 λ1 t
(4.4) λ2 λ1 t
λ2 λ1 t
of 2´% λ 2 λ & t as
2 1
2 1
RSS t % &T* eA 5 λ1 t
λ1 e A % t 2 18 2 % λ 2 λ & t 5å-R-R-]&!-
λ1 t
1 2
2
Next, let us see how the equation (4.5) would change if we would ignore the
dependency between the failures. If the primary and spare modules failures are
treated as mutually independent, the reliability of a standby sparing system is a
sum of two probabilities: (1) the probability that module 1 operates correctly,
and (2) the probability that module 2 operates correctly, module 1 has failed
and is replaced by module 2. Then, we get the following expression:
RSS * R 5 % 1 2 R & R
1 1 2
If R1 * R* R, then
* 2
2
RSS 2R R2
or
RSS t % &* 2eA 2 eA
λt 2λt
(4.6)
Figure 4.15 compares the plots of the reliabilities (4.5) and (4.6). One can
see that neglecting the dependencies between failures leads to underestimating
the standby sparing system reliability.
JЧ`£¢¥¤¡¦
C
J ? =<@ è épê ¢ ¢ ë q ¢
JÐçFç ? =<@ G»èHépë ê ècé ¢ ê
JÐçFç ? =<@ ? C > =;@ èHépê
B C >=
Figure 4.15. Standby sparing reliability compared to simplex system reliability.
2
probability that the switch successfully replaces the primary unit by a spare is p.
Then, the probability that the switch fails to do it is 1 p. The state transition
diagram with these assumptions is shown in Figure 4.16. The transition from
state 1 is partitioned into two transitions. The failure rate is multiplied by p to
2
get the rate of successful transition to state 2. The failure rate is multiplied by
1 p to get the rate of the switch failure.
C ì >po G >7q D
Figure 4.16. State transition diagram of a standby system with one spare.
The state transition equations corresponding to the state transition diagram
% &T*2 % &
(4.16) are
d
dt P1 t λ1 P1 t
d
P % t &T* pλ P % t &2 λ P % t &
P % t &T* λ P % t &65% 1 2 p & λ P % t &
dt 2 1 1 2 2
d
dt 3 2 2 1 1
% &* eA
P1 t λ1 t
P % t &* % eA 2 eA &
A λ1 t λ2 t
pλ1
2 λ2 λ1
P % t &* 1 2u% 1 5
A & eA 5 A eA
pλ1 λ1 t pλ1 λ2 t
3 λ2 λ1 λ2 λ1
As before, P % t & corresponds to system failure. So, the reliability of the system
is the sum of P % t & and P % t &
3
1 2
Figure 4.17 compares the reliability of a standby sparing system for different
values of p. As p decreases, the reliability of the standby sparing system
decreases. When p reaches zero, the standby sparing system reliability reduces
to the reliability of a simplex system.
4.3 Pair-and-a-spare
Pair-and-a-spare technique combines standby sparing and duplication and
comparison approaches (Figure 4.18). The idea is similar to standby sparing,
however two modules instead of one are operated in parallel. As in duplication
with comparison case, the results are compared to detect disagreement. If an
error signal is received from the comparator, the switch analyzes the report
ì BC í î
ì Bí ½
ì B
ì
B C >=
Figure 4.17. Reliability of a standby sparing system for different values of p.
from the fault detection block and decides which of the two modules output is
faulty. The faulty module is removed from operation and replaced with a spare
module.
]$YN C UWVNXNY$Z][ C
àá ߨ Þ VHYN;$YN
]$YN G UWVNXNY$Z][ G
ÑÑ àá Û]Ýã Ü
ØÙ
]$YNâÏ UWVNXNY$Z][Ï × [`beb;VHbÃi;]H © Z
àá
Figure 4.18. Pair-and-a-spare redundancy.
2
2
A pair-and-a-spare system with n modules can tolerate n 1 module faults.
When the n 1th fault occurs, it will be detected and located by the switch and
the correct result will be passed to the system’s output. However, since there
will be no more spares available, the switch will not be able to replace the faulty
module with a spare module. The system’s configuration will be reduced to a
simplex system with one module. So, nth fault will not be detected.
5. Hybrid redundancy
The main idea of hybrid redundancy is to combine the attractive features
of passive and active approach. Fault masking is used to prevent system from
producing momentary erroneous results. Fault detection, location and recovery
are used to reconfigure the system after a fault occurs. In this section, we con-
sider three basic techniques for hybrid redundancy: self-purging redundancy,
N-modular redundancy with spares and triplex-duplex redundancy.
\YN C UWVNXNYZ\[ C ïC
\YN G UWVNXNYZ\[ G ïG VcYNeYN
º+V»e[`b
ÑÑ
\YN D UWVNXNYZ\[Ï ïÏ
Figure 4.19. Self-purging redundancy.
2
2
A self-purging redundancy system with n modules can mask n 2 module
2
faults. When n 2 modules are purged and only two are left, the system will be
able to detect next, n 1th fault, but, as in duplication with comparison case,
voter will not be able to distinguish which of the results is the correct one.
* *ð-R-R-6* *
the voter and the switches are perfect, and if all the modules have the same
reliability R1 R2 Rn R, then the system is not reliable if all the
% 2 &
% 2 &A
modules have failed (probability 1 R n ), or if all but one modules have failed
(probability R 1 R n 1 ). Since there are n choices for one of n modules to
remain operational, we get the equation
JЧ`£¢¥¤¡¦
with three, five and seven modules.
C _ VNXNY$Z][
D _VNXNY$Z][Ii
½ _VNXNY$Z][Ii
Ç _VNXNY$Z][Ii
B C J¦ÃÂeÄÅIÆ]¤
Figure 4.20. Reliability of a self-purging redundancy system with 3, 5 and 7 modules.
5
N-modular redundancy with k spares is similar to self-purging redundancy
with k n modules, except that only n modules provide input to a majority
voter (Figure 4.21). Additional k modules serve as spares. If one of the primary
modules become faulty, the voter will mask the erroneous result and the switch
will replace the faulty module with a spare one. Various techniques are used
to identify faulty modules. One approach is to compare the output of the voter
with the individual outputs of the modules, as shown in Figure 4.21. A module
which disagrees with the majority is declared faulty.
The fault-tolerant capabilities of an N-modular redundancy system with k
spares depend on the form of voting used as well as the implementation of the
switch and comparator. One possibility is that, after the spares are exhausted,
the disagreement detector is switched off and the system continues working
as a passive NMR system. Then, such a system can mask n 2 k faults,
Í 8 Îâ5
i.e. the number of faults a NMR system can mask plus the number of spares.
í`í`í í!í`í
\Y C UWVNXNYZ\[ C
\Y G UWVNXNYZ\[ G ߨ Þ
ÑÑ Û]ôÝ Ü
\YâÏ UWVNXNYZ\[Ï ó º+Vc;[!b VcY;YN
ò Ù× ÑÑ
ï © be[ C Ø
Ùñ
ÑÑ ×
ï © b;[öõ
Figure 4.21. N-modular redundancy with spares.
Another possibility is that the disagreement detector remains on, but the voter is
designed to be capable to adjust to the decreasing number of inputs. In this case,
5 5 2
the behavior of the system is similar to the behavior of a self-purging system
5
with n k modules, i.e. up to k n 2 module faults can be masked. Suppose
the spares are exhausted after the first k faults, and the k 1th fault occurred.
As before, the erroneous result will be masked by the voter, the output of the
voter will be compared to the individual outputs of the modules, and the faulty
2
will be removed from considerations. A difference is that it will not be replaced
5 2 &
with a spare one, but instead the system will continue working as n 1-modular
system. Then a k ith fault occurs, the voter votes on n i modules.
\ YN C © WU VNXNYZ\[ C © ïC
]$YN C ª UWVNXNYZ\[ C ª
\ YN G © WU VNXNYZ\[ G © ïG º+Vc;[`b VHYN;Y
]$YN G ª UWVNXNYZ\[ G ª
\ YN D © WU VNXNYZ\[ D © ïD
]$YN D ª UWVNXNYZ\[ D ª
Figure 4.22. Triplex-duplex redundancy.
6. Problems
4.1. Explain the difference between passive, active and hybrid hardware redun-
dancy. Discuss the advantages and disadvantages of each approach.
* - * -
4.2. Suppose that in the system shown in Figure 4.1 the two components have
the same cost and R1 0 75, R2 0 96. If it is permissible to add two
components to the system, would it be preferable to replace component 1 by
a three-component parallel system, or to replace components 1 and 2 each
by two-component parallel systems?
4.3. A disk drive has a constant failure rate and an MTTF of 5500 hr.
(a) What is the probability of failure for one year of operation?
(b) What is the probability of failure for one year of operation if two of the
drives are placed in parallel and the failures are independent?
4.4. Construct the Markov model of the TMR system with three voters shown in
Figure 4.6. Assume that the components are independent. The failure rate
of the modules is λm . The failure rates of the voters is λ v . Derive and solve
the system of state transition equations representing this system. Compute
the reliability of the system.
4.5. Draw a logic diagram of a majority voter with 5 inputs.
4.6. Suppose the design life reliability of a standby system consisting of two
identical units should be at least 0.97. If the MTTF of each unit is 6 months,
determine the design life time. Assume that the failures are independent
and ignore switching failures.
* - * -
4.7. An engineer designs a system consisting of two subsystems in series with
the reliabilities R1 0 99 and R2 0 85. The cost of the two subsystems
is approximately the same. The engineer decides to add two redundant
components. Which of the following is the best to do:
(a) Duplicate subsystems 1 and 2 in high-level redundancy (Figure 4.2(a)).
(b) Duplicate subsystems 1 and 2 in low-level redundancy (Figure 4.2(b)).
(c) Replace the second subsystem by a three-component parallel system.
4.8. A computer with MTTF of 3000 hr is to operate continuously on a 500 hr
mission.
(a) Compute computer’s mission reliability.
(b) Suppose two such computers are connected in a standby configuration.
If there are no switching failures and no failures of the backup computer
while in the standby mode, what is the system MTTF and the mission
reliability?
(c) What is the mission reliability if the probability of switching failure is
0.02?
4.9. A chemical process control system has a reliability of 0.97. Because reli-
ability is considered too low, a redundant system of the same design is to
be installed. The design engineer should choose between a parallel and a
standby configuration. How small must the probability of switching failure
be for the standby configuration to be more reliable than the parallel con-
figuration? Assume that there is no failures of the backup system while in
the standby mode.
4.10. Give examples of applications where you would recommend to use cold
standby sparing and hot standby sparing (two examples each). Justify your
answer.
4.11. Compare the MTTF of a standby spare system with 3 modules and a pair-
and-a-spare system with 3 modules, provided the failure rate of a single
module is 0.01 failures per hour. Assume the modules obey the exponential
failure law. Ignore the switching failures and the dependency between the
module’s failures.
4.12. A basic non-redundant controller for a heart pacemaker consists of an analog
to digital (A/D) converter, a microprocessor and a digital to analog (D/A)
converter. Develop a design making the controller tolerant to any two com-
ponent faults (component here means A/D converter, microprocessor or D/A
converter). Show the block diagram of your design and explain why you
recommend it.
4.13. Construct the Markov model of a hybrid N-modular redundancy with 3 active
modules and one spare. Assume that the components are independent and
that the probability that the switch successfully replace the failed module
by a spare is p.
4.14. How many faulty modules can you tolerate in:
(a) 5-modular passive redundancy?
(b) standby sparing redundancy with 5 modules?
(c) self-purging hybrid modular redundancy with 5 modules?
4.15. Design a switch for hybrid N-modular redundancy with 3 active modules
and 1 spare.
4.16. (a) Draw a diagram of standby sparing active hardware redundancy tech-
nique with 2 spares.
(b) Using Markov models, write an expression for the reliability of system
you showed on the diagram for
(a) perfect switching case,
(b) non-perfect switching case.
* -
(c) Calculate the reliabilities for (a) and (b) after 1000 hrs for the failure
rate λ 0 01 per 100 hours.
4.17. Which redundancy would you recommend to combine with self-purging
hybrid hardware redundancy to distinguish between transient and perma-
nent faults? Briefly describe what would be the main benefit of such a
combination.
4.18. Calculate the MTTF of a 5-modular hardware redundancy system, pro-
vided the failure rate of a single module is 0.001 failures per hour. Assume
the modules obey the exponential failure law. Compare the MTTF of the
5-modular redundancy system with the MTTF of a 3-modular hardware
redundancy system having the failure rate 0.01 failures per hour.
4.19. Draw simplified Markov model for 5-modular hardware redundancy scheme
with failure rate λ. Explain which state of the system each of the nodes in
your chain represents.
INFORMATION REDUNDANCY
The major difference between a thing that might go wrong and a thing that cannot possibly
go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns
out to be impossible to get at or repair.
—Douglas Adams, "Mostly Harmless"
1. Introduction
In this chapter we study how fault-tolerance can be achieved by means of
encoding. Encoding is the powerful technique allowing us to ensure that infor-
mation has not been changed during storage or transmission. Attaching special
check bits to blocks of digital information enables special-purpose hardware
to detect and correct a number of communication and storage faults, such as
changes in single bits or changes to several adjacent bits. Parity code used
for memories in computer systems is a common example of an application of
encoding. Another example is communication protocols, providing a variety
of detection and correction options, including the encoding of large blocks of
data to withstand multiple faults and provisions for multiple retries in case error
correcting facilities cannot cope with faults.
Coding theory was originated in the late 1940s, by two seminal works by
Hamming and Shannon. Hamming, working at Bell Laboratories in the USA,
was studying possibilities for protecting storage devices from the corruption of
a small number of bits by a code which would be more efficient than simple
repetition. He realized the need to consider sets of words, or codewords, where
every pair differs in a large number of bit positions. Hamming defined the notion
of distance between two words and observed this was a metric, thus leading to
interesting properties. This distance is now called Hamming distance. His first
attempt produced a code in which four data bits were followed by three check
bits which allowed not only the detection but the correction of a single error.
The repetition code would require nine check bits to achieve this. Hamming
published his results in 1950.
Slightly prior to Hamming’s publication, in 1948, Shannon, also at Bell Labs,
wrote an article formulating the mathematics behind the theory of communi-
cation. In this article, he developed probability and statistics to formalize the
notion of information. Then, he applied this notion to study how a sender can
communicate efficiently over different media, or more generally, channels of
communication to a receiver. The channels under consideration were of two
different types: noiseless or noisy. In the former case, the goal is to compress
the information at the sender’s end and to minimize the total number of symbols
communicated while allowing the receiver to recover transmitted information
correctly. The later case, which is more important to the topic of this book,
considers a channel that alters the signal being sent by adding to it a noise. The
goal in this case is to add some redundancy to the message being sent so that a
few erroneous symbols at the receiver’s end still allow the receiver to recover the
sender’s intended message. Shannon’s work showed, somewhat surprisingly,
that same underlying notions captured the rate at which one could communicate
over either class of channels. Shannon’s methods involved encoding messages
using long random strings, and the theory relied on the fact that long messages
chosen at random tend to be far away from each other. Shannon had shown that
it was possible to encode messages in such a way that the number of extra bits
transmitted was as small as possible.
Although Shannon’s and Hamming’s works were chronologically and tech-
nically interwined, both researchers seem to regard the other as far away from
their own work. Shannon’s papers never explicitly refers to distance in his main
technical results. Hamming, in turn, does not mention the applicability of his
results to reliable computing. Both works, however, were immediately seen
to be of monumental impact. Shannon’s results started driving the theory of
communication and storage of information. This, in turn, became the primary
motivation for much research in the theory of error-correcting codes.
The value of error-correcting codes for transmitting information became
immediately apparent. A wide variety of codes were constructed, achieving
both economy of transmission and error-correction capacity. Between 1969
and 1973 the NASA Mariner probes used a powerful Reed-Muller code capable
of correcting 7 errors out of 32 bits transmitted. The codewords consisted of 6
data bits and 26 check bits. The data was sent to Earth on the rate over 16,000
bits per second.
Another application of error-correcting codes came with the development of
the compact disk (CD). To guard against scratches, cracks and similar damage
two "interleaved" codes which can correct up to 4,000 consecutive errors (about
2.5 mm of track) are used.
2. Fundamental notions
In this section, we introduce the basic notions of coding theory. We assume
that our data is in the form of strings of binary bits, 0 or 1. We also assume
that the errors occur randomly and independently from each other, but at a
predictable overall rate.
2.1 Code
A binary code of length n is a set of binary n-tuples satisfying some well-
*P ( S
defined set of rules. For example, an even parity code contains all n-tuples
that have an even number of 1s. The set B n 0 1 n of all possible 2n binary
n-tuples is called codespace.
A codeword is an element of the codespace satisfying the rules of the code.
To make error-detection and error-correction possible, codewords are chosen
A
to be a nonempty subset of all possible 2 n binary n-tuples. For example, a
parity code of length n has 2n 1 codewords, which is one half of all possible 2 n
n-tuples. An n-tuple not satisfying the rules of the code is called a word.
3 3
The number of codewords in a code C is called the size of C, denoted by C .
2.2 Encoding
Encoding is the process of computing a codeword for a given data. An
encoder takes a binary k-tuple representing the data and converts it to a codeword
using the rules of the code. For example, to compute a codeword for an even
parity code, the parity of the data is first determined. If the parity is odd, a
2
1-bit is attached to the end of the k-tuple. Otherwise, a 0-bit is attached. The
difference n k between the length n of the codeword and the length k of the
data gives the number of check bits which must be added to the data to do the
encoding. Separable code is a code in which check bits can be clearly separated
from data bits. Parity code is an example of a separable code. Non-separable
code is a code in which check bits cannot be separated from data bits. Cyclic
code is an example of a non-separable code.
3 3 0æ÷ 3 3 ø
words, since any data word should be assigned its own individual codeword
8
from C. Vice versa, a code of size C encodes the data of length k log 2 C
bits. The ratio k n is called the information rate of the code. The information
8
rate determines the redundancy of the code. For example, a repetition code
obtained by repeating the data three times, has the information rate 1 3. Only
one out of three bits carries the message, two others are redundant.
2.4 Decoding
Decoding is the process of restoring data encoded in a given codeword. A
decoder reads a codeword and recovers the original data using the rules of the
code. For example, decoder for a parity code truncates the codeword by one
bit.
Suppose that an error has occurred and a non-codeword is received by a
decoder. A usual assumption in coding theory is that a pattern of errors that
involves a small number of bits is more likely to occur than any pattern that
involves a large number of bits. Therefore, to perform decoding, we search for
a codeword which is “closest” to the received word. Such a technique is called
maximum likelihood decoding. As a measure of distance between two binary
n-tuples x and y we use Hamming distance.
P ( ( ( S
B3 presented by a three-dimensional cube shown in Figure 5.1. Codewords
000 011 101 110 are marked with large solid dots. Adjacent vertices differ
by a single bit. It is easy to see that Hamming distance satisfies the metric
CcCcC
BCIB CHC!B
BHBC C!BC
BcBHB CIBcB
Figure 5.1. û ü ü ü ý
Code 000 011 101 110 in the codespace B3 .
properties of Hamming distance listed above, e.g. δ 000 011 % ( &$5 δ % 011 ( 111&Ã*
5 * * % ( &
2 1 3 δ 000 111 .
P ( ( ( S
code equals two. Code distance determines the error detecting and correcting
capabilities of a code. For instance, consider the code 000 011 101 110 in
Figure 5.1. The code distance of this code is two. Any one-bit error in any
codeword produces a word laying on distance one from the affected codeword.
Since all codewords are on distance two from each other, the error will be
P ( S
detected.
As another example, consider the code 000 111 shown in Figure 5.2. The
B$CcC CcCcC
BCIB CHC!B
BHBC C!BC
BcBHB CIBcB
Figure 5.2. û ü ý
Code 000 111 in the codespace B3 .
codewords are marked with large solid dots. Suppose an error occurred in the
first bit of the codeword 000. The resulting word 100 is on distance one from
000 and on distance two from 111. Thus, we correct 100 to the codeword 000,
which is closest to 100 according to the Hamming distance.
P ( S
The code 000 111 is a replication code, obtained by repeating the data three
times. One one bit of a codeword carries the data, two others are redundant.
By its nature, this redundancy is similar to TMR, but it is implemented in the
information domain. In TMR, voter compares the output values of the modules.
In a replication code, a decoder analyzes the bits of the received word. In both
cases, majority of values of bits determines the decision.
In general, to be able to correct ε-bit errors, a code should have the code
5
distance of at least 2ε 1. To be able to detect ε-bit errors, the code distance
5
should be at least ε 1.
1 Number bit errors a code can detect/correct, reflecting the fault tolerant
capabilities of the code.
8
2 Information rate k n, reflecting the amount of information redundancy added.
The first item in the list above is the most important. Ideally, we would like
to have a code that is capable of correcting all errors. The second objective is
an efficiency issue. We would rather not waste resources by exchanging data
on a very low rate. Easy encoding and decoding schemes are likely to have a
simple implementation in either hardware or software. They are also desirable
for efficiency reasons. In general, the more errors that a code needs to correct
per message digit, the less efficient the communication and usually the more
complicated the encoding and decoding schemes. A good code balances these
objectives.
3. Parity codes
Parity codes are the oldest family of codes. They have been used to detect
errors in the calculations of the relay-based computers of late 1940’s.
2
The even (odd) parity code of length n is composed of all binary n-tuples that
contain an even (odd) number of 1’s. Any subset of n 1 bits of a codeword
can be viewed as data bits, carrying the information, while the remaining nth bit
checks the parity of the codeword. Any single-bit error can be detected, since
the parity of the affected n-tuple will be odd (even) rather than even (odd). It is
not possible to locate the position of the erroneous bit. Thus, it is not possible
to correct it.
ftþ f © b;]§d
ñ
á © © :§
Ø á © © Y UW[`_aVHb;d
1f ÿ f © be]§d
[`beb;VHbÃi;]H © Z
Figure 5.3. A memory protected by a parity code; PG = parity generator; PC = parity checker.
Before being written into a memory, the data is encoded by computing its
parity. In most computer systems, one parity bit per byte (8 bits) of data is
computed. The generation of parity bits is done by a parity generator (PG)
% ( ( ( &
implemented as a tree of exclusive-OR (XOR) gates. Figure 5.4 shows a logic
diagram of an even parity generator for 4-bit data d 0 d1 d2 d3 .
No É
Fq É
Ë É © b;]§d ª
Figure 5.4. Logic diagram of a parity generator for 4-bit data d0 d1 d2 d3 . K ü ü ü L
When data is written into memory, parity bits are written along with the
corresponding bytes of data. For example, for a 32-bit word size, four parity
bits are attached to data and a 36-bit codeword is stored in the memory. Some
systems, like Pentium processor, have a 64-bit wide memory data path. In these
case, eight parity bits are attached to data. The resulting codeword 72 bit long.
When the data is read back from the memory, parity bits are re-computed
and the result is compared to the previously stored parity bits. Re-computation
% ( ( ( &
of parity is done by a parity checker (PC). Figure 5.5 shows a logic diagram
of an even parity checker for 4-bit data d 0 d1 d2 d3 . The logic diagram is
similar to the one of a parity generator, except that one more XOR gate is added
to compare the re-computed parity bit to the previously stored parity bit. If the
parity bits disagree, the output of the XOR gate is 1. Otherwise, the output is
0.
No É
Fq É
Ë É
© be §d ª ] É [`beb;VHb1 C
K ü ü ü L
Figure 5.5. Logic diagram of a parity checker for 4-bit data d0 d1 d2 d3 .
Any computed parity bit that does not match the stored parity bit indicates
that there was at least one bit error in the corresponding byte of data, or in the
parity bit itself. An error signal, called non-maskable interrupt, is sent to the
CPU to indicate that the memory data is not valid and to instruct the processor
to immediately halt.
All operations related to the error-detection (encoding, decoding, compari-
son) are done by memory control logic on the mother-board, in the chip set, or,
for some systems, in the CPU. The memory itself only stores the parity bits,
just as it stores the data bits. Therefore, parity checking does not slow down
the operation of the memory. The parity bit generation and checking is done
in parallel with the writing and reading of the memory using the logic which
is much faster that the memory itself. Nothing in the system waits for a “no
error” signal from the parity checker. The system only performs an action of
interrupt when it finds an error.
Example 5.1. Suppose the data which is written in the memory is [0110110]
and odd-parity code is used. Then the check bit 1 is stored along with the data,
to make the overall parity odd. Suppose that the data read out of the memory
is [01111100]. The re-computed parity is 0. Because the re-computed parity
disagree with the stored parity, we know that an error has occurred.
Parity can only detect single bit errors and an odd number of bit errors. If an
even number of bits are affected, the computed parity matches the stored parity,
and the erroneous data is accepted with no error notification, possibly causing
later some mysterious problems. Studies have shown that approximately 98%
of all memory errors are single-bit errors. Thus, protecting a memory by a
parity code is an inexpensive and efficient technique. For example, 1 GByte
dynamic random access (DRAM) memory with a parity code has a failure rate
of 0.7 failures per year. If the same memory uses a single-error correction
double-error detection Hamming code, requiring 7 check bits in a 32-bit wide
memory system, then the failure rate reduces to 0.03 failures per year. An error
correcting memory is typically slower than a non-correcting one, due to the
error correcting circuitry. Depending on the application, 0.7 failures per year
may be viewed as an acceptable level of risk, or not.
A modification of parity code is horizontal and vertical parity code, which
arrange the data in a 2-dimensional array and add one parity bit on each row
and one parity bit on each column. Such a technique is useful for correcting
single bit errors within a block of data words, however, it may fail correcting
multiple errors.
4. Linear codes
Linear codes provide a general framework for generating many codes, includ-
ing Hamming code. The discussion of linear codes requires some knowledge
of linear algebra, which we first briefly review.
((
“ ” and multiplication “ ”, such that the following properties are satisfied for
all a b c Z2 :
4 5
1 Z2 is closed under ‘ ” and “ ”, meaning that a b 4 Z 2 and a 5 b Z2 .
2 a 5% b 5 c&+*% a 5 b6& 5 c.
3 a5 b * b5 a
4 There exists an element 0 in Z2 such that a 5 0 * a.
5 For each a Z , there exists an element 2 a Z such that a 5%2 a &+* 0.
2 2
10 a 4·% b 5 c &* a 4 b 5 a 4 c.
It is easy to see that above properties are satisfied if “ 5 ” is defined as addition
modulo 2, “ 4 ” is defined as multiplication modulo 2, 0 = 0 and 1 = 1. Throughout
the chapter, we assume this definition of Z 2 .
* *P ( ( ( ( ( ( S
Vector space V n . Let Z2n denote the set of all n-tuples containing elements
from Z2 . For example, for n 3, Z23 000 001 010 011 100 110 111 .
5 4
A vector spaceV n over a field Z2 is subset of Z2n , with two operations, addition
(( ((
“ ” and multiplication “ ”, such that the following axioms are satisfied for all
x y z V n and all a b c Zn :
1 V is closed under “ 5 ”, meaning that x 5 y V .
n n
2 x 5 y * y 5 x.
3 x 5% y 5 z &+*% x 5 y &65 z.
4 There exists an element 0 in V such that x 5 0 * x.
n
6 a4 x V . n
A A
0 k 1
a v 5 a v 5¸-R-R-5 a v
n
A A
0 k 1
0 0 1 1 k 1 k 1 0 1 k 1
of vectors is not linearly independent then they are linearly dependent.
A basis is a set of vectors in a vector space V n that are linearly independent
and span V n .
The dimension of a vector space is defined to be the number of vectors in its
basis.
P (R-R-R-h( A S
by k linearly independent vectors. All codewords can be written as a linear
combination of the k basis vectors v 0 vk 1 as follows:
c * d v 5 d v 5 -R-R-h5 d v
0 0 1 1
A A k 1 k 1
( (R-R-R-`(
Since a different codeword is obtained for each different combination of co-
%(&
P7' )¡(`' ¡) (`' )¡(`' )§S
Example 5.2. As an example, let us construct a 4 2 linear code.
The data we are encoding are 2-bit words 00 01 10 11 . These words
need to be encoded so that the resulting 4-bit codewords form a two-dimensional
subspace of V 4 . To do this, we have to select two linearly independent vectors
* ' ) * ' )
for a basis of the two-dimensional subspace. One possibility is to choose the
vectors v0 1000 and v1 0110 . They are linearly independent since neither
* ' )
is a multiple of the other.
P ( S * 5
To find the codeword, c, corresponding to the data word d d 0 d1 , we
* ' )
compute the linear combination of the two basis vectors v 0 v1 as c d0 v0
d1 v1 . Thus, the data word d 11 is encoded to
G * 1 0 0 0
0 1 1 0 - (5.1)
The codeword c is a product of the generator matrix G and the data word d:
c * dG
Note, that in the Example 5.2 the first two bits of each codeword are exactly
2
the same as the data bits, i.e. the code we have constructed is a separable code.
Separable linear codes are easy to decode by truncating the last n k bits.
' )
In general, separability can be achieved by insuring that basis vectors form a
generating matrix of the form Ik A , where Ik is an identity matrix of size k k.
Note also that the code distance of the code in the Example 5.2 is one.
Therefore, such a code cannot detect errors. It is possible to predict code
distance by examining the basic vectors. Consider the vector v 1 1000 from *' )
the Example 5.2. Since it is a basic vector, it is a codeword. A code is a subspace
5
V n and thus is itself a vector space. A vector space is closed under addition, thus
5
c 1000 also belongs to the code, for any codeword c. The distance between
c and c 1000 is 1, since they differ only in the first bit position. Therefore, a
code with a code distance δ should have basis vectors of weight greater than or
equal to δ.
%(&
' )¡(`' ) ' )
Example 5.3. Consider the 6 3 linear code spanned by the basic vectors
100011 010110 and 001101 . The generator matrix for this code is
G * |
1 0 0 0 1 1
0 1 0 1 1 0 - (5.2)
0 0 1 1 0 1
For example, the data word d *æ' 011) is encoded to
c * 0 4 ' 100011)c5 1 4 ' 010110)c5 1 4·' 001101) *' 011011)¡-
Recall, the “ 5 ” is modulo 2, so 1 5 1 * 0. Similarly, we can encode other data
data codeword
d1 d2 d3 c1 c2 c3 c4 c5 c6
0 0 0 0 0 0 0 0 0
0 0 1 0 0 1 1 0 1
0 1 0 0 1 0 1 1 0
0 1 1 0 1 1 0 1 1
1 0 0 1 0 0 0 1 1
1 0 1 1 0 1 1 1 0
1 1 0 1 1 0 1 0 1
1 1 1 1 1 1 0 0 0
4 *
of the codewords. The matrix H has the property that, for any codeword c,
H cT 0. By cT we denote a transponse of the vector c. Recall, that the
transpose of an n k matrix A is the k n matrix obtained by defining the ith
column of AT to be the ith row of A.
The parity check matrix is related to the generator matrix by the equation
HGT * 0
H *' A I )
A T
n k
HGT * 5 I A A * A 5 A * 0-
AT Ik n k
T T T
3
A* 1 1 0 -
0 1 1
| 1 0 1
So, we have
H *' A I )6* |
T
3
0 1 1 1 0 0
1 1 0 0 1 0 - (5.3)
1 0 1 0 0 1
4.5 Syndrome
Encoded data can be checked for errors by multiplying it by the parity check
*
matrix:
s HcT (5.4)
The resulting k-bit vector s is called syndrome. If the syndrome is zero, no
error has occurred. If s matches one of the columns of H, then a single-bit error
has occurred. The bit position of the error corresponds to the position of the
matching column in H. For example, if the syndrome coincides with the second
column of H, the error is in the second bit of the codeword. If the syndrome is
not zero and is not equal to any of the columns of H, then a multiple-bit error
has occurred.
* ' )
%(& * *ð' )
Example 5.5. As an example, consider the data d 110 encoded using
' )
the 6 3 linear code from the Example 5.3 as c dG 110101 . Suppose
that an error occurs in the second bit of c, transforming it to 100101 . By
*' )
multiplying this word by the parity check matrix (5.3), we obtain the syndrome
s 110 . The syndrome matches the second column of the parity check matrix
H, indicating that the error has occurred in the second bit.
2
the corresponding generator matrix. It is proved that a code has distance of at
least δ if and only if every subset of δ 1 columns of its parity check matrix
H are linearly independent. So, to have a code distance two, we must ensure
that every column of the parity check matrix is linearly independent. This is
equivalent to the requirement of not having a zero column, since the zero vector
can never be a member of a set of linearly independent vectors.
Example 5.6. In the parity check matrix (5.1) of the code which we have
constructed in Example 5.2, the first column is zero:
H * 0 1 1 0
0 0 0 1 -
Therefore, columns of H are linearly dependent and the code distance is one.
Let us modify H to construct a new code with the code distance of at least
two. Suppose to replace the zero column by the column containing 1 in all its
entries:
H * 1 1 1 0
1 0 0 1
AT I2
*' )¡-
So, now A is
A * 1 1
1 0 -
and therefor G can be constructed as
G *' I A )p*
2
T 1 0 1 1
0 1 1 0 -
Using this generator matrix, the data words are encoded as dG resulting in a
%(&
code shown in Table 5.2.
The code distance of the resulting 4 2 code is two. So, this code could be
used to detect single-bit errors.
data codeword
d1 d2 c1 c2 c3 c4
0 0 0 0 0 0
0 1 0 1 1 0
1 0 1 0 1 1
1 1 1 1 0 1
Example 5.7. Let us construct a code with a minimum code distance three,
' A )
capable of correcting single-bit errors. We apply an approach similar to the one
in Example 5.6. First, we create a parity check matrix in the form A T In k , such
that every pair of its columns is linearly independent. This can be achieved by
%(&
ensuring that each column is non-zero and no column is repeated twice.
For, example if our goal is to construct a 3 1 code, then the following
matrix
H *
1 1 0
1 0 1 -
has all its columns non-zero and no column are repeats twice. The matrix A T
in this case is
AT
1
1
* -
So, A *' 11) and therefore G is
G * 1 1 1 -
The resulting (3,1) code consists of two codewords, 000 and 111.
%(&
codes remain important until today.
Consider the following parity check matrix, corresponding to a 7 4 Ham-
ming code:
H * |
1 1 0 1 1 0 0
1 0 1 1 0 1 0 -
(5.5)
1 1 1 0 0 0 1
* 2 *
H has n 7 columns of length n k 3. Note, that 7 2 7 4 1, so the * A 2
%(&
columns of H represent all possible non-zero vectors of length 3. In general, the
2 2 A 2
parity matrix of a n k Hamming code is constructed as follows. For a given
2
n k, construct a binary n k 2n k 1 matrix H such that each non-zero
binary n k-tuple occurs exactly once as a column of H. Any code with such a
check matrix is called binary Hamming code. The code distance of any binary
Hamming code is 3, so a Hamming code is a single-error correcting code.
2
If the columns of H are permuted, the resulting code remains a Hamming
code, since the new check matrix is a set of all possible non-zero n k-tuples.
Different parity check matrices can be selected to suit different purposes. For
example, by permuting the columns of the matrix (5.5), we can get the following
matrix:
H * |
0 0 0 1 1 1 1
0 1 1 0 0 1 1 - (5.6)
1 0 1 0 1 0 1
This matrix is a parity check matrix for a different % 7 ( 4 & Hamming code.
Note, that its column i contains a binary representation of the integer i
P 1 ( 2 (R-R-R-I( 2 A 2 1S . A check matrix satisfying this property is called lexi-
n k
*
in Section 4.5. To check a codeword x for errors, we first calculate the syn-
P ( (R-R-R-!( A 2 S
drome s HxT . If s is zero, then no error has occurred. If s is not zero, then
it is a binary representation of some integer i 12 2 n k 1 . Then, x is
decoded assuming that a single error has occurred in the ith bit of x.
%(&
'
Example 5.8. Let us construct a 7 4 Hamming code corresponding the the
parity check matrix (5.5). H in the form A T I3 ] where
AT * |
1 1 0 1
1 0 1 1 -
1 1 1 0
The generator matrix if of form G *' I A) , or
4
\
1 0 0 0 1 1 1
G * ~|~ 0
0
1
0
0
1
0
0
1
0
0
1
1
1 -
0 0 0 1 1 1 0
*j' )
*' )
Suppose the data to be encoded is d 1110 . We multiply d by G to get the
codeword c 1110001 . Suppose that an error occurs in the last bit of c, trans-
' )
forming it to 1110000 . Before decoding this word, we first check it for errors
*æ' )
by multiplying it by the parity check matrix (5.5). The resulting syndrome
' ) ' )
s 001 matches the last column of H, indicating that the error has occurred
*æ' )
in the last bit. So, we correct 1110000 to 1110001 and then decode it by
taking the first four bits as data d 1110 .
0 1 0 1 0 1 0
\
G * ~|~ 0
0
0
0
1
0
1
1
0
1
0
1
1
1 -
1 0 0 0 0 1 1
*ð' ) ' ) ( (
* 5 5 * 5 5 *
So, data d d0 d1 d2 d3 is encoded as d3 d0 d1 p1 d2 p2 p3 where p1 p2 p3 are
5 5
parity check bits defined by p1 d0 d1 d2 , p2 d0 d2 d3 and p3
d1 d2 d3 . The addition is modulo 2.
%(& 8 * 8
%(& 8p% A 2 &
The information rate of a 7 4 Hamming code is k n 4 7. In general the
rate of a n k Hamming code is k 2n k 1 .
Hamming are widely used for DRAM error-correction. Encoding is usually
performed on complete words, rather than individual bytes. As in the parity
%(&
code case, when a word is written into a memory, the check bits are computed by
* 5 5 * 5 5
a check bits generator. For instance, for a 7 4 Hamming code from Example
* 5 5
5.9, the check bits are computed as p 1 d0 d1 d2 , p2 d0 d2 d3 and
p3 d1 d2 d3 , using a tree of XOR gates.
When the word is read back, check bits are recomputed and the syndrome
is generated by taking an XOR of the read and recomputed check bits. If
the syndrome is zero, no error has occurred. If the syndrome is non-zero, it
%(&
is used to locate the faulty bit by comparing it to the columns of H. This
can be implemented either in hardware or in software. If an n k Hamming
*
code with a lexicographic parity check matrix is used, then the error correction
P ( (R-R-R-I( A 2 S
can be implemented using a decoder and XOR gates. If the syndrome s i,
%(&
i 12 2n k 1 , the ith bit of the word is faulty. An example of error
correction for 7 4 Hamming code from Example 5.8 is shown in Figure 5.6.
' )
The first level of XOR gates compares read check bits p r with recomputed ones
* P ( (R-R-R-I( S
p. The result of this comparison is the syndrome s 0 s1 s2 , which is fed into the
decoder. For the syndrome s i, i 01 7 , the
ith output of the decoder is high. The second level of XOR gates complements
the ith bit of the word, thus correcting the error.
B Va[!b;beVcb
C o É Ë
o
ì ì o É G q É
No
ìì qq É o áú[IghVNXN[`b D Ë É
n É ìo
q
7ì ì7Ë Ë É ½ É
Ò
F q
É ìq
Ç É ì7Ë
Figure 5.6. KüL
Error-correcting circuit for an 7 4 Hamming code.
Often, extended Hamming code rather than regular Hamming code is used,
which allows not only the that single-bit errors can be corrected, but also that
double-bit errors can be detected. We describe this code in the next section.
%(&
errors and detect double-bit errors.
The parity check matrix for an extended n k Hamming code can be obtained
%(&
by first adding a zero column in front of a lexicographic parity check matrix of
2 5
an n k Hamming code, and then by attaching a row consisting of all 1’s as
%(&
the n k 1th row of the resulting matrix. For example, the matrix H for an
extended 1 1 Hamming code is given by
H * 0 1
1 1 -
%(&
The matrix H for an extended 3 2 Hamming code is given by
H * |
0 0 1 1
0 1 0 1 -
1 1 1 1
%(&
\
The matrix H for an extended 7 4 Hamming code is given by
0 0 0 0 1 1 1 1
H *}|~~ 0
0
0
1
1
0
1
1
0
0
0
1
1
0
1
1 -
1 1 1 1 1 1 1 1
5. Cyclic codes
Cyclic codes are a special class of linear codes. Cyclic codes are used in
applications where burst errors can occur, in which a group of adjacent bits is
affected. Such errors are typical in digital communication as well as in storage
devices, such as discs and tapes. Scratch on a compact disk is one example of
a burst error. Two important classes of cyclic codes which we will consider
are cyclic redundancy check (CRC) codes, used in modems and network pro-
tocols and Reed-Solomon codes, applied in satellite communication, wireless
communication, compact disk players and DVDs.
5.1 Definition
A linear code is called cyclic if ' c c c c -R-R- c ) is a codeword whenever
' c c c -R-R- c A c A ) is also a codeword.A So, any end-around
0 1 2 n 2 n 1
A
n 1 0 1 2 n 2
shift of a codeword
of a cyclic code produces another codeword.
When working with cyclic codes, it is convenient to think of words as poly-
-R-R- A ) '
nomials rather than vectors. For a binary cyclic code, the coefficients of the
polynomials are 0 or 1. For example, a data word d 0 d1 d2 dk 1 dk is repre-
sented as a polynomial
d 4 x 5 d 4 x 5 d 4 x 5å-R-R-R5 d
0
0
1
1
2
A 4xA 5 d 4x
2
k 1
k 1
k
k
where addition and multiplication is in the field Z 2 , i.e modulo 2. The degree
4 5 4 5 4 5 4 * 5 5
of a polynomial equals to its highest exponent. For example, a word [1011]
corresponds to the polynomial 1 x0 0 x1 1 x2 1 x3 1 x2 x3 (least
significant bit on the left). The degree of this polynomial is 3.
Before continuing with cyclic codes, we first review the basics of polynomial
arithmetics, necessary for understanding of encoding and decoding algorithms.
% 1 5 x 5 x &4·% 1 5 x &T* 1 5 x 5 x 5 x 5 x 5 x * 1 5 x 5 x 5
3 2 3 2 3 5 2
x5
x 5 x 5 x 5
x 5 x 5 1 3 x 5 x 5 x 5
3 1 2
x 5 x 5
3 6 1 5 3
x 5 x 5
6x 4 3
x 5 x 5
1 5 4
x 5 x 5 x 5
x 5 3 2
x 5 x 5
4 3 2 1
x 5 x 5
x 4 2
x 5 x 5
3 1
3 1
0
5 5 5 x5 x5
5 5 5
So, 1 x x3 divides 1 3 5 x6 without a reminder and the result is
1 x x2 x3 .
%& %&
modulo p x , where p x is a polynomial. To find f x mod p x , we divide
f x by p x and take the remainder.
x 5 x 5
5 5 1 3 x 5 x 5 1
3 1
x 5 x 5
x3 x 5 2
x 5
5 x 3 2
x 5 x 5
1 3
3 1
x
% &Ð*
polynomial determines the properties of the resulting cyclic code. For example,
suppose we encode the data [1011] using the generator polynomial g x
5 5
% &1* 5 5 % &4 % & 5 5 5 5
1 x x3 (least significant bit on the left). The polynomial representing the
5 5
data is d x 1 x2 x3 . By computing g x d x , we get 1 x x2 x3
%&
4 5 6
x x x . So, the codeword corresponding to the data [1011] is [1111111].
The choice of the generator polynomials is guided by the property that g x
%& 5
is the generator polynomial for a linear cyclic code of length n if and only if
g x divides 1 xn without a reminder.
%& 2
k n deg g x where deg g x denotes the degree of the generator polyno-
%(& %(& 2
mial g x . A cyclic code with a generator polynomial of degree n k is called
n k cyclic code. An n k cyclic code can detect burst errors affecting n k
bits or less.
*
*
Example 5.13. Find a generator polynomial for a code of length n 7 for
2 * 5
encoding data of length k 4.
5 5 ð* % 5
We are looking for a polynomial of degree 7 4 3 which divides 1 x 7
5 5 % &¶* 5 5
x x3 1 x2 x3 1 x , so we can choose either g x 1 x x3 or g x
% % &R&1*
1 x2 x3 . Table 5.3 shows the cyclic code generated by g x 1 x x3.
Since deg g x 3, 3-bit burst errors can be detected by this code.
data codeword
d0 d1 d2 d3 c0 c1 c2 c3 c4 c5 c6
0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 1 1 0 1
0 0 1 0 0 0 1 1 0 1 0
0 0 1 1 0 0 1 0 1 1 1
0 1 0 0 0 1 1 0 1 0 0
0 1 0 1 0 1 1 1 0 0 1
0 1 1 0 0 1 0 1 1 1 0
0 1 1 1 0 1 0 0 0 1 1
1 0 0 0 1 1 0 1 0 0 0
1 0 0 1 1 1 0 0 1 0 1
1 0 1 0 1 1 1 0 0 1 0
1 0 1 1 1 1 1 1 1 1 1
1 1 0 0 1 0 1 1 1 0 0
1 1 0 1 1 0 1 0 0 0 1
1 1 1 0 1 0 0 0 1 1 0
1 1 1 1 1 0 0 1 0 1 1
%(& %&
%&
Let C be an n k cyclic code generated by the generator polynomial g x .
Codewords xi g x are basis for C, since every codeword
d % x & g % x &1*
% 7& 5 d xg % x&75å-R-R- d A x A g % x&
d0 g x 1 k 1
k 1
is a linear combination of x g % x & . So, the following matrix G with rows x g % x &
i i
G* ~ R
- R
- -
~~ x A g % x& ~~ 0 0 0 g g -R-R- g 0 -
0 1 n k
x A g % x& -R-RA - g A
k 2
0 1 n k
k 1 0 0 0 0 g g 0 1 n k
% &t* 5 5
Example 5.14. If C is a binary cyclic code with the generator polynomial
\
gx 1 x x3 , then the generator matrix is given by:
1 1 0 1 0 0 0
G * ~|~ 0
0
1
0
1
1
0
1
1
0
0
1
0
0 -
0 0 0 1 1 0 1
% & % &t* 5
h x determined by
gxhx 1 xn
A parity check matrix H contains as its first row the coefficient of h % x & ,
starting from the most significant one. Every following row of H is a right
\
cyclic shift of the first row.
A -R-R- h -R-R-
A R-R- -R-R-- -R-R-
hk hk 1 0 0 0
* ~|~~
0
-
0 h k hk 1 h0 0 0
~ -R-R- -R-R- h
H
-R-R- A
A R- -R-
0 0 hk hk 1 0 0
0 0 0 h k hk 1 h0
% &* 5 5
Example 5.15. Suppose C is a binary cyclic code with the generator polyno-
5 5 ( 5 5 5 5
mial g x 1 x x3 . Let us compute its check polynomial.
%&
Since cyclic codes are linear codes, we can use the parity check polyno-
mial to detect errors which possibly occurred in a codeword c x during data
transmission or storage. We define
% & * h % x& c % x& mod 1 5 x
sx n
* 2
isters (LFSR). Logic diagram of an LFSR for the generator polynomial of
degree r n k is shown in Figure 5.7. It consists of a simple shift register
5 5 -R-R-
i 01 r 1 , are the coefficients of the generator polynomial g x
g0 x0 g2 x1 gn xn . Each gi is either 0, meaning “no connection”, or 1,
meaning “connection”. An exception is g r which is always 1 and therefore is
%& %&
always connected.
' -R-R- )
polynomial g x , resulting in the quotient d x and the reminder s x . The
%&
coefficients s0 s1 sr of the syndrome polynomial are contained in the register
% &¶* 5 5
Example 5.16. As an example, consider the logic circuit shown in Figure 5.8.
It implements LFSR for the generator polynomial g x 1 x x 3 . Let si
!"$#
94 FAULT TOLERANT DESIGN: AN INTRODUCTION
í`í`í
o q éo
denotes the next state value of the register cell s i . Then, the state values of the
LFSR in Figure 5.8 are given by
* s 5 c % x&
s0
* s5 s2
s1
s2 * s 0
1
2
ghZ\V¨g
?Ê @ É É o q 6? Ê @
K L7M
Figure 5.8. Implementation of the decoding circuit for cyclic code with the generator polyno-
mial g x 1 x x3 .
5 * 2
the output at 4th clock cycle. In general, the first bit of quotient comes out
at the clock cycle r 1 for an LFSR of size r n k. After the division is
5 5 5 5
000 , so 1010001 is a codeword and the quotient 1101 is valid data. We can
5 5 ' )
verify the obtained result by dividing 1 x 2 x6 by 1 x x3 . The quotient is
1 x x3 , which is indeed 1101 .
' ) ' )
Example 5.17. Next, suppose that a single-bit error has occurred in the 4th
bit of codeword 1010001 , and a word 1011001 is received instead. Table 5.5
' )
illustrates the decoding process. As we can see, after the division is completed,
the registers contain the reminder 110 , which matches the 4th column of the
parity check matrix H in (5.7).
%& %&
clock input register state output
period cx s 0 s1 s2 d x
0 0 0 0
1 1 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0
4 0 1 1 0 1
5 1 1 1 1 0
6 0 1 0 1 1
7 1 0 0 0 1
Table 5.4.
Register values of the circuit in Figure 5.8 for the input 1010001 .
%& %&
clock input register state output
period cx s 0 s1 s2 d x
0 0 0 0
1 1 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0
4 1 0 1 0 1
5 1 1 0 1 0
6 0 1 0 0 1
7 1 1 1 0 0
Table 5.5.
Register values of the circuit in Figure 5.8 for the input 1011001 .
*' -R-R- A )
construct a separable cyclic code by applying the following technique.
2
First, we take the data d d0 d1 dk 1 to be encoded and shift it right by
n k positions:
' 0 ( 0 (R-R-R-!( 0 ( d ( d (R-R-R-h( d A )
0 1 k 1
d % x & x A * d x A 5 d x A 5å-R-R-h5 d x A
n k
0
n k
1
n k 1
k 1
A n 1
By moving r % x & from the right hand side of the equation to the left hand side of
0 1 n k 1
A % &¶* % 5 &Ã* 5
with the generator polynomial g x 1 x x 3 . Let d x x x3 , i.e. 0101 .
n k
First, we compute x d x 3
x x x 3 x 4 6
x . Next, we employ the
division algorithm:
x 5 x *% 1 5 x &·% 1 5 x 5 x &75% x 5 1 &
4 6 3 3
i.e. ' 1100101) . We can easily separate the data part of the codeword, it is
contained in the last four bits.
%&* 5 5
Example 5.19. Suppose C is a binary separable cyclic code with the generator
polynomial g x 1 x x3 . Compute its generator and parity check matrices.
*Q' A )
Consider the parity check matrix (5.7). To obtain a separable code, we need
to permute its columns to bring it to the form H A T In k . One of the solutions
is:
H * |
1 0 1 1 1 0 0
1 1 1 0 0 1 0 -
0 1 1 1 0 0 1
*' I A) is: \
0 0 1 1 0
The corresponding generator matrix G k
1 0
G * ~|~ 0
0
1
0 1 0 1 1 1
0 0 0 1 1
-
0 0 0 1 1 0 1
2
% &A
mented using an LFSR identical to the one used for decoding. The multiplica-
tion by xn k is done by shifting the data right by n k positions. After the last
%& A %&
bit of d x has been fed in, the LFSR contains the reminder of division of input
polynomial by g x . By subtracting this reminder from x n k d x we obtain the
encoded word.
5 5 5
5 5 5
CRC-16: 1 x2 x15 x16
5 5 5 5 5 5x 5x 5x 5x 5x 5x 5x 5
CRC-CCITT: 1 x5 x12 x16
CRC-32:1 x x2 x4 x7 x8 10 11 12 16 22 23 26 x32
CRC-16 and CRC-CCITT are widely used in modems and network protocols
in the USA and Europe, respectively, and give adequate protection for most
applications. An attractive feature of CRC-16 and CRC-CCITT is the small
number of non-zero terms in their polynomials (just four). This is an advantage
because LFSR required to implement encoding and decoding is simpler for
generator polynomials with the small number of terms. Applications that need
extra protection, such as Department of Defense applications, use CRC-32.
The encoding and decoding is done either in software, or in hardware, using
% % &R&
the procedure from Section 5.7. To perform an encoding, data polynomial is
first shifted right by deg g x bit positions, and then divided by the genera-
tor polynomial. The coefficients of the remainder form the check bits of the
CRC codeword. The number check bits equals to the degree of the generator
% % &R& % % &R&
polynomial. So, an CRC detects all burst errors of length less or equal than
deg g x . An CRC also detects many errors which are larger than deg g x .
For example, apart from detecting all burst errors of length 16 or less, CRC-16
and CRC-CCITT are also capable to detect 99.997% of burst errors of length
17 and 99.9985 burst errors of length 18.
2
described in Section 5.7. The codeword is computed by shifting the data right
n k positions, dividing it by the generator polynomial and then adding the
*
obtained reminder to the shifted data. A key difference is that groups of m bits
P (S
rather than individual bits are used as symbols of the code. Usually m 8, i.e.
a byte. The theory behind is a field Z2m of degree m over 0 1 . The elements
of such a field are m-tuples of 0 and 1, rather than just 0 and 1.
An encoder for an Reed-Solomon code takes k data symbols of s bits each
* 2
and computes a codeword containing n symbols of m bits each. The maximum
Í 2 ÎH8
codeword length is related to m as n 2 m 1. A Reed-Solomon code can
correct up to n k 2 symbols that contain errors.
For example, a popular Reed-Solomon code is RS(255,223) where symbols
* ( *
are a byte (8-bit) long. Each codeword contains 255 bytes, of which 223 bytes
are data and 32 bytes are check symbols. So, n 255 k 223 and therefore
this code can correct up to 16 bytes containing errors. Note, that each of these
16 bytes can have multiple bit errors.
Decoding of Reed-Solomon codes is performed using an algorithm designed
by Berlekamp. The popularity of Reed-Solomon codes is due to a large extent
to the efficiency this algorithm. Berlekamp’s algorithm was used by Voyager II
for transmitting pictures of the outer space back to Earth. It is also a basis for
decoding CDs in players. Many additional improvements were done over the
years to make Reed-Solomon code practical. Compact discs, for example, use
a modified version of RS code called cross-interleaved Reed-Solomon code.
6. Unordered codes
Unordered codes are designed to detect unidirectional errors. A unidirec-
tional error is an error which changes either 0’s of the word to 1, or 1’s of
the word to 0, but not both. An example of a unidirectional error is an error
changing a word [1011000] to the word [0001000]. It is possible to apply a
special design technique to ensure that most of the faults occurring in a logic
circuit cause only unidirectional errors on the output. For example, consider
the logic circuit shown in Figure 5.9. If a single stuck-at fault occurs at any of
' )
the lines in the circuit, it will cause a unidirectional error in the output word
f1 f2 f3 .
0 P ( (R-R-R-I( S ù *' )
two binary n-tuples x x1 xn and y x1 xn are ordered if either
æ* ' ) ù
xi yi for all i 12 n , or xi yi for all i. For example if x 0101 and
y 0000 then x and y are ordered, namely x y. A unidirectional code is a
code satisfying the property that any two of its codewords are unordered.
The ability of unordered codes to detect all unidirectional errors is directly
related to the above property. A unidirectional error always changes a word
x to a word y which is either smaller or greater than x. A unidirectional error
Êq É É o
Ê7Ë É É Ìq
Ê É É ÌË
ÊÒ Ì
Figure 5.9. Logic diagram of a circuit in which any single stuck-at fault cause a unidirectional
error on the output.
cannot change x to a word which is not ordered with x. Therefore, if any two
of its codewords of a code are unordered, then a unidirectional error will never
map a codeword to another codeword, and thus will be detected.
In this section we describe two unordered codes: m-of-n codes and Berger
codes.
5 2
A m-of-n code consists of all n-bit words with exactly m 1’s. Any k-bit
unidirectional error forces the affected codeword to have either m k of m k
1’s, and thus detected.
An easy way to construct an m-of-n code is to take the original k bits of data
and append k bits so that the resulting 2k-bit code word has exactly k 1’s. For
example, the 3-of-6 code is shown in Table 5.6. All codewords have exactly
three 1’s.
data codeword
d0 d1 d2 c0 c1 c2 c3 c4 c5
0 0 0 0 0 0 1 1 1
0 0 1 0 0 1 1 1 0
0 1 0 0 1 0 1 0 1
0 1 1 0 1 1 1 0 0
1 0 0 1 0 0 0 1 1
1 0 1 1 0 1 0 1 0
1 1 0 1 1 0 0 0 1
1 1 1 1 1 1 0 0 0
*æ÷ % 5 &;ø
Check bits in a Berger code represent the number of 1’s in the data word. A
* 5
Berger code of length n has k data bits and m check bits, where m log 2 k 1
and n k m. A codeword is created by complementing the m-bit binary
representation of the number of 1’s in the encoded word. An example of Berger
code for 4-bit data is shown in Table 5.7.
data codeword
d0 d1 d2 d3 c0 c1 c2 c3 c4 c5 c6
0 0 0 0 0 0 0 0 1 1 1
0 0 0 1 0 0 0 1 1 1 0
0 0 1 0 0 0 1 0 1 1 0
0 0 1 1 0 0 1 1 1 0 1
0 1 0 0 0 1 0 0 1 1 0
0 1 0 1 0 1 0 1 1 0 1
0 1 1 0 0 1 1 0 1 0 1
0 1 1 1 0 1 1 1 1 0 0
1 0 0 0 1 0 0 0 1 1 0
1 0 0 1 1 0 0 1 1 0 1
1 0 1 0 1 0 1 0 1 0 1
1 0 1 1 1 0 1 1 1 0 0
1 1 0 0 1 1 0 0 1 0 1
1 1 0 1 1 1 0 1 1 0 0
1 1 1 0 1 1 1 0 1 0 0
1 1 1 1 1 1 1 1 0 1 1
Table 5.7. Defining table for Berger code for 4-bit data.
The primary advantages of a Berger code are that it is a separable code and
that it detects all unidirectional multiple errors. It is shown that Berger code is
* 2
redundancy of a Berger code is high. However, as k increases, the number of
check bits drops substantially. The Berger codes with k 2 m 1 are called
maximal length Berger codes.
7. Arithmetic codes
Arithmetic codes are usually used for detecting errors in arithmetic opera-
tions, such as addition or multiplication. The data representing the operands,
%& %&
say b and c, is encoded before the operation is performed. The operation is
%& %&
carried out on the resulting codewords A b and A c . After the operation, the
codeword A b A c representing the result of the operation “ ” is decoded
and checked for errors.
An arithmetic code relies on the property of invariance with respect to the
operation “ ”:
% &T* A % b& A % c&
A b c
Invariance guarantees that the operation “ ” on codewords A b and A c gives %& %&
% &
us the same result as A b c . So, if no error has occurred, decoding A b c % &
gives us b c, the result of the operation “ ” on b and c.
Two common types of arithmetic codes are AN-codes and residue codes.
7.1 AN-codes
AN-code is the simplest representative of arithmetic codes. The codewords
are obtained by multiplying data words N by some constant A. For example,
if the data is of length two, namely [00], [01], [10], [11], then the 3N-code is
[0000], [0011], [0110], [1001]. Each codeword is computed by multiplying a
data word by 3. And vice versa, to decode a codeword, we divide it by 3. If
there is no reminder, no error has occurred. AN-codes are non-separable codes.
% 4 & *y 4
The AN-code is invariant with respect to addition and subtraction, but not
to multiplication and division. For example, clearly 3 a b 3a 3b for all
non-zero a, b.
The constant A determines the information rate of the code and its error de-
tection capability. For binary codes, A should not be a power of two. This is
because single-bit errors cause multiplication or division of the original code-
word by 2r , where r is the position of the affected bit. Therefore, the resulting
word is a codeword and the error cannot be detected.
where b and c are data words and m is modulus. This allows us to handle
residues separately from data during addition process. The value of the modulus
determines the information rate and the error detection capability of the code.
A variation of residue codes are inverse residue code, where an inverse of
the residue, rather than the residue itself, is appended to the data. These codes
are shown to have better fault detecting capabilities for common-mode faults.
8. Problems
5.1. Give an example of a binary code of length 4 and of size 6. How many
words are contained in the codespace of your code?
5.2. Why is separability of a code considered to be a desirable feature?
5.3. Define the information rate. How is the information rate related to redun-
dancy?
5.4. What is the main difference in the objectives of encoding for the coding
theory and the cryptography?
P (S
5.5. What is the maximum Hamming distance between two words in the codespace
0 1 4?
5.6. Consider the code C *P 01100101110 ( 10010110111 ( 01010011001 S .
(a) What is the code distance of C?
(b) How many errors can be detected/corrected by code C?
P ((S
5.7. How would you generalize the notions of Hamming distance and code dis-
tance to ternary codes using 0 1 2 as valid symbols? Find a generalization
which preserves the following two properties: To be able to correct ε-digit
5
errors, a ternary code should have the code distance of at least 2ε 1. To
be able to detect ε-digit errors, the ternary code distance should be at least
5
ε 1.
5.8. Prove that, for any n ¹ 1, a parity code of length n has code distance two.
5.9. (a) Construct an even parity code C for 3-bit data.
(b) Suppose the word (1101) is received. Assuming single bit error, what
are the codewords that have possibly been transmitted?
5.10. Draw a gate-level logic circuit of an odd parity generation circuit for 5-bit
data. Limit yourself to use of two-input gates only.
5.11. How would you generalize the notion of parity for ternary codes? Give an
example of a ternary parity code for 3-digit data, satisfying your definition.
5.12. Construct the parity check matrix H and the generator matrix G for a linear
code for 4-bit data which can:
(a) detect 1 error
(b) correct 1 error
(c) correct 1 error and detect one additional error.
5.13. Construct the parity check matrix H and the generator matrix G for a linear
code for 5-bit data which can:
(a) detect 1 error
(b) correct 1 error
(c) correct 1 error and detect one additional error.
5.14. (a) Construct the parity check matrix H and the generator matrix G of a
Hamming code for 11-bit data.
(a) Find whether you can construct a Hamming code for data of lengths 1 2 (
and 3. Construct the parity check matrix H and the generator matrix G
for whose lengths for which it is possible.
5.15. The parity code is a linear code, too. Construct the parity check matrix H
and the generator matrix G for a parity code for 4-bit data.
5 5
5.16. Find the generator matrix for the (7,4) cyclic code C with the generator
polynomial 1 x2 x3 . Prove that C is a Hamming code.
5 5
5.17. Find the generator matrix for the (15,11) cyclic code C with the generator
polynomial 1 x x4 . Prove that C is a Hamming code.
polynomial g % x &t* 1 5 x 5 x .
5.18. Compute the check polynomial for the (7,4) cyclic code with the generator
2 3
5.19. Let C be and % n ( k & cyclic code. Prove that the only burst errors of length
n 2 k 5 1 that are codewords (and therefore not detectable errors) are shifts
of scalar multiples of the generator polynomial.
5.20. Suppose you use a cyclic code generated by the polynomial g % x &T* 1 5 x 5 x .
You have received a word c % x &1* 1 5 x 5 x 5 x . Check whether an error
3
4 5
% &+* 5 % &+* 5 5 5 5
5.21. Develop an LFSR for decoding of 4-bit data using the generator polynomial
%&
gx 1 x4 . Show the state table for the word c x x 7 x6 x5 x3 1
(as the one in Table 5.4). Is c x a valid codeword?
% &t* 5 5
5.22. Construct a separable cyclic code for 4-bit data generated by the polynomial
gx 1 x x4 . What code distance has the resulting code?
5.23. (a) Draw an LFSR decoding circuit for CRC codes with the following gen-
erator polynomials:
CRC 2 5 x5 x 5 x
16 : 1 2 15 16
CRC 2 CCIT T : 1 5 x 5 x 5 x 5 12 16
-R-R-
You may use " " between the registers 2 and 15 in the 1st polynomial
and 5 and 12 in the second, to make the picture shorter.
(b) Use the first generator polynomial for encoding the data 1 5 x5
3 x4 .
5 5
(c) Suppose that the error 1 x x2 is added to the codeword you obtained
in the previous task. Check whether this error will be detected or not.
5.24. Construct a Berger code for 3-bit data. What code distance has the resulting
code?
5.25. Suppose we know that the original 4-bit data words will never include the
word 0000. Can we reduce the number of check bits required for a Berger
code and still cover all unidirectional errors?
5.26. Suppose we encoded 8-bit data using a Berger code.
(a) How many check bits are required?
(b) Take one codeword c and list all possible unidirectional error which can
affect c.
0 0 0 0 0 0
0 0 0 1 0 1
0 0 1 0 1 0
0 0 1 1 1 1
0 1 0 1 0 0
0 1 1 0 0 1
0 1 1 1 1 0
1 0 0 0 1 1
0 0 0 0 0 0
0 0 0 0 1 1
0 0 0 1 1 0
0 0 1 0 0 1
0 0 1 1 0 0
0 0 1 1 1 1
0 1 0 0 1 0
0 1 0 1 0 1
0 1 1 0 0 0
0 1 1 0 1 1
0 1 1 1 1 0
1 0 0 0 0 1
1 0 0 1 0 0
1 0 0 1 1 1
1 0 1 0 1 0
1 0 1 1 0 1
0 0 0 1 1 1
0 0 1 1 1 0
0 1 0 1 0 1
0 1 1 1 0 0
1 0 0 0 1 1
1 0 1 0 1 0
1 1 0 0 0 1
1 1 1 0 0 0
TIME REDUNDANCY
1. Introduction
Space redundancy techniques discussed so far impact physical entities like
cost, weight, size, power consumption, etc. In some applications extra time is
of less importance than extra hardware.
Time redundancy is achieved by repeating the computation or data trans-
mission and comparing the result to a stored copy of the previous result. If
the repetition is done twice, and if the fault which has occurred is transient,
then the stored copy will differ from the re-computed result, so the fault will be
detected. If the repetition is done three or more times, a fault can be corrected.
In this section, we show that time redundancy techniques can also be used for
detecting permanent faults.
Apart from detection and correction of faults, time redundancy is useful for
distinguishing between transient and permanent faults. If the fault disappears
after the re-computation, it is assumed to be transient. In this case the hardware
module is still usable and it would be a waste of resources to switch it off the
operation.
2. Alternating logic
The alternating logic time redundancy scheme was developed by Reynolds
and Metze in 1978. It has been applied to permanent fault detection in digital
data transmission and in digital circuits.
Suppose the data is transmitted over a parallel bus as shown in Figure 6.1.
5
At time t0 , the original data is transmitted. Then, the data is complemented and
re-transmitted at time t0 ∆. The two results are compared to check whether
they are complements of each other. Any disagreement indicates a fault. Such
a scheme is capable of detecting permanent stuck-at faults at the bus lines.
% ( (R-R-R-!( &
Alternating logic concept can be used for detecting fault in logic circuits
which implement self-dual functions. A dual of a function f x 1 x2 xn is
dual function f for the input assignment x ( x (R-R-R-h( x equals to the value of
d
É/ É
in in
0
1 É
É É
Â;Å!¢
É i , © , C
Figure 6.2. Logic diagram of a full-adder.
% ( ( &1* 5 % &
ab a . b cin . Table 6.1 shows the defining table for s and
% ( (R-R-R-h( &* % ( (R-R-R-`( &
cout a b cin
cout . It is easy to see that the property f x 1 x2 xn f x1 x2 xn holds
for both functions.
% &
Figure 6.2 can be detected by applying the input assignment a b c in 100 ,
followed by the complemented assignment 011 . In a fault-free full-adder
a b cin s cout
0 0 0 0 0
0 0 1 1 0
0 1 0 1 0
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1
% & * % &
of the marked fault s 100 1, cout 100 1 and s 011 0, cout 011 1.
5
If the function f x1 x2 xn realized by the circuit is not self-dual, then it
can be transformed to a self-dual function of n 1-variables, defined by
f * x f 5 x f
sd n 1 n 1 d
5
At time t0 , the bit slice i of a circuit performs a computation. Then, the data
is shifted left and the computation is repeated at time t 0 δ. The shift operand
can be either arithmetic or logical shift. After the computation, the result is
%2 &
shifted right. The two results are compared. If there is no error, they are the
same. Otherwise, they disagree in either the ith, or i 1 th, or both bits.
The fault detection capability of RESO depends on the amount of shift. For
example, for a bit-sliced ripple-carry adder, 2-bit arithmetic shift is required to
guarantee the fault detection. A fault in the ith bit of a slice can have one of the
three effects:
2 5
1 The sum bit is erroneous. Then, the incorrect result differs form the correct
one by either 2i (if the sum is 0), or by 2i (if the sum is 1).
2 5
2 The carry bit is erroneous. Then, the incorrect result differs from the correct
one by either 2i 1 (if the carry is 0), or by 2i 1 (if the carry is 1).
3 Both, sum and carry bits, are erroneous. Then, we have four possibilities
2 34 2; i
5
sum is 0, carry is 0:
2i ;
2
sum is 0, carry is 1:
2i ;
5 4
sum is 1, carry is 0:
sum is 1, carry is 1: 3 2i ;
P ( ( ( 4 S
Summarizing, if the operands are not shifted, then the erroneous result differs
from the correct one by one of the following values: 0 2 i 2i 1 3 2i 1 .
A similar analysis can be done to show that if the operands are shifted left
P ( A ( A ( 4 A S
by two bits, then the erroneous result differs from the correct one by one of the
following values: 0 2i 1 2i 2 3 2i 2 . So, results of non-shifted and shifted
computations cannot agree unless they are both correct.
A primary problem with RESO technique is the additional hardware required
to store the shifted bits.
* 8
consider a bit-sliced ripple-carry adder with n-bit operands. Suppose the lower
5 2
half of an operand contains the bits from 0 to r n 2 and the upper half contains
the bits from r 1 to n 1. During the first computation, if the sum and carry
bits from slice i are faulty, then the resulting sum differs from the correct one
by 2i and 2i 1 , respectively. If both, the sum and the carry are faulty, then the
result differs from the correct one by 2 i 2i 1 , respectively. So, if the operands
P ( ( ( S 0
half are not swapped, a faulty bit slice i cause the result to disagree from the
correct result by one of the values 0 2 i 2i 1 2i 2i 1 . If i r, the result of
P ( A ( AA ( A AA S
the re-computation with the lower and the upper halves of operands swapped
differs from the correct result by one of the values 0 2 i r 2i r 1 2i r 2i r 1 .
So, results of non-swapped and swapped computations cannot agree unless they
are both correct.
8
bines hardware redundancy with time redundancy. An n-bit operation is per-
formed by using two n 2-bit devices twice. The operands are split into two
halves. First, the operation is carried out on the lower halves and their dupli-
cates and the results are compared. This is then repeated for the upper halves
of the operands.
As an example, consider how REDWC is performed on an n-bit full adder.
First, lower and upper parts of the adder are used to compute the sum of the
lower parts of the operands. A multiplexer is used to handle the carries at the
boundaries of the adder. The results are compared and one of them is stored
to represent the lower half of the final sum. The second computation is carried
out on the upper parts of the operands. Selection of the appropriate half of the
operands is performed using multiplexers.
REDWC technique allows to detect all single faults in one half of the adder,
as long as both halves do not to become faulty in a similar manner or at the
same time.
6. Problems
6.1. Give three examples of applications where time is less important than hard-
ware.
6.2. Two independent methods for fault detection on busses are:
use a parity bit,
use alternating logic.
Neither of these methods has the capability of correcting an error. However,
together these two methods can be used to correct any single permanent fault
(stuck-at type). Explain how. Use an example to illustrate your algorithm.
6.3. Write a truth table for a 2-bit full adder and check whether sum s and carry
out cout are self-dual functions.
SOFTWARE REDUNDANCY
Programs are really not much more than the programmer’s best guess about what a system
should do.
—Russel Abbot
1. Introduction
In this chapter, we discuss techniques for software fault-tolerance. In general,
fault-tolerance in software domain is not as well understood and mature as
fault-tolerance in hardware domain. Controversial opinions exist on whether
reliability can be used to evaluate software. Software does not degrade with
time. Its failures are mostly due to the activation of specification or design faults
by the input sequences. So, if a fault exists in software, it will manifest itself first
time when the relevant conditions occur. This makes the reliability of a software
module dependent on the environment that generates input to the module over
the time. Different environments might result in different reliability values.
Ariane 5 rocket accident is an example of how a piece of software, safe for
Ariane 4 operating environment, can cause a disaster in the new environment.
As we described in Section 3.2, Ariane 5 rocket exploded 37 seconds after its
lift-off, due to complete loss of guidance and attitude information. The loss of
information was caused by a fault in the software of the inertial reference system,
resulted from violating the maximum floating point number assumption.
Many current techniques for software fault tolerance attempt to leverage the
experience of hardware redundancy schemes. For example, software N-version
programming closely resembles hardware N-modular redundancy. Recovery
blocks use the concept of retrying the same operation in expectation that the
problem is resolved after the second try. However, traditional hardware fault
tolerance techniques were developed to fight permanent components faults pri-
2. Single-version techniques
Single version techniques add to a single software module a number of func-
tional capabilities that are unnecessary in a fault-free environment. Software
structure and actions are modified to be able to detect a fault, isolate it and
prevent the propagation of its effect throughout the system. In this section, we
consider how fault detection, fault containment and fault recovery are achieved
in software domain.
2 Local exceptions are signaled by a module when its fault detection mecha-
nism detects a fault within its internal operations. This type of exception is
supposed to be handled by the faulty module.
3 Failure exceptions are signaled by a module when it has detected that its
fault recovery mechanism is enable to recover successfully. This type of
exception is supposed to be handled by the system.
47;=<
There are two types of checkpoints: static and dynamic. A static checkpoint
takes a single snapshot of the system state at the beginning of the program
execution and stores it in the memory. Fault detection checks are placed at
the output of the module. If a fault is detected, the system returns to this
state and starts the execution from the beginning. Dynamic checkpoints are
created dynamically at various points during the execution. If a fault is detected,
the system returns to the last checkpoint and continues the execution. Fault
detection checks need to be embedded in the code and executed before the
checkpoints are created.
A number of factors influence the efficiency of checkpointing, including exe-
cution requirements, the interval between checkpoints, fault activation rate and
overhead associated with creating fault detection checks, checkpoints, recov-
ery, etc. In static approach, the expected time to complete the execution grows
exponentially with the execution requirements. Therefore, static checkpointing
is effective only if the processing requirement is relatively small. In dynamic
approach, it is possible to achieve linear increase in execution time as the pro-
cessing requirements grow. There are three strategies for dynamic placing of
checkpoints:
1 Equidistant, which places checkpoints at deterministic fixed time intervals.
The time between checkpoints is chosen depending on the expected fault
rate.
2 Modular, which places checkpoints at the end of the sub-modules in a mod-
ule, after the fault detection checks for the sub-module are completed. The
execution time depends on the distribution of the sub-modules and expected
fault rate.
3 Random, placing checkpoints at random.
Overall, restart recovery mechanism has the following advantages:
It is conceptually simple.
It is independent of the damage caused by a fault.
It is applicable to unanticipated faults.
It is general enough to be used at multiple levels in a system.
A problem with restart recovery is that non-recoverable actions exist in
some systems. These actions are usually associated with external events that
cannot be compensated by simply reloading the state and restarting the system.
Examples of non-recoverable actions are firing a missile or soldering a pair of
wires. The recovery from such actions need to include special treatment, for
example by compensating for their consequences (e.g. undoing a solder), or
delaying their output until after additional confirmation checks are completed
(e.g. do a friend-or-foe confirmation before firing).
Data diversity can also be used in combination with the multi-version fault-
tolerance techniques, presented in the next section.
3. Multi-version techniques
Multi-version techniques use two or more versions of the same software
module, which satisfy the design diversity requirements. For example, differ-
ent teams, different coding languages or different algorithms can be used to
maximize the probability that all the versions do not have common faults.
\Y C ºT[`bi<\Vc C
ߨ Þ
]$YN G ºT[`bi<\Vc G Û]ÝÚ Ü VcYNeYN
ÑÑ ØÙ >A?
]$YNâÏ ºT[`bi<\Vc Ï ×
Figure 7.3. Recovery blocks.
ment. They are highly application dependent, they are difficult to create and
they cannot test for a specific correct answer, but only for “acceptable” values.
\YN C º+[`bi;]VH C
\YN G º+[`bi;]VH G D è7E«è = F6 4 Ï VHYN;Y
G E H47;=6 = 2 9
ÑÑ
\YNúÏ º+[`bi;]VH Ï
Figure 7.4. N-version programming.
Many different types of voters has been developed, including formalized ma-
jority voter, generalized median voter, formalized plurality voter and weighted
% (&
averaging technique. The voters have the capability to perform inexact voting
by using the concept of metric space X d . The set X is the output space of the
software and d is a metric function that associates any two elements in X with
a real-valued number (see Section 2.5 for the definition of metric). The inexact
values are declared equal if their metric distance is less than some pre-defined
threshold ε. In the formalized majority voter, the outputs are compared and, if
more than half of the values agree, the voter output is selected as one of the val-
ues in the agreement group. The generalized median voter selects the median
of the values as the correct result. The median is computed by successively
eliminating pair of values that are farther apart until only one value remains.
The formalized plurality voter partitions the set of outputs based on metric
equality and selects the output from the largest partition group. The weighted
averaging technique combines the outputs in a weighted average to produce
the result. The weight can be selected in advance based on the characteristics
of the individual versions. If all the weights are equal, this technique reduces
to the mean selection technique. The weight can be also selected dynamically
based on pair-wise distances of the version outputs or the success history of the
versions measured by some performance metric.
The selection algorithms are normally developed taking into account the
consequences of erroneous output for dependability attributes like reliability,
availability and safety. For applications where reliability is important, the se-
lection algorithm should be designed so that the selected result is correct with
a very high probability. If availability is an issue, the selection algorithm is
expected to produce an output even if it is incorrect. Such an approach would
be acceptable as long as the program execution in not subsequently dependent
on previously generated (possibly erroneous) results. For applications where
safety is the main concern, the selection algorithm is required to correctly dis-
tinguish the erroneous version and mask its results. In cases when the algorithm
cannot select the correct result with a high confidence, it should report to the
system an error condition or initiate an acceptable safe output sequence.
\Y C º+[!bei;\Vc C
>A?
\Y G º+[!bei;\Vc G ߨ Þ
ÑÑ >A? Û]ÝÚ Ü VHYN;$YN
ØÙ
\YâÏ º+[!bei;\Vc Ï ×
>A?
4. Software Testing
Software testing is the process of executing a program with the intent of
finding errors [Beizer, 1990]. Testing is a major consideration in software
development. In many organizations, more time is devoted to testing than to
any other phase of software development. On complex projects, test developers
might be twice or three times as many as code developers on a project team.
There are two types of software testing: functional and structural. Functional
testing (also called behavioral testing, black-box testing, closed-box testing),
compares test program behavior against its specification. Structural testing
(also called white-box testing, glass-box testing) checks the internal structure
of a program for errors. For example, suppose we test a program which adds two
integers. The goal of functional testing is to verify whether the implemented
operation is indeed addition instead of e.g. multiplication. Structural testing
does not question the functionally of the program, but checks whether the inter-
nal structure is consistent. A strength of the structural approach is that the entire
software implementation is taken into account during testing, which facilitates
error detection even when the software specification is vague or incomplete.
The effectiveness of structural testing is normally expressed in terms of
test coverage metrics, which measure the fraction of code exercised by test
cases. Common test coverage metrics are statement, branch, and path cover-
age [Beizer, 1990]. Statement coverage requires that the program under test is
run with enough test cases, so that all its statements are executed at least once.
Decision coverage requires that all branches of the program are executed at
least once. Path coverage requires that each of the possible paths through the
program is followed. Path coverage is the most reliable metric, however, it is
not applicable to large systems, since the number of paths is exponential to the
number of branches.
This section describes a technique for structural testing which finds a part
of program’s flowgraph, called kernel, with the property that any set of tests
which executes all vertices (edges) of the kernel executes all vertices (edges)
of the flowgraph [Dubrova, 2005].
Related works include Agarval’s algorithm [Agrawal, 1994] for computing
super block dominator graph which represents all kernels of the flowgraph,
Bertolino and Marre’s algorithm [Bertolino and Marre, 1994] for finding path
covers in a flowgraph, in which unconstrained arcs are analogous to the leaves
of the dominator tree; Ball’s [Ball, 1993] and Podgurski’s [Podgurski, 1991]
techniques for computing control dependence regions in a flowgraph, which are
similar to the super blocks of [Agrawal, 1994]; Agarwal’s algorithm [Agrawal,
1999], which addresses the coverage problem at an inter-procedural level.
MONQP5R
STVUXW:Y[Z=\S]-S[YZH^
M_N`M_acbHR
dONcbP7e[MfR
If there is no test case which causes gihkjflm-nomihHj to evaluate false, the error
in this code will not be detected in spite of 100% statement coverage. The
error will appear only if gihkjflm-nomihHj evaluates false for some test case. Since
mHp -statements are common in programs, this problem is a serious drawback of
statement coverage.
STwUXW:YZL\S]-SY[Z b ^
M_N_P5R
x7y7z:x
M_N_{5R
STwUXW:YZL\S]-SY[Z { ^
d_N|bP=}[MfR
x7y7z:x
d_N|bP7eKMfR
The 100% branch coverage can be achieved by two test cases which cause
both gihHj~lm-nomkhkj and gihHj~l~mHnomkhkj to evaluate true, and both gkhkj~l~mHnmihkj@
and gihHj~lm-nomkhkjo to evaluate false. However, the error which occurs when
gkhkjflmHnmihHj evaluates true and gihkjflm-nomihHjo evaluates false will not be detected
by these two tests.
The error in the example above can be detected by exercising every path
through the program. However, since the number of paths is exponential to
- A
the number of branches, testing every path is not possible for large systems.
For example, if one test case takes 0 1 10 5 seconds to execute, then testing
all paths of a program containing 30 mkp -statements will take 18 minutes and
testing all paths of a program with 60 mHp -statements will take 366 centuries.
Branch coverage differs from basic path coverage, which requires each basis
path in the program flowgraph to be executed during a test [Watson, 1996]. Basis
paths are a minimal subset of paths that can generate all possible paths by linear
combination. The number of basic paths is called the cyclomatic number of the
flowgraph.
4.2 Preliminaries
*% ( ( ( &
A flowgraph is a directed graph G V E entry exit , where V is the set of
vertices representing basic blocks of the program, E V V is the set of edges
connecting the vertices, and entry and exit are two distinguished vertices of V .
Every vertex in V is reachable from entry vertex, and exit is reachable from
every vertex in V .
( (R-R-R-!(
Figure 7.8 shows the flowgraph of the C program in Figure 7.7, where
bl b2 b16 are blocks whose contents are not relevant for our purposes.
%& %&
exit contains v.
% &ú*ðP ( ( ( S
By Pre v and Post v we denote sets of all nodes which pre-dominate and
% &T*P ( ( S
post-dominate v, respectively. E.g. in Figure 7.8, Pre 5 1 2 3 4 and
Post 5 9 10 16 .
Many properties are common for pre- and post-dominators. Further in the
paper, we use the word dominator to refer to cases which apply to both rela-
tionships.
Vertex v is the immediate dominator of u, if v dominates u and every other
%&
dominator of u dominates v. Every vertex v V except entry (exit) has a
unique immediate pre-dominator (post-dominator), idom v [Lowry and Med-
% % &!( &
lock, 1969]. For example, in Figure 7.8, vertex 4 is the immediate pre-dominator
of 5, and vertex 9 is the immediate post-dominator of 5. The edges idom v v
form a directed tree rooted at entry for pre-dominators and at exit for post-
dominators. Figures 7.9 and 7.10 show the pre- and post-dominator trees of the
flowgraph in Figure 7.8.
%h3 3 &
The problem of finding dominators was first considered in late 60’s by Lorry
and Medlock [Lowry and Medlock, 1969]. They presented an O V 4 algo-
entry
1
a
b c d
2 3 4 5
y k j e i
q m
10 9 h 6
o f g
l
13 11 7 8
s p r
w 14 12 u
t
x
15
16
z
exit
3 16
4 10
5 15 13 11
6 9 14 12
7 8
Figure 7.9. Pre-dominator tree of the flowgraph in Figure 7.8; shaded vertices are leaves of the
tree in Figure 7.10.
¢¡~£ ¤¦¥¤¦§s¤©¨¥«ªo¬©
A vertex v V of the flowgraph is covered by a test case t
if the basic block of the program representing v is reached at least once during
the execution of t.
The following Lemma is the basic property of the presented technique [Agrawal,
®
1994].
¡~¯°¯²±³ If a test case t covers u V , then it covers any post-dominator of u
Let L post (L pre ) denote the set of leaf vertices of the post-(pre-)dominator tree
of G. The set LDpost º L post contains vertices of L post which pre-dominate some
vertex of L post :
2 2
since every path from entry to u contains v as well. Thus, any set of tests which
covers L post LDpost , covers L post as well. Since L post is a kernel, L post LDpost
2
is a kernel, too.
Next, we prove that the set L post LDpost is a minimum kernel. Suppose that
3 ¡3³æ3 2 3
y %&
there exists another kernel, K , such that K L post LDpost . If v K and
v L post , then v Post u for some u L post . Since every path from u to exit
contains v, if u is reached at least once during the execution of some test case,
y
then v is reached, too. Therefore, K remains a kernel if we replace v by u.
%& 2
Suppose we replaced all v K such that v L post by u L post such that
v Post u . Now, K ~ L post . If there exists w L post K such that, for all
y % &
u K , w Pre u then there exists at least one path from entry to each u which
%& %&
does not contain w. This means that there exists a test set, formed by the set
of paths path u where path u is the path to u which does not contain w, that
covers K but not w. According to Definition 7.2, this implies that K is not a
2 3 §3»*æ3 2 3
kernel. Therefore, to guarantee that K is a kernel, w must be a pre-dominator
L post LDpost .
2
of some u K , for all w L post K . This implies that K
The next theorem shows that the set L pre LDpre is also a minimum kernel.
¼»½ ¡¨¿¾A¡~¯Àªo¬Â¶
3L 2post LDpost 3 *æ3 L 2
pre LDpre 3]-
The proof is done by showing that the proof of minimality of Theorem 7.1
can be carried out starting from L pre .
16
15 1 10 13
11 14 9 3
12 5
4 6 7 8
Figure 7.10. Post-dominator tree of the flowgraph in Figure 7.8; shaded vertices are leaves of
the tree in Figure 7.9.
*% ( ( ( &
nels from [Dubrova, 2005]. The pseudo-code is shown in Figure 7.11.
First, pre- and post-dominator trees of the flowgraph G V E entry exit ,
denoted by Tpre and Tpost , are computed. Then, the numbers of leaves of
2 2
the trees, L pre and L post , are compared. According to Theorems 7.1 and 7.2,
ÃÅĦÆ@ÇÉÈËÊ is applied to the smaller of the sets L and L .
both, L post LDpost and L pre LDpre , represent minimum kernels. The procedure
ÃÅĦÆ@ÇÉÈËÊ checks whether the leaves L ofprethe tree post
pre
some vertex of L pre in another tree, Tpost , or vice versa. In other words,
ÃÅÄÁÆ@Ç¿ÈÌÊ
Tpre are dominated by
Æ
Proof: The correctness of the algorithm follows directly from Theorems 7.1
and 7.2. The complexity of the ÎÐÏ~Ñ ÏÒ is determined by the complexity
%h3 h3 5 3 3¾&
of computing the Tpre and Tpost trees. A dominator tree can be computed
%h3 e3 5 3 3¾&
in O V E time [Alstrup et al., 1999]. Thus, the overall complexity is
O V E .
As an example, let us compute a minimum kernel for the flowgraph in Fig-
*P ( ( ( ( ( ( S *
ure 7.8. Its pre- and post-dominator trees are shown in Figures 7.9 and 7.10.
Tpre has 7 leaves, L pre 7 8 9 12 14 15 16 , and Tpost has 9 leaves, L post
Tpost MM ØsÚÜãàkÙËÚÜÛÝÞÖkßJàHÚCÕiáÉÕkÔJÔ7K ü ü L
L pre set of leaves of Tpre ;
â
L post set of leaves of Tpost ;
V E exit ;
M ä å3æ ÝÞÖiçCè~ÙöK ü L
if L pre L post then
K L pre L pre Tpost ;
else
M å3æ ÝéÖiçÜèÙöK ü L
K L post
return K;
L post Tpre ;
end
P (((((( ( ( S
1 3 4 6 7 8 12 13 14 . So, we check which of the leaves of Tpre dominates
at least one other leaf of Tpre in Tpost . Leaves L pre are marked as shaded circles
in Tpost in Figure 7.10. We can see that, in Tpost , vertex 9 dominates 7 and
*QP ( ( S 2
8, vertex 15 dominates 12 and 14, and vertex 16 dominates all other leaves of
L pre . Thus, LDpre 9 15 16 . The minimum kernel L pre LDpre consist of four
2
vertices: 7,8,12 and 14.
For a comparison, let us compute the minimum kernel given by L post LDpost .
The L post leaves are marked as shaded circles in Tpre in Figure 7.9. We can see
that, in Tpre , vertex 4 dominates 6, 7 and 8, vertex 6 dominates 7 and 8, vertex
*jP ( ( ( ( S
13 dominates 14, vertex 3 dominates all leaves of L post except 1, and vertex
1 dominates all leaves of L post . Thus, LDpost
2
1 3 4 6 13 . The minimum
D
2 2
kernel L post L post consist of four vertices: 7,8,12 and 14. In this example, the
kernels L pre LDpre and L post LDpost are the same, but it is not always the case.
2 *
A 100% branch coverage can be achieved by constructing a set of tests for
the kernel. Minimum kernels for Figures 7.12 and 7.13 are: L pre LDpre
P((( ( (((S
i h k m p q t y and L post LDpost 2
f gk rqsmy . *P ( ( ( ( ( ( ( S
!"$#
134 FAULT TOLERANT DESIGN: AN INTRODUCTION
x
b
z
y c w o l
d q s r u
e j t p
g f m k
i h
Figure 7.12. Edge pre-dominator tree of the flowgraph in Figure 7.8; shaded vertices are leaves
of the tree in Figure 7.13.
5. Problems
7.1. A program consists of 10 independent routines, all of them being used
in the normal operation of the program. The probability that a routine
is faulty is 0.10 for each of the routines. It is intended to use 3-version
programming, with voting to be conducted after execution of each routine.
The effectiveness of the voting in eliminating faults is 0.85 when one of the
three routines is faulty and 0 when more than one routine is faulty. What is
the probability of a fault-free program:
(a) When only a single version is produced and no routine testing is con-
ducted.
(b) When only a single version of each routine is used, but extensive routine
testing is conducted that reduces the fault content to 10% of the original
level.
(c) When three-version programming is used.
q j m b y a w k o
e h i d t u
f g c s p l
Figure 7.13. Edge post-dominator tree of the flowgraph in Figure 7.8; shaded vertices are leaves
of the tree in Figure 7.12.
1. Introduction
The gene regulatory network is one of the most important signaling net-
works in living cells. It is composed from the interactions of proteins with the
genome. In this section, we consider a model of the gene regulatory network,
called Kauffman network. Introduced by Kauffman in 1969 in the context of
gene expression and fitness landscapes [Kaufmann, 1969], Kauffman networks
were later applied to the problems of cell differentiation, immune response,
evolution, and neural networks [Aldana et al., ]. They have also attracted the
interest of physicists due to their analogy with the disordered systems studied
in statistical mechanics, such as the mean field spin glass [Derrida and Pomeau,
1986, Derrida and Flyvbjerg, 1986, Derrida and Flyvbjerg, 1987].
The Kauffman network is a class of random nk-Boolean networks. An nk-
Boolean network is a synchronous Boolean automaton with n vertices. Each
vertex has exactly k incoming edges, assigned at random, and an associated
2
Boolean function. Functions are selected so that they evaluate to the values
0 and 1 with given probabilities p and 1 p, respectively. Time is viewed
as proceeding in discrete steps. At each step, the new state of a vertex v is
a Boolean function of the previous states of the predecessors of v. Cellular
automata can be considered as a special case of random nk-Boolean networks,
in which the incoming edges are assigned form the immediate neighbors.
*
The Kauffman network is an nk-Boolean network with the input fan-in two
* * -
(k 2) and the functions assigned independently an uniformly at random from
k
the set of 22 16 possible Boolean functions (p 0 5). The functions represent
the rules of regulatory interactions between the genes. The states of vertices
represent the status of gene expressions; "1" indicates that a gene is expressed,
"0" - not expressed. For edges, the value "1" means that a protein is present,
"0" - absent.
The parameters k and p determine the dynamics of the network. If a vertex
controls many other vertices and the number of controlled vertices grows in
time, the network is said to be in a chaotic phase. Typically, such a behavior
occurs for large values of k, k ê n. The next states are random with respect to the
previous ones. The dynamics of the network is very sensitive to changes in the
state of a particular vertex, associated Boolean function, or network connection.
If a vertex controls only a small number of other vertices and their number
remains constant in time, a network is said to be in a frozen phase. Usually,
*
independently on the initial state, after a few steps the network reaches a stable
state. This behavior typically occurs for small values of k, such as k 0 or 1.
There is a critical line between the frozen and the chaotic phases, when
the number of vertices controlled by a vertex grows in time, but only up to a
certain limit. Statistical features of nk-Boolean networks on the critical line are
shown to match the characteristics of real cells and organisms [Kaufmann, 1969,
Kaufmann, 1993]. The minimal disturbances typically cause no variations in
the network’s dynamics. Only some rare events evoke a radical change.
For a given probability p, there is a critical number of inputs k c below which
the network is in the frozen phase and above which the network is in the chaotic
phase [Derrida and Pomeau, 1986]:
* - *
For p 0 5, the critical number of inputs is k c 2, so the Kauffman networks
are on the critical line.
Since the number of possible states of a Kauffman network is finite (up to
2n ), any sequence of consecutive states of a network eventually converges to
either a single state, or a cycle of states, called attractor. The number and
length of attractors represent two important parameters of the cell modeled by
µ -
a Kauffman network. The number of attractors corresponds to the number of
different cell types. For example, humans have n 25 000 genes and 254 cell
types. Attractor’s length corresponds to the cell cycle time. Cell cycle time
refers to the amount of time required for a cell to grow and divide into two
daughter cells. The length of the total cell cycle varies for different types of
cells.
Human body has a sophisticated system for maintaining normal cell repair
and growth. The body interacts with cells through a feedback system that
signals a cell to enter different phases of the cycle. If a person is sick, e.g has a
cancer, then this feedback system does not function normally and cancer cells
enter the cell cycle independently of the body’s signals. The number and length
of attractors of a Kauffman network serve as indicators of the health of the
2. Kauffman Networks
*% ( &
The Kauffman Network is a directed cyclic graph G V E , where V is the
set of vertices and E V V is the set of edges connecting the vertices.
The set V has n vertices. Each vertex v V has exactly two incoming edges,
*P 63 % ( &â S
selected at random from V (including v itself). The set of predecessors of v is
*P 73 % ( â& S
denoted by Pv , Pv u V uv E . The set of ancestors of v is denoted
P (S
by Sv , Sv u V vu E .
Each vertex v V has an associated Boolean function, f v , of type 0 1 2 ë
P (S
* -
0 1 . Functions are selected so that they evaluate to the values 0 and 1 with
5
equal probability p 0 5.
The state σv of a vertex v at time t 1 is determined by the states of its
(
predecessors u1 u2 Pv as:
σ % t 5 1 &* f % σ % t &!( σ % t &R&!-
The vector % σ % t &!( σ % t &!(R-R-R-`( σ % t &R& represents the state of the network at time
v v ui u2
v1 v2 vn
t. An example of a Kauffman network with ten vertices is shown in Figure 8.1.
An infinite sequence of consecutive states of a Kauffman network is called
a trajectory. A trajectory is uniquely defined by the initial state. Since the
number of possible states is finite, all trajectories eventually converges to either
a single state, or a cycle of states, called attractor. The basin of attraction of
an attractor A is the set of all trajectories leading to A. The attractor length is
the number of states in the attractor’s cycle.
3. Redundant Vertices
Redundancy is an essential feature of biological systems, insuring their cor-
rect behavior in presence of internal or external disturbances. An overwhelming
v7 v1 v6
σv 9 í σv 1 σvì 7 1
v9
σvì 0 í σvì 5
v4 v0 v2
v5
σv 2 σvì 3 σv 4 í σvì 0 σv 0 σv 9
v3
0
v8
σvì 2
K L¨M K K L<ü K L£L
Figure 8.1. Example of a Kauffman network. The state of a vertex vi at time t 1 is given by
σvi t 1 fvi σv j t σvk t , where v j and vk are the predecessors of vi , and fvi is the Boolean
function associated to vi (shown by the Boolean expression inside vi ).
algorithm ïðÔÛ@Ú-ñCÔïðÔJçÜòÜÖiçHßÜÖkàöK
VE üL
/* I. Simplifying functions with one predecessor */
ó
for each v V do
if two incoming edges of v come from the same vertex then
Simplify fv ;
M
/* II. Removing constants and implied */
R1 Ø;
ó
for each v V do
if fv is a constant then
Append v at the end of R1 ;
ó
for each v R1 do
ó
for each u Sv R1 do å
Simplify fu by substituting constant f v ;
if fu is a constant then
Append u at the end of R1 ;
ó
Remove all v R1 and all edges connected to v;
/* III. Simplifying 1-variable functions */
ó
for each v V do
KüL
if fv is a 1-variable function then
Remove the edge u v , where u is the
predecessor of v on which v does not depend;
M
/* IV. Removing functions with no output and implied */
R2 Ø;
Mó
for each v V do
if Sv Ø then
Append v at the end of R2 ;
ó
for each v R2 do
ó
for each u Pv R2 do å
if all ancestors of u are in R2 then
Append u at the end of R2 ;
ó
Remove all v R2 and all edges connected to v;
end
Figure 8.2. The algorithm for finding redundant vertices in Kauffman networks.
ô Ï~õ÷öùø¿Ï ô Ï Ç¿ú@Æ@ÇsûÆ@ü
first checks whether there are vertices v with two
incoming edges coming from the same vertex. If yes, the associated functions
fv are simplified. In the example in Figure 8.1, there are not such cases of
redundancy.
ô
Then, Ï~õ÷öùø¿Ï Ï
ô Ç¿ú@Æ@ÇsûÆ@ü classifies as redundant all vertices v whose
associated function f v is constant 0 or constant 1. Such vertices are collected
in a list R1 . Then, for every vertex v R1 , ancestors of v are visited and the
functions associated to the ancestors are simplified. The simplification is done
by substituting the constant value of f v in the function of the ancestor u. If
as a result of the simplification the function f u reduces to a constant, then u is
appended to R1 .
v7 v1
σv 1 þ σv 9 σvý 7
v9
σvý 5
v5 v2
σv 2 σv 9
Figure 8.3. Reduced network GR for the Kauffman network in Figure 8.1.
*P ( S
In the example in Figure 8.1, vertices v 3 and v6 have constant associated
functions. Thus, initially R1 v3 v6 . The vertex v3 has an ancestor v4 . The
11101 01101
11110 11100
00110 10110
10001 00001
10111 00101 10101
11011
01010 11010
01111
01001
01110 00111
01011
11111 00011
00100 10011
10111 10010
10000
10100
01100
K K L K L K L K L K L£L
Figure 8.4. State transition graph of the Kauffman network in Figure 8.3. Each vertex represents
a 5-tuple σ v1 σ v2 σ v5 σ v7 σ v9 of values of states on the relevant vertices.
*
Boolean function associated to v4 is a single-variable function which depends
only on its predecessor v3 . By substituting σv3 by fv3 0 in the Boolean
expression of v4 , we get fv4 0 * *
1. Since f v4 reduces to a constant, it is
appended to R1 .
The vertex v6 has no ancestors, so no simplification is done. The vertex v 4
* 5 *
has ancestors v0 and v3 . The vertex v3 is already a constant. By substituting σ v4
*
by fv4 1 in the Boolean function associated to v 0 , we get fv0 1 σv0 1.
The vertex v0 is appended to R1 .
* 4 * * I5 *
Similarly, the associated functions of the ancestors v 2 and v9 of v0 are sim-
plified to f v2 1 σv9 σv9 and fv9 1 σv5 σv5 . The ancestor v8 remains
*P ( ( ( S
unchanged because is does not depend on v 0 . Since no new vertices are ap-
pended to R1 , the phase I ends. The vertices R1 v0 v3 v4 v6 and all edges
ô ô Ç¿ú@Æ@ÇsûÆü finds all vertices whose associated function
connected to these vertices are removed from the network.
Second, Ï~õ÷öùø¿Ï Ï
fv is a single-variable function. The edge between v and the predecessor of v
on which v does not depend is removed.
In the example in Figure 8.1, vertices v 1 , v2 , v5 , v8 and v9 have single-
* *
variable associated functions. Recall, that after simplifications in the phase II,
fv2 σv9 and fv9 σv5 and the function of v4 is a constant. The edges from the
ô ô Ç¿ú@Æ@ÇsûÆ@ü classifies as redundant all vertices which have
predecessors of v1 , v2 , v5 , v8 on which they do not depend are removed.
Next, Ï~õ÷öùø¿Ï Ï
no ancestors. Such vertices are collected in a list R 2 . For every vertex v R2 ,
both predecessors of v are visited. If all ancestors of some predecessor u Pv
* P S
are redundant, u is appended at the end of R 2 .
In the example in Figure 8.1, the vertex v 8 has no ancestors, thus R2 v8 .
None of other vertices has a predecessor with all ancestors redundant. So, no
ô ô Çú@Æ@Çsû¿Æ@ü is O V E ,
%h3 3e5 3 3¾&
vertex are added to R2 at phase IV. The vertex v8 is removed from the network.
The worst-case time complexity of Ïõ°öùøÉÏ Ï
3 3 ô 3 3
ô Ç¿ú@Æ@ÇsûÆü might not identify all cases
where V is the number of vertices and E is the number of edges in G.
As we mentioned before, Ï~õ÷öùø¿Ï Ï
of functional redundancy. For example, a vertex may have a constant output
value due to the correlation of its input variables. For example, if a vertex v with
* *
an associated OR (AND) function has predecessors u 1 and u2 with functions
fu1 σw and fu2 σw , then the ô value ofô f vÇis
ú @
Æ s
Ç
û
always@
Æ ü 1 (0). Such cases of
redundancy are not detected by Ï~õ÷öùø¿Ï Ï .
Let GR be the reduced network obtained from G by removing redundant
vertices. The reduced network for the example in Figure 8.1 is shown in Fig-
P ( ( ( ( ( S P ( ( ( S
ues of states on the relevant vertices v 1 , v2 , v5 , v7 , v9 . There is two attractors:
01111 01110 00100 10000 10011 01011 , of length six, and 00101 11010 00111 01010 ,
of length four. By Definition 8.1, by removing redundant vertices we do not
change the total number and length of attractors in a Kauffman network. There-
fore, G has the same number and length of attractors as G R .
4. Connected Components
Vertices of GR induce a number of connected components.
¢¡~£ ¤¦¥¤¦§s¤©¨¥Vîs¬·¶
Two vertices are in the same connected component if and
only if there is an undirected path between them.
P ( ( S P ( S
A path is called undirected if it ignores the direction of edges. For example,
3 3
Finding connected components can be done in O V E time, where V is
the number of vertices and E is the number of edges of G R , using the following
ÿ öõö Æ [Tarjan,
algorithm
ÆÏ @ü Ï û 1972].
1% &
To find a connected component number i, the function
ÿ Ñ v isÆ called
a component yet. ö¿õoö Ï
Æ@ü Ï for
û Ñavertex
v which has not been assigned to
( R( -R-R-h(
Let GA be a connected component of GR and AA be an attractor of GA . An
P ( (R-R-R-I( 2 S A
attractor AA of length L is represented by a sequence of states Σ 0 Σ1 ΣL 1 ,
.
where Σ i 1 modL is the next state of the state Σi , i
% &
01 L 1 .
The support set of an attractor AA , sup AA , is the set of vertices of GA . For
P ( ( S
example, the left-hand side connected component in Figure 8.3 has the support
set v2 v5 v9 .
¢¡~£ ¤¦¥¤¦§s¤©¨¥Vîs¬¹¸ ΣA0 ΣA1 * ( (R-R-R-`(
ΣALA and AB ΣB0 ΣB1 * ( (R-R-R-h( Σ
% & % &+*
Given two attractors AA B ,
LB
such that sup AA sup AB Ø, the composition of AA and AB is a set of at-
* A P A S
tractors defined by:
d 1
AA AB
9
k 0
k
*
AB :
Σki
.
ΣAimod LA ΣBi k mod LB
Σ * Σ Σ
0 A B Σ * Σ Σ 0 A B
Σ * Σ Σ Σ * Σ Σ
0 0 0 3 1 0
0 A B 0 A B
Σ * Σ Σ Σ * Σ Σ -
1 1 1 4 0 1
0 A B 0 A B
2 0 2 5 1 2
®
where
“ ” is the Cartesian product.
¡~¯°¯²± ¶ The composition AA AB consists of all possible cyclic sequences
of states which can be obtained from A A and AB .
P ( (R-R-R-h( A S
Proof: By Definition 8.3, the result of the composition of A A and AB is d
attractors A0 A1 Ad 1 of length m each, where d is the greatest common
(
divisor of LA and LB is m and the least common multiple of L A and LB .
Consider any two states of the attractor A k , Σki and Σkj , for some i j
P ( (R-R-R-I( 2 S * y
01 m 1 , i j, and some k 01 P ( R( -R-R-!( 2 S
d 1 . By Definition 8.3,
Σ * Σ Σ
k
i
A
. B
i mod LA i k mod LB
Σ * Σ -
and
Σ
We prove that
k
j
A
. B
j mod LA j k mod LB
%Σ A
i mod LA * Σ &Oµ A
j mod LA
% Σ. * y Σ.
B
i k mod LB
B
j k mod LB &!-
If ΣAimod LA * ΣBj mod LB , then we can express j as
j * i5 X 4 L ( A (8.2)
4 ³
% 5 &
where X is some constant which satisfies X L A m.
By substituting j by (8.2) in the expression j k mod L B , we get
4
%5 & 4
Clearly, if X LA is not evenly divisible by LB , then the right-hand side of
*y 4 ³
the expression (8.3) is not equal to i k mod L B . On the other hand, X LA
cannot be evenly divisible by LB , because LA LB and X LA m. Thus
% Σ. * Σ. &`µ
B B
% Σ * y Σ &!-
i k mod LB j k mod LB
A A
i mod LA j mod LA
Therefore, for a given k P 0 ( 1 (R-R-R-I( d 2 1 S , no two states in the attractor A are k
equal.
Similarly to the above, we can show that no two states in two different
attractors can be the same. If the first parts of two states are the same, than the
second parts differ due to the property
% k 5 X 4 L & mod L * y
A B 0
P ( R( -R-R-I( 2 S
4 P (R-R-R-`(
for any k 01 d 1 .
S P (R-R-R-`( S 4 * 4
There are LA LB different pairs of indexes in the Cartesian product 1
LA 1 LB . Thus, since LA LB m d, at least d attractors of length
( (R-R-R-h( A
m are necessary to represent all possible combinations. Since no two states of
A0 A1 Ad 1 are the same, exactly d attractors of length m are sufficient to
represent all possible combinations.
P ( (R-R-R-`( S
Let G1 G2 G p be the set of components of GR . Throughout the rest
* P ( (R-R-R-I( S
of the section, we use Ni to denote the number of attractors of G i , Ai j to de-
*P ( R( -R-R-!( S
note jth attractor Gi , and Li j to denote the length of Ai j , i 12 p ,
* -R-R-c * P ( (R-R-R-h( S
j 12 Ni .
Let I I1 I2 I p be the Cartesian product of sets Ii i1 i2 iNi ,
*
where p is the number of components of G R . The set Ii represents indexes of
( s* P ( ( S
attractors of the component Gi . For example, if Ni 3, then Gi has 3 attractors:
* * *
Ai1 Ai2 and Ai3 . The set Ii is then Ii 1 2 3 . The set I enumerates all
Proof: (1) The state space of any component is partitioned into basins of
attraction. There are no common states between different basins of attraction.
* ∑ ∏ %R%R% L1i
p
& L &7-R-R- L A & L
. i z I j9 2
N L2i2 3i3 j 1i j ji j
1 1
i
1 p
where " " is the least common multiple operation and " " is the greatest common
®
divisor operation.
¡~¯°¯²± The maximum length of attractors in the reduced network G R is
z
Lmax 1i1 L2i2 3i3 pi p
i1 ip I
v9
σv! 5
v7 v1
σv 1 σv! 7
v5 v2
σv 2 σv 9
000
011 00
001 010
A11 A12 10 A21 01
100 101 110
11
111
(a) (b)
A11
G2 MW#M û " ü ü ý
011 100 and A12 M#" ü ü %M ü " ü ü ü üM ü û $ $ ü ü ý
Figure 8.6. (a) State space of the component G1
$
v2 v5 v9 . There are two attractors,
000 001 101 111 110 010 . (b) State space of the component
v1 v7 . There is one attractor, A21 00 10 11 01 .
*QP ( ( S Q* P ( S
As an example, consider the network in Figure 8.5 with two components:
( * *
G1 v2 v5 v9 and G2 v1 v7 . Their state spaces are shown in Figure 8.6.
The first component has two attractors: A 11 & 011 100 of length L11 2 and
* ( ( ( ( ( *
A12 ' 000 001 101 111 110 010 of length L12 6. The second component
* ( ( (
has one attractor A21 ( 00 10 11 01 of length L21 4. *
*sP ( S *sP S *
P7% ( &!(`% ( &`S % (& * * *
The Cartesian product of I1 1 2 and I2 1 contains 2 pairs: I
1 1 2 1 . For the pair 1 1 we have L11 L21 2 4 2 and L11 L21
*
2 4 4. So, A11 and A21 compose into two attractors of length 4:
%(&
Similarly, for the pair 2 1 we have L12 L21 6 4 2 and L12 L21 * * *
*
6 4 12. So, A12 and A21 compose into two attractors of length 12:
A12 A21 * P ( ( ( ( ( (
( ( ( ( ( (
00000 00110 10111 11101 11000 01010
( ( ( ( ( (
00011 00101 10100 11110 11011 01001
( ( ( ( ( S$-
00010 00111 10101 11100 11010 01011
00001 00100 10110 11111 11001 01000
*
*
The total number of attractors is N 4. The maximum attractor length is
Lmax 12.
6. Simulation Results
This section shows simulation results for Kauffman networks of sizes from
ô ô Ç¿ú@Æ@ÇsûÆü as a function of n. Table 8.1 presents
10 to 107 vertices. Figure 8.7 plots the average number of relevant vertices
computed using Ï~õ÷öùø¿Ï Ï
the same results numerically in columns 1 and 2. The number of relevant
vertices scales as , n.
Column 4 gives the average number of connected components in G R as a
function of n. The number of components grows of order of log n.
%&
Column 3 of Table 8.1 shows the number of vertices in the largest component
of GR . As we can see, the size of this component is Θ n . “Giant” component
phenomena is well-studied in general random graphs [Molloy and Reed, 1998].
* 8
%& ¹
For a random graph with n vertices in which each edge appears independently
with probability p c n, c 1, the size of the largest component of the graph
is known to be Θ n . No theoretical explanation why the giant component
phenomena appears in the the subgraph induced by the relevant vertices of
Kauffman networks has been found yet.
size of number of
- G- - GR - the largest
component of GR
components
in GR
10 5 5 1
102 25 25 1
103 93 92 1
104 270 266 2
105 690 682 3
106 1614 1596 3
107 3502 3463 4
%(&
% (& ((
a predecessor of a vertex v is changed, i.e. the edge u v is replaced by an
edge w v , v u w V ;
the state of a vertex is changed to the complemented value;
Boolean function of a vertex is changed to a different Boolean function.
% 2 &
On one hand, the stability of Kauffman networks is due to the large percentage
of redundancy in the network: ê n ., n of n vertices are typically redundant.
On the other hand, the stability is due to the non-uniqueness of the network
representation: the same dynamic behavior (state space transition graph) can
be achieved by many different Kauffman networks.
An essential feature of living organisms is their capability to adopt to a
changing environment. Kauffman networks have been shown to be successful
in evolving to a predefined target function, confirming their purpose of being
models of living cells.
[Agrawal, 1994] Agrawal, H. (1994). Dominators, super blocks, and program coverage. In
Symposium on principles of programming languages, pages 25–34, Portland, Oregon.
[Agrawal, 1999] Agrawal, H. (1999). Efficient coverage testing using global dominator graphs.
In Workshop on Program Analysis for Software Tools and Engineering, pages 11–20,
Toulouse, France.
[Aho and Ullman, 1972] Aho, A. V. and Ullman, J. D. (1972). The Theory of Parsing, Trans-
lating, and Compiling, Vol. II. Prentice-Hall, Englewood Cliffs, NJ.
[Aldana et al., ] Aldana, M., Coopersmith, S., and Kadanoff, L. P. Boolean dynamics with
random couplings. http://arXiv.org/ abs/adap-org/9305001.
[Alstrup et al., 1999] Alstrup, S., Harel, D., Lauridsen, P. W., and Thorup, M. (1999). Domi-
nators in linear time. SIAM Journal on Computing, 28(6):2117–2132.
[Ball, 1993] Ball, T. (1993). What’s in a region?: or computing control dependence regions in
near-linear time for reducible control flow. ACM Letters on Programming Languages and
Systems (LOPLAS), 2:1–16.
[Bastola and Parisi, 1998] Bastola, U. and Parisi, G. (1998). The modular structure of Kauffman
networks. Phys. D, 115:219.
[Beizer, 1990] Beizer, B. (1990). Software Testing Techniques. Van Nostrand Reinhold, New
York.
[Bertolino and Marre, 1994] Bertolino, A. and Marre, M. (1994). Automatic generation of
path covers based on the control flow analysis of computer programs. IEEE Transactions on
Software Engineering, 20:885 – 899.
[Buchsbaum et al., 1998] Buchsbaum, A. L., Kaplan, H., Rogers, A., and Westbrook, J. R.
(1998). A new, simpler linear-time dominators algorithm. ACM Transactions on Program-
ming Languages and Systems, 20(6):1265–1296.
[Derrida and Flyvbjerg, 1986] Derrida, B. and Flyvbjerg, H. (1986). Multivalley structure in
Kaufmann’s model: Analogy with spin glass. J. Phys. A: Math. Gen., 19:L1103.
[Derrida and Flyvbjerg, 1987] Derrida, B. and Flyvbjerg, H. (1987). Distribution of local mag-
netizations in random networks of automata. J. Phys. A: Math. Gen., 20:L1107.
[Derrida and Pomeau, 1986] Derrida, B. and Pomeau, Y. (1986). Random networks of au-
tomata: a simple annealed approximation. Biophys. Lett., 1:45.
[Dubrova, 2005] Dubrova, E. (2005). Structural testing based on minimum kernels. In Pro-
ceedings of DATE’2005, Munich, Germany.
[Dubrova and Teslenko, 2005] Dubrova, E. and Teslenko, M. (2005). Compositional properties
of random boolean networks. Physics Review Letters.
[Dubrova et al., 2005] Dubrova, E., Teslenko, M., and Tenhunen, H. (2005). Computing attrac-
tors in dynamic networks. In Proceedings of International Symposium on Applied Computing
(IADIS’2005), pages 535–543, Algarve, Portugal.
[Harrel, 1985] Harrel, D. (1985). A linear time algorithm for finding dominators in flow graphs
and related problems. Annual Symposium on Theory of Computing, 17(1):185–194.
[Lengauer and Tarjan, 1979] Lengauer, T. and Tarjan, R. E. (1979). A fast algorithm for finding
dominators in a flowgraph. Transactions of Programming Languages and Systems, 1(1):121–
141.
[Lowry and Medlock, 1969] Lowry, E. S. and Medlock, C. W. (1969). Object code optimization.
Communications of the ACM, 12(1):13–22.
[Molloy and Reed, 1998] Molloy, M. and Reed, B. (1998). The size of the giant component of
a random graph with a given degree sequence. Combin. Probab. Comput., 7:295–305.
[Ntafos, 1988] Ntafos, S. (1988). A comparison of some structural testing strategies. IEEE
Transactions on Software Engineering, 14(6):868–874.
[Podgurski, 1991] Podgurski, A. (1991). Forward control dependence, chain equivalence, and
their preservation by reordering transformations. Technical Report CES-91- 18, Computer
Engineering & Science Department, Case Western Reserve University, Cleveland, Ohio,
USA.
[Purdom and Moore, 1972] Purdom, P. W. and Moore, E. F. (1972). Immediate predominators
in a directed graph. Communications of the ACM, 15(8):777–778.
[Roper, 1994] Roper, M. (1994). Software Testing. McGraw-Hill Book Company, London.
[Suurkula, 2004] Suurkula, J. (2004). Over 95 percent of DNA has largely unknown function.
http://www.psrast.org/junkdna.htm.
[Tarjan, 1972] Tarjan, R. E. (1972). Depth-first search and linear graph algorithms. SIAM
Journal on Computing, 1(2):146–160.
[Watson, 1996] Watson, A. H. (1996). Structured testing: Analysis and extensions. Technical
Report TR-528-96, Princeton University.