0% found this document useful (0 votes)
40 views70 pages

Reliability

Reliability engineering is increasingly important due to the need for systems to function in hostile environments and the high costs associated with system failures. Key definitions of reliability focus on the probability of failure-free operation over time under specific conditions, and various causes of failures include poor design, system complexity, and human error. Understanding and managing early failures, wearout failures, and chance failures are crucial for improving the reliability of components and systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views70 pages

Reliability

Reliability engineering is increasingly important due to the need for systems to function in hostile environments and the high costs associated with system failures. Key definitions of reliability focus on the probability of failure-free operation over time under specific conditions, and various causes of failures include poor design, system complexity, and human error. Understanding and managing early failures, wearout failures, and chance failures are crucial for improving the reliability of components and systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

4 Reliability Engineering

pacemakers, for instance, is the energy source, since circuit failures in


pacemakers occur with a probability of less than 140x 10-9 per hour.

Besides this, our technical systems are more and more put to use in hostile
environments; they have to be suitable for a wider variety of environments.
Just think of applications in the process industry (heat, humidity, chemical
substances), mobile applications in aircraft, ships, and vehicles (mechanical
vibrations, shocks, badly defined power supply voltages, high
electromagnetic interference level).

All in all, these are sufficient reasons for reliability engineering to be so


much in the limelight these days. Add to that the emphasis on reliability in
situations where no maintenance is possible, because of an isolated location
(unmanned arctic weather stations, remote space probes, underwater
amplification stations in transatlantic cables, etc). Even if maintenance were
possible, it is often better (more cost -effective) to increase the initial
reliability of a system because of the high costs associated with that
system being down for repairs. Despite the higher initial costs, the life
cycle cost may turn out to be lower. This is called the invest now, save later
principle of reliability.

Also the socio-ethical aspects of products with a reliability that is too low
cannot be underestimated. These low- reliability disposable products lead to
a waste of labour, energy, and raw materials that are becoming more and
more scarce.

1.3 DEFINITION

The concept of reliability has been interpreted in many ways in numerous


works. Since many of these do not agree in content, it is expedient to
examine the main ones.

The following definitions of reliability are most often met with in the
literature.

1. Reliability is the integral of the distribution of probabilities of


failure - free operation from the instant of switch- on to the first
failure.

2. The reliability of a component (or a system) is the probability that


the component (or a system) will not fail for a time t.

3. Reliability is the probability that a device will operate without


failure for a given period of time under given operating conditions.
Reliability Fundamentals 5

4. Reliability is the mean operating time of a given specimen between


two failures.

5. The reliability of a system is called its capacity for failure -free


operation for a definite period of time under given operating
conditions, and for minimum time lost for repair and preventive
maintenance.

6. The reliability of equipment is arbitrarily assumed to be the


equipment's capacity to maintain given properties under specified
operating conditions and for a given period of time.

One of the definitions which has been accepted by most contemporary


reliability authorities is given by the Electronics Industries Association, (EIA)
USA (formerly known as RETMA) which states:

The reliability of an item (a component, a complex system, a computer


program or a human being) is defined as the probability of performing its
purpose adequately for the period of time intended under the operating and
environmental conditions encountered.

This definition stresses four elements:

1. Probability
2. Adequate performance
3. Time
4. Operating and environmental conditions.

The true reliability is never exactly known, but numerical estimates quite
close to this value can be obtained by the use of statistical methods and
probability calculations. How close the statistically estimated reliability
comes to the true reliability depends on the amount of testing, the
completeness of field service reporting all successes and failures, and other
essential data. For the statistical evaluation of an equipment, the equipment
has to be operated and its performance observed for a specified time
under actual operating conditions in the field or under well-simulated
conditions in a Laboratory. Criteria of what is considered an adequate
performance have to be exactly spelled out for each case, in advance.

Measurement of the adequate performance of a device requires measuring all


important performance parameters. As long as these parameters remain
within the specified limits, the equipment is judged as operating satisfactorily.
When the performance parameters drift out of the specified tolerance limits,
the equipment is judged as having malfunctioned or failed. For instance, if
the gain of an electronic amplifier reduces to a value K1 from the designed
Reliability Fundamentals 7

analysis begins with the definition of an undesirable event and traces this
event down through the system to identify basic causes. In systems
parlance, the FMEA is a bottom-up procedure while the FT A is a top-down
technique.

1.4 CAUSES OF FAILURES

The specific causes of failures of components and equipments in a


system can be many. Some are known and others are unknown due to the
complexity of the system and its environment. A few of them are listed
below:

1. Poor Design, Production and Use

Poor design and incorrect manufacturing techniques are obvious reasons


of the low reliability. Some manufacturers hesitate to invest more money
on an improved design and modern techniques of manufacturing and testing.
Improper selection of materials is another cause for poor design.

Components and equipments do not operate in the same manner in all


conditions. A complete knowledge of their characteristics, applications, and
limitations will avoid their misuse and minimize the occurrence of failures.
All failures have a cause and the lack of understanding these causes is the
primary cause of the unreliability of a given system.

2. System Complexity

In many cases a complex and sophisticated system is used to accomplish a


task which could have been done by other simple schemes. The
implications of complexity are costly. First it employs more components
thereby decreasing overall reliability of the system. Second, a complex
scheme presents problems in terms of users' understanding and
maintenance. On the other hand, simplicity costs less, causes less problems,
and has more reliability. A basic rule of reliability with respect to complexity
is: Keep the system as simple as is compatible with the peiformance requirements.

3. Poor Maintenance

The important period in the life cycle of a product or a system is its


operating period. Since no product is perfect, it is likely to fail. However its
life time can be increased if it can be repaired and put into operation again.
In many cases preventive-measures are possible and a judiciously designed
preventive-maintenance policy can help eliminate failures to a large extent.
The adage Prevention is better than cure applies to products and equipments as
well.
8 Reliability Engineering

4. Communication and Coordination

Reliability is a concern of almost all departments of an organization. It is


essentially a birth-to-death problem involving such areas as raw material and
parts, conceptual and detailed engineering design, production, test and
quality control, product shipment and storage, installation, operation and
maintenance. A well-organized management with an efficient system of
communication is required to share the information and experiences about
components. Sufficient opportunity should be available for the people
concerned to discuss the causes of failures. In some organizations, rigidity
of rules and procedures prohibits the creative-thinking and design.

5. Human Reliability

In spite of increased application of automation techniques in industries


and other organisations, it is impossible to completely eliminate the human
involvement in the operation and maintenance of systems. The contribution
of human-errors to the unreliability may be at various stages of the product
cycle. Failures due to the human- error can be due to:

* Lack of understanding of the equipment


* Lack of understanding of the process
* carelessness
* forgetfulness
* poor judgemental skills
* absence of correct operating procedures and instructions
* physical inability
Although, it is not possible to eliminate all human-errors, it is possible to
minimize some of them by the proper selection and training of personnel,
standardization of procedures, simplification of control schemes and other
incentive measures. The designer should ensure that the operation of the
equipment is as simple as possible with practically minimum probability
for error. The operator should be comfortable in his work and should be free
from unnecessary stresses. The following checklist should prove useful to
the design engineer:

* Is the operator position comfortable for operating the controls?


* Do any of the operations require excessive physical effort?
* Is lighting of the workplace and surrounding area satisfactory?
* Does the room temperature cause any discomfort to the operator?
* Are noise and vibration within the tolerable limits?
* Does the layout ensure the required minimum movement of operator?
* Can the operator's judgement be further minimized?
Reliability Fundamentals 11

would then be evidenced when the area exceeds a specified amount. A


third possibility would be to use the number of crossings of the limits as an
indicator of unsatisfactory performance.

Y(t)

Y
max

(a)

Y.
IDIIl

time

v(t)

(b)

Vf

tim.

Fig. 1.1 <a) Non- monotonic drift or a variable.


(b) v(t) Is the total time Y(t) has spent in the region or degradation.

1.6 CHARACTERISTIC TYPES OF FAILURES

Reliability Engineering distinguishes three characteristic types of failures


(excluding damage caused by careless handling, storing, or improper
operation by the users) which may be inherent in the equipment and
occur without any fault on the part of the operator.

First, there are the failures which occur early in the life of a component.
They are called early failures. Some examples of early failures are:

• Poor welds or seals


• Poor solder joints
• Poor connections
• Dirt or contamination on surfaces or in materials
• Chemical impurities in metal or insulation
• Voids, cracks, thin spots in insulation or protective coatings
• Incorrect positioning of parts
12 Reliability Engineering

Many of these early failures can be prevented by improving the control over
the manufacturing process. Sometimes, improvements in design or materials
are required to increase the tolerance for these manufacturing deviations,
but fundamentally these failures reflect the manufacturability of the component
or product and the control of the manufacturing processes. Consequently,
these early failures would show up during:

* In-process and final tests


* Process audits
* Life tests
* Environmental tests.

Early failures can be eliminated by the so-called debugging or burn-in process.


The debugging process consists of operating an equipment for a number
of hours under conditions simulating actual use. The weak or substandard
components fail in these early hours of the equipment's operation and they
are replaced by good components. Similarly poor solder connections or
other assembly faults show up and they are corrected. Only then is the
equipment released for service.

Secondly, there are failures which are caused by wearout of parts. These
occur in an equipment only if it is not properly maintained-or not maintained
at all. Wearout failures are due primarily to deterioration of the design strength
of the device as a consequence of operation and exposure to environmental
fluctuations. Deterioration results from a number of familiar chemical and
physical phenomena:

* Corrosion or oxidation
* Insulation breakdown or leakage
* Ionic migration of metals in vacuum or on surfaces
* Frictional wear or fatigue
* Shrinkage and cracking in plastics

In most cases wearout failures can be prevented. For instance, in repeatedly


operated equipment one method is to replace at regular intervals the
accessible parts which are known to be subject to wearout, and to make
the replacement intervals shorter than the mean wearout life of the parts.
Or, when the parts are inaccessible, they are designed for a longer life
than the intended life of the equipment. This second method is also applied
to so-called one-shot equipment, such as missiles, which are used only once
during their lifetime.

Third, there are so-called chance failures which neither good debugging
techniques nor the best maintenance practices can eliminate. These failures
Reliability Fundamentals 13

are caused by sudden stress accumulations beyond the design strength of


the component. Chance failures occur at random intervals, irregularly and
unexpectedly. No one can predict when chance failures will occur. However,
they obey certain rules of collective behaviour so that the frequency of
their occurrence during sufficiently long periods is approximately constant.
Chance failures are sometimes called catastrophic failures, which is
inaccurate because early failures and wearout failures can be as catastrophic
as chance failures. It is not normally easy to eliminate chance failures.
However, reliability techniques have been developed which can reduce the
chance of their occurrence and, therefore, reduce their number to a minimum
within a given time interval.

Reliability engineering is concerned with eliminating early failures by


observing their distribution and determining accordingly the length of the
necessary debugging period and the debugging methods to be followed.
Further, it is concerned with preventing wearout failures by observing the
statistical distribution of wearout and determining the overhaul or preventive
replacement periods for the various parts or their design life. Finally, its
main attention is focused on chance failures and their prevention, reduction,
or complete elimination because it is the chance failure phenomenon which
most undesirably affects after the equipment has been debugged and
before parts begin to wear out.

1.7 USEFUL LIFE OF COMPONENTS

If we take a large sample of components and operate them under constant


conditions and replace the components as they fail, then approximately the
same number of failures will occur in sufficiently long periods of equal
length. The physical mechanism of such failures is a sudden accumulation
of stresses acting on and in the component. These sudden stress
accumulations occur at random and the randomness of the occurrence of
chance failures is therefore an obvious consequence.

If we plot the curve of the failure rate against the lifetime T of a very large
sample of a homogeneous component population, the resulting failure rate
graph is shown in Fig 1.3. At the time T=O we place in operation a very
large number of new components of one kind. This population will initially
exhibit a high failure rate if it contains some proportion of substandard,
weak specimens. As these weak components fail one by one, the failure
rate decreases comparatively rapidly during the so-called burn-in or debugging
period, and stabilizes to an approximately constant value at the time T b
when the weak components have died out. The component population after
having been burned in or debugged, reaches its lowest failure rate level
which is approximately constant. This period of life is called the useful life
period and it is in this period that the exponential law is a good
14 Reliability Engineering

approximation. When the components reach the life T w wearout begins to


make itself noticeable. From this time on, the failure rate increases rather
rapidly. If upto the time T w only a small percentage of the component
population has failed of the many components which survived up to the time
T w, about one-half will fail in the time period from T w to M. The time M is
the mean wearout life of the population. We call it simply mean life,
distinguished from the mean time between failures, m = 1/'A. in the useful life
period.

Early faJlures Wearout failures I

--
I

... ,
I
I
Chance failures
Useful life period I
I

T M
w

Operating life T )
(age
Fig. 1.3 Component failure rate as a function of age.

If the chance failure rate is very small in the useful life period, the mean
time between failures can reach hundreds of thousands or even millions of
hours. Naturally, if a component is known to have a mean time between
failures of say 100,000 hours (or a failure rate of 0.00001) that certainly
does not mean that it can be used in operation for 100,000 hours.

The mean time between failures tells us how reliable the component IS In
its useful life period, and such information is of utmost importance. A
component with a mean time between failures of 100,000 hours will have a
reliability of 0.9999 or 99.99 percent for any 10-hour operating period.
Further if we operate 100,000 components of this quality for 1 hour, we
would expect only one to fail. Equally, would we expect only one failure if
we operate 10,000 components under the same conditions for 10 hours, or
1000 components for 100 hours, or 100 components for 1000 hours.

Chance failures cannot be prevented by any replacement policy because of


the constant failure rate of the components within their useful life. If we try
Reliability Fundamentals 15

to replace good nonfailed components during useful life, we would improve


absolutely nothing. We would more likely do harm, as some of the
components used for replacement may not have been properly burned in,
and the presence of such components could only increase the failure rate.
Therefore, the very best policy in the useful life period of components is to
replace them only as they fail. However, we must stress again that no
component must be allowed to remain in service beyond its wearout
replacement time T w. Otherwise, the component probability of failure
increases tremendously and the system probability of failure increases even
more.

The golden rule of reliability is, therefore: Replace components as they fail
within the useful life of the components, and replace each component
preventively, even if it has not failed, not later than when it has reached the
end of its useful life. The burn-in procedure is an absolute must for missiles,
rockets, and space systems in which no component replacements are
possible once the vehicle takes off and where the failure of any single
component can cause the loss of the system. Component burn-in before
assembly followed by a debugging procedure of the system is, therefore,
another golden rule of reliability.

1.8 THE EXPONENTIAL CASE OF CHANCE FAILURES


In the simplest case, when a device is subject only to failures which occur
at random intervals, and the expected number of failures is the same for
eQually long operating periods, its reliability is mathematically defined by the
well-known exponential formula

R(t) = exp(- At) (1.1 )

In this formula A is a constant called the failure rate, and t is the operating
time. The failure rate must be expressed in the same time units as time, t-
usually in hours. However, it may be better to use cycles or miles in same
cases. The reliability R is then the probability that the device, which has a
constant failure rate A will not fail in the given operating time t.

This reliability formula is correct for all properly debugged devices which are
not subject to early failures, and which have not yet suffered any degree
of wearout damage or performance degradation because of their age.

To illustrate the important fact of an eQual chance of survival for periods of


eQual length throughout the useful life, let us assume that a device with a
1000-hour useful life has a constant failure rate A =0.0001 per hour. Its
reliability for any 10 hours' operation within these 1000 hours is
16 Reliability Engineering

R = exp( -0.0001 x 10) =0.9990 ( or 99.9 percent)

The probability that the device will not fail in its entire useful life period of
1000 hours is

R = exp( -0.0001 x 1000) = 0.9048 (or 90.48 percent)

Thus, it has a chance of 90 percent to survive up to 1000 hours counted


from the moment when first put into operation. But if it survives up to 990
hours, then its chance to survive the last 10 hours (from 990 to 1000 hours)
of its useful life is again 99.9 percent.

We often use the reciprocal value of the failure rate, which is called the
mean time between failures, m. The mean time between failures, abbreviated
MTBF can be measured directly in hours. By definition, in the exponential
case, the mean time between failures, or MTBF is

m = 11 A. (1.2)

The reliability function can , therefore, also be written in the form

R(t) = exp(-t/m) (1.3)

When plotting this function, with Reliability values on the ordinate and the
corresponding time values on the abscissa, we obtain a curve which is often
referred to as the survival characteristic and is shown in Fig 1.4.

It is important to understand that the time t on the abscissa is not a measure


of the calendar life. It counts only the hours of any arbitrarily chosen oper-
ating period with t = 0 designating the beginning of the considered operating
period. Therefore, 't' in this formula is often called mission time. It is assumed
that the device has survived previous missions, and it will not reach the end
of its useful life in the mission now under consideration. The first assumption
is written as R = 1 at t =0, which means that the device has survived to the
beginning of the mission. The second assumption is contained in the original
assumption of A. = constant. Second, it is seen that the time t in the graph
extends to infinity, which seems to make no sense. However, when only
chance failures are considered, the certainty that a device will fail because of
a chance failure exists only for an infinitely long operating period.

There are a few points on this curve which are easy to remember and which
help greatly in rough predicting work. For an operating time t = m, the
device has a probability of only 36.8 percent (or approximately 37 percent)
to survive. For t = m/10, the curve shows a reliability of R = 0.9 and for t
= m/1 00, the reliability is R = 0.99; for t = m/1000, it is 0.999.
Reliability Fundamentals 17

Reliability
1.0

(a)
0.367

3m
Time
1.00
0.99

0.95

(b)

- -+-
mllOO ml20 milO

Fig. 1.4 The standardised Reliability curve


(8) The curve
(b) The upper portion of the reliability curve.

For fast reliability calculations, we can use a Nomogram as shown in Fig 1.5.
If we know any two of the following three parameters, the third can be
directly read on the straight line joining the first two.

(i) Failure rate (or MTBF)


(ii) Reliability
(iii) Operating Time

Example 1.1

Consider the failure rate of an instrument as 0.0001/hr. What will be its


reliability for an operating period of 100 hours?

Solution

1..= 0.0001/hr

Therefore, m = 1/ A. = 10,000 hr
Reliability Fundamentals 19

1.9 RELIABILITY MEASURES

The reliability of a component can be interpreted as the fraction of the


number of components surviving a test to the total number of components
present at the beginning of the test.

If a fixed number No of components are tested, there will be, after a time t,
Ns(t) components which survive the test and Nt(t) components which fail.
Therefore, No = Ns(t) + Nt(t) is a constant throughout the test. The reliability,
expressed as a fraction by the probability definition at any time t during the
test is:

R(t) = (Ns(t)/ No) = Ns(t)/ (Ns(t) + Nt (t)) (1.4)

In the same way, we can also define the probability of failure 0 (called
unreliability) as

o(t) = (Nt (t)/ No) = Nt (t)/ (Ns (t) + Nt (t)) (1.5)

It is at once evident that at any time t,

R(t) + O(t) = 1 (1.6)

The events of component survival and component failure are called


complementary events because each component will either survive or fail.
These are also called mutually exclusive events because if a component has
failed, it has not survived, and vice versa.

The reliability can also be written as

(1.7)

By differentiation of this equation we obtain

dR(t)/dt = -(l/No)(dNt(t)/dt) (1.8)

Rearranging,

dNt(t)/dt = - No dR(t)/dt (1.9)

The term dNt(t)/dt can be interpreted as the number of components failing


in the time interval dt between the times t and t + dt, which is equivalent to
the rate at which the component population still in test at time t is failing.

At the time t, we still have Ns(t) components in test; therefore, dNt(t)/dt


20 Reliability Engineering

components will fail out of these Ns(t) components. When we now divide
both sides of the equation (1.9) by Ns(t), we obtain the rate of failure or the
instantaneous probability of failure per one component, which we call the
failure rate:

i.(t) = (1/Ns(t))(dNt(t)/dt) = -(No/Ns(t))(dR(t)/dt) (1.10)

Using (1.4) we get

I..(t) = -(1/R(t))(dR(t)/dt) (1.11)

which is the most general expression for the failure rate because it applies
to exponential as well as non-exponential distributions. In the general case, I..
is a function of the operating time t, for both Rand dR/dt are functions of t.
Only in one case will the equation yield a constant, and that is when failures
occur exponentially at random intervals in time. By rearrangement and
integration of the above equation, we obtain the general formula for
reliability,

I..(t)dt = -(dR(t)/R(t))

t
or, In (R (t) ) = - JI..(t) dt
o
Solving for R(t) and knowing that at t = 0, R(t) = 1, we obtain
t
R(t) = J
exp[- I..(t) dt] (1.12)
o
So far in this derivation, we have made no assumption regarding the nature
of failure rate and therefore it can be any variable and integrable function
of the time t. Consequently, in the equation (1.12), R(t) mathematically
describes reliability in a most general way and applies to all possible kinds of
failure distributions.

When we specify that failure rate is constant in the above equation, the
exponent becomes
t

-I I..(t) dt = - I.. t
o
and the known reliability formula for constant failure rate results,

R(t) = exp(- I..t) (1.13)


26 Reliability Engineering

Example 1.3:

The failure data for ten electronic components is as given in Tablel.3.


Compute and plot failure density, failure rate, reliability and unreliability

'. . . . . . . . . .,. .
functions.

I~.~I~~ !.:~.: ..!?al..t~..!~r··~·~I~~·p.·I:~...1.:~..:..............:..............:..............:


i Failure No i 1 2 1 3 4 i 5 6 1 7 +1•••••••••••••••••••.•.••.•.
: ............................. :..........
8 1 9 +1•••••••••••••
• . . . . . . . . . . . . . . . . . . . . . . . . . . . . .: . . . . . . . . . .
10 1
• ••.a . . . . . . . . . . . . . . . . . . . . . oc

i Operating i 8 20 1 34 46 i 63 861 111 1 141 1 1861 2661


L~}.!!!~..~.r.~:.......L........ .........1....................L......... .........1.............1.............1.............1.............1
Solution

The computation of failure density and failure rate is shown in Table 1.4.
Similarly the computation of reliability and unreliability function is shown
in Table 1.5. These results are also shown in Fig 1.8. As shown, we can
compute R(t) for this example using the formula R(t) = Ns(ti)/N o at each
value of ti and connecting these points by a set of straight lines. In the data
analysis one usually finds it convenient to work with A.(t) curve and deduce
the reliability and density functions theoretically. For example, in this
illustration, we can see that the hazard rate can be modeled as a constant.

***
T abl e 14Compu t af Ion 0 f f'l
al ure densnyan
't d f al'1 ure rat e
Time Interval Failure density Failure rate
(Hours)
0-8 11(10 x 8) = 0.0125 1/(10 x 8) = 0.0125
8-20 11(10 x 12) = 0.0084 11(9 x 12) = 0.0093
20-34 11(10 x 14) = 0.0072 11(8 x 14) = 0.0096
34-46 1/(10 x 12) = 0.0084 11(7 x 12) = 0.0119
46-63 11(10 x 17) = 0.0059 11(6 x 17) = 0.0098
63-86 11(10 x 23) = 0.0044 11(5 x 23) = 0.0087
86-111 11(10 x 25) = 0.0040 11(4 x 25) = 0.0100
111-141 11(10 x 30) = 0.0033 11(3 x 30) = 0.0111
141-186 11(10 x 45) = 0.0022 11(2 x 45) = 0.0111
186-266 11(10 x 80) = 0.0013 11(1 x 80) = 0.0125

We now show how can we measure the constant failure rate of a


component population very conveniently. Referring to the previous
experiment, if A. is constant, the product (1/N s(t)) (dNf(t)/dt) must also be
constant throughout a test.
Reliability Fundamentals 27

I) "'(I)

0 time time
(a) (b)
(I) Q(I)

(c)
lime
L (d)
time

FIg. 1.8 Reliability Parameters for Example 1.3.

Table 1.5 Computation of Reliability and Unrel iability


Time(hrs)
-_._--
....0
8
.. 1.0
0.9
__- __ -_._--
Reliability Unreliability
. .... ...- 0.0
0.1
20 0.8 0.2
34 0.7 0.3
._----- r-.-----.--
46 0.6 0.4...-_. --_
63 0.5 0.5
86 0.4 0.6
111 0.3 0.7
_._-_
141 ..--_..
186
0.2
0.1
--_._-_
.... 0.8 ,.-
0.9
266 0.0 1.0

That means that l/N s (t) and dNt(t)/dt must either decrease at the same rate
or must be held constant through the entire test. A simple way to measure a
constant failure rate is to keep the number of components in the test
constant by immediately replacing the failed components with good ones.
The number of alive components Ns(t) is then equal to No throughout the
test. Therefore, 1/Ns(t) = 1/No is constant, and dNt(t)/dt in this test must
also be constant if the failure rate is to be constant. But dNt(t)/dt will be
constant only if the total number of failed components Nt(t) counted from
the beginning of test increases linearly with time. If Nt components have
failed in time t at a constant rate, the number of components failing per unit
time becomes Ntlt and in this test we can substitute Ntlt for dNt(t)/dt and
1/No for l/N s (t). Therefore,
28 Reliability Engineering

(1.29)

Thus, we need to count only the number of failures Nf and the straight hours
of operation t. The constant failure rate is then the number of failures
divided by the product of test time t and the number of components in test
which is kept continuously at No. This product No. t is the number of unit-
hours accumulated during the test. Of course, this procedure for determining
the failure rate can be applied only if A. is constant.

If only one equipment (No = 1) is tested but is repairable so that the test can
continue after each failure, the failure rate becomes A. = Nflt where the unit-
hours t amount to the straight test time.

Example 1.4:

Consider another example wherein the time scale is now divided into equally
spaced intervals called class intervals. The data is tabulated in the Table 1.6
in class intervals of 1000 hours. Compute the failure density and failure rate
functions.
Table 1.6: Data for Example 1.4
Time interval hours Failures in the interval
0000 - 1000 59
1001-2000. 24
············200·1···~··30·00···········T··························2·9··························
........................................................................+.....................................................................
3001 - 4000 i 30
4001 - 5000 17
5001 - 6000 13

Solution:

The solution for this example is shown in Table 1.7.

Tabl e 1 7 Computatlon 0 f f·1


al ure densltyan d f al'1 ure rate
Interval Failure density Failure rate
0000 - 1000 591(172 x1000) = 0.000343 591(172 x 1000)=0.000343
1001 - 2000 .. ?~!.n??.. ~J..Q.9.9.t ..=.J?.:.9.9.9.J..~9. .... ..?~!.~J..1}.. ~ .. J.9.9..9.L=.9.:.9.9.9..?J?....
...................................
- 3000 ..?~.m..?.?. ..~.}.9.9.9..L.=... 9..:9.9..9J.§.~.... .:?W.!....~~ .. ~J.9.9.9..L=:..Q.:.9.9.9.~.?.~...
....................................
2001
3001 - 4000 301(172 x1000) = 0.000174 301( 60 x 1000) = 0.000500
4001 - 5000 171(172 x1000) = 0.000099 171( 30 x 1000) = 0.000569
5001 - 6000 131(172 x1000) = 0.000076 131( 13 x 1000)=0.001000

It can be seen that the failure rate in this case can be approximated by a
linearly increasing time function.
Reliability Fundamentals 29

Example 1.5 :

A sample of 100 electric bulbs was put on test for 1500 hrs. During this
period 20 bulbs failed at 840,861,901,939,993,1060, 1100,1137,
1184,1200,1225,1251,1270,1296,1314,1348,1362, 1389, 1421,
and 1473 hours. Assuming constant failure rate, determine the value of
failure rate.

Solution:

In this case,

Nf = 20
Not = 840 + 861 + 901 + 939 + 993 + 1060 + 1100 + 1137 + 1184 + 1200 + 12
25+ 1251 + 1270+ 1296+ 1314+ 1348+ 1362 + 1389 + 1421 + 1473+
80(1500) = 143, 564 hrs.

Hence, A. = Nt/Not = 20/143,564 = 0.139 x 10-4 /hr.

***
Reliability Mathematics 51

f{x)
1
- S ------------
a(lt J I
I
I

o II. x
(a>

F(x>
-------------------?-----

Fig. 1.9 The normal distribution.

Recall that a random variable is a function defined on the sample space S of


the underlying experiment. Thus the above family of random variables is a
family of functions {X(t"s) IseS, teT}. For a fixed t = t1, X(t1,S) is a random
variable [denoted by X(t1)] as s varies over the sample space S. At some
other fixed instant of time t2, we have another random variable X(t2,S). For
a fixed sample point S1 eS, X(t,S1) is a single function of time t, called a
sample function or a realization of the process. When both sand t are varied,
we have the family of random variables constituting a stochastic process.

If the state space of a stochastic process is discrete, then it is called a


discrete-state process, often referred to as a chain. In this case, the state space
is often assumed to be {O, 1, 2, ... }. Alternatively, if the state space is
continuous, then we have a continuous-state process. Similarly, if the index
set T is discrete, then we have a discrete (time)-parameter process; otherwise
we have a continuous parameter process.

2.7 MARKOV CHAINS

A Markov process is a stochastic process whose dynamic behaviour is such


that probability distributions for its future development depend only on the
present state and not on how the process arrived in that state. If we assume
that the state space, I, is discrete (finite or countably infinite), then the
Markov process is known as a Markov chain.

In order to formulate a Markov model (to be more precise we are talking


52 Reliability Engineering

about continuous-time and discrete-state models) we must first define all the
mutually exclusive states of the system. For example, in a system composed
of a single non-repairable element X1 there are two possible states: so= x1,
the element is good, and S1 = X'1, the element is bad. The states of the
system at t = 0 are called the initial states, and those representing a final or
equilibrium state are called final states. The set of Markov state equations
describes the probabilistic transitions from the initial to the final states.

The transition probabilities must obey the following two rules:

1. The probability of transition in time At from one state to another


is given by z(t) At, where z(t) is the hazard associated with two
states in question. If all the Zj(t)'s are constant, Zj(t) = A.j, and the
model is called homogeneous. If any hazards are time functions,
the model is called nonhomogeneous.

2. The probabilities of more than one transition in time At are


infinitesimals of a higher order and can be neglected.

2.71 One Component System:

The probability of being in state So at time t+ At is written Po(t+At). This is


given by the probability that the system is in state So at time t, Po(t), times
the probability of no failure in time At, 1-z(t) At, plus the probability of being
in state S1 at time t, P1 (t), times the probability of repair in time At, which
equals zero. (We are neglecting the possibility of repairs for the present).

The resulting equation is

(2.37)

Similarly, the probability of being in state S1 at time t+ At is given by

(2.38)

The transition probability z(t) At is the probability of failure (change from


state So to S1), and the probability of remaining in state S1 is unity.

Rearrangement of the above equations yields

Po(t+ At) - Po(t)


----------------------------- = -z(t) Poft)
At
Reliability Mathematics 53

----------------------------- = zIt) Po(t)


At

Passing to a limit as At becomes small. we obtain

dPo(t)
------------- = -zIt) Po(t) (2.39)
dt

dP1 (t)
--------- = zIt) Po (t) (2.40)
dt

These equations can be solved in conjunction with the appropriate initial


conditions for Po(t) and P1(t). The most common initial condition is that the
system is good att=O. that is Po(t=0)=1 andP1(t=0)=0.

The solution of these equations is:

t
Po(t) = exp[ - f z( 't )d't] (2.41 )
o
and
t
P1 (t) 1 - exp[ - f z( 't)d't] (2.42)
o
Ofcourse. a formal solution of the second equation is not necessary to obtain
since it is possible to recognize at the outset that

(2.43)

The role played by the initial conditions is clearly evident. If there is a fifty-
fifty chance that the system is good at t = O. then Po(O) = 1/2. and

t
Po(t) = (1/2) exp[ - f z('t)d't] (2.44)
o
It is often easier to characterize Markov models by a graph composed of
nodes representing system states and branches labeled with transition
probabilities. Such a Markov graph for the problem described above is given
in Fig 2.10. Note that the sum of transition probabilities for the branches
54 Reliability Engineering

leaving each node must be unity. Treating the nodes as signal sources and
the transition probabilities as transmission coefficients, we can write
difference equations by inspection. Thus, the probability of being at any
node at time t + ~t is the sum of all signals arriving at that node. All other
nodes are considered probability sources at time t, and all transition
probabilities serve as transmission gains. A simple algorithm for writing the
differential equations by inspection is to equate the derivative of the
probability at any node to the sum of the transmissions coming into the
node. Any unity gain factors of the self-loops must first be set to zero, and
the ~t factors are dropped from the branch gains.

o o
l-z(t) 6. t

z(t) 6. t
P 1
Fig. 2.10 Markov graph for a single nonrepairable element

2.72 Two-element system

If a two element system consisting of elements Xl and X2 is considered,


there are four system states: So = Xl x2, S, = X' 1 x2, S2 = X, X' 2 and S3 = x' 1x' 2.
The Markov graph is shown in Fig 2.11. The probability expression for state
So is given by

(2.44)

where [Zol (t) + Z02(t)] ~t is the probability of a transition in time ~t from So to


s, or S2. For state s"

(2.45)

where Z'3(t) ~t is the probability of a transition from state s, to S3. Similarly


for state S2.

(2.46)

where Z23(t) M is the probability of a transition from state S2 to S3.

For state S3 the transition equation is

(2.47)
Reliability Mathematics 55

~(t)At

1-13 (t)At
Fig. 1.11 Markov graph for two distinct nonrepairable elements.

Rearranging these equations and passing to a limit yields

dPo(t)
= -[Zo1 (t) + Zo2(t)] Poft) (2.48a)
dt

dP 1 (t)
= -[Z13(t)) P1 (t) + [Zo1 (t)] Poft) (2.48b)
dt

dP2 (t)
= -[Z23(t)] P2(t) + [Zo2(t)]Poft) (2.48c)
dt

dP3(t)
= [Z13(t)]P1 (t) + [Z23(t)]P2(t) (2.48d)
dt

The initial conditions associated with this set of equations are PolO), P1(O),
P2(O), and P3(O). These equations, of course could have been written by
inspection using the algorithm previously stated.

It is difficult to solve these equations for a general hazard function zIt), but
if the hazards are specified, the solution is quite simple. If all the hazards are
constant, Zo1 (t) = A,1, Zo2(t) = A,2, Z13(t) = A,3, and Z23(t) = A,4.
3
RELIABILITY ANALYSIS OF
SERIES PARALLEL SYSTEMS

3.1 INTRODUCTION

Reliability is not confined to single components. We really want to evaluate


the reliabilities of the systems, simple as well as extremely complex, and to
use these evaluation techniques for designing reliable systems. System
reliabilities are calculated by means of the calculus of probability. To apply
this calculus to systems, we must have some knowledge of the probabilities
of its components, since they affect the reliability of the system.

Component reliabilities are derived from tests which yield information about
failure rates. The actual value of this failure rate can be obtained only by
means of statistical procedures because of the two main factors which
govern the probability of survival of a component:

1. The uncertainties of the production process.


2. The uncertainties of the stresses which component must withstand
in operation.

In reliability tests we actually measure the failure rate of a component,


which means we measure its instantaneous probability of failure at a given
set of environmental and operating stress conditions. System reliability
calculations are based on two important operations:

1. As precise as possible a measurement of the reliability of the


components used in the system environment.

59
60 Reliability Engineering

2. The calculation of the reliability of some complex combination of


these components.

Once we have the right figures for the reliabilities of the components in a
system, or good estimates of these figures, we can then perform very exact
calculations of system reliability even when the system is the most complex
combination of components conceivable. The exactness of our results does
not hinge on the probability calculations because these are perfectly
accurate; rather, it hinges on the exactness of the reliability data of the
components. In system reliability calculations for Series-Parallel Systems we
need use only the basic rules of the probability calculus.

The following assumptions are made:

1. The reliabilities of all constituent components of the system are


known and these are constant during the time interval in which the
reliability of the network is being examined.

2. All components are always operating except possibly in the case of


redundancy.

3. There does not exist any correlation between failures of different


links i.e. the states of all elements are s-independent.

4. The state of each element and of the entire network is either good
(operating) or bad (failed).

5. The nodes of the network are perfect.

6. There is no limitation on the flow transmission capability of any


component, i.e. each link/node can transmit the required amount of
flow.

These assumptions are primarily made for mathematical practicability.


Several of these assumptions are removed in the published work on
Reliability Analysis.

3.2 RELIABILITY BLOCK DIAGRAMS

A block diagram which depicts the operational relationship of various


elements in a physical system, as regards the success of the overall system,
is called Reliability Block Diagram or Reliability Logic Diagram. While the system
diagram depicts the physical relationship of the system elements, the
reliability block diagram shows the functional relationship and indicates
which elements must operate successfully for the system to accomplish its
Reliability Analysis of Series Parallel Systems 61

intended function. The function which is performed may be the simple


action of a switch which opens or closes a circuit or may be a very complex
activity such as the guidance of a spacecraft.

Two blocks in a block diagram are shown in series if the failure of either of
them results in system failure. In a series block diagram of many blocks,
such as Fig 3.1, it is imperative that all the blocks must operate successfully
for system success. Similarly two blocks are shown in parallel in the block
diagram, if the success of either of these results in system success. In a
parallel block diagram of many blocks, such as Fig 3.2, successful operation
of anyone or more blocks ensures system success. A block diagram, in
which both the above connections are used is termed as Series-Parallel Block
Diagram.

A closely related structure is a k-out-of-m structure. Such a block diagram


represents a system of m components in which any k must be good for
system to operate successfully. A simple example of such a type of system
is a piece of stranded wire with m strands in which at least k are necessary

In"IT""IT~,
'",
x
1 . X2 .f----·,~

Fig. 3.1 A Series Block Diagram


··.'1L-----....1
n
X
~~'Out

In Out In Out

(atleast k needed)
FIg. 3.2 A Parallel Block Diagram Fig. 3.3 A k-out-of-m Block Diagram

to pass the required current. Such a block diagram can not be recognised
without a description inscribed on it, as in Fig 3.3. Series and Parallel
reliability block diagrams can be described as special cases of this type with
k equal to m and unity respectively.
62 Reliability Engineering

A block diagram which can not be completely described through series or


parallel operational relationships, is called a non-series parallel block diagram.
The analysis methods for such systems are discussed in the next chapter.

3.3 SERIES SYSTEMS


Many complex systems are series systems as per reliability logic. The block
diagram of a series system was shown in Fig 3.1. If Ej and Ej' denote the
events of satisfactory and unsatisfactory operation of the component i, the
event representing system success is the logical intersection of E" E2, ... ,En.
Reliability of the system is the probability of success of this event and is
given by

R = Pr(E, n E2 n ....... n En) (3.1 )


= Pr(E,) Pr(E2/E,) Pr(E3/E2E,) ... (3.2)

where Pr(E2/E,) is the probability of event E2 provided E, has occurred. For


independent components

R = Pr(E, )Pr(E 2 ) ..... Pr(E n) (3.3)

If Pr(Ej) = Pj(t); the time dependent reliability function is

n
R(t) = II Pj(t) (3.4)
i= 1

The above equation is commonly known as product-law ofreliabilities.

In the case of exponential distributions, if A.j is the failure rate of component


i,

and

n
R(t) exp [-t L A.jl (3.5)
i=1

Therefore, the reliability law for the whole system is still exponential. Also,
for series systems with constant failure rate components the system failure
rate is the sum of failure rates of individual components i.e.,
Reliability Analysis of Series Parallel Systems 63

(3.6)

and the MTBF of the system is related to the MTBF of individual components
by
n
ms= l/:E (lITj) (3.7)
i=l

Example 3.1

An electronic circuit consists of 5 silicon transistors, 10 silicon diodes, 20


composition resisters, and 5 ceramic capacitors in continuous series
operation and assume that under the actual stress conditions in the circuit
the components have the following failure rates:

Silicon transistors At = 0.000008/hr


Silicon diodes Ad =0.000002/hr
Composition resistors Ar =0.000001 Ihr
Ceramic capacitors Ac =0.000004/hr

Estimate the reliability of this circuit for 10 hour operation.

Solution

Circuit failure rate is given as:

This sum is the expected hourly failure rate As of the whole circuit. The
estimated reliability of the circuit is then

R(t) = exp(-O.OOOl t)

for an operating time t. For a 10 hour operation the reliability is

R(10) = 0.999 = 99.9%


Also the expected mean time between failures is

ms = 1IA.s = 1/0.0001 = 10,000 hours

This does not mean that the circuit could be expected to operate without
failure for 10,000 hours. We know from the exponential function that its
64 Reliability Engineering

chance to survive for 10,000 hours is only about 37%.

***
It may be noted that the component failure rate figures apply to definite
operating stress conditions-for instance, to an operation at rated voltage,
current, temperature, and at a predicted level of mechanical stresses, such
as shock and vibration. Failure rates usually change radically with changes
in the stress levels. If a capacitor is operated at only half of its rated voltage,
its failure rate may drop to 1/30th of the failure rate at full rated voltage
operation.

Thus, to upgrade the reliability of the circuit it becomes necessary to reduce


the stresses acting on the components; that is, to use components of higher
voltage and current ratings, and to make provisions for a reduction of the
operating temperature levels. Using these techniques, component failure rate
reductions by a factor of ten are often easily achieved.

Thus, when designing the circuits and their packaging, the circuit designer
should always keep two things in mind:

1. Do not overstress the components, but operate them well below their
rated values, including temperature. Provide good packaging against
shock and vibration, but remember that in tightly packaged
equipment without adequate heatsinks, extremely high operating
temperatures may develop which can kill all reliability efforts.

2. Design every equipment with as few components as possible. Such


simplification of the design increases reliability and also makes
assembly and maintenance easier.

It may be observed that the time t used above is the system operating time.
Only when a component operates continuously in the system will the
component's operating time be equal to the system's operating time. In
general, when a component operates on the average for t1 hours in t system
operating hours, it assumes in the system's time scale a failure rate of

(3.8)

Where /..' is the component's failure rate while in operation.

The above equation is based on the assumption that in the non-operating or


de-energized condition the component has a zero failure rate even though the
system is in operation. This is not always the case. Components may
exhibit some failure rates even in their quiescent or idle condition while the
Reliability Analysis of Series Parallel Systems 65

system is operating. If the component has a failure rate of A.' when operating
and A." when de-energized, and it operates for t, hours every t hours of
system operation, the system will see this component behaving with an
average failure rate of

A. = (A.'t, + A. "(t - t, )lIt (3.9)

If the failure rate of a component is expressed in terms of operating cycles,


and if the component performs on the average 'C' operations in t system
hours, the system will see this component behave with a failure rate of

(3.10)

But if this component also has a time dependent failure rate of A.' while
energized, and a failure rate of A." when de-energized (with system still
operating), the component assumes in the system time scale a failure rate of

A. = (C I.e + t, A.' + A." (t-t, )lIt (3.11 )

Example 3.2

An electric bulb has a failure rate of 0.0002/hr when glowing and that of
0.00002/hr when not glowing. At the instant of switching -ON, the failure
rate is estimated to be 0.0005/switching. What is the average failure rate of
the bulb if on the average it is switched 6 times every day and it remains ON
for a total of 8 hrs in the day on the average.

Solution

Here,

t =24 hrs
t, =8 hrs
A.' =0.0002/hr
A." =0.00002/hr
I.e =0.0005/switching
C =6

Therefore, using equation (3.11),

A. = [6(0.0005) + 8(0.0002) + 16(0.00002)]/24


= 0.00492/24 =0.000205/hr.

An interesting point to be made here is that purely from reliability


considerations, it is better to keep the bulb on for the whole day rather than
66 Reliability Engineering

switching it off when not needed. (We have not discussed the question of
energy consumption here -which may force the other decision on us).

***
In case the components in a series system are identical and independent
each with reliability, p or unreliability, q

R = pn = (1-q)n (3.12)

For high reliability region,

R ~ 1-nq (3.13)

is a good approximation and can be used for fast calculation.

Example 3.3

A series system is composed of 10 identical independent components. If the


desired value of system reliability is 0.99, how good must the components
be from the reliability point of view?

Solution

Using relation (3.13),

R ~ 1-nq
or, 0.99 = 1-1 Oq
or, q =0.001
Hence, p =0.999

On the other hand, if we use the exact relationship,

R = p10
or, p10 =0.99
p =(0.99)0.1 = 0.99899.

We can thus see that the difference between exact calculation and
approximate calculation is negligible and hence the approximate realtion is
frequently used in practical design which in simple words means that the
system unreliability is the product of component unreliability by the number
of components in the system.

***
Reliability Analysis of Series Parallel Systems 67

3.4 PARALLEL SYSTEMS


When a system must be designed to a quantitatively specified reliability
figure, it is generally not enough for the designer to simply reduce the
number of components and the stresses acting on them. He must, during
the various stages of the design, duplicate components, and sometimes
whole circuits, to fulfill such requirements. In other words, he must use
parallel systems, such as shown in Fig 3.2.

If Ej and Ej' are the events of satisfactory and unsatisfactory operation


of the component i, the event for system success now is the union of E"
E2, ... ,Em. Reliability of the system is the probability of success of this event
and is given by

R = Pr(E,u E2 u ... u Em) (3.14)


= 1-Pr(E,'("\ E2'("\ ... ("\ Em') (3.15)

For independent components,

R = 1 - Pr(E', )Pr(E' 2) ... Pr(E'm) (3.16)

If Pr(Ej') =qj and Pr(Ej) =pj, the time dependent reliability function is
m
R(t) = 1 - n qj(t) (3.17)
i=1
m
= 1- n(1-pj(t)) (3.18)
i=1
In case of identical components,

R = 1 - [1-p(t)]m (3.19)

and the unreliability

a = q(t)m (3.20)

which is commonly called product law of unreliabilities. For designing a system


having unreliability less than a, the number of parallel components each with
unreliability q can be determined easily using the above equation.

For constant failure rates,

R(t) = 1 - [1-exp(-A.t)]m (3.21 )

and the MTBF for the system is given by


68 Reliability Engineering

00
ms =I [1 - (1-exp(-At))]m dt (3.22)
o
It can be easily derived now that:
m
ms = (1!A) L (1/i) (3.23)
i=1
For large values of m, equation (3.23) can be reduced to:

ms =(1!A) [Ln(m) +0.577+ 112m] (3.24)

Reliability improvement through redundancy is thus seen to be logarithimic.

It implies that although more number of components in parallel is


advantageous from the reliability considerations, the incremental advantage
keeps on reducing with every increase in the component used. A designer
must study this vis-a-vis his observation that cost will be generally a linearly
increasing function of the number of components. The above observation
implies that while designer has the option of adding redundant components
for improved reliability, this option should not be used indiscriminately.

When two components with the failure rates Al and A2 operate in parallel,
the reliability Rp of this parallel system is given by

(3.25)

The mean time between failures in this case is

00

mp =I Rp dt = 1fA.l + 1fA.2 - 1/(Al + A2) (3.26)


o
When the failure rates of two parallel components are equal so that Al =
A2 = A, the unreliability of this parallel combination of two identical
components is

Op = 0, O2 = 02 = [1-exp(-At)]2

The reliability is

Rp = 1-0p = 1 - [1-exp(-At)]2 = 2exp(- At) - exp(-2 At) (3.27)

The mean time between failures now is

mp = 2fA. - 1/(2 A) = 1fA. + 1/(2 A) = 3/(2 A) (3.28)


Reliability Analysis of Series Parallel Systems 69

For three identical components in parallel, we have

Rp = 1-0p = 1-03 = 1-[1-exp(-At)]3

= 3exp(- At) - 3exp(-2 At) + exp(-3 At) (3.29)

or, mp = 3f)..-3/2A + 1/3A = 11/6A which can also be expressed as:

mp = 1f).. + 11(2 A) + 1/(3 A) = 11/(6 A) (3.30)

When three components in parallel are not similar,

1-[1-exp(- A1t)][ 1-exp(- A2t)][ 1-exp(- A3t)]


mp 1f)..1 + 1/A2 + 1f)..3 - 1/(A1 + A2) - 11 (A1 + A3) - 1/(A2 + A3)

(3.31 )

Finally, for n similar components in parallel, we obtain,

Rp = 1 - Op = 1-0" = 1-[1-exp(- At)]"

mp = 1f).. + 1/(2 A) + 1/(3 A) + ... + 1/(n A) (3.32)

Although the improvement in reliability achieved by operating components


in parallel is quite obvious, it must be remembered that not all components
are suitable for what we have defined as parallel operation, i.e.,
continuous operation of two parallel sets for the sole purpose of having one
to carry on the operation alone should the other fail. Resistors and
capacitors are particularly unsuitable for this kind of operation because if
one fails out of two parallel units, this changes the circuit constants. When
high reliability requirements make redundant arrangements of such units a
necessity, these arrangements must then be of the stand-by type where
only one unit operates at a time and the second unit, which is standing by
idly, is switched into the circuit if the first unit fails. Such systems are
discussed in a subsequent section.

Example 3.4

A broadcast station has three active and independent transmitters. At least


one of these must function for the system's success. Calculate the reliability
of transmission if the reliabilities of individual transmitters are 0.92, 0.95,
and 0.96 respectively.
70 Reliability Engineering

Solution
m
Rp = 1-IT (1-Pi)
i=1

= 1-(0.08)(0.05)(0.04) = 0.99984 ( or 99.98%)


***
3.5 SERIES PARALLEL SYSTEMS

In such systems, we have to apply the product law of reliability and product
law of unreliability repeatedly for reliability analysis of the systems. This is
best clarified with the help of some examples:

Example 3.5

A system consists of five components connected as shown in Fig 3.4 with


given values of component reliabilities. Find the overall system reliability.

0.92

B
~ 0.98 r--

A 0.92

0.98 0.98

D E

Fig. 3.4: System for Example 3.5

Solution

The reliability for series combination O-E is:

RdRe =0.98·0.98 =0.9604

The reliability for parallel combination 8-C is:


Reliability Analysis of Series Parallel Systems 71

Hence, the reliability of ABC together is:

(0.98)(0.9936) =0.9737

Therefore the overall system reliability is:

0.9737 +0.9604-(0.9737)(0.9604) =0.99896

***
Example 3.6

Three generators, one with a capacity of 100 kw and the other two with a
capacity of 50 kw each are connected in parallel. Draw the reliability logic
diagram if the required load is:
(i) 100 kw (ii) 150 kw

Determine the reliability of both the arrangements if the reliability of each


generator is 0.95.

Solution

The reliability logic diagram for case (i) is drawn as shown in Fig 3.5(a)
because in this case either one 100 kw or two 50 kw generators must
function. Similarly, the logic diagram for case (ii) is drawn as shown in Fig
3.5(b) as in this 100 kw generator must function and out of the remaining
two anyone is to function.

, - - - - i lOOkw SOkw

SOkw SOkw SOkw

Fig. 3.S (8): Case (i) (b): Caoe(ii)

If r is the reliability for each component, the system reliability Rl and R2 is


respectively computed as:
72 Reliability Engineering

Rl =r+r2-r3
R2 = rI2r-r 2]

With r=0.95, Rl =0.995 and R2=0.948

***
3.51 Redundancy at Component Level

The pertinent question here is, at what level should the components be
duplicated, i.e, at component level, subsystem level or system level? We
will explain this with the help of an example. Consider the two
configurations as given in Fig 3.6.

[]-Cl··_·_····_· n

<a)

(b)
FIg 3.6: Redundancy at Component Level

In this configuration 3.6(a), there are n components connected in series, and


the set of this n components, is placed in parallel with another set. In
configuration 3.6(b), the components have been first placed in parallel, and
in turn connected in series. Which configuration gives the better reliability,
that is, the components duplicated at component level [Fig 3.6(b)], or at the
subsystem level [Fig 3.6(a)] ?

Let the reliability of each component be r. The reliability of the system (Rs) in
the case of configuration 3.6(a) can be expressed as

Rs = 1-( 1-rn)2 = rn(2-rn)


Reliability Analysis of Series Parallel Systems 73

The reliability of the system (Rs') in the case of configuration 3.6(b) is


expressed as

Rs' = [1-( 1-r)2]n = rn(2-r)n

The ratio of Rs' and Rs gives

Rs' rn(2-r)n
----- = -------------

It can be shown that the ratio R's:Rs is greater than unity for r< 1. Hence,
the configuration 3.6(b) would always provide higher reliability. Thus, as a
generalisation, it can be said that the components if duplicated in the system
at the component level give higher system reliability than if duplicted at the
subsystem level (here each set is considered as a subsystem). In general, it
should be borne in mind that the redundancy should be provided at the
component level until and unless there are some overriding reasons or
constraints from the design point of view.

3.6 K-OUT- OF-M SYSTEMS

In many practical systems more than one of the parallel components


are required to work satisfactorily for successful operation of the system.
For example, we can consider a power plant where two of its four
generators are required to meet the customer's demand. In a 6-cylinder
automobile, it may be possible to drive the car, if only four cylinders are
firing. Such systems are known as k-out-of-m systems. For identical,
independent components, with p as the reliability of each component, the
probability that exactly x out of m components are successful is:

p = mCx px (1-p)m-x (3.33)

For a k-out-of-m system, the event of system success will be when k, k + 1,


k + 2, ... or m components function successfully. So the system reliability is
the sum of probabilities for x varying from k to m i.e.

m
R= t mCi pi (1-p)m-i (3.34)
i=k

For constant failure rates,


m
R(t) = t mCi exp(-iAt) [1-exp(- At)]m-i (3.35)
i=k
74 Reliability Engineering

and
m
ms = (1/A.) L 1/i (3.36)
i=k

In a k-out-of-m system, (m-k) components are redundant components and


any increase in the value of k decreases the system reliability. For example
let us suppose that there are four generators of 200 KW each in a power
plant and the demand is 400 KW. This demand can be met by any two of
the generators and this becomes a 2-out-of-4 system, leaving 2 generators
as redundant. In case the demand increases to 600 KW, this can be met
by 3 generators and this would become a 3-out-of-4 system leaving only
one generator as redundant with a decreased system reliability.

If the components are not identical but have different reliabilities, the
calculations become more complicated.

Assume three components with the reliabilities R" R2 and R3 operating


simultaneously and in parallel. Then,

(R, +0,) (R2+02) (R3+03) = R,R2R3 + (R,R 20 3 + R,R302+R2R30,)


+ (R,0203 + R20,03 + R30,02) + 0,0203
=1
To obtain system reliability for 1-out-of-3 system, we will discard the last
term only, i.e., 0,0203 and for 2-out-of-3 system, the last four terms are to
be discarded.

Example 3.7

An electrical system consists of four active, identical, and independent units


whose failure rates are constant. For the system's success atleast three
units must function normally. Each unit has a constant failure rate equal to
0.0005 failures/hr. Calculate the system mean time to failure.

Solution

Now, m =4, k=3 and A.=0.0005 failures/hr

Using equation (3.35),


4
R(t) =L 4Ci e-i).t (1-e-).t)4-i = 4e-3).t -3e-4).t
i=3
Also using equation (3.36),

ms = (1 /A.)(1 /3 + 1/4) = 7/12A. 1,167 hr


Reliability Analysis of Series Parallel Systems 81

m
as = PrO ~ k} = l: mCj (Ps)i (1-ps)m-j (3.51 )
j=k

Again using the rare-event approximation that Ps < < 1, we may approximate
this expression by

(3.52)

From Eqs.(3.50) and (3.52) the trade-off between fail-to-danger and spurious
operation is seen. The fail-safe unreliability is decreased by increasing k and
the fail-to-danger unreliability is decreased by increasing m-k.

3.8 STAND-BY SYSTEMS

Often it is not feasible or practical to operate components or units in parallel


and so called Stand-by arrangements must be applied; that is, when a
component or unit is operating, one or more components or units are
standing by to take over the operation when the first fails.

Stand-by arrangements normally require failure sensing and switchover


devices to put the next unit into operation. let us first assume that the
sensing and switch over devices are 100 percent reliable and that the
operating component and the stand-by components have the same constant
failure rate.

We can regard such a group of stand-by components as being a single unit


or system which is allowed to fail a number of times before it definitely
stops performing its function. If n components are standing by to support
one operating component, we have (n + 1) components in the system, and n
failures can occur without causing the system to fail. Only the (n + 1 )th
failure would cause system failure.

Since exp(- At) exp( At) =1


We have,
exp(- At)[1 + At + (At)2/2! + (At)3/3! + ---------] = 1

In this expression the term exp(- At) *1 represents the probability that no
failure will occur, the term exp(- At) * (At) represents the probability that
exactly one failure will occur, exp(- At)(At)2/2! represents the probability that
exactly two failures will occur, etc. Therefore, the probability that two or one
or no failure will occur or the probability that not more than two failures will
occur equals:
82 Reliability Engineering

exp(- At) + exp(- At) At + exp(- At) (At)2/2!

If we denote by Rs and as the reliability and the unreliability of the system,


and because Rs + as = 1 we can write

Rs + as = exp(- At)[1 + At + (At)2/2! + (At)3/3! + ---------]


= exp(- At) + exp(- At) At + exp(- At) (At)2/2! + -----
= 1

If in this expanded form, we allow one failure, then the reliability of a


stand-by system composed of one operating component and another
standing by idly to take over if the first fails is given by:

Rs = exp(- At)[1 + At] (3.53)

The mean time between failures for a two-component system is:


00

ms = I Rsdt = 11 A + AI A2 = 21 A (3.54)
o
For a stand-by system of three units which have the same failure rate and
where one unit is operating and other two are standing by to take over the
operation in succession, we have

Rs = exp(- At)[1 + At+ A2t 2 /2!] (3.55)


and
ms = (11 A) +(11 A) + (11 A) = 31 A (3.56)

In general, when n identical components or units are standing by to support


one which operates,
n
Rs = exp(- At) L (At)i/i! (3.57)
i=O

ms = (n + 1 )1 A (3.58)

The stand-by arrangements are slightly more reliable than parallel operating
units, although they have a considerably longer mean time between
failures. However, these advantages are easily lost when the reliability of
the sensing-switching device Rss is less than 100%, which is more often
the case. Taking this into consideration and when the circuits are
arranged so that the reliability of the operating unit is not affected by the
unreliability of the sensing-switching device, we obtain for a system in
which one stand-by unit is backing up one operating unit:
Reliability Analysis of Series Parallel Systems 83

Rs = exp(- A.t) + Rss exp(- A.t) A.t (3.59)

It is the exception rather than the rule that the failure rates of the stand-by
units are equal to those of the operating unit. For instance, a hydraulic
actuator will be backed up by an electrical actuator, and there may be even
a third stand-by unit, pneumatic or mechanical. In such cases, the failure
rates of the stand-by units will not be equal and the formulae which we
derived above will no longer apply.

If the system contains two different elements, A and 8, the reliability


functions can be found directly as follows:

The system will be successful at time t if either of the following two


condtions holds (letting A be the primary element).

1. A succeeds up to time t or
2. A fails at time t, <t and B operates from t, to t.

Translation of these two condtions to the time dependent probabilities gives


00 t 00
R(t) = Ifa(t)dt + I[fa(t,) I fb(t)dt1 dt, (3.60)
t 0 t-t,
where f(t) is the time-to-failure density function of an element.

The first term of this equation represents the probability that element A
will succeed until time t. The second term excluding the outside integral, is
the density function for A failing exactly at t, and 8 succeeding for the
remaining (t-t,) hours. Since t, can range from 0 to t, t, is integrated over
that range.

For the exponential case where the element failure rates are A. a and ~
00 t 00

R(t) = I A.a exp(-A.at) dt + J [A.a exp(-A.at,) J A.beXP(-A.bt) dt1 dt,


t 0 ~t,
t
= exp(-A.at) + JA.a exp(-A.at,) exp[-A.b(t-t,)1 dt,
o
t
= exp(-A.at) + A. a exp(-A.bt) Jexp[-(A.a-A.b)t,1 dt,
o
or, R(t) = [ A.bexP(- A.at) - A.aexp(- A.bt)]/( A.b- A. a) (3.61)

and (3.62)

It can be shown that it does not matter whether the more reliable element
84 Reliability Engineering

is used as the primary or the stand-by element.

Example 3.9

One generator is placed in standby redundancy to the main generator. The


faliure rate of each generator is estimated to be A. = 0.05/hr. Compute the
reliability of the system for 1Ohrs and its MTBF assuming that the sensing
and switching device is 100% reliable. If the reliability of this device is only
BO%, how are the results modified?

Solution

When sensing and switching device is 100% reliable,

Rs = (1 + A.t)exp(-A.t) = (1 + (0.05)(1 0)) exp(-(0.05)(10))


= 0.909B.

Also, MTBF = 2()" =2/0.05 = 40 hrs.

When sensing and switching device is BO% reliable,

Rs= (1 +O.BOA.t) exp(-A.t) =0.B491


and,
MTBF= (1 +O.BO)()"= 1.BO/0.05 =36 hrs

The appreciable decrease in the values of reliability and MTBF may please be
observed by the reader because of the imperfect nature of sensing and
switching over device.

***
3.81 Types of Standby Redundancy

There could be several variations of the standby arrangements in actual


practice some of these are discussed in the section below;

1. Cold Standby

The standby configuration discussed earlier having perfect or imperfect


sensing and switching over devices, is known as cold standby, as in this
case, the primary component operates and one or more secondary
components are placed in as standbys. It is assumed that the secondary
components in the standby mode do not fail.
Reliability Analysis of Series Parallel Systems 85

2. Tepid Standby

In this case, the value of the standby component changes progressively. For
example, components having rubber parts deteriorate over time and
ultimately affect the reliability of standby component.

3. Hot Standby

The standby component in this case, fails without being operated because of
a limited shelf life. For example, batteries will fail even in standby due to
some chemical reactions.

4. Sliding Standby

Consider a system consisting of N components connected in series. To this


system, a sliding standby component is attached which will function when
any of the components of the system fails. This is shown in Fig 3.9.

--------@-

Fig 3.9: SlidiDl! Standby

It may be noted that sliding standby components may have more than one
component in standby depending upon the reliability requirement.

5. Sliding Standby with AFL

In this case, an Automatic Fault Locator (AFL) is provided with the main
system which accomplishes the function of locating the faulty component,
disconnecting it and connecting the standby component. AFL's are generally
provided in automatic and highly complex systems. The sliding standby
redundancy having AFL is shown in Fig 3.10.
86 Reliability Engineering

~-8 .... ~
LGJ
r---------------------------------1
I
~
L6J ,
1 ,,
!
!
,
m i
i
_________________________________ J

Fig 3.10: Sliding Standby with AFL


8
MAINTAINABILITY AND A V AILABILITY

8.1 INTRODUCTION

The principal objectives of maintenance can be defined as follows:

1. To extend the useful life of assets. This is particularly important in view


of the lack of resources.
2. To ensure the optimum availability of installed equipments for
production (or service) and obtain the maximum possible return on
investment.
3. To ensure the operational readiness of all equipment required for
emergency use, such as standby units, firefighting and rescue
equipment, etc.
4. To ensure the safety of personnel using facilities.

From time to time, statistics are generated which emphasize the costliness
of maintenance actions. While estimates of actual costs vary, they
invariably reflect the immensity of maintenance expenditures. According to
one source, approximately 800,000 military and civilian technicians in U.S.A.
are directly concerned with maintenance. Another source states that for a
sample of four equipments in each of three classes - radar, communication,
and navigation the yearly support cost is 0.6, 12 and 6 times, respectively,
the cost of the original equipment. Such figures clearly indicate the need
for continually improved maintenance techniques.

In addition to these cost considerations, maintainability has a significant


effect on other system-effectiveness characteristics. System effectiveness is a
function of system performance capability, system dependability and system

153
162 Reliability Engineering

8.4 MAINTAINABILITY FUNCTION

Maintainability is an index associated with an equipment under repair. It is


the probability that the failed equipment will be repaired within time t hr.
If T is a random variable representing the repair time, then maintainability is
defined as

M(t) = Pr(T::;; t) (8.3)

If the repair time is exponentially distributed with the parameter ~, then


the repair-density function is

g(t) = ~ exp(- ~t) (8.4)

and therefore,

t
Pr(T::;; t) f ~ exp(- ~t) dt
o
1 - exp(- ~t) (8.5)

M(I)

I-lie

1111 Time

Fig.8.4 Maintainability graph.

Thus the maintainability equation is

M(t) = 1 - exp(- ~t) (8.6)

The graph between M(t) and t is shown in Fig. 8.4.


Maintainability and Availability 163

The expected value of repair-time is called the mean time to repair (MTIR)
and is given by

00

MTIR Jt g(t) dt
o
00

J Jl t exp(- Jl t) dt = 1 / Jl (8.7)
o

8.5 AVAILABILITY FUNCTION


The availability function can be computed using the familiar Markov model. It
is assumed that the failure and repair rates are constant. The Markov graph
for the availability of single component with repair is shown in Fig.8.5. The
repair starts as soon as the component fails.

A = failure rate (failures per unit time)


Jl = repair rate ( repairs per unit time)

o
AAI l,u\ I

1- AA I

Stale 0 Siale 1

~I

Fig.8.S Markov graph for availability.

State 0 denotes that no failure has occurred and state 1 denotes that one
failure has occurred (i. e. the component is down). If component has not
failed at time t, then the probability that the component will fail in the time
interval (t, t + M) is equal to AAt. On the other hand, if the component is in
state 1 (failed state), then the probability that the compnent will enter into
state 0 is equal to Jl At.

From the Markov graph, it can be seen that the probability that the
component will be in state 0 at time t + At is

Po(t+At) = Po(t) ( 1- AAt) + p,(t) IlAt (8.8)


164 Reliability Engineering

Similarly, the probability that the component will be in state 1 at time t + At


is

P, (t + At) = P, (t) ( 1- ~At) + Po(t) A.At (8.9)

The above equations can be rewritten as follows:

--------------------- = - Po (t) A. + P, (t) ~

--------------------- = Po(t) A. - P, (t) ~


At

The resultant differential equations are

dPo(t)
(8.10a)
dt

dP, (t)
(8.10b)
dt

At time t = 0

PolO) = 1 and PlIO) = 0

The solution of this set of two differential equations yields:

~ A.
+ --------- exp [-( A. + ~)tl (8.11a)
A.+~ A.+~

A. A.
---------- - --------- exp [-( A. + ~)tl (8.11 b)
A. + ~ A. + ~

As per the definition of availability,

A(t) = PoW = ----------- + ------------- exp [-(A. + ~)tl (8.12)


A. + ~ A. + ~
Maintainability and Availability 165

The availibility function is plotted in Fig. 8.6(a).

As time becomes large, the availability function reaches some steady-state


value. The steady-state or long term availability of a single component is

A(t) = A (00 ) = ~ I (A. + ~) (8.13)

o nonnalized time
(a) Availability of the unit

OIP

o U T
c
(b) Average history of o/p of the unit

up do

(c) Two state transition diagram.

Fig.8.6 Behaviour of a single repairable unit


166 Reliability Engineering

This equation can be modified as

1I A.
A (8.14)
l/A.+l/J.1

Here, 1/ A. is the mean time between failures (MTBF). It may be noted that
this has been defined as the mean time to failure (MTTF) in the case of non-
repairable components. 1/J.1 is the mean repair time or mean time to repair
(MTTR). Fig.8.6(b) characterizes the expected or mean behaviour of the
component. U represents the mean up-time (MTBF) and 0 represents the
mean down-time (MTTR). To is known as cycle time. Here,

U=l/A.
o= 1/ J.1

The steady-state availability is a number greater than zero and less than
one. It is equal to zero when no repair is performed ( J.1 = 0) and equal to one
when the equipment does not fail (A. = 0). Normally, 1/J.1 is much smaller than
1I A. and therefore the availability can be approximated as

A = 1 /( 1 + A./J.1) = 1 - ( A./J.1 ) (8.15)

When A./J.1 approaches zero, A approaches unity.

P, (t) defines the unavailability of the equipment and hence

A'(t) = M(A. + J.1) [1 - exp(-( A.+ J.1)t)]

A' = A' (00 ) = A./(A. + J.1) (8.16)

The number of failures per unit time is called the frequency of failures.
This is given by

f = 1 ITo = l/(U + 0) (8.17)

The availability, transition rates ( A. and J.1 ) and mean cycle time can be
related as follows:

A = U/(U + 0) = fU = f/ A. (8.18)

A' = O/( U + 0) = f/J.1 (8.19)

f = A A. = A' J.1 (8.20)


Maintainability and Availability 167

Example 8.1

The following data was collected for an automobile:


mean time between failures = 500 hr
mean waiting time for spares = 5 hr
mean time for repairs = 48 hr
mean administrative time = 2 hr
Compute the availability of the automobile.

Solution

Total mean down time = 5 + 48 + 2 = 55 hrs.


Using relation(8.18), we get

500
Availability = -------------
= 500/555 = 0.90
500+55
The automobile would be available 90% of the time.

***
Example 8.2

An equipment is to be designed to have a minimum reliability of 0.8 and a


minimum availability of 0.98 over a period of 2 x 10 3 hr. Determine the
mean repair time and frequency of failure of the equipment.

Solution

R(t) = exp(-At)

Now, R(t) = 0.8 for t = 2 x 103 hr

Therefore,

A = - 0.5 x10- 3 In(0.8) = 1.12 x 10-4 Ihr.


Also, steady state availability is given by equation (8.13),

Jl
0.98
168 Reliability Engineering

or, J.I. = 0.98 J.I.+ 1.12 x 10-4x 0.98


or, J.I. = 5.49 x 10-3 Ihr.

Hence, mean repair time is given by

MDT = 11 J.I. = 10 3 /5.49 = 182.2 hrs.


Also, f = A. A = 1.12 x 10-3 x 0.98 = 1.1 x 10-4 /hr.
***
8.6 TWO UNIT PARALLEL SYSTEM WITH REPAIR
8.61 System Reliability

The reliability of a parallel system can be influenced by repairs. Consider


a simple system having two units in parallel. In such systems when a unit
fails it goes to repair and the other unit starts meeting the system demands.
The system fails only when the second unit fails before the failed one is
restored to operation. A two-unit system can be represented by a three-
state Markov model as shown in Fig.8.7. At state 0 both the units are good,
at state lone unit has failed and at state 2 both units have failed.

1-~ t

\-f>t

Fig.S.7 Markov reliability model for a two unit parallel system.

The following set of differential equations can be obtained from the state-
probability equations,

After solving for P's, we find that the system reliability is

(8.21)
Maintainability and Availability 169

Sl S2
= -------- exp(s2t) - ---------- exp(sl t ) (8.22)

Where,

The mean time to first system failure (MTFF) is another system parameter useful
for the analysis of system effectiveness when repairs are performed. This
parameter is often referred to as the mean time between failures (MTBF) as
the system states alternate between good and bad continuously due to
repair.

00

MTFF = J R(t) dt
o

r
ao
51 exp (S2t) -S2 exp (Slt)
= I ------------------------------------
J (Sl-S2)
o
(8.24)

For a two-unit system

51 + 52 = - ( 1..0 + 1..1 + ~1)


51 S2 = 1..0 1..1
MTFF = ( 1..0 + 1..1 + ~1)1 1..0 1..1 (8.25)

For the active - redundant system, this turns out to be,

MTFF = (31.. + ~)/2 1..2 = 31(2 A.) + ~ I (2 1..2) (8.26)

For ~ = 0, we get MTFF = 31(21..) which is the mean time to failure of a two-
unit non-maintained parallel system. Similarly, for a standby two-unit system

MTFF = (21.. + ~)/l..2 = 2/1.. + ~/A.2 (8.27)

which reduces to 211.. for ~ = O.


8.62 System Availability

The approach to the computation of availability is same as that of reliability


170 Reliability Engineering

computation. However, since availability is concerned with the status of the


system at time t, the repair at state 2 is also considered. The Markov-
availability model is thus shown in Fig. 8.8.
I-J.2A t

o
1- Ao1 t

\Il t
2

Fig.8.8 Markov availability model for a two unit parallel system.

The steady - state availability of the system is

(8.28)

For the case of a two-unit active redundant system

Therefore,
A2
A (00 ) = 1 - --------------------- 1 - [A/( A + 1l)]2 (8.29)
A2 + 2 All + 112

For a two-unit series system, the availability becomes

A = 1l1/( Ao + Ill) = 1l/(2 A + Il) (8.30)

If we have n units in series, then

A = Il/(n A + Il) (8.31 )

Example 8.3

Two transmitters are installed at a particular station with each capable of


meeting the full requirement. One transmitter has a mean constant failure
rate of 9 faults per 104 hrs and occurrence of each fault renders it out of
service for a fixed time of 50 hours. The other trasmitter has a
corresponding failure rate of 1 5 faults per 104 hours and an out of service
Maintainability and Availability 171

time per fault of 20 hours. What is the mean availability of the system?

Solution

For the first trasmitter,

1..1 = 9xl 0-4/hr

J.l1 = 1/50 = 0.02 /hr

Hence,
A1 = [J.l1/(J.l1 +A.1»)=[0.02/(0.02+9xl0- 4»)= 0.9569

Similarly, for the second transmitter,

1..2 = 15xl0-4/hr

J.l2 = 1/20=0.05 /hr

Hence,
A2 = [J.l2/(J.l2 + 1..2») = [0.05/(0.05 + 15xl 0- 4)] = 0.9800

Hence, the system availability for two transmitters in parallel is given by:

A = 1 - (1 - A1)(1 - A2)
1 - (1 -0.9569)(1 - 0.9800)
= 1 - 0.0431 x 0.02 = 0.9987

***
8.7 PREVENTIVE MAINTENANCE

Preventive maintenance is sometimes considered as a procedure intended


primarily for the improvement of maintenance effectiveness. However, it is
more proper to describe preventive maintenance as a particular category of
maintenance, designed to optimize the related concepts of reliability and
availability.

Preventive maintenance is advantageous for systems and parts whose failure


rates increase with time. The cost savings accrue for preventive
maintenance (planned replacement) only if the parts under consideration
exhibit increasing failure rates. Many types of electron tubes, batteries,
lamps, motors, relays and switches fall within this category. Most
semiconductor devices and certain types of capacitors exhibit decreasing
12
ECONOMICS OF RELIABILITY
ENGINEERING

12.1 INTRODUCTION

Any manufacturing industry is basically a profit making organization and no


organization can survive for long without minimum financial returns for its
investments. There is no doubt that the expense connected with reliability
procedures increases the initial cost of every device, equipment or system.
However, when a manufacturer can lose important customers because his
products are not reliable enough, there is no choice other than to incur this
expense. How much reliability cost is worth in a particular case depends
on the cost of the system and on the importance of the system's failure
free operation. If a component or equipment failure can cause the loss
of a multimillion dollars' system or of human lives, the worth of reliability
and the corresponding incurred cost must be weighed against these factors.
For the producer, it is a matter of remaining in the business. However, his
business volume and profit will be substantially increased once his
reliability reputation is established. Therefore, from manufacturer's point of
view, two important economic issues are involved:

(i) Financial profit


(ii) Customers' satisfaction

If a manufacturer intends to stay in his business, he has not only to


optimize his own costs and profits but to maximize customers' satisfaction
as well.

12.2 RELIABILITY COSTS

Reliability costs can be divided into five categories as shown in fig. 12.1.

272
Economics of Reliability Engineering 273

Components of each classification are described below:

FiI.l1.l Classiftcations or reHabUity costa.


Classification I

This classification includes all those costs associated with internal failures,
in other words, the costs associated with materials, components, and
products and other items which do not satisfy quality requirements.

Furthermore, these are those costs which occur before the delivery of the
product to the buyer. These costs are associated with things such as the
following:

1. Scrap
2. Failure analysis studies
3. Testing
4. In-house components and materials failures
5. Corrective measures

Classification II

This classification is concerned with prevention costs. These costs are


associated with actions taken to prevent defective components, materials,
and products. Prevention costs are associated with items such as the
following:

1. Evaluating suppliers
2. Calibrating and certifying inspection and test devices and
instruments.
274 Reliability Engineering

3. Receiving inspection
4. Reviewing designs
5. Training personnel
6. Collecting quality-related data
7. Coordinating plans and programs
8. Implementing and maintaining sampling plans
9. Preparing reliability demonstration plans

Classification III

Under this classification are costs associated with external failures - in


other words, costs due to defective products shipped to the buyers. These
costs are associated with item such as the following :

1. Investigation of customer complaints


2. Liability
3. Repair
4. Failure analysis
5. Warranty charges
6. Replacement of defective items

Classification IV

This category includes all the administrative-oriented costs- for example,


costs associated with the following :
1. Reviewing contracts
2. Preparing proposals
3. Performing data analysis
4. Preparing budgets
5. Forecasting
6. Management
7. Clerical

Classification V

This category includes costs associated with detection and appraisal. The
principal components of such costs are as follows:
1. Cost of testing
2. Cost of inspection (Le.,in-process, source, receiving, shipping
and so on)
3. Cost of auditing

12.3 EFFECT OF RELIABILITY ON COST

Any effort on the part of manufacturer to increase the reliability of his


Economics of Reliability Engineering 275

products will increase reliability design costs and internal failure costs.
However, after some time internal failure costs will start decreasing. The
external costs like transportation do not depend on reliability but installation
and commissioning and maintenance costs will show decline with an
increase in reliability.

Total Cost

-----------------------------~--~--~

Cost
Failure Cost

Mfg. Cost

Operating Cost

Reliability

Flg.Il.1 Cost curves or a product.

In general, it is not profitable to aim for complete perfection by eliminating


all failures (even if it is possible). This is clear from the reliability cost
curves given in Figure 12.2 for various categories of costs for an equipment.
Upto certain point, it is worth to make appropriate investments for reliability
and further investments will be advisable only where the reliability has an
13
RELIABILITY MANAGEMENT

13.1 INTRODUCTION

Reliability is no more a subject of interest confined to only academicians and


scientists. It has become a serious concern for practising engineers and
manufacturers, sales managers and customers, economists and government
leaders. The reliability of a product is directly influenced by every aspect of
design and manufacturing, quality engineering and control, commissioning
and subsequent maintenance, and feedback of field-performance data. The
relationships between these activities are shown in Fig.13.1. A well-planned

111
EXTERNAL SOURCES

Flg.13.1 Reliability and product Iife-cycle.

and efficiently managed reliability programme makes possible a more


effective use of resources and results in an increase in productivity and

293
294 Reliability Engineering

decrease in wastage of money, material, and manpower. As organizations


grow more and more complex, communication and coordination between
various activities become less and less effective. The cost of ineffective
communication can be dangerously expensive in terms of both time and
money. Moreover reliability achievement needs, in addition to proper
coordination of information, a specialized knowledge of each and all of the
interrelated components in a system. This places a great emphasis on the
creation of an independent group which could not only coordinate between
different departments but also carry out all reliability activities of the
organization.

The managing of reliability and quality control areas under the impact of
today's organized world competition is a highly complex and challenging
task. Management's reliability and quality control ingenuity in surmounting
the technological developments required for plant equipment, process
controls, and manufactured hardware requires a close working relationship
between all producer-and user-organization elements concerned.

The techniques and applications of reliability and quality control are rapidly
advancing and changing on an international basis. Industry views the use
of higher performance and reliability standards as scientific management
tools for securing major advantage over their competition. The application of
these modern sciences to military equipment, space systems, and
commercial products offers both challenge and opportunity to those
responsible for organization effectiveness. The use of intensified reliability
and quality programs as a means to improving product designs, proving
hardware capability, and reducing costs offers far reaching opportunity for
innovations in organization and methods.

The effects of the increasing complexity, reliability, schedule, and cost


competition on the reliability and quality control organization have required
that all top management be aware of the most logical cost-saving areas and
be assured that the product is as dependable as possible under the allowable
conditions of contract or competition.

To manufacture an excellent quality product with a very high numerical


reliability sometimes requires much more money than a customer is willing
to pay. Therefore, since high reliability and acceptable product costs are
often initially difficult to achieve, it becomes necessary that timely
management decisions be made regarding reliability, schedule, and cost
trade-offs. These decisions require the use of very exacting and cautiously
selected information and careful organization of implementing action in order
to obtain the most value for the money expended.
Reliability Management 295

13.2 MANAGEMENT OBJECTIVES

The management objectives in organizing the reliability and quality control


department should be to design and develop an organizational plan that will
provide the controls necessary to assure that the services and products of
the parent organization meet contractual requirements. These management
objectives may be stated in many different ways, but in essence they
probably control and reliability department is to assure that competitively
proved services and hardware that meet or exceed the customer's
requirements are provided.

Of course, there must be an optimum balance between the quality and


reliability aspects of a product and its cost; otherwise, the industry may
price itself out of the range that the customer is willing or has the ability
to pay. Also, in some instances the customer may deliberately elect to
sacrifice some reliability assurance for schedule reasons. Deliberate actions
are required of management in order to accomplish its planned objectives
for a program effectively and to assure that any trade-offs affecting product
reliability and maintenance are clearly understood by the producer and
customer.

Management is responsible for the business enterprise showing a profit. It is


in this area that quality control and reliability have the responsibility to assist
top management by assuring that planned actions are met in the design,
manufacture, and use phases of the hardware. The company that develops a
reputation for the manufacture of reliable products within budget will usually
grow and prosper. Certainly a manufacturing or service enterprise of high
integrity and enthusiasm will increase the prosperity and security of the
organization and employees, as well as contribute to the social well-being of
the community and nation.

Management of each organization element must be flexible and able to react


quickly to meet the demands of any possible competition or new customer
requirement. The ability to react quickly, objectively, and effectively to
quality and reliability challenges and to anticipate these needs before
difficulties arise is an organization characteristic most desired. Quality control
and reliability departments have a responsibility to minimize warranty and
customer service complaints by planned preventive actions as well as timely
corrective-action coordinations. A satisfied customer is a most important
contributing factor to the continuance of the manufacturing enterprise and
the achievement of management objectives.

The reliability requirements should be clearly stated at the design and


development stage itself. While setting reliability objectives it is worth
considering the following objectives of the organization:
296 Reliability Engineering

1. Maximize output,
2. Optimize reliability,
3. Minimize waste,
4. Maximize customer satisfaction and reputation,
5. Optimize job satisfaction, and
6. Minimize discontent.

All concerned should participate in deciding specific objectives and agree for
the ways and means of achieving them. Management by objectives approach
places greater emphasis on the importance of the basic decisions made
during design and development cycle in terms of reliability and how well it
satisfies the needs for which it is intended.

All objectives, whether requirement specifications or design instructions, are


essentially a means of communicating information to others. Therefore they
should be:

1. Clearly understandable,
2. Unambiguous, and
3. Realistic in terms of resources available.

A reliability specification format can be prepared for each type of product.


Even though the content may vary considerably from one type to another,
the typical contents may include:

1. The type and source of component failure data.


2. Reliability assessment methods to be employed.
3. Confidence levels required for reliability predictions
4. Mode of reliability specification:

(a) MTTF (mean time to failure) for nonrepairable items,


(b) MTBF (mean time between failures) for repairable items,
(c) Probability of success for one-shot devices whose operation is limited
to a single operation cycle,
(d) Failure rate, and
(e) Mean number of operations before an item fails (for devices such as
switches, connectors, relays, circuit breakers, etc.)

5. Maximum acceptable down time and mean time to repair (maintainability


characteristics) .

6. Maintenance policy:

(a) Repair plan,


Reliability Management 297

(bl Availability of spares,


(cl Maintenance personnel requirements, and
(dl Test facilities.

7. Details of environmental conditions and methods of operation

13.3 TOP MANAGEMENT'S ROLE IN RELIABILITY AND


QUALITY CONTROL PROGRAMS

Management must provide the controls needed to assure that all quality
attributes affecting reliability, maintainability, safety, and cost comply with
commitments and satisfy the customer's requirements. Tersely stated,
management must have well-planned policies, effective program planning,
timely scheduling, and technical training. Management must clearly state and
support its objectives and policies for accomplishing the product quality and
reliability and assign responsibility for accomplishment to appropriate
functions throughout the organization.

Top management's basic objective is to provide and maintain quality and


reliability organizations capable of efficiently accomplishing the necessary
inspection, test, and analytical laboratory services to assure that all
products satisfy the specified requirements of quality and reliability. The
quality control organization must support these objectives in a timely,
objective, and helpful manner. Improved product performance and lower
costs must be continually emphasized, and the results must be made visible
to management.

Fig.13.2 depicts a typical top-management organization which shows the


responsible management of the combined quality control and reliability
control departments. This arrangement provides for the entire function to be
headed by a director, with the quality control and reliability control functions
headed by managers. In this manner the necessary coordination, services,
and assurances at the equally important policy setting operating levels of
the various programs are kept on the policy course and not allowed to drift
off to the detriment of anyone aspect. Advantages of this combined quality
control and reliability organization are that top management has one point
of communication and the overhead costs of combined R&QC
organization may be lower than for separate organizations.

13.31 Time-phase Planning, Scheduling, and Implementation

The importance of reliability and quality control management control through


detailed scheduling of each item of the reliability and quality task must be
emphasized. Care must be exercised to sequence reliability and quality
program elements to coincide with related total program plans. For example,
298 Reliability Engineering

it would not be practical to request a major change in existing procedures


when the contract is nearing completion and the return will not justify the
effort expended. Nor would it be practical to expect the accomplishment of
tests in nonessential areas of operation when the cost of the test equipment
would not be justified by the service the equipment would provide.
However, the purchase and installation of equipment for assurance may
more than justify itself when compared with the potential impact of
equipment failure in customer operations.

PRESIDENT
OR
PLANT GENERAL MANAOER

Fig.13.2 Top-management organisation.

Management follow-up and evaluation of reliability and quality program


progress should be accomplished by use of audits and simple reports that
are specifically designed for the purpose. These management reports serve
as decision-making tools and forewarn management in the event progress
becomes static. Timely management action must be readily available and
applied as needed to many areas of the manufacturing sequences to
maintain a good, smooth-flowing, low-cost operation.
Reliability Management 299

13.32 Management Selection of Key Personnel

Management must recognize and choose the type of persons that are needed
to fill the key positions in the reliability and quality control organization.
Management must know that these selected people will be able to work
closely with and motivate others to accomplish their respective tasks.
Top management philosophy establishes the element for employee
motivation throughout the enterprise.

Top management must be organizationally situated to apprise, counsel, and


instruct the middle management that reports to them. All levels of
management must maintain clear two-way communications and motivate
others without destroying initiative and creativity.

When top management can report improvements in progress, whether it be


in implementing a new program or during the actual manufacturing process,
the chances are good that the operations of the particular departments
are contributing effectively to assuring a fair profit for the business
enterprise.

13.4 COST EFFECTIVENESS CONSIDERATIONS

13.41 Organization Responsibility

Responsibility for costs within the reliability and quality control organizations
can be most effectively accomplished when specific, capable individuals are
charged with coordinating all matters relating to cost analysis and budget
control. However, the assignment of coordination responsibility to these
individuals must not be allowed to detract from the duty of each member of
the reliability and quality control organization to maintain a high level of cost
effectiveness.

The cost control function within the reliability and quality control
organization is most frequently located within the quality control
Administrative Group, the Quality Control Systems Group, or the Quality
Control Engineering Group. Regardless of which group is given the
responsibility, the director of reliability and quality control and his
department managers must maintain very close and continuing
communications with the responsible individuals. Timely analysis of trends
and decisions and guidance should be provided frequently.

13.42 Timely Cost Planning

The reliability and quality control management team has value to the total
organization that is related directly to its favourable impact on product
300 Reliability Engineering

reliability, performance, and costs. Its contribution to the organized task


is of greatest value when performance, reliability, and maintainability of the
product are optimized with total program costs.

Although many individuals cooperatively contribute to the overall


performance schedule-cost profit objective, it is necessary that the
executive authority of R&QC management enter into the cycle whenever
the desired voluntary cooperation in other branches of the organization
falters or the need for new ground rules and policy decisions becomes
evident.

Product quality assurance is most economically secured when the


conditions which might lead to loss of sale, customer rejection, or excessive
warrantly cost are predicted, prevented, or corrected at the earliest possible
time.

13.43 Incentive Contracts

The abrupt deemphasis of cost plus fixed fee military contracting has
focused attention upon the incentive contract as a means for assuring
effective management interest in achieving product reliability and
maintenance commitments. With this medium, a specified scale of incentive-
and sometimes penalty is applied as a factor in the total contract price.
Penalty scales are usually applied at lower rates than incentive scales and
may be omitted in competitive fixed price contracts.

13.44 Cost Analysis and Budgeting

Every product merits an analysis of the total tasks to be performed with the
allowed costs. The estimation of costs for every function must be quite
close to the final actual costs of the specific function if effective results are
to be achieved. It is apparent that the general readjustment (usually arbitrary
cuts) of budgetary estimates by top management will be in those areas
where the departmental estimates and accounting reports of past
performance on similar programs are in obvious disagreement.

13.45 Equipment and Facility Costs

Cost estimation of the equipment and facilities required for standards and
calibration, process control, inspection and test is another essential task
for reliability and quality control engineers. Applicable staff and line
personnel should be given the opportunity to take part in the planning of all
equipment and facilities expansion, retirement, or replacement.

Great care must be exercised to determine that adequate justification exists


Reliability Management 301

for the addition or replacement of facilities. Improved product reliability and


lower costs must be tangible and measurable. Savings predicted should
offset the cost of new equipment and facilities within a period prescribed by
top management.

13.46 Cost Records

Reliability and quality control organizations have the responsibility for


generating and maintaining the important segments of product records of
rework and scrap costs, testing costs, warranty costs, etc., upon which
pricing structures, company procedures, redesign, and even critical litigation
have been founded. The cost of these record-keeping and data processing
activities must certainly be compared with their worth to the company.
The responsibility for this falls upon those who implement and make the
system work.

Cost estimation for this requirement must include the consideration of


savings through the use of automated data processing equipment, the ever
increasing cost of records storage and data retrieval, the nature of any
contractual requirement for data reproduction and translation, participating in
data centers.

13.47 Quality and Reliability Cost Control

To control cost in the quality and reliability programs, careful long range
planning must be exercised by management. This planning must be
accomplished by those to whom top management has delegated the
responsibility and who will be held accountable for the implementation of the
plans. The controlling of these long range plans at the time of
implementation is one of the basic principles of cost control.

Sturdy programs, research and development programs, production


programs, prevention, assessment, rework, and scrap cost estimates should
all be made in the long range plans whereby proper budgeting may be
forecast and arrangements made.

13.5 THE MANAGEMENT MATRIX

The adroitness of a company to remain competitive and maintain its profit


level requires more than the ability to engineer and produce products in
quantity. The matrix technique applied to decision making provides an
objective means for solving various management problems. Quality
assurance of a product or system is a significant factor in the growth pattern
of a company. The departmental functions, policies and responsibilities
dictate the type of organizational structure which can best fulfill the
302 Reliability Engineering

objectives of the consumer and the company. At the top management level,
the matrix technique is useful in determining the organisation structure
based upon the responsibilities delegated to each department and as a
basis for penetrating new market areas. In all cases, the effectiveness
of the management process is directly related to profitability through
consumer assurance that product performance and quality are maximized
within the negotiated cost structure.

Management of a department responsible for administration of the quality


assurance program in a division of a company primarily oriented to
research, development and production of diversified products and systems
requires special planning, techniques and philosophy. The management
must have the capability to continually maintain the proper level of
customer satisfaction and evaluate product performance even though the
products and systems are usually required to perform at limits bounded by
the state of the art. In general, each product or system has performance
requirements in scope and magnitude such that the product assurance
requirements specified are as diverse as the product line, depending upon
the customer documents or procurement agency involved in the contract.

The solution, to the stated conditions must be one of dynamic planning of


the steps in organizing to accomplish the department objectives. Elements of
the matrix can then be sequentially incorporated into the organizational
structure in logically phased steps. The matrix planning is always an
evolutionary process to eliminate the administrative stresses associated
with revolutionary changes due to new business and profound
requirements. A continual audit of the structure, and contract requirements
should be conducted to validate the effectiveness of the organization in cost
and performance and its applicability with program demands.

A study of programs determine the need for an operational analysis since the
interface relations between the sections for each contract would have to be
established during the proposal stage. Each new program is placed in the
organization after a decision has been made as to the need for establishing
it as a project. Several factors are considered and the methodology of
decision theory is applied. The following factors are considered as the most
heavily weighted.

1. Customer Requirement

Certain programs are of such magnitude that management and


communications must extend in an unbroken line through all levels of
procurement. The need for a specific organizational structure is a customer
requirement. This does not assure that all activities will be performed by
the project but that authority and responsibility for compliance with
Reliability Management 303

requirements is maintained by the project.

2. Special Requirements

The product or system and/or contractual requirements are so specific and


different that existing procedures cannot suffice.

3. Schedule

This objective requires special attention. A tight schedule requires


appropriate manpower to evaluate acceptability of the production flow. In
some cases, the personnel performing acceptance must be certified in special
ways or have specific talents.

4. Product Complexity and Skill Levels

Product complexity (processes, test techniques, production fabrication) and


skill levels are such that the product is significantly different from related
products.

5. Dollar Volume as a Function of Time

The ratio of program c.ost/time is high. This implies a concentrated program


effort is required.

6. Manpower Availability

The program requirements for specialized manpower are such that this
factor is considered. This objective is not heavily weighted since it is
related to attainment of other objectives.

These objectives are weighted in terms of the various courses of action


using the matrix approach to establish a decision. This approach has a
basic purpose of analyzing the array of actions and depicting the decision in
mathematical terms.

The management function then utilizes this tool for planning and action in
performance of its activities. The organization matrix provides the
mechanism for management in an expeditious manner and efficient
departmental control commensurate with this company's products and
philosophies.

The placement of quality and reliability assurance in the overall


organizational structure should be considered on the basis of optimum
product control and assurance which minimizes the total program costs.

You might also like