Arno 2012
Arno 2012
Abstract—When the concept of reliability began to formally be- For operational availability, all the delays for scheduling,
come an integrated engineering approach in the 50s, reliability was travel time, parts, etc. are included. If it takes 24 h to fly a part
associated with failure rate. Today the term “reliability” is used in to repair the equipment, that adds to the “repair time.”
as an umbrella definition covering a variety of subjects including
availability, durability, quality, and sometimes the function of the Inherent availability and operational availability show differ-
product. Reliability engineering was developed to quantify “how ent aspects of the system being analyzed. Operational availabil-
reliable” a component, product, or system was when used in a ity would be the “real world”; how the system really operates.
specific application for a specific period of time. The data center There are usually delays between the time a piece of equipment
industry has come to rely on “tier classifications” as presented in fails and when the repair begins. Spare parts inventories are also
a number of papers by the Uptime Institute as a gradient scale of
very significant and directly impact operational availability.
data center configurations and requirements from least (Tier 1)
to most reliable (Tier 4). This paper will apply the principles and Therefore, when determining spare parts inventories, on-site
modeling techniques of reliability engineering to specific examples personnel and their level of training, etc., operational availabil-
of each of the tier classifications and discuss the results. A review ity is a useful tool.
of the metrics of reliability engineering being used will also be Inherent availability is more useful tool in analyzing the
included. system design. Since there are wide variations in the mainte-
Index Terms—Availability, component, failure rate, mean time nance practices from facility to facility, operational availability
between failures (MTBF), mean time to repair (MTTR), reliability. could vary significantly between two facilities with identical
infrastructures. Eliminating all of the logistics involved with
I. R ELIABILITY T ERMINOLOGY AND M ETRICS getting the parts and trained individual to the piece of equip-
ment and counting only the actual repair time provide a more
I N THIS section, reliability terminology and metrics are in-
troduced. Availability (A): is the long-term average fraction
of time that a repairable component or system is in service and
accurate evaluation of the infrastructure design. It shows the
availability that is “inherent” to the design, if the spare parts
inventory and repair are perfect. In this paper, all of the values
satisfactorily performing its intended function. For example, if
and discussion concerning availability will be for inherent
the electricity is off for 1 h in a year, but the rest of the year the
availability.
electricity is on, the availability of electrical power for that year
The failure rate(λ) is defined as the rate that a failure per unit
is 8759 h divided by 8760 h, which is 0.999886.
time occurs in the interval, given that no failure has occurred
An availability of 0.99999 could mean that the system was
prior to the beginning of the interval.
down for 5.3 min (or 315 s) per year. It would make no dif-
Mean time between failures (MTBF), as its name implies, is
ference in the availability calculation if there was one 5.3-min
the average time the equipment performed its intended function
outage, or 315 1-s outages. It could also be one outage of 1.77 h
between failures. Shown below is a very common distribution
in 20 years. In all three cases, the availability is 0.99999.
function, the “normal” distribution.
There are two common measures of availability, inherent
For the case of a constant failure rate,
availability and operational availability. The difference be-
tween the two is based on what all is included as “repair MTBF = 1/λ.
time.” For inherent availability, only the time it takes to fix the
Electronic equipment, along with many other types of equip-
equipment is included. Inherent availability assumes that the
ment, has a relatively constant failure rate over much of its
technician is immediately available to work on the equipment
useful life and follows an exponential statistical distribution.
the moment it fails and that he has all the parts, etc. necessary
The common assumption for reliability analysis is that all
to complete the repair.
the equipment in the system to be analyzed falls within this
statistical distribution where the failures are random and the
Manuscript received February 15, 2010; accepted May 17, 2010. Date of failure rate is constant. All of the calculations shown below
publication December 21, 2011; date of current version March 21, 2012. assume a constant failure rate for the equipment.
Paper 2010-PSEC-021, presented at the 2010 IEEE/IAS Industrial and Com-
mercial Power Systems Technical Conference, Tallahassee, FL, May 9–13, Mean time to repair (MTTR) is the average time it takes to
and approved for publication in the IEEE T RANSACTIONS ON I NDUSTRY repair the failure and get the equipment back into service.
A PPLICATIONS by the Power System Engineering Committee of the IEEE Inherent Availability is mathematically defined as the MTBF
Industry Applications Society.
The authors are with HP Critical Facility Services, Frankfort, NY 13340 divided by the MTBF plus the MTTR
USA (e-mail: rarno@hp.com; afriedl@hp.com; pgross@hp.com; bschuerger@
hp.com). A = MTBF/(MTBF + MTTR).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. Reliability (R) is the probability that a product or service will
Digital Object Identifier 10.1109/TIA.2011.2180872 operate properly for a specified period of time under design
TABLE I For two blocks in series with failure rates of λ1 and λ2, the
MTBF OF O UTAGES E XAMPLES
reliability as a function of time R(t) is
TABLE II
OVERVIEW OF T IER C LASSIFICATION R EQUIREMENTS
fails, power is lost to the critical IT loads when the batteries run
out of power. If the critical output distribution (COD) or one of
the PDU fails, power is lost immediately to critical IT loads.
Fig. 3 shows the addition of a redundant UPS module and
a redundant generator. All of the SPOFs that exist in Fig. 2
also exist in Fig. 3, except for the UPS module. There is also
a redundant generator for when the utility power has failed and
the system is operating on generator power.
Fig. 4 shows the addition of a second path. In this design,
there are two ATSs and two sets of CODs and PDUs. The side
with the UPS modules is the “active” source, as it is normally in
service. The second ATS provides the “passive” source, which
can now be used to perform maintenance on the active source.
The system in Fig. 4 is “concurrently maintainable,” using the
passive source. As we will see later in the paper, the reliability
has not been greatly improved, since the passive side requires
manual switching.
Fig. 5 shows two complete active paths providing power to
the critical IT loads. In this design, all of the SPOF have been
eliminated from the electrical distribution system.
As you can see by comparing the reliability calculations of
the four systems in Table III, the reliability is not very good
until Tier 4 is reached. That can give the mistaken impression
that in order to have a reliable system, it must be Tier 4. That
is not necessarily the case. Shown below in Fig. 6 is another
example of a Tier 3 design. This one has static transfer switches
(STSs) to switch the power from the active source to the passive
source on failure of the active source.
STSs are designed to transfer from one source to another so
quickly that the IT equipment is not affected by the transfer. For
the example in Fig. 6, should the active source have a failure, the
STS transfers the critical IT load to the alternate (passive) source.
Since the utility power is available, the vast majority of the
time and the active source does not fail very often, the reliability
is greatly improved with this configuration over the single path
from a UPS system that is N + 1. Fig. 3. Tier 2—N + 1.
A much more subtle issue of our Tier 3 system in Fig. 6
shows up in the availability number. As shown in Table IV, the a significant percentage of the failures. If we were to add STSs
availability is quite high, better than the 2N system of Tier 4. to the Tier 4 2N design and do the calculations, we would get
When we investigate this further, we find the answer in the higher reliability than the numbers in Table IV.
MTTR. Using the formula from Section I, we find the MTTR In this paper, we have looked as a few simple systems for
for the Tier 3—STS example to be 0.47 h and for the Tier 4 what would be a small data center. A large enterprise (owned
example to be 3.2 h. For the Tier 3—STS design, the critical and operated by a single company for their own use) data center
IT load is directly on utility power if the UPS system fails. often has multiple UPS systems to carry megawatts of critical
Therefore, any voltage sag will be seen directly by the critical IT loads. The large colocation facilities also can be quite large.
IT load. From the MTTR, we know that failures caused by There are many other designs for critical electrical distribution
voltage sags while the IT load is on the alternate source are system that have advantages over some of the systems shown
ARNO et al.: RELIABILITY OF DATA CENTERS BY TIER CLASSIFICATION 781
TABLE III
R ELIABILITY C ALCULATIONS FOR T IERS D RAWINGS
V. C ONCLUSION
From the discussion above, it is obvious that the Tier classi-
What can occur is that the complexity of the system starts fications provide guidelines and a gradient scale of data center
working against the increase in redundancy, and what is gained designs. It is a very useful tool that can be used in conjunction
on one hand is lost on the other. Reliability modeling can be a with reliability engineering to design or evaluate an existing
very powerful tool to assist in finding the point of diminishing critical facility.
returns and maximizing the investment in reliability. To fully specifying “reliability” requires five major metrics:
Finally, the concept of reliability needs to be brought into the MTBF, MTTR, availability, reliability, and time. These metrics
more practical context of financial risk. Risk is a function of are significantly impacted by what the definition of “failure”
both the severity of the failure (typically measured in terms of is for the system to be modeled. They are also significantly
financial losses caused by the failure) and the probability of a impacted by the size of the facility and number of “critical
failure occurring during a certain period of time loads” used in the model.
Reliability modeling is a very effective tool when used for
Risk ($/year) = Failure rate (failures/year) comparison between similar systems. Understanding of the
basic concepts involved is necessary to correctly model the
× Severity ($/failure)
systems and utilize the data provided to reach the proper
R = λS. conclusions.
ARNO et al.: RELIABILITY OF DATA CENTERS BY TIER CLASSIFICATION 783