0% found this document useful (0 votes)
42 views7 pages

Arno 2012

This paper discusses the reliability of data centers through tier classification, emphasizing the importance of inherent and operational availability metrics. It explores the principles of reliability engineering, including failure rates, mean time between failures (MTBF), and mean time to repair (MTTR), and how they apply to data center configurations from Tier 1 to Tier 4. The authors aim to provide a deeper understanding of reliability analysis and its application in designing more dependable systems while highlighting the significance of defining 'failure' in the context of data centers.

Uploaded by

panchoxavier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views7 pages

Arno 2012

This paper discusses the reliability of data centers through tier classification, emphasizing the importance of inherent and operational availability metrics. It explores the principles of reliability engineering, including failure rates, mean time between failures (MTBF), and mean time to repair (MTTR), and how they apply to data center configurations from Tier 1 to Tier 4. The authors aim to provide a deeper understanding of reliability analysis and its application in designing more dependable systems while highlighting the significance of defining 'failure' in the context of data centers.

Uploaded by

panchoxavier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

IEEE TRANSACTIONS ON INDUSTRY APPLICATIONS, VOL. 48, NO.

2, MARCH/APRIL 2012 777

Reliability of Data Centers by Tier Classification


Robert Arno, Senior Member, IEEE, Addam Friedl, Peter Gross, Senior Member, IEEE, and
Robert J. Schuerger, Member, IEEE

Abstract—When the concept of reliability began to formally be- For operational availability, all the delays for scheduling,
come an integrated engineering approach in the 50s, reliability was travel time, parts, etc. are included. If it takes 24 h to fly a part
associated with failure rate. Today the term “reliability” is used in to repair the equipment, that adds to the “repair time.”
as an umbrella definition covering a variety of subjects including
availability, durability, quality, and sometimes the function of the Inherent availability and operational availability show differ-
product. Reliability engineering was developed to quantify “how ent aspects of the system being analyzed. Operational availabil-
reliable” a component, product, or system was when used in a ity would be the “real world”; how the system really operates.
specific application for a specific period of time. The data center There are usually delays between the time a piece of equipment
industry has come to rely on “tier classifications” as presented in fails and when the repair begins. Spare parts inventories are also
a number of papers by the Uptime Institute as a gradient scale of
very significant and directly impact operational availability.
data center configurations and requirements from least (Tier 1)
to most reliable (Tier 4). This paper will apply the principles and Therefore, when determining spare parts inventories, on-site
modeling techniques of reliability engineering to specific examples personnel and their level of training, etc., operational availabil-
of each of the tier classifications and discuss the results. A review ity is a useful tool.
of the metrics of reliability engineering being used will also be Inherent availability is more useful tool in analyzing the
included. system design. Since there are wide variations in the mainte-
Index Terms—Availability, component, failure rate, mean time nance practices from facility to facility, operational availability
between failures (MTBF), mean time to repair (MTTR), reliability. could vary significantly between two facilities with identical
infrastructures. Eliminating all of the logistics involved with
I. R ELIABILITY T ERMINOLOGY AND M ETRICS getting the parts and trained individual to the piece of equip-
ment and counting only the actual repair time provide a more
I N THIS section, reliability terminology and metrics are in-
troduced. Availability (A): is the long-term average fraction
of time that a repairable component or system is in service and
accurate evaluation of the infrastructure design. It shows the
availability that is “inherent” to the design, if the spare parts
inventory and repair are perfect. In this paper, all of the values
satisfactorily performing its intended function. For example, if
and discussion concerning availability will be for inherent
the electricity is off for 1 h in a year, but the rest of the year the
availability.
electricity is on, the availability of electrical power for that year
The failure rate(λ) is defined as the rate that a failure per unit
is 8759 h divided by 8760 h, which is 0.999886.
time occurs in the interval, given that no failure has occurred
An availability of 0.99999 could mean that the system was
prior to the beginning of the interval.
down for 5.3 min (or 315 s) per year. It would make no dif-
Mean time between failures (MTBF), as its name implies, is
ference in the availability calculation if there was one 5.3-min
the average time the equipment performed its intended function
outage, or 315 1-s outages. It could also be one outage of 1.77 h
between failures. Shown below is a very common distribution
in 20 years. In all three cases, the availability is 0.99999.
function, the “normal” distribution.
There are two common measures of availability, inherent
For the case of a constant failure rate,
availability and operational availability. The difference be-
tween the two is based on what all is included as “repair MTBF = 1/λ.
time.” For inherent availability, only the time it takes to fix the
Electronic equipment, along with many other types of equip-
equipment is included. Inherent availability assumes that the
ment, has a relatively constant failure rate over much of its
technician is immediately available to work on the equipment
useful life and follows an exponential statistical distribution.
the moment it fails and that he has all the parts, etc. necessary
The common assumption for reliability analysis is that all
to complete the repair.
the equipment in the system to be analyzed falls within this
statistical distribution where the failures are random and the
Manuscript received February 15, 2010; accepted May 17, 2010. Date of failure rate is constant. All of the calculations shown below
publication December 21, 2011; date of current version March 21, 2012. assume a constant failure rate for the equipment.
Paper 2010-PSEC-021, presented at the 2010 IEEE/IAS Industrial and Com-
mercial Power Systems Technical Conference, Tallahassee, FL, May 9–13, Mean time to repair (MTTR) is the average time it takes to
and approved for publication in the IEEE T RANSACTIONS ON I NDUSTRY repair the failure and get the equipment back into service.
A PPLICATIONS by the Power System Engineering Committee of the IEEE Inherent Availability is mathematically defined as the MTBF
Industry Applications Society.
The authors are with HP Critical Facility Services, Frankfort, NY 13340 divided by the MTBF plus the MTTR
USA (e-mail: rarno@hp.com; afriedl@hp.com; pgross@hp.com; bschuerger@
hp.com). A = MTBF/(MTBF + MTTR).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. Reliability (R) is the probability that a product or service will
Digital Object Identifier 10.1109/TIA.2011.2180872 operate properly for a specified period of time under design

0093-9994/$26.00 © 2011 IEEE


778 IEEE TRANSACTIONS ON INDUSTRY APPLICATIONS, VOL. 48, NO. 2, MARCH/APRIL 2012

TABLE I For two blocks in series with failure rates of λ1 and λ2, the
MTBF OF O UTAGES E XAMPLES
reliability as a function of time R(t) is

R(t) = R(1)XR(2) = e − (λ1 + λ2)t.

For two blocks in parallel with redundancy, where 1 out


of 2 is necessary for successful operation, the reliability as a
function of time R(t) is
R(t) = R(1) + R(2) − [R(1)XR(2)]
operating conditions without failure. Reliability is time depen-  
dent. The longer the time, the lower the reliability, regardless = e−λ1t + e−λ2t − e−(λ1+λ2)t .
of what the system design is. The better the system design,
the higher the probability of successful operation for a longer Most critical facilities will consist of many blocks combined
period of time. in both series and parallel. If the components of the system
For a constant failure rate λ, reliability as a function of time are repairable, this further complicates the matter. For complex
R(t) is systems with multiple interconnections, where some of the
components are neither in series nor in parallel but in a stand-
R(t) = e−λt . by mode (such as a generator plant that is only active during
From the equations shown above, we see that there are five a utility failure), direct analytical calculations are impractical.
important factors to define the “reliability” of a system; MTBF, The reliability is calculated using a computer program that does
MTTR, availability, reliability, and time. It can also be seen how random simulations, called a Monte Carlo simulation.
these five factors are interrelated. What is not as obvious is that When performing a Monte Carlo simulation, a random series
“availability” is time independent, since it is the combination of simulations are performed on the RBD. These simulations
of two terms that are themselves averages over long periods of are test runs through the system (from the start node through
time (MTBF and MTTR). Reliability, as we can see from the the end node) in order to determine if the system completes
equation above, is very “time dependent.” its task or fails. During each iteration or test, the software
Reliability is the “probability of success” for a given period uses the properties of each figure to decide whether that figure
of time. Reliability is a metric directly related to how often (or is operating or not, and therefore determines if the system is
how fast) the system fails. As shown in Table I, the system that operating.
failed once in a year for 5.3 min would have a much better
reliability than the system that failed 315 times for 1 s, but no III. I MPORTANT FACTORS IN R ELIABILITY A NALYSIS
where near as good as the system that failed once in 20 years
for 1.77 h, even though all have the same availability. The aim of this paper is to provide a broader understanding
The reliability has dropped to 36.8% when the MTBF of the of reliability engineering and how it can be successfully used
system is reached (see MTBF of 1 year). Therefore, the system as a tool for designing better systems. We will also point out a
that fails 315 times a year has a reliability of 36.8% a little over few of the pitfalls to be avoided along the way.
a day after you start it, while the system that fails one time takes The first pitfall is a subtle one: What is the definition of
a year to reach this same level of reliability. The last one takes failure? On a superficial level, the answer may seem obvious. If
20 years for the reliability to drop to 36.8%! the UPS system fails and all the critical loads for the data center
The discussion above demonstrates the importance of using lose power, that would obviously be a “failure.” What about one
both reliability and availability as metrics to determine how 20-A circuit breaker tripping and one rack of equipment losing
dependable the component or system is. power? Is that a “failure” for the data center?
Therefore, the first step of any reliability analysis is to define
what “failure” is for the analysis. This is the most single
II. R ELIABILITY B LOCK D IAGRAMS
important issue to come to agreement on and probably the
There are several common methodologies to perform reli- most difficult. The reason for doing the analysis in the first
ability calculations. The one used for this paper is reliability place should drive this definition. For this paper, we are going
block diagrams (RBD), which is a graphical representation of to use the definition of failure recommended in Chapter 8 of
the components of the system and how they are connected. For the IEEE Gold Book, Standard 493-2007, IEEE Recommended
electrical systems, the one-line diagram is used, and each ma- Practice for the Design of Reliable Industrial and Commercial
jor component, such as switchboard, generator, uninterruptible Power Systems. Section 8.3.3 has, “The loss of power to a
power system (UPS) module, transformer, etc., is represented power distribution unit (PDU) (or UPS distribution panel) is
as a block on the diagram. The failure and repair rates for each the recommended definition of failure for most types of system
component are entered in the block that represents it in the calculations. In most data centers, the loss of an entire PDU
RBD. The blocks are connected in the same manner as the flow would impact the overall mission of the facility.” In the systems
of electrical power, including parallel paths where they exist. used for our examples, this is modeled by keeping a single load
Calculations are then performed to determine the reliability, for each PDU for N systems and a single load per two PDUs for
availability, and MTBF for the system modeled in RBD. 2N systems.
ARNO et al.: RELIABILITY OF DATA CENTERS BY TIER CLASSIFICATION 779

Fig. 1. Dual corded IT equipment.

Another important concept in reliability analysis is “single


points of failure” (SPOF). SPOF are all the places in the one-
line from the utility entrance to the critical loads in which one
component failing causes the system to fail. When we find the
system is not as reliable as desired, to improve it, the first step
is to eliminate the SPOF. In the tier examples that follow, we
will see this addressed with each level; each stepup in the tier
classification eliminates some of the previous levels SPOF, until
we reach tier 4 which has eliminated all of the SPOF.

IV. R ELIABILITY AS A T OOL IN C OMPARISON


OF T IER C LASSIFICATIONS

The data center industry has come to rely on “tier classifi-


cations” as a gradient scale of data center configurations and
requirements from least (Tier 1) to most reliable (Tier 4). The
tier classifications provide some very useful guidelines to work
from when trying to determine what is needed for a specific
application. However, as we will see in some examples of
simple and small systems, Tier 4 is not necessarily “best” in
terms of overall client needs.
The following is a general overview of the tier classification
system as presented in a number of papers by the Uptime Insti-
tute [1]. This overview is not intended to completely define or
modify in any way the classifications developed by the Uptime
Institute. It is presented here to provide some understanding of
what the Tier classifications are for those not already familiar
with this terminology.
By “N” we mean the number of generators, UPS modules,
etc., needed to carry the load. If the load is 500 kW, one
500-kW UPS would be “N.” “N + 1” means there is one re-
dundant component. In the case above of 500-kW load, “N + 1”
would mean there were two 500-kW UPS modules; one to carry
the load and one “redundant” UPS module. Fig. 2. Tier 1—N.
“2N” means there are two complete systems in which either
one can carry the load. There is not only a second UPS module, which can power the equipment. For our analysis, we will
there is a complete second UPS system including the UPS input assume that 99% of the time, the dual cord IT equipment will
and output switchboards, automatic transfer switch (ATS), etc. continue to function when power is lost to one of the two cords.
For the data center to be “concurrently maintainable,” it must This assumption is based on experience with dual corded IT
be designed so that all of the necessary maintenance can be equipment in actual operation. When a large quantity of dual
performed while the critical IT load continues to operate. No corded IT equipment loses power to one side, between 1% and
maintenance outages are required that would remove power 3% of the equipment fails to continue operating on one cord.
from the critical IT load for the data center to be properly Fig. 2 shows a design that would most likely be considered
maintained indefinitely. Tier 1. There is a single UPS module supplying power to the
For the data center to be “fault tolerant,” it must be able to sus- critical IT loads. In this example, we have included a generator
tain a major failure, such as loss of one whole UPS system, while and ATS, which Table II shows as “optional.”
the critical IT load continues to operate without interruption For this system, loss of utility power or one of the remote
(Fig. 1). power panels is about the only failures that would not impact
In all of the examples that follow, we are assuming that the critical IT load. (The UPS module could fail and the static
all of the critical IT loads have “dual cords.” Dual corded bypass switch could carry the critical IT load, provided utility
IT equipment has two power supplies built into it, either of power was available.) If the ATS or UPS input switchboard
780 IEEE TRANSACTIONS ON INDUSTRY APPLICATIONS, VOL. 48, NO. 2, MARCH/APRIL 2012

TABLE II
OVERVIEW OF T IER C LASSIFICATION R EQUIREMENTS

fails, power is lost to the critical IT loads when the batteries run
out of power. If the critical output distribution (COD) or one of
the PDU fails, power is lost immediately to critical IT loads.
Fig. 3 shows the addition of a redundant UPS module and
a redundant generator. All of the SPOFs that exist in Fig. 2
also exist in Fig. 3, except for the UPS module. There is also
a redundant generator for when the utility power has failed and
the system is operating on generator power.
Fig. 4 shows the addition of a second path. In this design,
there are two ATSs and two sets of CODs and PDUs. The side
with the UPS modules is the “active” source, as it is normally in
service. The second ATS provides the “passive” source, which
can now be used to perform maintenance on the active source.
The system in Fig. 4 is “concurrently maintainable,” using the
passive source. As we will see later in the paper, the reliability
has not been greatly improved, since the passive side requires
manual switching.
Fig. 5 shows two complete active paths providing power to
the critical IT loads. In this design, all of the SPOF have been
eliminated from the electrical distribution system.
As you can see by comparing the reliability calculations of
the four systems in Table III, the reliability is not very good
until Tier 4 is reached. That can give the mistaken impression
that in order to have a reliable system, it must be Tier 4. That
is not necessarily the case. Shown below in Fig. 6 is another
example of a Tier 3 design. This one has static transfer switches
(STSs) to switch the power from the active source to the passive
source on failure of the active source.
STSs are designed to transfer from one source to another so
quickly that the IT equipment is not affected by the transfer. For
the example in Fig. 6, should the active source have a failure, the
STS transfers the critical IT load to the alternate (passive) source.
Since the utility power is available, the vast majority of the
time and the active source does not fail very often, the reliability
is greatly improved with this configuration over the single path
from a UPS system that is N + 1. Fig. 3. Tier 2—N + 1.
A much more subtle issue of our Tier 3 system in Fig. 6
shows up in the availability number. As shown in Table IV, the a significant percentage of the failures. If we were to add STSs
availability is quite high, better than the 2N system of Tier 4. to the Tier 4 2N design and do the calculations, we would get
When we investigate this further, we find the answer in the higher reliability than the numbers in Table IV.
MTTR. Using the formula from Section I, we find the MTTR In this paper, we have looked as a few simple systems for
for the Tier 3—STS example to be 0.47 h and for the Tier 4 what would be a small data center. A large enterprise (owned
example to be 3.2 h. For the Tier 3—STS design, the critical and operated by a single company for their own use) data center
IT load is directly on utility power if the UPS system fails. often has multiple UPS systems to carry megawatts of critical
Therefore, any voltage sag will be seen directly by the critical IT loads. The large colocation facilities also can be quite large.
IT load. From the MTTR, we know that failures caused by There are many other designs for critical electrical distribution
voltage sags while the IT load is on the alternate source are system that have advantages over some of the systems shown
ARNO et al.: RELIABILITY OF DATA CENTERS BY TIER CLASSIFICATION 781

Fig. 5. Tier 4—2N, 2 active paths.

TABLE III
R ELIABILITY C ALCULATIONS FOR T IERS D RAWINGS

Fig. 4. Tier 3—N + 1, 1 active and 1 passive.

here. Reliability modeling can be a very valuable tool in evalu-


ating all of the various design options and determining the point
of diminishing returns for a specific application.
Fig. 7 shows a curve for availability versus cost. Though this
is a hypothetical curve, it conveys an important concept quite
well. Once “five 9’s” (an availability of 99.999, which would
mean the facility is operational 99.999% of the time) is reached,
adding more equipment may not provide higher availability.
782 IEEE TRANSACTIONS ON INDUSTRY APPLICATIONS, VOL. 48, NO. 2, MARCH/APRIL 2012

Fig. 7. Availability versus cost.

It is indeed difficult to calculate the severity component,


since it is not constant, but a function of many variables.
Estimating severity simply as having an impact on productivity
is typically insufficient since it does not incorporate intangibles
such as loss of customer’s confidence or in some cases even loss
of vital data.
In the past, there has been a tendency to go from one extreme
to the other. As IT equipment became more and more intrinsic
to the production of the company, the systems and complexity
would expand quite rapidly, often without much long term
planning. Then, the inevitable failure would occur, causing
massive disruption to the business, and the management would
Fig. 6. Tier 3—N + 1, 1 active and 1 passive with STS. go into a “never again” mentality. While in this reactionary
state, decisions would be made that in many cases would put
TABLE IV in motion building “IT fortresses” for the business. In some
R ELIABILITY C ALCULATIONS FOR T IER 3 W ITH STS cases, such as the enterprise data center of a major financial
institution, an “IT fortresses” may be the correct solution. The
financial risk is quite high. For many other businesses, the
correct solution is somewhat less than the fortress and reliability
engineering can be a very powerful tool to find the proper
solution.

V. C ONCLUSION
From the discussion above, it is obvious that the Tier classi-
What can occur is that the complexity of the system starts fications provide guidelines and a gradient scale of data center
working against the increase in redundancy, and what is gained designs. It is a very useful tool that can be used in conjunction
on one hand is lost on the other. Reliability modeling can be a with reliability engineering to design or evaluate an existing
very powerful tool to assist in finding the point of diminishing critical facility.
returns and maximizing the investment in reliability. To fully specifying “reliability” requires five major metrics:
Finally, the concept of reliability needs to be brought into the MTBF, MTTR, availability, reliability, and time. These metrics
more practical context of financial risk. Risk is a function of are significantly impacted by what the definition of “failure”
both the severity of the failure (typically measured in terms of is for the system to be modeled. They are also significantly
financial losses caused by the failure) and the probability of a impacted by the size of the facility and number of “critical
failure occurring during a certain period of time loads” used in the model.
Reliability modeling is a very effective tool when used for
Risk ($/year) = Failure rate (failures/year) comparison between similar systems. Understanding of the
basic concepts involved is necessary to correctly model the
× Severity ($/failure)
systems and utilize the data provided to reach the proper
R = λS. conclusions.
ARNO et al.: RELIABILITY OF DATA CENTERS BY TIER CLASSIFICATION 783

R EFERENCES Peter Gross (M’01–SM’02) graduated from the


Polytechnic Institute of Bucharest, Romania, with
[1] W. P. Turner, IV and K. G. Brill, Industry standard tier classifications define
a degree in electrical engineering, and received the
site infrastructure performance, The Uptime Inst., New York, NY, 2001.
Master’s degree in business administration from
[2] R. Arno, P. Gross, and R. Schuerger, “What five 9’s really mean and
California State University.
managing expectations,” in Conf. Rec. IEEE IAS Annu. Meeting, 2006,
He leads strategic technology planning and busi-
pp. 270–275.
ness development as Managing Partner for HP Crit-
[3] IEEE Recommended Practice for the Design of Reliable Industrial and ical Facilities Services. He has played a pivotal role
Commercial Power Systems, Standard 493-2007.
in the rapid growth of EYP MCF’s business since its
founding in 1997, leading to its acquisition by HP in
2007. He co-founded the Critical Power Coalition,
an organization focused on public policy for improving the reliability and the
quality of electric power in the public and private sectors.
Mr. Gross is a Registered Professional Engineer in the States of California,
Robert Arno (A’04–M’08–SM’09) received the
Arizona, and New York.
B.S. degree in electrical engineering from the State
University of New York at Utica/Rome in 1982.
He has worked in the reliability field for 32 years,
and currently is employed by HP Critical Facility Robert J. Schuerger (S’88–M’02) received the
Services (HP CFS), Albany, NY. He currently leads B.S.E.E. degree from the University of Akron,
the Intelligence Group at HP CFS servicing the needs Akron, OH.
of government critical facilities. His principal re- He has over 35 years of experience in power
sponsibilities include leading and directing programs engineering, specializing in electrical testing and
and bringing technology to the development of better maintenance, power quality, and the design, commis-
facilities. sioning, and reliability analysis of mission critical
facilities. He has done start-up, commissioning, and
maintenance for nuclear and fossil fuel generation
plants, transmission and distribution substations, and
power distribution systems for a variety of industrial
facilities and critical facilities such as data centers. Currently, he is Princi-
Addam Friedl received the B.S. degree in electrical pal Reliability Analysis Corporate Lead at HP Critical Facilities Services,
engineering from the University of South Florida, El Segundo, CA.
Tampa, in 1998. Mr. Schuerger is the current Chair for the LA Chapter of the IAS. He
He is a critical facilities and power reliability was Chapter 4 Chairman for the IEEE Emerald Book, Standard 1100-2005,
specialist with 19 years of experience. In addition to Recommended Practice for Powering and Grounding Electronic Equipment.
designing new data centers and commercial build- He also was Chapter 8 (7X24 Continuous Facilities) Chairman for the IEEE
ings, his experience encompasses providing com- Gold Book, Standard 493-2007, Recommended Practice for Design of Reliably
prehensive risk/reliability assessments, identifying Industrial and Commercial Power Systems. He was a member of the Working
single points of failure and their impact on opera- Group and wrote Chapter 6 of the IEEE Yellow Book, Standard 902-1998,
tions, and upgrading the electrical infrastructure for Guide for Maintenance, Operation, and Safety of Industrial and Commercial
existing facilities. His expertise includes electrical Power Systems. He is currently a Project Chair for several Working Groups that
utility feeders, generators, UPS systems and batteries, computer room power are revising these standards and a member of the Technical Book Coordinating
distribution and EPO systems, fire alarm and leak detection systems, computer Committee in charge of the process. He is a registered Professional Engineer
room grounding systems, power distribution grounding systems, and building in several states. He is a member of the Grounding Subcommittee and was part
monitoring and control systems. He has worked with numerous clients in the of the ballot committee for the latest revision of the Green Book. He was the
technology, broadcasting, financial, and corporate sectors. Chairman for IEEE Std. 3007.2–2010, Recommended Practice for Maintenance
Mr. Friedl is a Registered Professional Engineer in multiple states. of Industrial and Commercial Power Systems.

You might also like