Failure Analysis of FMEA
Zigmund Bluvband, ALD Ltd.
Pavel Grabov, ALD Ltd.
Key Words: CAPA, Effectiveness evaluation , Failure Mode and Effect Analysis, Ranking procedure.
SUMMARY & CONCLUSIONS number for each possible cause of each failure using the
following equation:
Failure Mode and Effect Analysis (FMEA) is a proactive
tool developed to identify, evaluate and prevent product RPN = S * O * D (1)
and/or process failures. The conventional FMEA procedure
suffers from inadequate definitions for some steps, high Once all items under consideration have been analyzed
uncertainty, and even decision making failures throughout the and the estimated RPN values assigned, corrective actions can
procedure. be planned for the RPN values in descending order.
The effectiveness of an FMEA can be significantly The ultimate goal of a corrective action is to achieve an
improved by identifying potential pitfalls, and raising appropriate reduction in the severity, occurrence and/or
awareness of potential problems. Applying a strategy that detection rankings in order to obtain "acceptable" RPNs.
utilizes controls and rules can efficiently mitigate, or even 2. POSSIBLE FAILURES OF CONVENTIONAL FMEA
avoid, all known possible harmful effects. STEPS AND PROPOSED REMEDIES
This article proposes proven solutions that support the
entire end-to-end FMEA sequence of activities (from the point 2.1. Step 1: Failure Modes Identification
of initiation of the analysis - Failure Modes identification – up Pitfall: Missing Failure Modes. The first step of the
to its culmination – evaluation of the effectiveness of the FMEA ‘Step by Step’ procedure is compiling a list of system
procedure), and the remedies proposed, in reducing risk. functions or system equipment items and identifying the
1. INTRODUCTION failure modes of each item.
One of the main problems besetting the FMEA process is
FMEA is a classic tool of what we refer to as the omission of Failure Modes because the brainstorming
"Disciplined Engineering" - a systematic framework session is not sufficiently comprehensive.
considered as a tool to reduce potential errors, prevent One of the causes of this problem is inherent in the well-
common mistakes, and improve the consistency of the known classical definition of failure: "The inability of an item,
engineering work. product or service to perform required functions on demand
The purpose of FMEA is to examine possible failure due to one or more defects" [1]. We are of the opinion that
modes and determine the impact of these failures on the this definition is too narrow and, therefore, does not cover all
product (Design FMEA - DFMEA) and process (Process possible aspects of failure analysis.
FMEA - PFMEA): Remedy. This paper proposes a checklist of 10 types of
• DFMEA is used to analyze product designs before they Failures Modes that can be utilized by the FMEA team as a
are released to production. It focuses on potential failure basis for defining the customized list of failures associated
modes associated with the functions of the product and with any given activity or item. This check list is based on the
caused by design deficiencies; Key Question "What Can Go Wrong?":
• PFMEA is used to analyze the new or existing processes. 1. The intended function (mission) is not performed.
It focuses on the potential failure modes associated with 2. The intended function (mission) is performed, but there
both the process safety /effectiveness/efficiency, and is some safety problem or a problem in meeting a
problems with the functions of a product caused by the regulation (for example, ecological) associated with the
problems in the process. intended function (mission) performance.
Traditionally, in order to assess risk the FMEA team 3. The intended function (mission) is performed, but at a
ranks the Severity (S) of the failure, the probability of its wrong time (availability problems).
Occurrence (O) and the probability of detecting the failure 4. The intended function (mission) is performed, but at a
mode or its cause, i.e. Detectability (D). Risk assessment is wrong place.
determined via RPN (Risk Priority Number), which is 5. The intended function (mission) is performed, but in a
calculated by multiplying the ranking values of Severity, wrong way (efficiency problems).
Occurrence and Detectability and obtaining one categorization
1-4244-2509-9/09/$20.00 ©2009 IEEE
6. The intended function (mission) is performed, but the using some alternative scales, considering RPN as an
performance level is lower than planned. illustration of the Pareto Priority Index PPI [3]. For example,
7. The intended function (mission) is performed, but its one could use ‘Rational Scales’ for RPN components
cost is higher than planned (unscheduled maintenance or evaluation, such as Failure Rate for Occurrence, probability of
repair, higher consumption of required resources, etc.). misdetection for Detectability and the Failure Cost for
8. An unintended (unplanned) and (or) undesirable function Severity.
(mission) is performed. 2.3. Step 3: Total Risk Estimate
9. Period of intended function (mission) performance (life
time) is lower than planned (reliability problem). Pitfall: Undefined Risk Acceptance Criteria. Once all
10. Support for intended function (mission) performance is items have been analyzed and evaluated by a RPN value, it is
impossible or problematic (maintenance, repairability, common to plan corrective actions for the failure
serviceability problems). modes/causes - from the highest RPN value down.
While the goal of any corrective action is the reduction of
2.2. Step 2: Ranking Procedure the Severity, Occurrence and/or Detectability rankings, the
Pitfall: Use of Irrelevant Statistics. After the failure question is whether corrective action is necessary. During the
effects have been identified, Severity (S), Occurrence (O) and Risk Management process, based on the overall risk analysis
Detectability (D) should be evaluated. One possible method is results, important decisions are made to either modify or
the use of the conventional ranking procedure to rank these accept the tasks at hand. If a risk level does not exceed an
risk components on a ‘1’ (Best Case) to ‘10’ (Worst Case) acceptable risk level, set at the project start, the operation is
ordinal scale that appears on standard FMEA forms [2]. permissible and no corrective action is required. Acceptance
A comprehensive FMEA team discussion on a specific of ‘Zero’ risk level as an ultimate requirement is foolish in any
item can result in a wide spread of ranks raising the question business area. Firstly, it is impossible to achieve, and
of how to resolve this situation. Drop Outliers? Calculate secondly, even if it was possible (theoretically), it is not
average rank? Define as highest rank (Worst Case Approach)? profitable.
Conventional FMEA does not provide any guidelines for The only method of achieving zero risk is to go out of
this eventuality. Typically, such problems are resolved by business! But then you are taking another risk…
applying the arithmetic mean value. In some cases more Unfortunately, the conventional FMEA procedure does
sophisticated specialists calculate the standard deviation of the not set any Risk Acceptance Criteria, nor does it require any
proposed values and then, using Normal distribution evaluation of the general necessity of corrective actions.
approximation, apply all kinds of statistical sensitivity Furthermore, in some cases the teams tinker with the
analyses. This is a mistake! process/product when it is unnecessary.
The RPN components are evaluated on the Ordinal Scale. Remedy. This paper proposes the use of calculated RPN
This scale uses so-called Non-Parametric Statistics! Such values in order to derive the Total Risk Estimate (TRE)
measures as mean, standard deviation, etc. are absolutely characterizing the overall risk level for each given project,
irrelevant to the Ordinal Scale because the distance between where RPNi are RPN values for a given i-th cause and ‘n’ is
ranks is meaningless. the number of causes in the FMEA table:
n
Remedy. The following is a short list of proven guidelines
that could be useful for FMEA teams: ∑ RPN i
i =1
• Team members could decide not to participate in the TRE = ∗ 100 % (2)
n ∗ 1000
ranking of a given item or given component due to lack of One can see that the TRE values will always fluctuate
relevant knowledge or experience between 0.1% and 100%. Risk Acceptability Criteria could
• A wide rank spread indicates some problem (usually due be established as 17%, i.e. with risky projects assigned higher
to the heterogeneity of the team). Nonetheless, we always TRE values. Boundary value 17% approximately corresponds
try to obtain consensus. On the other hand, zero to the multiplied Midpoint (5.5) values for three RPN
difference of ranks could indicate total indifference by components ranked on a ‘1’ to ‘10’ scale.
team members towards the item under discussion. This does not mean that no corrective action is required
• Outliers should be considered. Maybe they represent true for TRE<17%. Obviously, extremely high RPN values should
estimates proposed by “process experts”! Maybe these be dealt with. Nevertheless, calculated TRE values could be
outliers are the result of some misunderstanding or used for comparative analysis of different processes or
irrelevant experience! operations in order to focus efforts on the most critical
• Either the Median or Mode (certainly not the Mean!) operation, or as an indicator of design maturity when deciding
should be used as the team’s rank estimate! when to claim a design freeze and transfer a design to
Remark. Actually, even the RPN calculation obtained by
production.
multiplying the Ordinal Scale values (1) is a kind of pitfall,
which is, unfortunately, regulated by the Automotive Industry 2.4. Step 4: Critical Items Identification
Action Group (AIAG) Standards [2]! As a result, this case
Pitfall: Wrongly Defined Criteria for High Priority Items.
needs to be dealt as well. The situation can be improved by
From the risk values point of view, the items covered by the actions divided by the corresponding feasibility ranking
FMEA procedure are usually very different. Obviously, the factors:
most significant items, characterized by high RPN, should be RPN i Before − RPN i After Δ RPN
separated from those characterized by a significantly lower = (3)
Fi Fi
RPN value. Selected ‘High Priority’ items represent issues for
corrective action plan development. Where: RPNi Before and RPNi After are RPN values for a given
Some FMEA instructions recommend the acceptance of item before and after implementation of the i-th corrective
failures with RPN≤80 [2, 4], and therefore, require corrective action, ΔRPN is the difference between these values; Fi is the
action for all failure causes with RPN≥80. This rule tends to feasibility rank of i-th corrective action.
mislead the team requiring a large number of corrective Obviously, the most preferable corrective action is the one
actions. characterized by the largest ratio.
Another common practice resorted to by FMEA teams 2.6. Step 6: FMEA Effectiveness Evaluation
analyzing RPN values in Pareto fashion is to limit the list of
recommended corrective actions to ‘Top ‘X’ Issues’. In such Pitfall: Lack of Guidelines for FMEA Effectiveness
cases, the X-value chosen could be 3 or 5 or 10, etc. In other Evaluation. Since FMEA performance is a rather time
words, the ‘X’ selected will be an absolutely random choice. consuming activity, requiring the participation of highly
Obviously, this kind of decision-making is very problematic. experienced personnel (team members), its cost is rather high.
Remedy. We recommend the usage of a very simple and Therefore the effectiveness of the procedure should be
quite effective graphical tool, so-called Scree Plot used in evaluated after completing the FMEA.
principal component analysis, for RPN value analysis [5]. Remedy. We suggest performing the calculation of
Scree Plot settings require the preliminary ordering of the normalized improvement estimate for this evaluation:
RPN values by size, from the smallest to the largest. These
ΔRPN Rel =
∑ RPN iBefore − ∑ RPN iAfter ∗ 100% (4)
values are then plotted, by size, across the graph, and then ∑ RPN iBefore
typically appear, when observing from the right, like a cliff
Where: ∑RPNi Before and ∑RPNi After represent the sum of
descending to base level of ground (see Fig. 1).
the RPN values before and after CAPA implementation,
The lower long part of the plot is characterized by a
respectively.
gradual increase of the RPN values that can, usually, fit a
In our experience, if the FMEA is performed correctly,
straight line with a rather slight slope. The RPN values
the magnitude of the reduction in the risk level after FMEA
scattered around this line should be considered as a kind of
completion is expected to be in the vicinity of at least 30%.
‘Information Noise’. The issues characterized by these RPN
values do not require immediate attention. 3. CASE STUDY
The short uppermost part of Scree Plot is characterized by
The proposed procedure was applied to the evaluation of
a very steep increase of the RPN values (RPN jumps). A
a medical device (Design FMEA). After the identification of
straight line with a very strong slope could fit it. The RPN
all failure effects and the application of root-cause analysis of
values scattered around this line are related to the most critical
failure modes in teamwork, 40 corresponding RPN values
issues of FMEA that need to be dealt with promptly.
were calculated and used for TRE evaluation.
2.5. Step 5: Corrective Action & Prevention Action (CAPA) The calculated TRE value in the initial state was rather
high, indicating serious risk-related problems:
Pitfall: Lack of Guidelines for the Optimal Choice. There
11022
are, usually, several possible competitive corrective actions TRE Initial = ∗ 100 % = 27 .6 % (5)
that, theoretically, are capable of reducing the RPN for any 4 0 ∗ 1000
given failure mode. Since conventional FMEA does not The RPN values were sorted and plotted on a graph in
provide any guidelines for selecting the optimal option ascending order (see Fig. 1). Eight critical issues appearing at
between competitive corrective actions, the FMEA team faces the uppermost part of Scree Plot were identified and reviewed.
a difficult task. In planning corrective actions, 1 to 3 alternatives for
every issue have been suggested by the FMEA team (see
Remedy. We propose a simple procedure that provides
Table 1). Each corrective action was then evaluated, using the
the basis for the optimal corrective action choice. This
proposed ‘standardized-improvement’ criterion (3) calculated
procedure evaluates both the feasibility of a corrective action
as RPN reduction divided by corresponding feasibility ranks
implementation and the expected RPN value after
implementing this action. ( ΔRPN /F) and the software package [6] supporting the
Similar to the conventional FMEA’s procedure, the proposed improved FMEA procedure [5].
feasibility rank (F) is estimated on a ‘1’ (Best Case) to ‘10’ Post-FMEA evaluation of CAPA effectiveness revealed
(Worst Case) scale using the criteria proposed by the authors significant risk reduction:
and presented in [5]. The final decision, i.e. the choice of the 11022 − 5682
ΔRPN Rel = ∗ 100% = 48.4% (6)
optimal corrective action, is based on the results of the 11022
comparative analysis of the differences between the RPN
values before and after the implementation of given corrective
5682
TRE Final = ∗ 100 % = 14 . 2 % (7)
4 0 ∗ 1000
Current Vs. Expected
Item S O D RPN ΔRPN Corrective
(Failure Recommended
F F Action's
Difference
Mode & Corrective Action
Expected
Expected
Expected
Expected
Priority
Cause)
Current
Current
Current
Current
Detector Change 10 10 7 2 6 6 420 120 300 6 50 2
Failed
Measurement Change of Measurement
10 10 7 4 6 6 420 240 180 2 90 1
Due to Low Procedure
Accuracy Change of Calibration
Procedure 10 10 7 5 6 6 420 300 120 4 30 3
Table 1. Example of Corrective Actions Prioritization
1000
A.L.D. Ltd.
900
52 Menachem Begin Road
800
Tel-Aviv, Israel 67137
700 Internet (e-mail): zigmund@ald.co.il
600
Zigmund Bluvband is the President of Advanced Logistics
RPN
500
400
Critical Items Developments Ltd. His Ph.D. (1974) is in Operation Research.
He is a Fellow of ASQ and ASQ – Certified Quality &
300
Reliability Engineer, Quality Manager and Six Sigma Black
200
Belt. He has accrued 30 years of industrial and academic
100
experience and has published more than 60 papers and
0
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 tutorials, 4 patents and three books. He was the President of
Item the Israel Society of Quality from 1989 to 1994. In 2006
Zigmund Bluvband has been honored with the IEEE
Figure 1. Scree Plot of Ordered RPN Values Reliability Society Lifetime Achievement Award.
REFERENCES Pavel Grabov, Ph.D.
A.L.D. Ltd.
1. 'Quality Glossary', Quality Progress, June 2007. 52 Menachem Begin Road
2. ‘Potential Failure Mode and Effects Analysis (FMEA)’, Tel-Aviv, Israel 67137
QS-9000 Reference Manual, 1995.
3. ‘Quality Greatest Hits: Classic Wisdom from the Leaders Internet (e-mail): grabov@ald.co.il
of Quality’, Z. Bluvband, ASQ Quality Press, Pavel Grabov is VP and CTO of A.L.D. Ltd. His Ph.D. (1978)
Milwaukee, Wisconsin, 2002. is in Nuclear Physics. He is a member of ASQ and ASQ –
4. ‘Supplier´s Quality Assurance Manual’, SQAM, April Certified Quality Engineer and Six Sigma Black Belt. His area
2008. of expertise is quantitative methods of Quality & Reliability
5. ‘Expanded FMEA (EFMEA)’, Z. Bluvband, P. Grabov, Engineering. He is senior lecturer at Technion (Israel Institute
O. Nakar, RAMS, 2003. of Technology). He is the Chairman of Six Sigma Forum of
6. PFMEA, RAM Commander RAMC 7.2 New Features,
Israel Society for Quality and has over 35 years of academic
ALD WEB Site, www.aldservice.com
and industrial experience in Quality and Reliability.
BIOGRAPHIES
Zigmund Bluvband, Ph.D.