Accident Causation Models
Accident Causation Models
DSTO-TR-2094
ABSTRACT
The increasing complexity in highly technological systems such as aviation, maritime, air
traffic control, telecommunications, nuclear power plants, defence and aerospace, chemical
and petroleum industry, and healthcare and patient safety is leading to potentially disastrous
failure modes and new kinds of safety issues. Traditional accident modelling approaches are
not adequate to analyse accidents that occur in modern sociotechnical systems, where
accident causation is not the result of an individual component failure or human error. This
report provides a review of key traditional accident modelling approaches and their
limitations, and describes new system-theoretic approaches to the modelling and analysis of
accidents in safety-critical systems. It also discusses current research on the application of
formal (mathematically-based) methods to accident modelling and organisational theories on
safety and accident causation. This report recommends new approaches to the modelling and
analysis of complex systems that are based on systems theory and interdisciplinary research,
in order to capture the complexity of modern sociotechnical systems from a broad systemic
view for understanding the multi-dimensional aspects of safety and accident causation.
RELEASE LIMITATION
1Current Address: Defence and Systems Institute, Division of Information Technology, Engineering
and the Environment, University of South Australia.
Published by
Executive Summary
Large complex systems such as the Bhopal chemical plant and the Operation Provide
Comfort Command and Control System are semantically complex (it generally takes a
great deal of time to master the relevant domain knowledge), with tight couplings
between various parts, and where operations are often carried out under time pressure
or other resource constraints (Woods et al., 1994). In such systems, accidents gradually
develop over a period of time through a conjunction of several small failures, both
machine and human (Perrow, 1984; Reason, 1990). This pattern is generally found in
different industrial and aerospace accidents, despite the fact that every sociotechnical
system is unique and each accident has many different aspects.
Traditionally, accidents have been viewed as resulting from a chain of failure events,
each related to its “causal” event or events. Almost all safety analysis and risk
assessment techniques are based on this linear notion of causality, which have severe
limitations in the modelling and analysis of modern complex systems. As opposed to
conventional engineered systems, modern complex systems constitute different kinds
of elements, intentional and non-intentional: social institutions, human agents and
technical artefacts (Kroes et al., 2006). In these systems, referred as sociotechnical
systems, humans interact with technology to deliver outcomes that cannot be attained
by humans or technology functioning in isolation. In sociotechnical systems human
agents and social institutions are integrated, and the attainment of organisational
objectives is not met by the optimisation of technical systems alone, but by the joint
optimisation of the technical and social aspects (Trist & Bamforth, 1951). Thus, the
study of modern complex systems requires an understanding of the interactions and
interrelationships between the technical, human, social and organisational aspects of
systems. These interactions and interrelationships are complex and non-linear, and
traditional modelling approaches cannot fully analyse the behaviours and failure
modes of such systems.
The findings of this survey recommend new approaches to the modelling and analysis
of complex systems that are based on systems theory. The sociotechnical system must
be treated as an integrated whole, and the emphasis should be on the simultaneous
consideration of social and technical aspects of systems, including social structures and
cultures, social interaction processes, and individual factors such as capability and
motivation as well as engineering design and technical aspects of systems.
Interdisciplinary research is needed to capture the complexity of modern sociotechnical
systems from a broad systemic view for understanding the multi-dimensional aspects
of safety and modelling sociotechnical system accidents.
Author
Zahid H. Qureshi
Defence and Systems Institute, Division of Information
Technology, Engineering and the Environment, University of
South Australia
1. INTRODUCTION ............................................................................................................... 1
9. ACKNOWLEDGEMENTS............................................................................................... 56
10. REFERENCES..................................................................................................................... 57
DSTO-TR-2094
1. Introduction
System safety is generally considered as the characteristics of a system that prevents
injury to or loss of human life, damage to property, and adverse consequences to the
environment. The IEC 61508 (1998-2000) safety standard defines safety as, “freedom
from unacceptable risk of physical injury or of damage to the health of people, either
directly, or indirectly as a result of damage to property or to the environment”.
Bhopal is the site of probably the greatest industrial disaster in history. In the early
hours of 3rd December 1984, a pesticide plant owned by Union Carbide, a US-based
multinational company, released a cloud of deadly gas into the atmosphere
(Srivastava, 1992). Within minutes, it had drifted over the sleeping town of Bhopal in
India. Estimates of the number of deaths on that night vary widely. The Indian
government's official estimate is that 1,700 people died within 48 hours. Unofficially, it
is said that around 6,000 people perished in the days immediately following the gas
leak. What is certain is that the victims of Bhopal suffered horribly, most of them
drowning in their own bodily fluids as the gas attacked their lungs. To date, over
20,000 people have died as a result of the accident. An estimated 10-15 people suffer
crippling, gas-related deaths each month. More than 50,000 are too sick to work, while
around 5,000 families continue to drink poisoned water. As a result, the infant
mortality rate is significantly higher in Bhopal than in the rest of the country. The
Bhopal disaster was a result of a combination of legal, technological, organisational,
and human errors (Rasmussen, 1997).
One of the worst air-to-air friendly fire accidents involving US aircraft in military
history occurred on April 14, 1994 over northern Iraq (AAIB, 1994) during Operation
Provide Comfort. A pair of F-15Cs of the 52nd Fighter Wing enforcing the No Fly Zone
mistakenly shot down two UH-60 Black Hawk helicopters, killing 26 American and
United Nations personnel who were carrying out humanitarian aid to Kurdish areas of
Iraq. One of the helicopters was destroyed by an AIM-120, the other by a Sidewinder.
After a series of investigations by military and civilian boards with virtually unlimited
resources, no culprit emerged; no bad guy showed himself and no smoking gun was
found (Snook, 2002). The major reasons for what went wrong were organisational
factors and the human operational use of technical systems that were embedded in a
complex Command and Control structure (Leveson et al., 2002). Furthermore, it should
be noted that, except for the failure of the Identify Friend or Foe (IFF) equipment, there
were no technical malfunctions which contributed to the accident.
1
DSTO-TR-2094
Large complex systems such as the Bhopal chemical plant and the Operation Provide
Comfort Command and Control System are semantically complex (it generally takes a
great deal of time to master the relevant domain knowledge), with tight couplings
between various parts, and where operations are often carried out under time pressure
or other resource constraints (Woods et al., 1994). In such systems, accidents gradually
develop over a period of time through a conjunction of several small failures, both
machine and human (Perrow, 1984; Reason, 1990). This pattern is generally found in
different industrial and aerospace accidents, despite the fact that every sociotechnical
system is unique and each accident has many different aspects.
The historical development of accident models and various approaches for accident
analysis have been discussed by engineers, scientists, cognitive psychologists, and
sociologists (Ferry, 1988; Hayhurst & Holloway, 2003; Hollnagel & Woods, 2005;
Johnson, 2003; Leveson, 1995; Leveson, 2001; Perrow, 1984; Rasmussen & Svedung,
2000; Reason, 1997; Skelt, 2002; Vaughn, 1996). In particular, Hollnagel (2001) provides
an overview of the major changes to accident models since the 1950s, and argues that
this reflects the developments in the commonly agreed understandings of the nature of
an accident.
One of the earliest accident causation models is the Domino theory proposed by
Heinrich in the 1940s (Heinrich et al., 1980), which describes an accident as a chain of
discrete events which occur in a particular temporal order. This theory belongs to the
class of sequential accident models or event-based accident models, which underlie
most accident models such as Failure Modes and Effects Analysis (FMEA), Fault Tree
Analysis (FTA), Event Tree Analysis, and Cause-Consequence Analysis (Leveson,
1995). These models work well for losses caused by failures of physical components or
human errors in relatively simple systems. Typically, in these models, causal factors in
an accident which was not linked to technical component failures were classified as
human error as a kind of catchall or “garbage can” (Hollnagel, 2001). These models are
limited in their capability to explain accident causation in the more complex systems
that were developed in the last half of the 20th century.
2
DSTO-TR-2094
Modern technology and automation has significantly changed the nature of human
work from mainly manual tasks to predominantly knowledge intensive activities and
cognitive tasks. This has created new problems for human operator performance (such
as cognitive load) and new kinds of failure modes in the overall human-machine
systems. Cognitive systems engineering (Hollnagel & Woods, 1983) has emerged as a
framework to model the behaviour of human-machine systems in the context of the
environment in which work takes place. Two systemic accident models for safety and
accident analysis have been developed based on the principles of cognitive systems
engineering: CREAM - Cognitive Reliability and Error Analysis Method (Hollnagel,
1998); and FRAM - Functional Resonance Accident Model (Hollnagel, 2004).
During the last decade many attempts have been made on the use of formal methods
for building mathematically-based models to conduct accident analysis (Fields et al.,
1995; Burns, 2000; Johnson & Holloway, 2003a; Vernez et al., 2003)). Formal methods
can improve accident analysis by emphasising the importance of precision in
definitions and descriptions, and providing notations for describing and reasoning
about certain aspects of accidents. One of the most advanced application of formal
methods to accident analysis is the Why-Because Analysis method (Ladkin & Loer, 1998),
which employs a formal logic for accident modelling and rigorous reasoning for causal
analysis. This method has been successfully applied to a number of case studies in
aviation and rail transportation (Höhl & Ladkin, 1997; Ladkin, 2005).
approach towards understanding the social and organisational causes of accidents (see,
for example: Perrow, 1984; Vaughn, 1996; Hopkins, 2000).
This paper provides a review of key traditional accident modelling approaches and
their limitations, and describes new system-theoretic approaches to the modelling and
analysis of accidents in complex sociotechnical systems. An overview of traditional
approaches, in particular event-based models, to accident modelling is given in
Chapter 2, including its limitations to analyse accidents in modern complex systems. In
Chapter 3, we discuss the nature and complexity of modern sociotechnical systems,
describe Reason’s organisational model of accident causation, and discuss the recent
developments of systemic accident models. Two main systemic models, Rasumussen’s
risk management framework and AcciMap accident analysis technique, and Leveson’s
Systems Theoretic Accident Modelling and Processes approach are discussed in
Chapters 4 and 5 respectively. The recent work on the application of formal methods,
based on formal logics, to accident modelling and analysis is discussed in Chapter 6. In
Chapter 7, we discuss the social, cultural and organisational factors in system
accidents, and review sociological and organisational theories on safety and accident
causation. Finally, we discuss future trends in the application and development of
systemic accident models that consider the simultaneous interactions of technical,
human, social, cultural and organisational aspects of modern complex systems.
4
DSTO-TR-2094
t
ers
Ac
nm
nt
fP
fe e
sa cid
iro
Un Ac Injury
v
ult
En
Fa
l
cia
So
timeline
Sequential models work well for losses caused by failures of physical components or
human errors in relatively simple systems. While the Domino model considers only a
single chain of events, event-based accident models can also be represented by
multiple sequences of events in the form of hierarchies such as event tree and networks
5
DSTO-TR-2094
A detailed description of these models can be found in Leveson (1995). The events
considered in these models generally correspond to component failure, human error,
or energy-related event. For example, in the Multilinear Events Sequencing (MES)
model (Benner, 1975) every event is a single action by an actor. A timeline is included
to show the timing sequencing of the events and conditions (Figure 2). Multiple chains
of events, corresponding to different actors, are synchronised using the timeline. The
MES charting method provides criteria to guide the development of the explanation of
specific accidents in a manner that facilitates the transfer of knowledge among accident
investigators. The accident sequence begins when a stable situation is disturbed. If the
actor involved in the sequence adapts to the disturbance, the accident is averted.
Countermeasures can be formulated by examination of each individual event to see
where changes can be introduced to alter the process.
Although the MES model shows how events are related and combine to cause
accidents, the development and analysis of such models is time consuming and
requires significant analyst expertise. Insensitivity of the analyst to the possibility of
missing information has been shown to cause overconfidence in model predictions
(Fischoff et al., 1978).
Condition Condition
Event Event
11:15 11:30
Figure 2: Activity events and outcomes for two actors including events (Ferry, 1988)
In event-based models, the events have a direct linear relationship. These models can
only describe linear causality, and it is difficult to incorporate non-linear relationships.
The first event in the chain is often considered the “initiating event”; however, the
selection of the initiating event is arbitrary and previous events and conditions could
always be added (Leveson, 2001). A particular event may be selected as the cause
because it is the event immediately preceding the accident. The friendly fire shoot
down of the two US Black Hawk helicopters in Iraq (AAIB, 1994) could be blamed on
the F-15 pilots as the root cause, since the last condition before the accident was the
firing of the missiles. However, the accident report has shown that there were a large
number of factors and events that contributed to the accident. One reason for this
tendency to look for a single cause is to assign blame, often for legal purposes.
Occasionally, an accident investigator will stop at a particular event or condition that is
familiar and can be used as an acceptable explanation of the accident. Usually there is
no objective criterion for distinguishing one factor or several factors from the other
factors that make up the cause of the accident (Leveson, 2001).
6
DSTO-TR-2094
In many systems engineering areas, complex and safety critical systems development
employ hazard analysis techniques to predict the occurrence of accidents in order to
reduce risk and ensure safety in system design and operation. Hazard analysis is an
activity by which sequences of events that can lead to hazards or accidents are
identified, and the chance of such a sequence occurring is estimated (Leveson, 1986;
ATEA, 1998). Leveson evaluates a number of models and techniques that are used in
accident investigations and occasionally in predictive analysis. We discuss two widely
used hazard analysis techniques that are employed during the early stages of system
design.
Fault Tree Analysis (FTA) is primarily a technique for analysing the causes of hazards,
and traditionally used for the safety analysis of electromechnical systems. A fault tree
is a logical diagram that shows the relationship between a system failure, i.e. a specific
undesirable hazardous event in the system, and failures of the components of the
system. The component failures can be events associated with hardware, software and
human error. It is a technique based on deductive logic. The analyst first assumes a
particular system state, and a top (hazardous) event and then identifies the causal
events (component failure) related to the top event and the logical relations between
them, using logic symbols to describe the relations. A fault tree is a simplified
representation of a very complex process. It does not convey any notion of time
ordering or time delay. A fault tree is a snapshot of the state of the system at one point
in time.
AND OR
Failure Modes and Effects Analysis (FMEA) was originally developed to predict the
reliability of hardware systems. The objective of the analysis is to validate the design
by listing all possible sources of failures of a system’s components and by determining
the various effects of these failures on the behaviour of the system. FMEA uses forward
search based on an underlying chain-of-events model, where the initiating events are
failures of individual components. FMEA is most appropriate for standard components
with few and well-known failure modes, and is effective for analysing single point
failure modes. FMEA considers each failure as an independent occurrence with no
relation to other failures in the system. Thus this technique does not consider multiple
or common cause failures, and it is quite difficult to investigate accidents that could
7
DSTO-TR-2094
arise due to combination of failure modes. It cannot easily be used to analyse the
interactions between complex subsystems. Furthermore, the analysis is static, i.e., real-
time aspects are ignored. Because FMEAs establish the end effects of failures, they are
sometimes used in safety analysis for predicting the failures and hazards that may lead
to accidents. Failure Modes and Effects Criticality Analysis (FMECA) is basically an
FMEA with more detailed analysis of the criticality of the failure.
FTA, FMEA, and FMECA are standard risk analysis methods for component failure
analysis. Such traditional approaches have serious limitations in the analysis of
complex sociotechnical systems, since they do not consider the organisational, social,
and complex interactions between the various system components.
Sequential models assume that the cause-effect relation between consecutive events is
linear and deterministic. Analysing an accident may show that cause A led to effect B
in a specific situation, while A may be a composite event (or state) in turn having
numerous causes (Hollnagel, 2001). Thus, these models cannot comprehensively
explain accident causation in modern sociotechnical systems where multiple factors
combine in complex ways leading to system failures and accidents.
8
DSTO-TR-2094
Charles Perrow’s seminal work on normal accident theory (Perrow, 1984) provides an
approach to understanding accident causation in complex organisations managing
hazardous technologies such as nuclear power plants, petrochemical plants, aircraft,
marine vessels, space, and nuclear weapons. Perrow analyses many notable accidents
involving complex systems such as the 1979 Three Mile Island nuclear power accident,
and identifies that the characteristics that make a technological system or organisations
more prone to accident are complex interactions and tight coupling.
A complex system is composed of many components that interact with each other in
linear and complex manners. Linear interactions are those that are expected in
production or maintenance sequences, and those that are quite visible even if
unplanned (during design), while complex (nonlinear) interactions are those of
unfamiliar sequences, unplanned and unexpected sequences, and either not visible or
not immediately comprehensible (Perrow, 1984). Two or more discrete failures can
interact in unexpected ways which designers could not predict and operators cannot
comprehend or control without exhaustive modelling or test.
The type of coupling (tight or loose coupling) of components in a system affects its
ability to recover from discrete failures before they lead to an accident or disaster.
Perrow (1984) discusses the characteristics of tightly and loosely coupled systems.
Tightly coupled systems have more time-dependant processes, so that is a failure or
event in one component has an immediate impact on the interacting component.
Tightly coupled systems have little slack, quantities must be precise and resources
9
DSTO-TR-2094
cannot be substituted for one another. For example, a production system must be
shutdown if a subsystem fails because the temporary substitution of other equipment
is not possible. In contrast, loosely coupled systems are more forgiving; delays are
possible, products can be produced in a number of ways, and slack in resources is
possible.
10
DSTO-TR-2094
organisational
Some holes due to active failures line management factors
(unsafe acts) factors
Hazards
Active failures are the unsafe acts committed by people who are in direct contact with
the patient or system. They take a variety of forms: slips, lapses, fumbles, mistakes, and
procedural violations (Reason, 1990). Active failures have a direct and usually short
lived impact on the integrity of the defences. At Chernobyl, for example, the operators
wrongly violated plant procedures and switched off successive safety systems, thus
11
DSTO-TR-2094
creating the immediate trigger for the catastrophic explosion in the core. Followers of
the person approach often look no further for the causes of an adverse event once they
have identified these proximal unsafe acts. But, as discussed below, virtually all such
acts have a causal history that extends back in time and up through the levels of the
system.
Latent conditions are the inevitable “resident pathogens” within the system (Reason,
1997). They arise from decisions made by designers, builders, procedure writers, and
top-level management. Such decisions may be mistaken, but they need not be. All such
strategic decisions have the potential for introducing pathogens into the system. Latent
conditions have two kinds of adverse effect: they can translate into error provoking
conditions within the local workplace (for example, time pressure, understaffing,
inadequate equipment, fatigue, and inexperience) and they can create long-lasting
holes or weaknesses in the defences (untrustworthy alarms and indicators, unworkable
procedures, design and construction deficiencies, etc). Latent conditions, as the term
suggests, may lie dormant within the system for many years before they combine with
active failures and local triggers to create an accident opportunity. Unlike active
failures, whose specific forms are often hard to foresee, latent conditions can be
identified and remedied before an adverse event occurs. Understanding this leads to
proactive rather than reactive risk management.
The notion of latent factors supports the understanding of accident causation beyond
the proximate causes, which is particularly advantageous in the analysis of complex
systems that may present multiple-failure situations. This model has been particularly
useful in accident investigation, as it addresses the identification of latent failures
within the causal sequence of events as well. This model has been widely applied in
many domains to understand how accidents are caused such as the oil and gas
industry (Wagenaar et al., 1994), commercial aviation (Maurino et al., 1995), and it has
become a standard in medicine (Reason et al., 2000; Reason, 2000).
This model places a great emphasis on the search for latent or organisational causes
and provides an understanding of how these are related to the immediate causes at the
sharp end. Reason (1990) conducted a number of case studies of the Three Mile Island,
Bhopal disaster, and Chernobyl accident and identified several latent failures related to
organisational, management and design failures. In the Swiss cheese model the latent
and active errors are causally linked to management as a linear sequence of events, and
this can lead to the illusion that the roots of all accidents or even errors stem from the
organisation’s management. Shorrock et al. (2003) argue that, in some cases, the main
contributory factors might well have been active errors with more direct implications
for the accident causation.
Johnson & Botting (1999) employed Reason’s model to understand the organisational
aspects of the Watford Junction railway accident. They studied the latent conditions
that contributed to the active failure by the train driver to violate two sets of signals.
Numerous organisational factors were identified as the causal factors that contributed
to the probability of the accident (see Table 1). However, this model does not give a
clear explanation how these causal factors combined to provide the circumstances for
an accident to take place. For example, the main defences of the Watford Junction, the
positioning of the Permanent Speed Restrictions signs and the junction signals, were
not independent, and Johnson & Botting recommend the use of formal methods to
analyse this complexity in detail. Furthermore, the causal links between distant latent
12
DSTO-TR-2094
conditions (organisational factors) and the accident outcome is complex and loosely
coupled (Shorrock et al., 2003), and Reason’s model only guides to a high-level analysis
of the contributory factors involved in an organisational accident.
Table 1: Watford Junction Railway Accident – Active failures and latent conditions
(Johnson & Botting, 1999)
Reason’s model shows a static view of the organisation; whereas the defects are often
transient i.e. the holes in the Swiss cheese are continuously moving. In reality, the
sociotechnical system is more dynamic than the model suggests.
In sociotechnical systems, computers and technical artifacts in general are being more
and more tightly integrated with human activities. Failures in sociotechnical systems
13
DSTO-TR-2094
are the result of a combination of factors meshed into a complex causal network spread
over several hierarchical levels within an organisation (Reason, 1990; 1997). Besnard &
Baxter (2003) argue that technical and organisational issues need to be simultaneously
considered to capture the causal mesh leading to accidents and discuss the integrative
representation of the event chain and Reason’s Swiss cheese models. There are strong
common ideas between these two models (Besnard & Baxter, 2003):
Besnard & Baxter (2003) state that each organisational layer invariably contains one or
more holes, which can be attributed to the occurrence of fault-error-failure chains
during its creation or functioning. One then gets an elementary failure generation chain
for a hole in a given system’s layer (Figure 6) that provides an identifiable causal path
for each hole. In other words, this approach provides a mapping between failures and
holes in the system’s layers, where event chain and Reason’s models can be turned into
compatible representations of systems failures. This opens up a new area of application
for the event chain model, that of sociotechnical system failures. Equally, it allows
Reason’s model to connect to technical causal paths of failures in systems.
Besnard & Baxter (2003) developed a three-layer model for the THERAC-25 system: the
regulation authorities, the company who developed the system, and the programmer
who wrote the code, and introduced a fault-error-failure chain for each hole in the
various system layers (see Figure 7). One of the many chains for each of the layers is
described below:
14
DSTO-TR-2094
• The programmer did not take all of the system’s real-time requirements into
account (fault).
• This led to the possibility of flaws in some software modules (error) that
degraded the reliability of the software (failure).
• The company did not perform all the required tests on the software (fault).
This resulted in bugs in some modules remaining undetected and hence
unfixed (error), thereby triggering exceptions when the given modules were
called (failure).
• The regulation authorities did not thoroughly inspect the system (fault). This
led to some flaws remaining undetected (error). In turn, these flaws caused
injuries and deaths when the system was used (failure).
Programmer
Company
Regulation Authorities
THERAC -25
FAILURE
Figure 7: Integrating Event Chain and Reason’s Models (Besnard & Baxter, 2003)
The resulting integrated model offers a richer description of sociotechnical failures by
suggesting a mapping between sequences of events (a fault-error-failure chain) and
holes in the layers of a system (Reason’s Swiss cheese model). This approach provides
some intrinsic interest since it constitutes a step forward in reconciling technical and
organisational views on failures in sociotechnical systems.
New approaches to accident modelling adopt a systemic view which considers the
performance of the system as a whole. In systemic models, an accident occurs when
several causal factors (such as human, technical and environmental) exist
coincidentally in a specific time and space (Hollnagel, 2004). Systemic models view
accidents as emergent phenomena, which arises due to the complex interactions
between system components that may lead to degradation of system performance, or
result in an accident.
Systemic models have their roots in systems theory. Systems theory includes the
principles, models, and laws necessary to understand complex interrelationships and
interdependencies between components (technical, human, organisational and
management) of a complex system.
information and control. A system is not regarded as a static design, but as a dynamic
process that is continually adapting to achieve its objectives and react to changes in
itself and its environment. The system design should enforce constraints on its
behaviour for safe operation, and must adapt to dynamic changes to maintain safety.
Accidents are treated as the result of flawed processes involving interactions among
people, social and organisational structures, engineering activities, and physical and
software system components (Leveson, 2004).
Modern technology has changed the nature of human work from mainly manual tasks
to predominantly knowledge intensive activities and cognitive tasks. Technology-
driven approaches to automation have created new problems for human operator
performance and new kinds of failure modes in the overall human-machine systems,
which have led to many catastrophic accidents in the fields of aviation, nuclear power
plants and military command and control (Parasuraman, 1997). This has influenced the
development of new approaches for human performance and error modelling, and
accident analysis of joint human-machine systems.
Two systemic accident models for safety and accident analysis have been developed
based on the principles of cognitive systems engineering: the Cognitive Reliability and
Error Analysis Method (CREAM); and the Functional Resonance Accident Model
(FRAM).
16
DSTO-TR-2094
17
DSTO-TR-2094
Structural Hierarchy
The top level L1 describes the activities of government, who through legislation control
the practices of safety in society. Level L2 describes the activities of regulators,
industrial associations and unions (such as medical and engineering councils) that are
responsible for implementing the legislation in their respective sectors. Understanding
these two levels usually requires knowledge of political science, law, economics and
sociology. Level L3 describes the activities of a particular company, and usually
requires knowledge of economics, organisational behaviour, decision theory and
sociology. Level L4 describes the activities of the management in a particular company
that lead, manage and control the work of their staff. Knowledge of management
theories and industrial-organisational psychology is used to understand this level.
Level L5 describes the activities of the individual staff members that are interacting
directly with technology or process being controlled such as power plant control
operators, pilots, doctors and nurses. This level requires knowledge in new disciplines
such as psychology, human-machine interaction and human factors. The bottom level
18
DSTO-TR-2094
Research Environmental
Discipline Stressors
Public
Opinion L1 Government
Changing political
Political Science, climate and public
Law, Economics, awareness
Sociology L2 Regulators,
Associations
Economics,
Decision Theory,
Changing market
Organisational
conditions and
Sociology L3 Company
financial pressure
Industrial
Engineering,
Management & L4 Management Changing
Organisation competency and
levels of education
Psychology,
Human Factors, L5 Staff
HMI
Fast pace of
Engineering - technological
Mechanical, L6 Work change
Chemical, and
Electrical
As shown on the right of Figure 8, the various layers of complex sociotechnical systems
are increasingly subjected to external disruptive forces, which are unpredictable,
rapidly changing and have a powerful influence on the behaviour of the sociotechnical
system. When different levels of the system are being subjected to different pressures,
each operating at different time scales, it is imperative that efforts to improve safety
within a level be coordinated with the changing constraints imposed by other levels.
19
DSTO-TR-2094
System Dynamics
Decision making and human activities are required to remain between the bounds of
the workspace defined by administrative, functional and safety constraints. Rasmussen
argues that in order to analyse a work domain’s safety, it is important to identify the
boundaries of safe operations and the dynamic forces that may cause the sociotechnical
system to migrate towards or cross these boundaries. Figure 9 shows the dynamic
forces that can cause a complex sociotechnical system to modify its structure and
behaviour over time.
Boundary of
functionally Boundary of
acceptable economic failure
behaviour Gradient towards
least effort
Gradient towards
economic efficiency
accidents Boundary of
unacceptable work load
Boundary of safety
regulations
20
DSTO-TR-2094
Over a period of time, this adaptive behaviour causes people to cross the boundary of
safe work regulations and leads to a systematic migration toward the boundary of
functionally acceptable behaviour. This may lead to an accident if control is lost at the
boundary. The migration in work practices does not usually have any visible,
immediate threat to safety prior to an accident, because violation of procedures does
not immediately lead to a catastrophe. At each level in the sociotechnical hierarchy,
people are working hard, striving to respond to cost-effective pressures, but they do
not see how their decisions interact with those made by other actors at different levels
in the system (Woo & Vicente, 2003). Rasmussen asserts that these uncoordinated
attempts of adapting to environmental stressors are slowly but surely “preparing the
stage for an accident”.
The safety control structure often changes over time, which accounts for the
observation that accidents in complex systems frequently involve a migration of the
system towards a state where a small deviation (in the physical system or operator
behaviour) can lead to a catastrophe. The analyses of several accidents such as Bhopal
and Chernobyl demonstrate that they have not been caused by coincidence of
independent failures and human errors, but by a systematic migration of
organisational behaviour towards an accident under the influence of pressure toward
cost-effectiveness in an aggressive, competitive environment (Rasmussen, 1997).
Rasmussen’s approach for improving safety and risk management raises the need for
the identification of the boundaries of safe operation, making these boundaries visible
to the actors and giving opportunities to control behaviour at the boundaries.
A representative set of accident cases are selected for the industrial sector under
investigation. For each of these accident scenarios the causal chains of events are then
analysed. From here an overview of the patterns of accidents related to a particular
activity or system is generated by a cause-consequence analysis that is represented by a
cause-consequence chart.
21
DSTO-TR-2094
Causes Causes
AND Gate
Causes
OR Gate
Causes
Critical Event
Disturbance of
Monitoring
Major Energy
of Balance
Operator Errors
Balance
Safety measures
Technical Faults
Auto Control
Faulty Maintenance
Functions
Operator
No Yes Interference
Technical
Faults
Termination
Safety Systems
Functions
Faulty
Maintenance
Yes No
Barriers
Intact
No Yes
Termination
Accident
Identification of Actors
The cause-consequence chart focuses on the control of the hazardous process at the
lowest level of the sociotechnical system (level 6 in Figure 8). In order to conduct a
vertical analysis across the hierarchical levels, the cause-consequence chart
representation is extended which explicitly includes the normal work decisions at the
22
DSTO-TR-2094
higher levels of the sociotechnical system (levels 1-6 in Figure 8). This extension results
in an AcciMap which shows the activities of various decision makers contributing to or
preventing an accident. The AcciMap represents a mapping of these contributing
factors onto the respective levels of a complex sociotechnical system identified in
Figure 8.
11
Figure 11: AcciMap Structure and Symbols (Rasmussen & Svedung, 2000)
The basic AcciMap is developed from analysis of one particular accident case, i.e., it
reflects one particular course of events. The layout and symbols used in an AcciMap
are shown in Figure 11 (Rasmussen & Svedung, 2000):
- At the bottom is a level representing the topography of the accident scene: the
configuration and physical characteristics of the landscape, buildings,
equipment, tools, vehicles, etc. found at the location and involved in the
accident.
- At the next higher level is represented the accident processes, that is, the causal
and functional relations of the dynamic flow, described in terms of the cause-
consequence charts convention. In the flow are included "Decision/Action"
boxes connected to consequence boxes where the flow has been or could be
changed by human (or automated) intervention.
- At the levels above this, the "Decision/Action" box symbol is used to represent
all decision-makers that, through decisions in their normal work context, have
influenced the accidental flow at the bottom.
In this way, the AcciMap serves to identify relevant decision-makers and the normal
work situation in which they influence the occurrence of accidents. The focus is not on
the traditional search for identifying the “guilty person”, but on the identification of
those people in the system that can make decisions resulting in improved risk
23
DSTO-TR-2094
The basic AcciMap represents the flow of events from one particular accident. From the
set of AcciMaps based on the set of accident scenarios, a generalised map, a generic
AcciMap is developed that identifies the interaction among the different decision
makers and the events leading to an accident. The generic AcciMap regarding the
transport of dangerous goods is shown in (Rasmussen & Svedung, 2000; Svedung &
Rasmussen, 2002), and an AcciMap of the F-111 Chemical Exposure accident is shown
in the next section (see Figure 12).
Work Analysis
For each accident scenario, the decision-makers, planners, and actors who have been
involved in the preparation of accidental conditions are identified and represented in
an ActorMap. This map should identify the individuals and groups that are involved in
an adverse event at all relevant levels of society shown in Figure 8. An ActorMap is an
extract of the generic AcciMap showing the involved decision makers, and an
ActorMap in the transport of dangerous goods case study is shown in Rasmussen &
Svedung (2000).
In early 2000, after the health of more than 400 maintenance workers had been
seriously affected, RAAF finally realized the problem and the fuel tank repair program
were suspended. This had a negative impact on the availability of F-111 aircraft, which
resulted in a detriment to defence capability.
24
DSTO-TR-2094
Initially, the material made available to the F-111 Board of Inquiry (BOI) points to
ongoing failings at a managerial level to implement a safe system of work and co-
ordinate processes within a complex organisation. The BOI hence pointed out that if
anybody is to be held accountable, it should be the RAAF itself. The aim of the
investigation, however, was not to assign blame; it was conducted to understand how
the exposure occurred and to make recommendations designed to reduce the chance of
recurrence.
A wide array of causal and contributory factors, occurring over 20 years, combined in
complex ways to affect the health of hundreds of RAAF maintenance workers
(Clarkson et al., 2001). A causal analysis was conducted for the spray seal program,
and a causal diagram was developed based on Rasmussen’s (1997) AcciMap technique.
This analysis is based on the assumption that there is no ultimate cause or causes
responsible for the accident; rather many causal factors contribute to the final outcome,
including latent factors within the organisation as discussed in Reason’s (1997)
organisational accident model.
The causal diagram of the spray seal program (Figure 12) constitutes six hierarchical
levels, where the principle employed is that the more remote the cause with respect to
the final outcome, the higher up in the diagram it is located. The diagram is
constructed by starting with the “Health outcomes” (the accident) and asking why it
occurred, which leads to the identification of preceding causes. Counterfactual
reasoning is employed to determine the necessary causal factor, in the sense that, had
this factor been otherwise, the accident probably would not have occurred. The causal
pathways are then determined, proceeding from the bottom of the diagram upwards
(see Figure 12).
At the bottom are the outcomes - damage to the health of Air Force workers leading to
suspension of the spray seal program and consequent reduction in the availability of
F-111 aircraft. Next level up are the immediate causes. Above that are the
organisational causes, to do with the way the Air Force as an organisation functioned.
Above that are shown a number of Air Force values that accounted for many of the
factors at the organisational level. Finally there are two levels, government and society,
both beyond the Air Force organisation and over which the Air Force therefore has no
control.
A summary of main findings and explanations of the various contributory factors and
causal pathways is described in the BOI report (Clarkson et al., 2001: Chap. 11). Here,
the causal paths leading to the failure of the chain of command to operate optimally is
described.
25
DSTO-TR-2094
Can-do culture
There was a particular weak link in the chain of command between the senior NCOs
and the junior engineering officers, and there was limited communication between
these two levels. Part of the reason for this was the very broad span of responsibilities
which junior engineering officers were expected to shoulder. This in turn was a
consequence of reductions in staff numbers as part of a general downsizing. Senior
officers, too, were suffering extreme work overload as a result of being expected to
carry out market testing (outsourcing) functions as well as their normal supervisory
functions. The result was that senior officers had relatively little conception of what
was occurring on the hanger floor. These weaknesses at the upper levels of the chain of
command stem fairly directly from government policy decisions lying largely outside
the control of the Air Force.
The causal diagram in Figure 12 is based on the official F-111 Board of Inquiry report
(Clarkson et al., 2001). The causal flow diagram looks at the culture of RAAF as well as
factors that lie beyond the organisational limits of RAAF. This analysis concludes that
the failure of the chain of command to operate optimally predominantly lies at the
values and culture of RAAF, and to government policies such as the government
26
DSTO-TR-2094
initiated cost-cutting and down-sizing of employees, and social attitudes such as the
focus on air safety driven partly by public pressure.
In this way, the causal analysis serves to identify relevant decision-makers and the
normal work situation in which they influence and condition possible accidents. The
focus is not on the traditional search for identifying the “guilty person”, but on the
identification of those people in the system that can make decisions resulting in
improved risk management and hence to the design of improved system safety.
27
DSTO-TR-2094
The most basic concept in STAMP is a constraint, rather than an event. Traditional
accident models explain accident causation in terms of a series of events, while STAMP
views accidents as the result of a lack of constraints (control laws) imposed on the
system design and during operational deployment. Thus, the process that causes
accidents can be understood in term of the flaws in the control loops between system
28
DSTO-TR-2094
Here we provide a summary of the STAMP analysis of the Black Hawk fratricide
during the operation Provide Comfort in Iraq in 1991, which is described in detail in
(Leveson et al., 2002; Leveson, 2002).
29
DSTO-TR-2094
The hierarchical control structure of the Black Hawk accident is shown in Figure 14,
starting from the Joint Chiefs of Staff down to the aircraft involved in the accident. At
the lowest level in the control structure are the pilots who directly controlled the
aircraft (operator at the sharp end).
Incirlik Zakhu
Mission Director
AWACS
Staff Controllers
MCC
ACE
Figure 14: Hierarchical Control Structure in the Iraqi No-Fly Zone (Leveson, 2002)
The AWACS mission crew was responsible for tracking and controlling aircraft. The
AWACS also carried an Airborne Command Element (ACE), who was responsible for
ensuring that the larger OPC mission was completed. The ACE reported to a
ground-based Mission Director. The Army headquarters (Military Coordination
Center) Commander controlled the U.S. Black Hawk operations while the Combined
Forces Air Component (CFAC) Commander was responsible for the conduct of OPC
missions. The CFAC Commander had tactical control over all aircraft flying in the No
Fly Zone (NFZ) including both Air Force fighters and Army helicopters, but had
operational control only over the Air Force fixed-wing aircraft.
In addition to the formal control channels, there were also communication channels,
shown in Figure 14 as dashed lines, between the process components at each level of
the hierarchy.
The hierarchical control structure (Figure 14) is then analysed to identify the safety
constraints at each level in the hierarchy and the reasons for the flawed control. Using
the general classification in Figure 13, Leveson (2002) describes the analysis at each of
the levels in the Hierarchical Control Structure. For example, at the Physical Process
Level (see Figure 15), the safety constraint required that weapons must not be fired at
friendly aircraft. All the physical components worked exactly as intended, except
perhaps for the IFF (Identify Friend or Foe) system, which gave an intermittent
response (this has never been completely explained). There were, however, several
30
DSTO-TR-2094
• The Black Hawks and F-15s were on different radio frequencies and thus could
not communicate or hear the radio transmission between the two F-15 pilots
and between the lead F-15 pilot and the AWACS.
• The F-15 aircraft were equipped with the latest anti-jamming HAVE-QUICK II
radios while the Army helicopters were not. The F-15 pilots could have
switched to non-HAVE QUICK mode to enable communication with the Black
Hawk pilots; however, the procedures given to the F-15 pilots did not contain
this instruction.
• The Black Hawks were not squawking the required IFF code for flying within
the NFZ, and this was concluded as cause for F-15s receiving no response to
their Mode IV IFF query. However, according to an Air Force analysis of the IFF
system, the F-15s should have received a Mode IV response regardless of the
code squawked by the targets; this contradiction has never been explained.
Dysfunctional Interactions
Black Hawk and F -15 pilots were on different frequencies
F-15Õ
s were equipped with latest anti -jamming HAVE -QUICK II radios while Black Hawks were not
Black Hawks were not squawking the required IFF codes for th e NFZ
F-15Õ
s did not receive a Mode IV response from Black Hawks
Figure 15: Physical Process Level: Classification and Analysis of Flawed Control
A major reason for these dysfunctional interactions can be attributed to the use of
advanced technology by the Air Force, which was incompatible with the Army radios
in the Black Hawks. The hilly terrain also contributed to the interference in the line-of-
sight transmissions.
However, it is also important to analyse the safety constraints and flawed control at the
higher levels in the hierarchical control structure to obtain a system-wide
understanding of the contributory causal factors. Leveson (2002) conducted a detailed
analysis at each of the other levels in the Hierarchical Control Structure, namely, The
Pilots Level, ACE and Mission Director Level, AWACS Control Level, CFAC and MCC
Level, CTF Level, and the National Command Authority and Commander-in-Chief
Europe levels.
31
DSTO-TR-2094
The following four causes have been generally accepted by the military community as
the explanation for the shootdown (AAIB, 1994):
While there certainly were mistakes made at the pilot and AWACS levels as identified
by the special Air Force Task Force and the four factors identified by the accident
report were involved in the accident, the use of the STAMP analysis (Leveson, 2002)
provides a much more complete explanation including:
Leveson attributes the organisational factors at the highest levels of command for the
lack of coordination and communication, as a key accident factor, which led to the
failures at the lower technical and operational levels.
Using the traditional accident models based on event chains would have resulted in
focusing attention on the proximate events of this accident and on the identification of
the humans at the sharp end such as the pilots and the AWACS personnel. The STAMP
method clearly identifies other organisational factors and actors and the role they
played.
32
DSTO-TR-2094
Clarke & Wing (1996) provide a survey of formal methods and tools for specifying and
verifying complex hardware and software systems. They assess the application of
formal methods in industry and describe some successful case studies such as: the
formal specification of IBM’s Customer Information Control System (CICS); and, an
on-line transaction processing system in the Z language; formal requirements
specification for the Traffic Collision Avoidance System (TCAS) II using the
Requirements State Machine Language (RSML). Model checking and theorem proving
are two well-established approaches for formal verification. Clarke & Wing describe
some notable examples of the successful application of these techniques and associated
tools in industry.
Typically formal methods have been applied to various software development phases
such as requirements analysis, specification, design and implementation (Bjørner &
Druffel, 1990); they are currently mainly used for stabilising requirements and re-
engineering existing systems (Gaudel, 1994; Wildman, 2002).
Wildman (2002) describes the use of formal specification techniques to reformulate the
requirements of the Nulka Electronic Decoy. The Nulka Electronic Decoy is a joint
Australian/US project to counter anti-ship missiles. The requirements specification
contained informal natural language requirements relating both to time-related
33
DSTO-TR-2094
The formal analysis of the NULKA PIDS consisted of translating the original informal
requirements into the Interval Calculus, and the resultant formal requirements were
then manually checked for critical properties, namely, consistency, correctness,
precision, and abstraction. The results of this application have demonstrated the
usefulness of mathematical modelling of the English language specification and its
subsequent reverse engineering back into English, which provided an accurate, readable,
and clear understanding of the natural language specification.
The tremendous potential of formal methods has been recognised by theoreticians for a
long time. There are comprehensive accounts of experience on the use of formal
methods in industry and research (see for instance: Butler at al., 2002; Hinchey &
Bowen, 1995). A comprehensive database of industrial and space applications is
available at the Formal Methods Europe applications database (FME, 2004).
There is a variety of formal methods which support the rigorous specification, design
and verification of computer systems, for example, COLD, Circal Process Algebra,
Estelle, Esterel, LOTOS, Petri Nets, RAISE, SDL, VDM and Z (see for example: FMVL,
2007). Lindsay (1998) provides a tutorial example to illustrate the use of formal
methods for system and software development. The example concerns part of a
simplified Air Traffic Control system, using the Z notation and Cogito methodology
for modelling, specification, validation and design verification.
Formal languages and methods are frequently applied to gain high confidence in the
accuracy of information in the design of safety-critical systems (Hinchey & Bowen,
1995). For example, the Federal Aviation Administration’s air traffic collision
avoidance system (TCAS II) was specified in the formal language, RSML
(Requirements State Machine Language), when it was discovered that a natural
language specification could not cope with the complexity of the system (Leveson et
al., 1994). Haveland & Lowry (2001) discuss an application of the finite state model
checker SPIN to formally analyse a software-based multi-threaded plan execution
module programmed in LISP, which is one component of NASA’s Remote Agent, an
artificial intelligence-based space-craft control system. A total of five previously
undiscovered concurrency errors were identified in the formal model; each represented
an error in the LISP code. In other words, the errors found were real and not only
errors in the model. The formal verification effort had a major impact: locating errors
that would probably not have been located otherwise and identifying a major design
flaw. Formal approaches to development are particularly justified for systems that are
complex, concurrent, quality-critical, safety and security-critical.
34
DSTO-TR-2094
Formal methods presently do not scale up to the modelling and verification of large
complex systems. Furthermore, there is a need for further information on the practical
application of formal methods in industry to assist in the procurement, management,
design, testing and certification of safety and security critical system. The formal
methods group of European Workshop on Industrial Computer Systems (EWICS) have
released guidelines on the use of Formal Methods in the Development and Assurance
of High Integrity Systems (Anderson et al., 1998a; 1998b). These guidelines provide
practical advice for those wishing to use or evaluate formal methods in an industrial
environment. The employment of formal methods does not a priori guarantee
correctness; however, they can enhance our understanding of a system by revealing
inconsistencies, ambiguities, and incompletenesses that might otherwise go undetected
(Clarke & Wing, 1996). Thus, the main benefits can be seen as achieving a high degree
of confidence in the correctness and completeness of specifications and a high degree
of assurance that the design satisfies the system specification.
Ensuring the quality of accident reports should be a high priority for organisations as
they have a moral responsibility to prevent accident recurrence. They also have a
financial responsibility to their investors; accident recurrence carries the possibility of
damaging litigation and loss of customer confidence (Burns, 2000). However, the
structure, content, quality, and effectiveness of accident reports have been much
criticised (e.g., Burns et al., 1997; Ladkin & Loer, 1998). A large number of accident
investigation reports do not accurately reflect the events, or are unable to identify
critical causal factors, and sometimes conclude with incorrect causes of the accident.
Omissions, ambiguities, or inaccurate information in a report can lead to unsafe system
designs and misdirected legislation (Leveson, 1995). Thus, there is a critical need to
improve the accuracy of the information found in conventional accident investigation
reports.
Burns (2000) provides an overview of aspects of natural language accident reports that
inhibit the accurate communication of the report contents:
• Size: The sheer size of accident reports makes it difficult for the reader to
absorb all the salient points; a great deal of information can be forgotten, lost
track of, or simply missed. The size also increases the chances of syntactic and
semantic errors, ambiguities, and omissions in the transcription of the report.
• Structure: Sections of the report cannot be read in isolation, and thus the reader
needs to read the full document to comprehend the information provided.
35
DSTO-TR-2094
36
DSTO-TR-2094
During the last decade many attempts have been made on the use of formal methods
for building mathematically-based models to conduct accident analysis. During the last
decade in particular, many attempts have been made on the use of formal methods for
building mathematically-based models to conduct accident analysis. A comprehensive
survey on the application of various formal logics and techniques to model and reason
about accident causation is given by Johnson & Holloway (2003a) and Johnson (2003).
They discuss the weakness of classical (propositional) logic in capturing the different
forms of causal reasoning that are used in accident analysis. In addition, the social and
political aspects in accident analysis cannot easily be reconciled with the classical logic-
based approach. Johnson & Holloway argue that the traditional theorem proving
mechanisms cannot accurately capture the wealth of inductive, deductive and
statistical forms of inference that investigators routinely use in their analysis of adverse
events.
Thomas (1994) used a first order logic to formalise the software code known to be a
source of error in the Therac-25 radiation machine (Leveson, 1993). The automated
theorem prover LP (Larch Prover) was employed to reason about the behaviour of the
code, which helped identify the underlying cause of the unexpected behaviour of the
code that contributed to the accident. This approach assisted in correcting the software
error and in providing rigorous evidence (via formal proofs) that the modified
software executed according to the expected/specified behaviour.
Fields et al. (1995) employed CSP Process Algebra (Hoare, 1985) to formally specify
both the tasks of the human operators and the behaviour of the system. The
performance model contributed to the analysis of human error in system failures by
identifying the sequence of actions (erroneous traces) related to the failure modes of
the operator.
Petri nets have been successfully used for dynamic modelling of parallel and
concurrent systems with time constraints in a wide range of applications including
safety-critical systems. Vernez et al. (2003) provide a review of the current uses of Petri
nets in the fields of risk analysis and accident modelling. They demonstrate that Petri
nets can explicitly model the complex cause to consequence relationships between
37
DSTO-TR-2094
events. Vernez et al. provide a translation of key safety concepts onto the Petri nets
formalism and suggest that this can facilitate the development of accident models.
They investigate the modelling capability of Coloured Petri Nets (CPN) to predict
possible accident scenarios in the Swiss metro, a high-speed underground train
planned for interurban linking in Switzerland. Relevant actors, events and causal
relationships were translated into the CPN formalism, and the Design/CPN tool and
the state space method were employed to analyse the states or accident scenarios (a
succession of possible system states) generated in the occurrence graph. Vernez et al.
argue that the results obtained in the CPN modelling and analyses of accident
processes are realistic as compared to both previous tunnel accidents and tunnel safety
principles.
Burns et al. (1997) applied a Sorted First Order Logic (SOFAL) to specify and reason
about the human contribution to major accidents. SOFAL has the advantage over other
formalisms such as first order logic and Petri nets, that it explicitly specifies agents
(people, operators) as distinguished from other system objects (such as inanimate
objects). This feature supports the analysis of accidents in focusing more on those
objects which directly affect the system behaviour. SOFAL can also support reasoning
over temporal aspects of the system behaviour (Burns et al., 1997), for example it can
demonstrate that there exists a sequence of actions which, when performed, will lead
to a scenario where an accident occurs.
Deontic logics were developed for reasoning about norms in complex organisational
and procedural structures within a system, in particular, to reason about notions in
ethics and philosophy of laws (Wieringa & Meyer, 1994). Deontic logic can also express
normative (e.g. legal) and non-normative (e.g. illegal, non-permitted) behaviour, which
can be used for modelling and reasoning about ideal (normative) and non-ideal (non-
normative) or actual system behaviour which is commonly found in accidents (Burns,
2000). The concept of non-normative scenarios is important in accidents as it can be
used to model some non-ideal, non-permitted or illegal behaviour (such as smoking in
a “non-smoking” zone), which if occurs may lead to an accident. Burns (2000) describes
an Extended Deontic Action Logic (EDAL) language for formally modelling accident
reports, which is an extension of Deontic Action Logic (Khosla, 1988). EDAL models
both the prescribed (expected) and the actual behaviour of the system and the
relationship between the two; this facilitates an analysis of the conflict between the two
behaviours and thus can greatly assist in understanding the causal factors in an
accident. Using the Channel Tunnel fire accident report as a case study, Burns (2000)
developed a formal model in EDAL and demonstrated that this approach can be used
to reason about qualitative failure, errors of omission and commission, and
prescriptive failures. For example, EDAL enabled the specification and analysis of
where, how, and by whom, norms were broken within the system. Burns has
demonstrated that constructing and reasoning about formal accident report models
highlights problems in the accident report.
The focus of EDAL has been on the deontic modalities in an accident, other formal
modelling techniques, such as Why-Because Analysis (Ladin & Loer, 1998), have
considered further aspects of accidents such as epistemic and real-time behaviour.
38
DSTO-TR-2094
The accident modelling approaches discussed so far are based on deterministic models
of causality. These models focus on the identification of deterministic sequence of
cause and effect relationships, which are difficult to validate (Johnson, 2000). For
example, it cannot be guaranteed that a set of effects will be produced even if necessary
and sufficient conditions can be demonstrated to hold at a particular moment. Johnson
argues that the focus should be on those conditions that make effects more likely
within a given context, and examines the application of probabilistic models of
causality to support accident analysis. Probabilistic causation designates a group of
philosophical theories that aim to characterise the relationship between cause and
effect using the tools of probability theory (Hitchcock, 2002). The central idea
underlying these theories is that causes raise the probabilities of their effects. Johnson
proposes an approach for the causal analysis of adverse accidents that is based on the
integration of deterministic and probabilistic models of causality.
The use of conditional probabilities has some significant benefits for accident analysis
(Johnson, 2000); for example, in the Nantichoke fire we need to know the probability of
ignition from each source (indicator taps, exposed manifolds) given the fuel leak
characteristics. Johnson & Holloway (2003a) discuss the use of Bayesian Logic (which
exploits conditional probabilities) for accident analysis, as an example to reason about
the manner in which the observation of evidence affects our belief in causal hypothesis.
The probabilistic theory of causality has been developed in slightly different ways by
many authors. Hitchcock (2002) conducts a review of these developments and
discusses the issues and criticism to the probabilistic theories of causation. Here, we
discuss the mathematical theory of causality developed by Pearl (2000), which is a
structural model approach evolved from the area of Bayesian networks. The main idea
behind the structure-based causal models is that the world is modelled by random
variables, which may have causal influence on each other (Eiter & Lukasiewicz, 2001).
The variables are divided into exogenous variables, which are influenced by factors
outside the model, and endogenous variables, which are influenced by exogenous and
endogenous variables. This latter influence is expressed through functional
relationships (described by structural equations) between them.
(i) U is a set of background variables, (also called exogenous), that are determined
by factors outside the model;
(ii) V is a set {V1,K ,Vn } of variables, called endogenous, that are determined by
variables in the model - that is, variables in U U V and
(iii) F is a set of functions {f1, f2, …, fn} such that each fi is a mapping from (the
respective domains of) U U (V\Vi) to Vi and such that the entire set F forms
a mapping from U to V . In other words, each fi tells us the value of Vi
given the values of all other variables in U U V, and the entire set F has a
unique solution V(u). Symbolically, the set of functions F can be
represented by writing: vi = fi (pai; ui), i = 1, …, n,
where, pai is any realization of the unique minimal set of variables PAi in
V\Vi (connoting parents) sufficient for representing fi. Likewise, Ui ⊆ U
stands for the unique minimal set of variables in U sufficient for
representing fi.
39
DSTO-TR-2094
Pearl (2000) uses the structural causal model semantics and defines a probabilistic
causal model as a pair (M, P(u)) where M is a causal model and P(u) is a probability
function defined over the domain of the background variables U.
Pearl (2000) has also demonstrated how counterfactual queries, both deterministic and
probabilistic, can be answered formally using structural model semantics. He also
compares the structural models with other models of causality and counterfactuals,
most notably those based on Lewis’s closest-world semantics.
A number of research groups are investigating the use, extension and development of
formal languages and methods for accident modelling and analysis, such as the
Glasgow Accident Analysis Group (GAAG, 2006) and the NASA Langley formal
methods research program on accident analysis (LaRC, 2004). The research program at
NASA Langley is investigating the suitability of using one or more existing
mathematical representations of causality as the basis for developing tools for:
Formal methods have been applied successfully to the design and verification of
safety-critical systems; however, they need to be extended to capture the many factors
and aspects that are found in accidents and accident reports. A single modelling
language is unlikely to model all the factors and aspects in an accident (Burns, 2000).
Also scaling up, formal methods have limitations to model complete sociotechnical
systems, they need specialists in mathematics, and not everything can be formalised.
Why-Because Analysis (WBA) is a method for the failure analysis of complex, open,
heterogeneous systems (Ladkin, 1999). The adjective “open” means that the behaviour
of the system is highly affected by its environment, and “heterogeneous” refers to a
system comprised of a group of closely connected components that are not alike, such
as digital, physical, human and procedural components, which are all supposed to
work together. For example, modern aviation operations have all of these components
and thus form a complex, open, heterogeneous system.
The investigation of failures of complex systems is a wide field of practical interest that
has traditionally not been carried out with any significant use of formal methods.
Ladkin & Loer (1998) developed the formal Why-Because Analysis (WBA) method,
which enables one to develop, then formally to prove, the correctness and relatively
40
DSTO-TR-2094
In general, the term “cause” is not well defined and there is little consensus on what
constitutes a cause. One philosophical approach to causation views counterfactual
dependence as the key to the explanation of causal facts: for example, events c (the
cause) and e (the effect) both occur, but had c not occurred, e would not have occurred
either (Collins et al., 2004). The term ‘‘counterfactual’’ or ‘‘contrary-to-fact’’ conditional
carries the suggestion that the antecedent of such a conditional is false.
Ladkin & Loer (1998) introduce notations and inference rules which allows them to
reduce the Lewis criterion for counterfactuals in the form (Figure 16). This logic
provides semantics for informal concepts such as “cause” that is used to explain the
causal-factor relation between facts A and B.
41
DSTO-TR-2094
Inference Rule:
A ^B
ĀA −> ĀB
A =ČB
If we know that A and B occurred and that if A had not occurred then B
would not have occurred then we can conclude that A causes B.
Figure 16: WBA formal notations and rules for causal relation
Lewis’s semantics for causation in terms of counterfactuals, and the combination of
Lamport’s Temporal Logic (Lamport, 1994) and other logics into a formal logic called
Explanatory Logic, form the basis of the formal method WBA. WBA is based around
two complementary stages:
1) Construction of the WB-Graph; and
2) Formal Proof of Correctness of the WB-Graph
The WB-Graph is subjected to a rigorous proof to verify that: the causal relations in the
graph are correct, that is they satisfy the semantics of causation defined by Lewis; and
there is a sufficient causal explanation for each identified fact that is not itself a root
cause. The formal logics employed in the WBA formal proof method are shown in
Figure 17. A detailed development of the formal proof of correctness and the EL logic
is described in Ladkin & Loer (1998).
42
DSTO-TR-2094
The WBA method has been used for analysing a fairly large number of accident
reports, mainly for aircraft accidents. In the Lufthansa A320 accident in Warsaw, the
logic of the braking system was considered the main cause of the accident. The
accident report contained facts that were significantly causally related to the accident.
However, these facts were not identified in the list of “probable cause/contributing
factors” of the accident report.
[0] accident
/\ [1] death of 1st person
/\ [2] death of 2nd person
/\ [3] damage to AC
Figure 18: Extract of Textual Form of WB-Graph from the Warsaw Accident
(Höhl & Ladkin, 1997)
Höhl & Ladkin (1997) analysed the text of the accident report and identified the
relevant states and events concerning the accident. The events and states were used to
prepare a textual version of the WB-Graph with path numbering (Figure 18). The WB-
Graph commences from the accident event (node 0), and proceeds via a backwards-
chronological search investigating which events and states were causal factors. This
search continues with reasoning about why each subsequent event occurred until a
source events or state is reached. A source node has no incoming links i.e. they have no
significant causal factors and are considered as the original source to a sequence of
events, such as nodes 3.1.2, 3.1.1.3.2.2 and 3.1.1.3.1.1.1.3 in Figure 19. The information
and path numbering in the textual version is then used to draw the WB-Graph (Figure
19). The WB-Graph can be used to answer questions such as, Why did an event X
happen? The event X happened because of the events A, B and C. The “Because part “
shows the conjunction of explanations (events A, B, C) why the event X happened. The
causal graph grows by investigating why the next event, such as X.A happen, and
43
DSTO-TR-2094
explaining that X.A occurred because of events D and F. Therefore, the event 3.1,
Aircraft hits earth bank, occurred because of event 3.1.1, Aircraft overruns runway, and
state 3.1.2, earth bank in overrun path. The state 3.1.2 occurred because of the source
node 3.1.2.1, earth bank was built by airport authority for radio equipment.
3.1
AC hits earth bank
3.1.1 3.1.2
AC overruns RWY earth bank in
overrun path
3.1.1.2 3.1.2.1
3.1.1.3
braking delayed unstabilised built by airport
approach authority for
radio equipment
3.1.1.3.1
wheel braking 3.1.1.2.1
delayed CRW ’s actions
3.1.1.3.2
speed brakes and 3.1.1.3.1.1
thrust reverser aquaplaning
deployment delayed
3.1.1.3.2.2 3.1.1.3.1.1.3
3.1.1.3.1.1.1
braking system ’s low weight on each
RWY very wet
logical design main gear wheel
Figure 19: WB-Graph Extract of the Warsaw Accident (Höhl & Ladkin, 1997)
The rigorous reasoning employed in the WBA-Method enabled Höhl & Ladkin (1997)
to identify two fundamental causes (source nodes in the WB-graph) that occurred in
the accident report but were omitted as “probable cause” or “contributing factors”: the
position of the earth bank (node 3.1.2.1), and the runway surfacing (node
3.1.1.3.1.1.1.3). Once the position of the earth bank was identified as an original causal
factor, it can be concluded that had the bank not been where it is, the accident would
not have happened. Also, if the condition of the runway surfacing had been otherwise,
the wheel braking systems could have functioned earlier and perhaps the collision with
the bank avoided. The rigorous reasoning in the WB-Graph enabled the
recommendation of appropriate preventative strategies, e.g., removal of earth bank to
provide a free overrun area, and to mitigate the occurrence of future similar accidents.
Thus the WB-Graph helped to identify logical mistakes in the accident report. This
example has illustrated how the WB-method renders reasoning rigorous, and enables
the true original causal factors to be identified from amongst all the causally-relevant
states and events.
Twenty-six people died by friendly fire during peace-keeping operations after the Gulf
War on April 14, 1994, when two U.S. Air Force F-15 fighters shot down two U.S. Army
Black Hawk helicopters in the no-fly zone over northern Iraq (AAIB, 1994; GAO, 1997).
The major reasons for this fratricide are attributed to multiple coordination failures at
the individual, group and organisational levels in a complex command and control
44
DSTO-TR-2094
structure (see Figure 14). It is interesting to note that there were no notable technical
failures; in fact the failure of the IFF system in the F-15s to receive the Black Hawks’
identification code remains unexplained.
Snook (2000) employed social and organisational theories to explain the accidental
shootdown of the two Black Hawk helicopters. He developed a timeline of significant
events and a complex Causal Map of the incident. Ladkin & Stuphorn (2003) conducted
a Why-Because Analysis of facts as presented in the Executive Summary of the U.S.A.F.
Aircraft Accident Investigation Board report (AAIB, 1994), and compared their analysis
with Snook's Causal Map.
On Station 0545
Check In 0705
Figure 20: Partial Time Line of Significant Events (Ladkin & Stuphorn, 2003)
A timeline is generally considered useful, in which actors are represented along with
the times of events in which they participated. Ladkin & Stuphorn identified a number
of ambiguities in the method used by Snook to develop the timeline of significant
events. They constructed a single vertical timeline of all events, and annotated the
events with the actors participating in this event, as shown in Figure 20. Thin columns
lying to the right of the time line represent the actors, and a mark (a cross) in a column
by an event indicates that the corresponding actor participated in that event. Use of a
vertical timeline with columns for actor participation allows easily for a greater
number of actors than appears visually feasible using Snook's representation.
45
DSTO-TR-2094
Figure 21: A Partial List of Facts (Adapted from: Ladkin & Stuphorn, 2003)
Ladkin & Stuphorn (2003) derived the List of Facts (a partial list is shown in Figure 21)
directly from the AAIB report (AAIB, 1994). This list of Facts differs considerably from
that of Snook, and Ladkin & Stuphorn argue that the nodes in Snook’s Causal Map do
not appear to correspond to the facts in the AAIB report. They conducted a
methodological check of Snook’s Causal Map by checking the relations of the nodes to
each other using the Counterfactual Test. Ladkin & Stuphorn concluded that one-
quarter of the causal connections proposed by Snook were not correct since they did
not pass the Counterfactual Test.
A WB-Graph was constructed from the List of Facts, and the Counterfactual Test was
applied to determine the necessary causal factor relation amongst them (Ladkin &
Stuphorn, 2003). The complete WB-Graph that was produced is quite hard to read, and
has been split into three parts, the top part, the middle section, and the lower part; the
top part of the WB-graphs is reproduced in Figure 22 showing the links to the two
lower parts. The WB-Graphs illustrate the accuracy in the causal explanation and the
advantages of applying a methodological approach such a WBA to the task of
determining causality.
46
DSTO-TR-2094
*WB-Graph
Middle Part*
*WB-Graph
Lower Part*
Figure 22: The AAIB WB-Graph, Top Part (Ladkin & Stuphorn, 2003)
47
DSTO-TR-2094
A number of studies on aviation and maritime accidents have shown the human and
organisational factors as major contributors to accidents and incidents. Johnson &
Holloway (2007) analysed major aviation and maritime accidents in North America
during 1996-2006, and concluded that the proportion of causal and contributory factors
related to organisational issues exceed those due to human error. For example, the
combined causal and contributory factors of aviation accidents in the USA showed:
48% related to organisational factors, 37% to human factors, 12% to equipment and 3%
to other causes; and the analysis of maritime accidents classified the causal and
contributory factors as: 53% due to organisational factors, 24-29% as human error, 10-
19% to equipment failures, and 2-4% as other causes.
Hopkins (2000) examined the findings of the Royal Commission, from a cultural and
organisational perspective, into the Esso gas plant explosion at Longford, Victoria in
September 1998. This accident resulted in the death of two workers, injured eight
others and cut Melbourne’s gas supply for two weeks. Hopkins argues that the
accident’s major contributory factors were related to a series of organisational failures:
the failure to respond to clear warning signs, communication problems, lack of
attention to major hazards, superficial auditing and, a failure to learn from previous
experience. The Royal Commission in Australia invited Hopkins as an expert witness
to the Longford inquiry, he looked at this with astonishment and remarked that, “It is
most unusual in this country for a sociologist to be called as an expert witness in a
disaster or coronial inquiry, but in accepting my evidence the Commission was
acknowledging the value of the sociological approach to its inquiry” (Hopkins, 2000:
Preface).
Hopkins was a member of the Board of Inquiry into the F-111 chemical exposure of
RAAF maintenance workers (Clarkson et al., 2001). He identified many cultural and
organisational causes of this incident, and employed the AcciMap technique to
produce a diagram identifying the network of causes that contributed to the damage
done to the health of the Air Force workers (see Figure 12). Hopkins (2005) discusses
various aspects of the Air Force culture and identified several fundamental values
which contributed to the incapacity of the Air Force to recognise and respond to what
was happening to its fuel tank workers. This emphasises the significance of
48
DSTO-TR-2094
organisational factors and their influence to safety in the workplace (see also, Blackman
et al., 2000).
NASA’s Space Shuttle Challenger disintegrated in a ball of fire 73 seconds after launch
on 28 January 1986. The Rogers Commission Report (1986) on the Space Shuttle
Challenger Accident identified the cause of the disaster: the O-rings that seal the Solid
Rocket Booster joints failed to seal, allowing hot gases at ignition to erode the O-rings,
penetrate the wall of the booster, which finally destroyed Challenger and its crew. The
Commission also discovered an organisational failure in NASA. In a midnight hour
teleconference on the eve of the Challenger launch, NASA managers had proceeded
with launch despite the objections of contractor engineers who were concerned about
the effect of predicted cold temperatures on the rubber-like O-rings. Further, the
investigation discovered that NASA managers had suppressed information about the
teleconference controversy, violating rules about passing information to their
superiors; NASA had been incurring O-ring damage on shuttle missions for years. The
Rogers Commission Report also identified “flawed decision making” as a contributing
cause of the accident, in addition to other causal factors such as production and
schedule pressures, and violation of internal rules and procedures in order to launch
on time.
– An enacted work group culture, that is how culture is created as people interact
in work groups;
– A culture of production built from occupational, organisational, and
institutional influences; and
– A structure induced dispersion of data that made information more like a body
of secrets than a body of knowledge, which silenced people.
These elements had shaped shuttle decision making for over a decade. What was
unique in this particular situation was that this was the first time all three influences
came together simultaneously across multiple levels of authority and were focused on
a single decision to meet the Challenger launch deadline.
The physical cause of the loss of Columbia and its crew was a breach in the Thermal
Protection System on the leading edge of the left wing, caused by a piece of insulating
foam which struck the wing (CAIB, 2003). The foam impact caused a crack in the wing
that allowed superheated gas to penetrate through the leading edge insulation and
progressively melt the aluminium structure of the left wing, resulting in the
disintegration of the Orbiter during re-entry on 1st February 2003.
The Columbia Accident Investigation Board reviewed the contemporary social science
literature on accidents and invited experts in sociology and organisational theory.
These experts examined NASA’s organisational, historical and cultural factors and
provided insights into how these factors contributed to the accident (CAIB, 2003). In
49
DSTO-TR-2094
the Board’s view, NASA’s organisational structure and culture was equally a causal
factor of the accident as the physical cause (the foam debris strike). In particular,
Vaughan recognised similarities between Columbia and Challenger accidents in that
both accidents resulted due to organisational system failures, and presented a causal
explanation that links the culture of production, the normalisation of deviance, and
structural secrecy in NASA. (CAIB 2003: Chap. 8).
Complex technological systems have many interrelated parts, and component failures
in one or more parts of the system interact in unanticipated ways that lead to
catastrophic accidents. Organisations managing and operating high-risk technologies
can be considered as complex sociotechnical systems with systemic dependencies and
tight coupling in the organisation structure and management policies, which can lead
to organisational failures as contributory causal factors in system accidents. It is
important to consider the organisational context in which such technological systems
operate as it adds to their complexity and susceptibility to the occurrence of system
accidents. Sagan (1993) examines two important schools of thought in organisation
theory, namely, Normal Accident Theory and High Reliability Organisation theory,
concerning the issue of safety and reliability of organisations involved in the
development, management and operation of complex technological systems such as
nuclear power plants, petrochemical plants, and nuclear weapons. Sagan argues that
organisation theories on accidents and risk are necessary to understand and address
the social causes of an accident and in enhancing performance in technologically
complex organisations to safely operate and manage high-risk technological systems.
The Columbia investigation Report identifies a “broken safety culture” as a focal point
of the accident’s organisational causes (CAIB, 2003). The report examines how NASA’s
organisational culture and structure weakened the safety structure that created
structural secrecy, causing decision makers to miss the threat posed by successive
events of foam debris strikes. Organisational culture refers to the values, norms,
beliefs, and practices that govern how an institution functions. Schein (1992) refers to
shared basic assumptions and provides a more formal definition as follows:
The culture of a group can now be defined as a pattern of shared basic assumptions that
the group learned as it solved its problems of external adaptation and internal integration,
that has worked well enough to be considered valid, and therefore, to be taught to new
members as the correct way to perceive, think, and feel in relation to those problems
(Schein, 1992: 12).
A safety culture is the set of assumptions, and their associated practices, through which
a group understands and conceives the dangers and hazards of the world that should
be minimised for exposure to people and society (Pidgeon, 1991). Pidgeon discusses
that a safety culture is created and recreated as members of a group repeatedly behave
and communicate in ways which seem to them to be “natural”, obvious and
unquestionable, and as such will serve to construct a particular version of risk, danger
and safety. Organisational culture has an influence on the overall safety, reliability and
effectiveness of the operations in an organisation. Safety culture is a part of the
organisational culture, and it is the leaders of an organisation who determine how it
functions, and it is their decision making which determines in particular, whether an
organisation exhibits the practices and attitudes which go to make up a culture of
50
DSTO-TR-2094
safety (Hopkins, 2005). The disaster at the Moura coal mine in central Queensland,
which exploded in 1994, killing 11 men, presents an excellent illustration of the
importance of safety culture in organisations. The accident inquiry revealed a culture,
as a set of practices, focused on maximising production and largely oblivious to the
potential for explosion (Hopkins, 1999). This accident is indicative of the systematic
attention that was paid to production by managers at Moura and the systematic lack of
attention paid to safety. This managerial focus shaped the whole culture of the mine.
Organisational cultures may be detrimental to safety, not because leaders have chosen
to sacrifice safety for the sake of production, but because they have not focused their
attention on safety at all (Hopkins, 2005). Hopkins argues that if leaders attend to both
production and safety, the organisations they lead will exhibit a culture which
potentially emphasises both.
Pidgeon (1991) discusses a number of features that characterise a “good” safety culture:
senior management commitment to safety; shared care and concern for hazards and a
solicitude over their impacts upon people; Realistic and flexible norms and rules about
hazards; and continual reflection upon practice through monitoring, analysis and
feedback systems. Modern industrial organisations are facing strong pressures for
change due to competition and change of generation (both technology and people) and
at the same time they need to be able to ensure and demonstrate their reliability and
safety in managing high-risk technological systems to the general public (Rieman &
Oedewald, 2005). A central finding of the Columbia investigation report is the
recommendation that NASA should address the “political, budgetary and policy
decisions” that influenced the organisational structure, culture and systems safety
which led to the flawed decision-making (CAIB, 2003). Leveson et al. (2004) propose a
systems orientation approach that links system safety and engineering systems to
address safety culture and other organisational dynamics in NASA.
Sagan’s (1993) analysis of accidents and near-misses in the US nuclear weapons system
provides compelling evidence that power and politics in complex organisations
contribute to accidents, and furthermore emphasises the role of group interests in
producing accident-prone systems.
Vaughn (1996) describes the Challenger accident as "social construction of reality" that
allowed the banality of bureaucracy to create a habit of normalizing deviations from
safe procedures. While Perrow (1999) concurs with Vaughan’s account of the Challenger
accident as an appropriately sociological and organisational explanation, he argues
that Vaughan’s analysis minimises the corruption of the safety culture, and more
particularly drains this case of the extraordinary display of organisational power that
overcame the objections of the engineers who opposed the launch. Perrow concludes
that this was not the normalization of deviance or the banality of bureaucratic
procedures and hierarchy or the product of an engineering "culture;" it was the exercise
of organisational power.
Perrow (1994) discusses organisational politics where corporate leaders pay lip service
to safety and use their power to impose risk on the many for the benefit of the few. He
determines the reasons for corporate behaviour as: the latency period for a catastrophic
accident to occur may be longer than any decision maker’s career; few managers are
punished for not putting safety first even after an accident, but will quickly be
punished for not putting profits, market share or prestige first. Moreover, managers
51
DSTO-TR-2094
may start to believe their own rhetoric about safety first because information that
creates the awareness on lack of safety is suppressed for reasons of organisational
politics. Sagan (1994) argues that even if organisational leaders place safety first and try
to enforce this goal, clashes of power and interest at lower levels may defeat it.
It is essential to understand the role of politics and power in organisations as they have
high potential to contribute to accident causation and disasters in complex
sociotechnical systems. Sagan’s (1993) study of nuclear weapons organisations found
them to be infused with politics, with many conflicting interest at play both within the
military command and control, and between military and civilian leaders. Sagan
concludes that power and politics should be taken seriously and necessary not only to
understand the organisational causes of accidents, but also to start the difficult process
of designing reforms to enhance safety and reliability in organisations. Sagan
encourages organisational theorists to study these organisational factors in order to
bring the hazardous organisations’ culture and operational practices to the public
view.
52
DSTO-TR-2094
Accident models generally used for the prediction of accidents during the development
of safety-critical system, in particular, are based on sequential models. Furthermore,
traditional safety and risk analysis techniques such as Fault Tree Analysis and
Probabilistic Safety Analysis are not adequate to account for the complexity of modern
sociotechnical systems. The choice of accident model has consequence for how post hoc
accident analysis and risk assessment is done, thus we need to consider the extension
and development of systemic accident models both for accident analysis and for risk
assessment and hazard analysis of complex critical systems.
Similarly, STAMP has been applied to a number of case studies for post hoc accident
analysis (e.g., Leveson et al., 2002; Johnson & Holloway, 2003b). There is a need for a
methodology for the development of the STAMP model including guidelines for
developing the control models and interpretation of the flawed control classification.
Some advances have been made in extending the STAMP model to conduct a proactive
accident investigation in the early stages of system design. Leveson & Dulac (2005)
discuss the use of STAMP model for hazard analysis, safety (risk) assessment, and as a
basis for a comprehensive risk management system.
the human operator as a separate system is not feasible; rather the human-machine is
regarded as a whole where the dynamics and complexity of the interaction can be
captured by providing a joint model (Hollnagel & Woods, 2005).
The recent advances in new systemic accident models, based on cognitive systems
engineering, such as the Functional Resonance Accident Model (Hollnagel, 2004),
should be investigated further and applied to the modelling of complex sociotechnical
systems to understand the variability in human and system performance and how this
relates to accident causation.
Although, formal methods have been applied successfully to the design and
verification of safety-critical systems, they need to be extended to capture the many
factors, including human behaviour and organisational aspects that are found in
accidents and accident reports. Further research is needed to develop languages and
semantics for modelling the various aspects of accidents in modern complex systems,
such as: organisational, cultural and social properties, and human performance.
However, formal methods have limitations in scalability to model complete
sociotechnical systems, they need specialists in mathematics, and it should be noted
that not every aspect of a complex system can be formalised in a mathematical sense.
Why-Because Analysis is probably the most mature formal method for accident
analysis. WBA has also been compared with other causal analysis methods; in
particular the comparison with Rasmussen’s AcciMap technique showed that the
methodical approach employed by WBA produces greater precision in determining
causal factors than does the informal approach of the AcciMap (Ladkin, 2005).
However, a single case study is not sufficient to draw general results; comparisons of
these methods need to be conducted on a large variety of sociotechnical systems in
diverse domains.
54
DSTO-TR-2094
55
DSTO-TR-2094
9. Acknowledgements
This research was initiated at the Defence Science and Technology Organisation, under
the Critical Systems Development Task JTW 04/061, sponsored by the Defence
Materiel Organisation. I am particularly indebted to Dr. Tony Cant, DSTO for his
continuous encouragement and inspiring discussions on safety-critical systems. I am
grateful to Drs. Brendan Mahony and Jim McCarthy, High Assurance Systems Cell,
Command, Control, Communications and Intelligence Division, DSTO who have
assisted me greatly with their expertise in many “formal” aspects of safety-critical
systems and accident modelling research, and for the many stimulating afternoon
discussions.
I would like to thank Professor Stephen Cook and Associate Professor David Cropley,
Director and Deputy-Director respectively, of the Defence and Systems Institute at the
University of South Australia for their support in the writing of this report and in
providing a congenial atmosphere for system safety research.
56
DSTO-TR-2094
10. References
AAIB (1994). U.S. Army Black Hawk Helicopters 87-26000 and 88-26060: Volume 1.
Executive Summary: UH-60 Black Hawk Helicopter Accident, 14 April, USAF Aircraft
Accident Investigation Board.
Anderson, S. O., Bloomfield, R. E., & Cleland, G. L. (1998a). Guidance on the use of
Formal Methods in the Development and Assurance of High Integrity Industrial Computer
System, Parts I and II. Working Paper 4001, European Workshop on Industrial
Computer Systems (EWICS) Technical Committee 7.
http://www.ewics.org/docs/formal-methods-subgroup
Anderson, S. O., Bloomfield, R. E., and Cleland, G. L. (1998b). Guidance on the use of
Formal Methods in the Development and Assurance of High Integrity Industrial Computer
Systems, Parts III A Directory of Formal Methods. Working Paper 4002, European
Workshop on Industrial Computer Systems (EWICS) Technical Committee 7.
http://www.ewics.org/docs/formal-methods-subgroup
ATEA (1998). Def (Aust) 5679: The Procurement of Computer-Based Safety-Critical Systems.
Australian Defence Standard, August, Australia: Army Technology Engineering
Agency.
Blackman, H., Gertman, D. & Hallbert, B. (2000). The need for organisational analysis.
Cognition, Technology & Work, 2, 206-208.
Booch, G. (1994). Object-Oriented Analysis and Design with Applications. 2nd Ed., Menlo
Park, CA: Addison-Wesley.
Bowen, J. & Stavridou, V. (1993). Safety Critical Systems: formal methods and
standards. Software Engineering Journal, 8(4), 189-209, UK: IEE.
Buede, D. M. (2000). The Engineering Design of Systems: Models and Methods. New York:
Wiley.
Burns, C. P. (2000). Analysing Accident Reports Using Structured and Formal Methods.
Ph.D. Thesis, February, Glasgow: The University of Glasgow.
http://www.dcs.gla.ac.uk/research/gaag/colin/thesis.pdf
Burns, C. P., Johnson, C. W. & Thomas, M. (1997). Agents and actions: Structuring
57
DSTO-TR-2094
Butler, R. W., Carreño, V. A., Di Vito, B. L., Holloway, C. M. & Miner, P. S. (2002).
NASA Langley’s Research and Technology-Transfer Program in Formal Methods. Assessment
Technology Branch, Hampton, Virginia: NASA Langley Research Center.
http://shemesh.larc.nasa.gov/fm/NASA-over.pdf
CAIB (2003). Columbia Accident Investigation Board Report Volume I. Washington, D.C.:
Columbia Accident Investigation Board.
Collins, J., Hall, N. &. Paul, L. A. (2004). Counterfactuals and Causation: History,
Problems, and Prospects, Chapter 1, In Collins, J., Hall, N. &. Paul, L. A. (Eds.),
Causation and Counterfactuals. Cambridge, MA: The MIT Press.
Clarke, E. M. & Wing, J. M. (1996). Formal Methods: State of the Art and Future Directions.
Report CMU-CS-96-178, School of Computer Science, Pittsburgh PA: Carnegie Mellon
University.
Clarkson, J., Hopkins, A. & Taylor, K. (2001): Report of the Board of Inquiry into F-111
(Fuel Tank) Deseal/Reseal and Spray Seal Programs - Vol. 1. Canberra, ACT: Royal
Australian Air Force.
http://www.defence.gov.au/raaf/organisation/info_on/units/f111/Volume1.htm
Eiter, T. & Lukasiewicz, T. (2002). Complexity Results for Explanations in The Structural-
Model Approach. INFSYS Research Report 1843-01-08, July, Institut für
Informationssysteme, Abtg. Wissensbasierte Systeme Technische, Universität Wien
Favoritenstraße 9-11 A-1040, Wien, Austria.
Ferry, T. S. (1988). Modern Accident Investigation and Analysis. Second Edition, New
York: J. Wiley.
58
DSTO-TR-2094
GAO (1997). Operation Provide Comfort: Review of Air Force Investigation of Black Hawk
Fracticide Incident. GAO/OSI-9804, Office of Special Investigations, Washington DC: US
General Accounting Office.
Heinrich, H. W., Petersen, D. & Roos, N. (1980). Industrial Accident Prevention. New
York: McGraw-Hill.
Höhl, M. & Ladkin, P. (1997). Analysing the 1993 Warsaw Accident with a WB-Graph.
Article RVS-Occ-97-09, 8 September, Faculty of Technology, Bielefeld University.
http://www.rvs.uni-bielefeld.de
Hollnagel, E. (1998). Cognitive Reliability and Error Analysis Method. Oxford: Elsevier
Science.
59
DSTO-TR-2094
Hollnagel, E. & Woods, D. D. (1983). Cognitive Systems Engineering: New wine in new
bottles. International Journal of Man-Machine Studies, 18, 583-600. Reprinted in
International Journal of Human-Computer Studies, 1999, 51, 339-356.
Hollnagel, E., Woods, D. D. & Leveson, N. (2006). Resilience Engineering: Concepts and
Precepts. Aldershot: Ashgate.
Hopkins, A. (1999). Managing Major Hazards: The Lessons of the Moura Mine Disaster.
Sydney: Allen and Unwin.
Hopkins, A. (2000). Lessons from Longford: The Esso Gas Plant Explosion. Sydney: CCH.
Hopkins, A. (2005). Safety, Culture and Risk: The Organisational Causes of Disasters.
Sydney: CCH.
Huang, Yu-Hsing (2007). Having a New Pair of Glasses Applying Systemic Accident Models
on Road Safety. Dissertation No. 1051, Department of Computer and Information
Science, Linköping, Sweden: Linköping University.
60
DSTO-TR-2094
Johnson, C., & Holloway, C. M. (2003b). The ESA/NASA SOHO Mission Interruption:
Using the STAMP Accident Analysis Technique for a Software Related `Mishap'.
Software: Practice and Experience, 33(12), 1177-1198.
Khosla, S. (1988). System Specification: A Deontic Approach. PhD thesis, Imperial College
of Science and Technology, UK: University of London.
Kroes, P., Franssen, M., van de Poel, Ibo. & Ottens, M. (2006). Treating socio-technical
systems as engineering systems: some conceptual problems. Systems Research and
Behavioral Science, 23(6), 803-814.
Ladkin, P.B. (2005). Why-Because Analysis of the Glenbrook, NSW Rail Accident and
Comparison with Hopkins's Accimap. Report RVS-RR-05-05, 19 December, Faculty of
Technology, Bielefeld University. http://www.rvs.uni-bielefeld.de
Ladkin, P. B. & Loer, K. (1998). Why-because analysis: Formal reasoning about incidents.
Technical Report RVS-Bk-98-01, Faculty of Technology, Bielefeld University
http://www.rvs.uni-bielefeld.de
Ladkin, P. B. & Stuphorn, J. (2003). Two Causal Analyses of the Black Hawk
Shootdown During Operation Provide Comfort, Proceedings of the 8th Australian
Workshop on Safety Critical Software and Systems, Peter Lindsay and Tony Cant (Eds.),
Conferences in Research and Practice in Information Technology, Volume 33,
Canberra: Australian Computer Society.
LaRC (2004). The CAUSE Project, Research on Accident Analysis, NASA Langley
Formal Methods Site. http://shemesh.larc.nasa.gov/fm/fm-now-cause.html
Leveson, N. G. (1986). Software Safety: Why, What, and How, Computing Surveys, 18, 2
June.
61
DSTO-TR-2094
Leveson, N. (1993). An Investigation of the Therac-25 accidents. IEEE Computer, 26, 18-
41.
Leveson, N. G. (1995). Safeware: System Safety and Computers. Reading. MA: Addison-
Wesley.
Leveson, N. (2001). Evaluating Accident Models using Recent Aerospace Accident -. Part I:
Event–based Models. Technical Report, Aeronautics and Astronautics Department,
Massachusetts Institute of Technology, June 28, Cambridge, MA: MIT.
Leveson, N. G. (2002). System Safety Engineering: Back to the Future. Aeronautics and
Astronautics Department, Massachusetts Institute of Technology, Cambridge, MA:
MIT. http://sunnyday.mit.edu/book2.pdf
Leveson, N. (2004). A New Accident Model for Engineering Safer Systems, Safety
Science, 42, 237-270.
Leveson, N. G., Allen, P. & Storey, Margaret-Anne. (2002). The Analysis of a Friendly
Fire Accident using a Systems Model of Accidents. Proceedings of the 20th International
System Safety Conference, Denver, Colorado, 5-9 August.
Leveson, N., Cucher-Gershenfeld, J., Barrett, B., Brown, A., Carroll, J., Dulac, N., Fraile,
L. & Marais, K. (2004). Effectively addressing NASA's organization and safety culture.
Engineering Systems Division Symposium, Massachusetts Institute of Technology,
March 29-31, Cambridge, MA: MIT.
Leveson, N. G. & Dulac, N. (2005). Safety and Risk-Driven Design in Complex Systems-
of-Systems. 1st NASA/AIAA Space Exploration Conference, Orlando.
Leveson, N., Dulac, N., Zipkin, D., Cutcher-Gershenfeld, J., Carroll, J. & Barrett, B
(2006). Engineering Resilience into Safety-Critical Systems. In Hollnagel, E., Woods, D.
D. & Leveson, N. (2006). Resilience Engineering: Concepts and Precepts. Aldershot:
Ashgate.
Leveson, N. G., Heimdahl, M .P. E., Hildreth, H. & Reese, J. D. (1994). Requirements
specification for process-control systems, IEEE Transactions on Software Engineering,
20(9), 684-707, September.
Marais, K., Dulac, N., & Leveson, N. (2004). Beyond Normal Accidents and High Reliability
Organizations: The Need for an Alternative Approach to Safety in Complex Systems, ESD
Symposium, Massachusetts Institute of Technology, Cambridge, MA: MIT.
62
DSTO-TR-2094
Maurino, D., Reason, J. T., Johnston, N. & Lee, R. (1995). Beyond aviation human factors,
Aldershot: Ashgate.
NAS (2003). Securing the Future of U.S. Air Transportation: A System in Peril, Committee
on Aeronautics Research and Technology for Vision 2050, Aeronautics and Space
Engineering Board, Washington, DC: National Academy of Sciences.
Parasuraman, R. (1997). Humans and Automation: use, misuse, disuse, abuse. Human
Factors, 39(2), 230-253.
Pearl, J. (2000). Causality: Models, Reasoning, and Inference, UK: Cambridge University
Press.
Perrow, C. (1984). Normal Accidents: Living with High-Risk Technologies. New York: Basic
Books.
Reason, J. (2000). Human error: models and management. British Medical Journal, 320,
768-770.
Rogers Commission Report (1986). Report of the Presidential Commission on the Space
Shuttle Challenger Accident. June 6, Washington, D.C.: NASA.
http://history.nasa.gov/rogersrep/genindex.htm
Shappell, S. A. & Wiegmann, D. A. (2000). Human factors analysis and classification system
- HFACS. Report DOT/FAA/AM-00/7, Washington, DC: Department of
Transportation, FAA.
Shorrock, S., Young, M. & Faulkner, J. (2003). Who moved my (Swiss) cheese? Aircraft
and Aerospace, January/February, 31-33.
Skelt, S. (2002). Methods for accident analysis. Report No. ROSS (NTNU) 2000208,
Norwegian University of Science and Technology, Trondhein: NTNU.
Snook, S. (2002). Friendly Fire: The Accidental Shootdown of U.S. Black Hawks over Northern
Iraq, Princeton, New Jersey: Princeton University Press.
Vaughn, D. (1996). The Challenger Launch Decision: Risky Technology, Culture and
Deviance at NASA. Chicago: University of Chicago Press.
Vernez, D., Buchs, D. & Pierrehumbert, G. (2003). Perspectives in the use of coloured
Petri nets for risk analysis and accident modelling. Safety Science, 41, 445-463.
Vicente, K. J. (1999). Cognitive Work Analysis: Towards Safe, Productive, and Healthy
Computer-Based Work. Mahwah, NJ: Lawrence Erlbaum.
64
DSTO-TR-2094
Wagenaar, W. A., Groeneweg, J., Hudson, P. T. W. & Reason J. T. (1994). Safety in the
oil industry. Ergonomics, 37, 1999-2013.
Wieringa, R. J. & Meyer, J. -J. Ch. (1994). Applications of Deontic Logic in Computer
Science: A Concise Overview, In J.-J. Ch. Meyer & R.J. Wieringa (Eds.), Deontic Logic in
Computer Science: Normative System Specification. Chichester: John Wiley and Sons.
Woods, D. D., Johannesen, L. J. & Sarter, N. B. (1994). Behind Human Error: Cognitive
Systems, Computers and Hindsight. SOAR Report 94-01, Wright-Patterson Air Force Base,
Ohio: CSERIAC.
Yourdon, E. (1989). Modern Structured Analysis. Englewood Cliffs, NJ: Yourdon Press.
65
DSTO-TR-2094
6a. DSTO NUMBER 6b. AR NUMBER 6c. TYPE OF REPORT 7. DOCUMENT DATE
DSTO-TR-2094 AR 014-089 Technical Report January 2008
8. FILE NUMBER 9. TASK NUMBER 10. TASK SPONSOR 11. NO. OF PAGES 12. NO. OF REFERENCES
C3ID DMO 07-007 DMO 66 124
13. URL on the World Wide Web 14. RELEASE AUTHORITY
No Limitations
19. ABSTRACT
The increasing complexity in highly technological systems such as aviation, maritime, air traffic control, telecommunications,
nuclear power plants, defence and aerospace, chemical and petroleum industry, and healthcare and patient safety is leading
to potentially disastrous failure modes and new kinds of safety issues. Traditional accident modelling approaches are not
adequate to analyse accidents that occur in modern sociotechnical systems, where accident causation is not the result of an
individual component failure or human error. This report provides a review of key traditional accident modelling
approaches and their limitations, and describes new system-theoretic approaches to the modelling and analysis of accidents
in safety-critical systems. It also discusses current research on the application of formal (mathematically-based) methods to
accident modelling and organisational theories on safety and accident causation. This report recommends new approaches to
the modelling and analysis of complex systems that are based on systems theory and interdisciplinary research, in order to
capture the complexity of modern sociotechnical systems from a broad systemic view for understanding the multi-
dimensional aspects of safety and accident causation.