A Survey On Fault Injection Techniques
A Survey On Fault Injection Techniques
Abstract: Fault tolerant circuits are currently required in several major application sectors. Besides and in complement to
other possible approaches such as proving or analytical modeling whose applicability and accuracy are significantly
restricted in the case of complex fault tolerant systems, fault-injection has been recognized to be particularly attractive and
valuable. Fault injection provides a method of assessing the dependability of a system under test. It involves inserting faults
into a system and monitoring the system to determine its behavior in response to a fault. Several fault injection techniques have
been proposed and practically experimented. They can be grouped into hardware-based fault injection, software-based fault
injection, simulation-based fault injection, emulation-based fault injection and hybrid fault injection. This paper presents a
survey on fault injection techniques with comparison of the different injection techniques and an overview on the different
tools.
Keywords: Fault tolerance, fault injection, fault simulation, VLSI circuits, fault injector, VHDL fault models.
Specification, design, development, manufacturing,            of the system state may mask the fault, as when the
assembly, and installation throughout its operational         bug is triggered by very particular timing relationships
life. Most faults that occur before full system               between several system components, or by some other
deployment are discovered and eliminated through              rare and irreproducible situation.
testing. Faults that are not removed can reduce a                Curiously, most computer failures are blamed on
system’s dependability when it is embedded into the           either software faults or permanent hardware faults, to
system.                                                       the exclusion of the transient and intermittent hardware
    Hardware/Physical Fault that arise during system          types. Yet many studies show these types are much
operation are best classified by their duration:              more frequent than permanent faults. The problem is
Permanent, transient, or intermittent.                        that they are much harder to track down.
                                                                 During the process of software development, faults
• Permanent faults: Caused by irreversible
                                                              can be created in every step: Requirement definition,
    component damage, such as a semiconductor
                                                              requirement specifications, design, implementation,
    junction that has shorted out because of thermal          testing, and deployment. And these faults can be
    aging, improper manufacture, or misuse. Since it is
                                                              cataloged to:
    possible that a chip in a network card that burns
    causing the card to stop working, recovery can only       • Function       faults:     Incorrect or     missing
    be accomplished by replacing or repairing the               implementation that requires a design change to be
    damaged component or subsystem.                             corrected.
• Transient faults: Triggered by environmental                • Algorithm       faults:    Incorrect or     missing
    conditions such as power-line fluctuation,                  implementation that can be fixed without the need
    electromagnetic interference, or radiation. These           of design change.
    faults rarely do any lasting damage to the                • Timing/serialization faults: Missing or incorrect
    component affected, although they can induce an             serialization of shared resources.
    erroneous state in the system. According to several       • Checking fault: Missing or incorrect validation of
    studies, transient faults occur far more often than         data, or incorrect loop, or incorrect conditional
    permanent ones, and are also far harder to detect.          statement.
• Intermittent faults: Caused by unstable hardware or         • Assignment fault: Values assigned incorrectly or not
    varying hardware states. They can be repaired by            assigned.
    replacement or redesign.
Hardware faults of almost all types are easily injected
by the devices available for the task. Dedicated
                                                              2. An Overview of Fault Injection
hardware tools are available to flip bits on the instant at   Fault Injection is defined by Arlat [3] as the validation
the pins of a chip, vary the power supply, or even            technique of the dependability of fault tolerant systems
bomb the system/chips with heavy ions-methods                 which consists in the accomplishment of controlled
believed to cause faults close to real transient hardware     experiments where the observation of the system’s
faults. An increasingly popular software tool is a            behavior in presence of faults is induced explicitly by
software-implemented fault injector, which changes            the writing introduction (injection) of faults in the
bits in processor registers or memory, in this way            system.
producing the same effects as transient hardware faults.         The fault injection techniques have been recognized
All these techniques require that a system, or at least a     for a long time as necessary to validate the
prototype, actually be built in order to perform the fault    dependability of a system by analyzing the behavior of
testing.                                                      the devices when a fault occurs. Several efforts have
    Software faults are always the consequence of             been made to develop techniques for injecting faults
incorrect design, at specification or at coding time.         into a system prototype or model. Most of the
Every software engineer knows that a software product         developed techniques fall into five main categories:
is bug free only until the next bug is found. Many of
                                                              • Hardware-based fault injection: It is accomplished
these faults are latent in the code and show up only
                                                                at physical level, disturbing the hardware with
during operation, especially under heavy or unusual
                                                                parameters of the environment (heavy ion radiation,
workloads and timing contexts.
                                                                electromagnetic interferences, etc.), injecting
    Since they are a result of bad design, it might be
                                                                voltage sags on the power rails of the hardware
supposed that all software faults would be permanent.
                                                                (power supply disturbances), laser fault injection or
Interestingly, practice shows that despite their
                                                                modifying the value of the pins of the circuit.
permanent nature, their behavior is transient; that is,
when a bad behavior of the system occurs, it cannot be        • Software-based        fault   injection      (software
observed again, even if great care is taken to repeat the       implemented fault injection): The objective of this
                                                                technique consists of reproducing at software level
situation in which it occurred. Such behavior is
                                                                the errors that would have been produced upon
commonly called a failure of the system. The subtleties
                                                                occurring faults in the hardware.
A Survey on Fault Injection Techniques                                                                               173
   isolation and determining the coverage of a given          simulation functions normally other than the
   set of tests.                                              introduction of the fault.
In practice, frequently fault removal and fault                   Hardware fault injections occur in actual examples
forecasting are not used separately, but one follows the      of the circuit after fabrication. The circuit is subjected
other. For instance, after rejecting a system by fault        to some sort of interference to produce the fault, and
forecasting testing, several fault removal tests should       the resulting behavior is examined. So far, this has
be applied. These new tests provide actions that will         been done with transient faults, as the difficulty and
help the designer to improve the system. Then, it will        expense of introducing stuck-at and bridging faults in
be applied to another fault forecasting test, and so on.      the circuit has not been overcome. The circuit is
                                                              attached to a testing equipment which operates it and
3. Hardware-Based Fault Injection                             examines the behavior after the fault is injected. This
                                                              consumes time to prepare the circuit and test it, but
Hardware-based fault injection involves augmenting            such tests generally proceed faster than simulation
the system under analysis with specially designed test        does. It is, rather obviously, used to test circuit just
hardware to allow for the injection of faults into the        before or in production. These simulations are non-
system and examine the effects. It uses additional            intrusive, since they do not alter the behavior of the
hardware to introduce faults into the target system’s         circuit other than to introduce the fault. Special
hardware. Depending on the faults and their locations,        circuitry should be included to cause or simulate faults
hardware-implemented fault injection methods fall into        in the finished circuit; these would most likely affect
two categories:                                               the timing or other characteristics of the circuit, and
                                                              therefore be intrusive.
• Hardware fault injection with contact: The injector
   has direct physical contact with the target system,
                                                              Suppositions:
   producing voltage or current changes externally to
   the target chip. Examples are methods that use pin-        • The fault injector should have no interference with
   level active probes and socket insertion. The probe          the exercised system.
   method is usually limited to stuck-at faults, although     • Faults should be injected at internal locations to the
   it is possible to attain bridging faults by placing a        ICs in the exercised system.
   probe across two or more pins. Socket insertion            • Faults that are injected into the system are
   technique inserts a socket between the target                representative of the actual faults that occur within
   hardware and its circuit board. The inserted socket          the system. It means that both random generated and
   injects stuck-at, open, or more complex logic faults         non-random generated faults can be injected into the
   into the target hardware by forcing the analog               system, and both permanent and transient faults can
   signals that represent desired logic values onto the         be injected into the system.
   pins of the target hardware. The pin signals can be
   inverted, ANDed, or ORed with adjacent pin signals         Benefits:
   or even with previous signals on the same pin.
                                                              • Hardware fault injection technique can access
• Hardware fault injection without contact: The                 locations that is hard to be accessed by other means.
   injector has no direct physical contact with the             For example, the Heavy-ion radiation method can
   target system. Instead, an external source produces          inject fault into VLSI circuits at locations which are
   some physical phenomenon, such as heavy ion                  impossible to reach by other methods.
   radiation and electromagnetic interference, causing
                                                              • This technique works well for the system which
   spurious currents inside the target chip.
                                                                needs high time-resolution for hardware triggering
Hardware simulations typically occur in a high level
                                                                and monitoring.
description of the circuit. This high level description is
turned into a transistor level description of the circuit,    • Experimental evaluation by injection into actual
and faults are injected into the circuit. Software              hardware is in many cases the only practical way to
                                                                estimate coverage and latency accurately.
simulation is most often used to detect the response to
manufacturing defects. The system is then simulated to        • This technique injects faults which have low
evaluate the response of the circuit to that particular         perturbation.
fault. Since this is a simulation, a new fault can then be    • This technique is better suited for the low-level fault
easily injected, and the simulation is rerun to measure         models.
the response to the new fault. This consumes time to          • Not intrusive: No modification of the target system
construct the model, insert the faults, and then simulate       is required to inject faults.
the circuit, but modifications in the circuit are easier to   • Experiments are fast.
make than later in the design cycle. This sort of testing     • Experiments can be run in near real-time, allowing
would be used to check a circuit early in the design            for the possibility of running a large number of fault
cycle. These simulations are non-intrusive, since the           injection experiments.
A Survey on Fault Injection Techniques                                                                            175
• Running the fault injection experiments on the real         of faults with specific impact on the target system
  hardware that is executing the real software has the        can be generated. Fault injection results showing the
  advantage of including any design faults that might         coverage and latency achieved with a set of simple
  be present in the actual hardware and software              behavior based error detection mechanisms are
  design.                                                     presented in [22]. It is shown that up to 72,5% of the
• Fault injection experiments are performed using the         errors can be detected with fairly simple
  same software that will run in the field.                   mechanisms. Furthermore, for over 90% of the
• No model development or validation required.                faults the target system has behaved according to
• Ability to model permanent faults at the pin level.         the fail-silent model, which suggests that a
                                                              traditional computer equipped with simple error
Drawbacks:                                                    detection mechanisms is relatively close to a fail-
                                                              silent computer.
• Hardware fault injection can introduce high risk of       • FOCUS: A design automation environment
  damage for the injected system.                             developed at University of Illinois at Urbana-
• High level of device integration, multiple -chip            Champaign        [9]     used    for     analyzing     a
  hybrid circuit, and dense packaging technologies            microprocessor-based jet-engine controller used in
  limit accessibility to injection.                           the Boeing 747 and 757 aircrafts. FOCUS uses a
• Some hardware fault injection methods, such as              hierarchical simulation environment based on
  state mutation, require stopping and restarting the         SPLICE for tracing the impact of transient faults.
  processor to inject a fault, it is not always effective     The fault from the simulation is automatically fed
  for measuring latencies in the physical systems.            into the analysis-software in order to quantify the
• Low portability and observability.                          fault tolerance of the system under test. In the
• Limited set of injection points and limited set of          controller, fault detection and reconfiguration are
  injectable faults.                                          performed by transactions over the communication
• A recent paper indicates that the setup time for each       link. The simulation consists of the instructions
  experiment might, in fact, offset the time gained by        specifically designed to exercise this cross-channel
  the ability to perform the experiments in near real-        communication. The level of effectiveness of the
  time.                                                       dual configuration of the system to single and
• Requires special-purpose hardware in order to               multiple transient faults is measured. The results are
  perform the fault injection experiments. This               used to identity critical design aspects from fault
  hardware is used to inject faults into the processor        tolerant viewpoint. The usefulness of state transition
  by applying the rail voltages (representing logic one       models which describe the error propagation within
  and zero) to the Input/Output (I/O) pins of the             the chip, enabling identification of critical fault
  processor. Also, if the processor contains                  propagation paths and the module s most sensitive to
  appropriate special-purpose hardware known as               fault propagation, are shown using the tool.
  scan chains, then the external hardware could also        • MESSALINE: A pin-level fault forcing system
  be used to inject stuck-at-1 and stuck-at-0 faults into     developed at LAAS-CNRS [3]. MESSALINE uses
  the internal registers of the processor. In general,        both active probes and sockets to conduct pin-level
  this hardware can be very difficult and costly to           fault injection. It can inject stuck-at, open, bridging,
  build.                                                      and complex logical faults, among others. It can
• Limited observability and controllability. At best,         also control the length of fault existence and the
  one would be able to corrupt the I/O pins of the            frequency. It is made up of four modules: Injection
  processor and the internal processor registers.             module, activation module, collection module, and
                                                              management module . The injection module enables
Tools:                                                        injection on up to 32 injection points by means of
                                                              injecting elements that support two different fault
• RIFLE: A pin-level fault injection system for               injection techniques: Forcing and insertion. The
  dependability validation developed at University of         activation module ensures the proper initialization
  Coimbra, Portugal [22]. This system can be adapted          of the target system according to the elements of the
  to a wide range of target systems and the faults are        A set. The readout collection module is used to
  mainly injected in the processor pins. The injection        collect the elements of R set. The management
  of the faults is deterministic and can be reproduced        module is responsible for the automatic and
  if needed. Faults of different nature can be injected       parametrable generation of test sequence, for the run
  and the fault injector is able to detect whether the        time control of its execution and for result archiving
  injected fault has produced an error or not without         for post-test analysis.
  the requirement of feedback circuits. RIFLE can           • FIST (Fault Injection System for Study of Transient
  also detect specific circumstances in which the             Fault Effect): Developed at the Chalmers University
  injected faults do not affect the target system. Sets
                                                              of Technology in Sweden [14], employs both
176                                The International Arab Journal of Information Technology, Vol. 1, No. 2, July 2004
  contact and contactless methods to create transient       system, but they will more accurately capture the
  faults inside the target system. This tool uses heavy-    timing aspects of the system. This testing is performed
  ion radiation to create transient faults at random        to verify the system's reaction to introduced faults and
  locations inside a chip when the chip is exposed to       catalog the faults successfully dealt with. This is done
  the radiation and can thus cause single - or multiple-    later in the design cycle to show performance for a
  bit-flips. FIST can inject faults directly inside a       final or near-final design. These simulations can be
  chip, which cannot be done with pin-level                 non-intrusive, especially if timing is not a concern, but
  injections. It can produce transient faults at random     if timing is at all involved the time required for the
  locations evenly in a chip, which leads to a large        injection mechanism to inject the faults can disrupt the
  variation in the errors seen on the output pins. In       activity of the system, and cause timing results that are
  addition to radiation, FIST allows for the injection      not representative of the system without the fault
  of power disturbance faults.                              injection mechanism deployed. This occurs because
• MARS         (Maintainable     Real-time      System):    the injection mechanism runs on the same system as
  Developed at Technical University of Vienna               the software being tested.
  Austria [13]. MARS system is a time-triggered,
  fault-tolerant, distributed system. It consists of        Suppositions:
  several computer nodes communicating by means of
                                                            • Faults that are injected into the system are
  a synchronous time division multiple access
                                                              representative of the actual faults that occur within
  strategy. The nodes contain extra hardware and
                                                              the system.
  software for fault tolerance and can be configured to
  operate in redundancy, i.e. when two nodes execute        • The additional software required to inject the faults
  the same task. The fundamental fault tolerance              does not affect the functional behavior of the system
                                                              in response to the injected fault. Essentially, the
  property of each processing node in the MARS
                                                              assumption states that the software that is used to
  system is to be fail-silent. The implementation of
                                                              inject the fault is independent of the rest of the
  the fail silence property relies on numerous Error
                                                              system, and that any faults present in the fault
  Detection Mechanisms (EDMs) at three levels: The
                                                              injection software will not affect the system under
  hardware software, the system software, and the
                                                              analysis.
  application software level.
                                                            Benefits:
4. Software-Based Fault Injection
                                                            • This technique can be targeted to applications and
Software faults are probably the major cause of system        operating systems, which is difficult to be done
outages. Fault injection method is a possible way to          using hardware fault injection.
assess the consequences of hidden bugs. Traditionally,      • Experiments can be run in near real-time, allowing
software-based      fault   injection   involves     the      for the possibility of running a large number of fault
modification of the software executing on the system          injection experiments.
under analysis in order to provide the capability to        • Running the fault injection experiments on the real
modify the system state according to the programmer’s         hardware that is executing the real software has the
modeling view of the system. This is generally used on        advantage of including any design faults that might
code that has communicative or cooperative functions          be present in the actual hardware and software
so that there is enough interaction to make fault             design.
injection useful. All sorts of faults may be injected,      • Does not require any special-purpose hardware; low
from register and memory faults, to dropped or                complexity,      low      development     and     low
replicated network packets, to erroneous error                implementation cost.
conditions and flags. These faults may be injected into
                                                            • No model development or validation required.
simulations of complex systems where the interactions
are understood though not the details of                    • Can be expanded for new classes of faults.
implementation, or they may be injected into operating
                                                            Drawbacks:
systems to examine the effects.
   Software fault injections are more oriented towards      • Limited set of injection instants: At assembly
implementation details, and can address program state         instruction level, only.
as well as communication and interactions. Faults are       • It cannot inject faults into locations that are
mis-timings, missing messages, replays, corrupted             inaccessible to software.
memory or registers, faulty disk reads, and almost any      • Does require a modification of the source code to
other state the hardware provides access to. The system       support the fault injection, which means that the
is then run with the fault to examine its behavior.           code that is executing during the fault experiment is
These simulations tend to take longer because they            not the same code that will run in the field.
encapsulate all of the operation and detail of the
A Survey on Fault Injection Techniques                                                                            177
• Limited observability and controllability. At best,       • Code insertion: In this technique, instructions are
   one would be able to corrupt the internal processor        added to the target program that allows fault
   registers (as well as locations within the memory          injection to occur before particular instructions,
   map) that are visible to the programmer,                   much like the code-modification method. Unlike
   traditionally referred to as the programmer’s model        code modification, code insertion performs fault
   of the processor. So faults cannot be injected in the      injection during runtime and adds instructions rather
   processor pipeline or instruction queue for example.       than changing original instructions. Unlike the trap
• Very difficult to model permanent faults.                   method, the fault injector may exist as part of the
• Related to four, execution of the fault inje ction          target program and run at user mode rather than
   software could affect the scheduling of the system         system mode.
   tasks in such a way as to cause hard, real-time
   deadlines to be missed, which violates assumption        Tools:
   two.                                                     • FERRARI (Fault and Error Automatic Real-Time
We can categorize software injection methods on the           Injection) : Developed at the University of Texas at
basis of when the faults are injected: During compile-        Austin [19], uses software traps to inject CPU,
time or during run-time.                                      memory, and bus faults. Ferrari consists of four
   To inject faults at compile -time, the program             components: The initializer and activator, the user
instruction must be modified before the program image         information, the fault-and-error injector, and the
is loaded and executed. Rather than injecting faults          data collector and analyzer. The fault-and-error
into the hardware of the target system, this method           injector uses software trap and trap handling
injects errors into the source code or assembly code of       routines. Software traps are triggered either by the
the target program to emulate the effect of hardware,         program counter when it points to the desired
software, and transient faults. The modified code alters      program locations or by a timer. When the traps are
the target program instructions, causing injection.           triggered, the trap handling routines inject faults at
Injection generates an erroneous software image, and          the specific fault locations, typically by changing
when the system executes the fault image, it activates        the content of selected registers or memory
the fault.                                                    locations to emulate actual data corruptions. The
   This method requires the modification of the               faults injected can be those permanent or transient
program that will evaluate fault effect, and it requires      faults that result in an address line error, a data line
no additional software during runtime. In addition, it        error, and a condition bit error.
causes no perturbation to the target system during
                                                            • FTAPE (Fault Tolerance and Performance
execution. Because the fault effect is hard-coded,
                                                              Evaluator): Developed at the University of Illinois
engineers can use it to emulate permanent faults. This
                                                              [30]. Engineers can inject faults into user-accessible
method’s implementation is very simple, but it does
                                                              registers in CPU modules, memory locations, and
not allow the injection of faults as the workload
                                                              the disk subsystem. The faults are injected as bit-
program runs.
                                                              flips to emulate error as a result of faults. Disk
   During run-time, a mechanism is needed to trigger
                                                              system faults are injected by executing a routine in
fault injection. Commonly used triggering mechanisms
                                                              the driver code that emulates I/O errors (bus error
include:
                                                              and timer error, for example). Fault injection drivers
• Time-out: In this simplest of techniques, a timer           added to the operating system inject the faults, so no
  expires at a predetermined time, triggering injection.      additional hardware or modification of application
  Specifically, the time-out event generates an               code is needed. A synthetic workload generator
  interrupt to invoke fault injection. The timer can be       creates a workload containing specified amounts of
  a hardware or software timer.                               CPU, memory, and I/O activity, and faults are
• Exception/trap: In this case, a hardware exception or       injected with a strategy that considers the
  a software trap transfer control to the fault injector.     characteristics of the workload at the time of
  Unlike time-out, exception/trap can inject the fault        injection (which components are experiencing the
  whenever certain events or conditions occur. For            greatest amount of workload activity, for example).
  example, a software trap instruction inserted into a      • FIAT (Fault Injection-based Automated Testing):
  target program will invoke the fault injection before       Environment developed at Carnegie Mellon
  the program executes a particular instruction. A            University [26]. FIAT is an automated real-time
  hardware exception invokes injection when a                 distributed accelerated fault injection environment.
  hardware observed event occurs (when a particular           The FIAT environment provides experimenters with
  memory location is accessed, for example). Both             facilities for defining fault classes (relationships
  mechanisms must be linked to the interrupt handler          between faults and the error patterns that they
  vector.                                                     cause); for specifying (e.g., relative to the source
                                                              code of an application) where, when, and for how
178                                  The International Arab Journal of Information Technology, Vol. 1, No. 2, July 2004
  long errors will strike; and how they will interact           and in the user registers of the processor. The
  with executing object code or data. In its initial            approach can be easily extended to support different
  version, FIAT software can fault inject user                  fault models, such as permanent stuck-at, couple,
  application code and data and can inject faults into          temporal and spatial multiple bit-flip, etc. The main
  messages (corrupted, lost delayed), tasks (delayed,           characteristics of EXFI are the low cost (it does not
  abnormal termination), and timers. Later versions             require any hardware device), the high speed (which
  will extend these fault injection capabilities into           allows a higher number of faults to be considered),
  operating systems.                                            the low requirements in terms of features provided
• XCEPTION: Developed at University of Coimbra,                 by the operating systems, the flexibility (it supports
  Portugal [6] uses the advanced bugging and                    different fault types), and the high portability (it can
  performance monitoring features present in many of            be easily migrated to address different target
  today’s modern processors to inject fault. It also            systems).
  uses the processors own exceptions to trigger the           • NFTAPE: Developed at the Center of Reliable and
  faults. It requires no modification in application            High Performance Computing at the University of
  software and no insertion of software traps. The              Illinois at Urbana-Champaign [29]. The objective of
  fault injector is implemented as an exception                 NFTAPE is to support several different types of
  handler and requires modification of the interrupt            fault injection, providing the capability of targeting
  handler vector. The Xception faults are trigger by            several heterogeneous systems concurrently. This is
  access to specific addresses. This makes the                  accomplished through use of a common control
  experiments reproducible. Xception uses a fault               mechanism and common triggers. NFTAPE
  mask when injecting a fault into a location in the            supports an arbitrary fault model. It can support a
  system. The mask is compared with the                         hardware fault injector to inject network faults, a
  memory/register/data and then the bits that are set to        SWIFI fault injector to inject communication faults,
  one in the mask are changed in the system by using            and a second SWIFI injector to target a distributed
  bit-level-operations such as: Stuck-at-zero, stuck-at-        application. The first two injectors share an event-
  one, bit-flip and bridging.                                   based trigger to coordinate communication faults,
• DOCTOR: Integrated software fault injection                   and the other uses a path-based trigger. Other fault
  environment developed at University of Michigan               injectors typically use one method of fault injection
  [16] allows injections into the CPU, memory and               (say SWIFI or HWIFI), not to mention using
  also network-communication faults. DOCTOR uses                multiple injectors at the same time or sharing
  a more sophisticated method than the basic                    triggers. In addition, NFTAPE contains a new driver
  technique of modifying memory contents. Memory                based fault injection scheme, which unlike other
  modification is a powerful fault injection method             SWIFI fault injectors, can inject faults into both
  because almost every fault results, sooner or later, in       kernel and user space with minimum required
  some kind of contamination in the memory. Though              modifications for different operating systems.
  it is a powerful method some faults may infect the          • GOOFI (Generic Object-Oriented Fault Injection):
  memory in a very subtle and non-deterministic way,            Developed at the Department of Computer
  hence it can be very difficult to emulate such faults         Engineering at Chalmers University of Technology
  with basic memory modification. DOCTOR can use                in Sweden [1]. GOOFI can perform fault injection
  three different triggering mechanisms: Time-out               campaigns using different fault injection techniques
  triggered memory faults, when triggered the fault             on different target systems. A major objective of the
  injector overwrites memory contents to emulate                tool is to provide a user-friendly fault injection
  memory faults. Traps are used to create non-                  environment with a graphical user interface and an
  permanent CPU faults. For permanent CPU faults                underlying generic architecture that assists the user
  program instructions are changed during                       when adapting the tool for new target systems and
  compilation to emulate instruction and data                   new fault injection techniques. The GOOFI tool is
  corruptions.                                                  highly portable between different host platforms
• EXFI: A fault injection system for embedded                   since the tool was implemented us ing the Java
  microprocessor-based        boards     developed      at      programming language and all data is saved in a
  Politecnico di Torino, Italy [5]. The kernel of the           SQL compatible database. Furthermore, an object-
  EXFI system is based on the trace exception mode              oriented approach was chosen which increases the
  available in most microprocessors. During the fault           extensibility and maintainability of the tool. The
  injection experiment, the trace exception handler             current version of GOOFI supports pre-runtime
  routine is in charge of computing the fault injection         Software Implemented Fault Injection (SWIFI) and
  time, executing the injection of the fault, and               Scan-Chain Implemented Fault Injection (SCIFI).
  triggering a possible time-out condition. The tool is         The SCIFI technique injects faults via the built-in
  able to inject single bit-flip transient faults both in       test-logic, i.e. boundary scan-chains and internal
  the memory image of the process (data and code)               scan-chains, present in many modern VLSI circuits.
A Survey on Fault Injection Techniques                                                                              179
   This enables faults to be injected into the pins and       value or timing characteristics of one or more signals
   many of the internal state elements of an integrated       when active, i.e when a fault is being injected.
   circuit as well as observation of the internal state. In   Saboteurs are inserted, in series or in parallel, either
   pre-runtime SWIFI, faults are injected into the            interactively at the schematic editor level or
   program and data areas of the target system before it      manually/automatically directly into the VHDL source
   starts to execute. GOOFI is capable of injecting           code. Serial insertion, in its simplest form, consists of
   single or multiple transient bit-flip faults.              braking up the signal path between a driver (output)
                                                              and its corresponding receiver (input) and placing a
5. Simulation-Based Fault Injection                           saboteur in between. In its more complex form, it is
                                                              possible to break up the signal paths between a set of
Simulation-based fault injection [18] involves the            drivers and its corresponding set of receivers and insert
construction of a simulation model of the system under        a saboteur. For parallel insertion, a saboteur is simply
analysis, including a detailed simulation model of the        added as an additional driver for a resolved signal
processor in use. It means that the errors or failures of     (signal that have many drivers-signal sources –
the simulated system occur according to predetermined         provided that a resolution function is supplied to
distribution. The simulation models are developed             resolve the values generated by the multiple sources
using a hardware description language such as the             into a single value). Saboteurs can be used to model
Very high speed integrated circuit Hardware                   most faults and to simulate environmental conditions
Description Language (VHDL). Faults are injected into         such as noise or ESD. However, because they have no
VHDL models of the design and excited by a set of             input pattern discrimination, saboteurs cannot model
input patterns. It is important to note that VHDL             faults below the gate level of abstraction.
constitutes a privileged language to comply with the             A mutant is a model which contains dormant code
goals of fault injection for the following reasons:           blocks within the normal gate description. These
• Its widespread use in detailed design.                      blocks of code are activated by injecting faults, altering
• Its inherent hierarchical abstraction description           the operation of the logic device itself. Because the
   capabilities.                                              fault response is generated internally within the model,
                                                              any level of abstraction for fault injection is possible.
• Its ability to describe both the structure and
                                                              However, the use of mutants requires that the original
   behavior of a system in a unique syntactical
                                                              gate models be replaced by the new mutant models.
   framework.
                                                              This method main advantage is its complete
• Its recognition as a viable framework for developing        independence on the adopted simulator, but it normally
   high-level models of digital systems.                      provides very low performance, due to the high cost
• Its recognition as a viable framework for driving           for modification and possibly recompilation for every
   test activities.                                           fault.
An elementary fault injection experiment corresponds             A second approach uses modified simulation tools
to one simulation run of the target system during which       (built-in commands of the VHDL simulators), which
any number of faults can be injected on single or             support the injection and observation features. This
multiple locations of the model and at one or several         approach normally provides the best performance
points in time during the simulation. A series of             (does not require the modification of the VHDL code),
experiments consists of a sequence of elementary fault        but it can only be followed when the code of the
injection experiments.                                        simulation tools is available and easily modifiable,
   Several techniques have been proposed in the past          e.g., when fault injection is performed on zero-delay
to efficiently implement simulation-based fault-              gate-level models. Its adoption when higher-level
injection. Two main categories can be identified, those       descriptions (e.g., RT-level VHDL descriptions) are
that require modification of VHDL code and those that         used is much more complex. The applicability of these
use the built-in commands of the simulator. A first           techniques depends strongly on the existing
approach, based on VHDL code modification,                    (commercial) simulators and on the functionality of
modifying the system description by the addition of           their commands. Two techniques based on the use of
dedicated fault injection components called saboteurs         simulator commands have been identified: VHDL
or the mutation of existing component descriptions in         signal manipulation (faults are injected by altering the
the VHDL model which generates modified                       value of the signals that are used to link the
component descriptions called mutants. So that faults         components that made up the VHDL model, this is
can be injected where and when desired, and their             done by disconnecting a signal from its driver(s) and
effects observed, both inside and on the outputs of the       forcing it to a new value) and VHDL variable
system.                                                       manipulation (faults are injected into behavioral
   A saboteur is a component added the VHDL model             models by altering values of variables defined in
for the sole purpose of fault injection. It is inactive       VHDL processes).
during normal system operation, while altering the
180                                  The International Arab Journal of Information Technology, Vol. 1, No. 2, July 2004
   A third approach relies on the simulation command              architectural. It provides the maximum flexibility in
language and interface provided by some specific                  terms of supported fault models.
simulator. The main advantage of this approach lies in        •   Not intrusive.
the relatively low cost for its implementation, while the     •   Full control of both fault models and injection
obtained performance is normally intermediate                     mechanisms.
between those of the first and second approaches. It          •   Low cost computer automation; does not require
must be noted that it is now increasingly common for              any special-purpose hardware.
the new releases of most commercial simulation                •   It provides timely feedback to system design
environments to support some procedural interface,                engineers.
thus allowing an efficient and portable interaction with      •   Fault injection experiments are performed using the
the simulation engine and with its data structures.               same software that will run in the field. Simulated
Several approaches have been presented for speeding               fault injection can normally be rather easily
up the simulation process. Fault injection techniques             integrated into already existing design flows.
are compared in terms of fault modeling capacity,             •   Maximum          amount     of    observability     and
effort required for setting up an experiment and                  controllability. Essentially, given sufficient detail in
simulation time overhead.                                         the model, any signal value can be corrupted in any
   Mutants offer the highest fault modeling capacity of           desired way, with the results of the corruption easily
the fault injection techniques presented, Saboteurs are           observable regardless of the location of the
generally less powerful, signal manipulation is suited            corrupted signal within the model. This flexibility
for implementing simple fault models and variable                 allows any potential failure mode to be accurately
manipulation offers a simple way for injecting                    modeled.
behavioral faults.
                                                              •   Allows performing reliability assessment at
   The effort for setting up an experiment is small
                                                                  different stages in the design process, well before
using signal and variable manipulation, as modification           than a prototype is available.
of the VHDL model is not required. More effort is
                                                              •   Able to model both transient and permanent faults.
needed for mutants and saboteurs (creation/generation,
                                                              •   Allows modeling of timing-related faults since the
inclusion in the model, recompilation of the VHDL
model).                                                            amount of simulation time required to inject the
   The simulation time overhead imposed by signal                  fault is effectively zero.
and variable manipulation is only due to fault injection
control, as the simulation must be stopped and started        Drawbacks:
again for each fault injected. It is important to note that   • Large development efforts.
the simulation time overhead imposed by saboteurs             • Time consuming (experiment length): Being based
and mutants depends on: Amount of additional                    on the simulation of the system in its fault-free
generated events, amount of code to execute per event           version as well as in the presence of the enormous
and the complexity of the fault injection control.              number of the possible faults.
   When considering a series of fault injection               • Models are not readily available; rely on model
experiments, two ways can be distinguished: One way             accuracy
is to generate a new configuration for each fault             • Accuracy of the results depends on the goodness of
location (this requires recompilation of the VHDL               the model used.
model for each fault location and may also require            • No real time faults injection possible in a prototype.
manual intervention to start up a simulation using the        • Model may not include any of the design faults that
new model), another way is to generate only one                 may be present in the real hardware.
configuration in which all required fault are included
and then activate these one at a time (this may increase      Tools:
the simulation time). Thus, there is a trade-off between
the overhead in simulation time and the overhead in           • VERIFY (VHDL-based Evaluation of Relia bility by
compilation time.                                               Injection Faults Efficiently): Developed at
                                                                University of Erlangen-Nurnberg, Germany [27].
Suppositions:                                                   VERIFY uses an extension of VHDL for describing
                                                                faults correlated to a component, enabling hardware
Model is an accurate representation of the actual
                                                                manufacturers, which provide the design libraries,
system under analysis.
                                                                to express their knowledge of the fault behavior of
                                                                their components. Multi-threaded fault injection
Benefits:
                                                                which utilizes checkpoints and comparison with a
• Simulated fault injection can support all system              golden run is used for faster simulation of faulty
  abstraction levels-electrical, logical, functional, and       runs. The proposed extension to the VHDL
                                                                language is very interesting but unfortunately
A Survey on Fault Injection Techniques                                                                              181
    requires modification of the VHDL language itself.          abstraction level [11]. The main objective of FTI is
    VERIFY uses an integrated fault model, the                  to generate a fault tolerant VHDL design
    dependability evaluation is very close to that of the       description. Designer will provide an original
    actual hardware.                                            VHDL design description and some guidelines
•   MEFISTO-C: A VHDL-based fault injection tool                about the type of fault-tolerant techniques to be used
    developed at Chalmers University of Technology,             and their location in the design. FTI tool will
    Sweden [12] that conduct fault injection                    process original VHDL descriptions by automatic
    experiments using VHDL simulation models. The               insertion of hardware and information redundancy.
    tool is an improved version of the MEFISTO tool             Therefore, a unified format to deal with descriptions
    which was developed jointly by LAAS-CNRS and                is needed. There are several intermediate formats
    Chalmers. (A similar tool called MEFISTO-L has              that represent, by means of a database, the VHDL
    been developed at LAAS-CNRS). MEFISTO-C                     description in a formal way that could be accessed
    uses the vantage optimum VHDL simulator and                 and processed with some procedural interface.
    injects faults via simulator commands in variables          Fault-tolerant components to be included into
    and signals defined in the VHDL model. It offers            VHDL original descriptions will be already
    the user a variety of predefined fault models as well       described and stored in a special library called FT
    as other features to set-up and automatically conduct       library. These components come from previous
    fault injection campaigns on a network of UNIX              researches about FT and designer just use them. FTI
    workstations.                                               use an intermediate format for VHDL descriptions
•   HEARTLESS: A hierarchical register-transfer-level           (FTL/TAURI) and it will work only with
    fault-simulator for permanent and transient faults a        synthesizable descriptions IEEE 1076.
    simulator that was developed, by CE Group-BTU             • [24, 28] Present a new techniques and a platform,
    Cottbus in Germany, to simulate the fault behavior          developed at Politecnico di Torino – Italy, for
    of complex sequential designs like processor cores          accelerating and speeding-up simulation-based fault
    [25]. Furthermore it serves for the validation of on-       injection in VHDL descriptions and show how
    line test units for embedded processors. The input          simulation time can be significantly shortened. The
    for HEARTLESS can support structural VHDL and               techniques developed analyze the faults to be
    ISCAS as input formats. It can support permanent            injected in order to identify the final fault effects as
    stuck-at faults, transient bit flip and delay faults.       early as possible and exploit the features provided
    HEARTLESS was developed in ANSI C++. The                    by modern commercial VHDL simulators to speed-
    whole design or parts (macros) can be selected for          up injection operations. The ideas proposed in [23]
    fault simulation based on fault list generation. Fault-     was extended by making them more general and
    lits are collapsed according to special rules derived       applying them dynamically during fault inject
    from logic level structure and signal traces.               campaigns. The purpose of this approach is to
    HEARTLESS can be enhanced by propagation over               minimize the time required for performing Fault
    macros described in a C-function.                           Injection campaigns. This problem is addressed by
•   GSTF: A VHDL-based fault injection tool                     performing fault analysis (before and during the
    developed by Fault Tolerance Systems Group at the           Fault Injection campaign) and resorting to simulator
    Polytechnic University of Valencia, Spain [4]. This         commands that can be used to minimize the
    tool is presented as an automatic and model-                simulation time required to drive the system to the
    independent fault injection tool to use on an IBM-          injection time. A prototypical version of the fault-
    PC or compatible system to inject faults into VHDL          injection platform has been devised in ANSI C, and
    models (at gate, register and chip level). The tool         consists of about 3,000 lines. Circuit analysis
    has been build around a commercial VHDL                     exploits FTL systems Tauri (a new version of the
    simulator (V-System by Model Technology) and                fault injector will be closely fastened to Auriga),
    can implement the main injection techniques:                fault-list generation takes advantage of Synopsis
    Simulator commands, saboteurs and mutants. Both             VHDL simulator, while the fault injector is
    transient and permanent faults, of a wide range of          currently based on Modelsim software.
    types, can be injected into medium-complexity             • [10] Presents a fault injection technique, developed
    models. The tool can inject a wide range of fault           at Virginia University, USA, that allows faults to be
    models, surpassing the classical models of stuck-at         injected at the ISA (Instruction Set Architecture)
    and bit-flip and it is able to analyze the results          level where actual machine code is executed on a
    obtained from the injection campaigns, in order to          behavioral model of a processor written in VHDL.
    study the error syndrome of the system model                The idea of this technique is based on the use of a
    and/or validate its fault-tolerance mechanisms.             Bus Resolution Function (BRF) and the ability to
•   FTI (Fault Tolerance Injection): Developed at               communicate to the BRF when a fault is to be
    universidad Carlos III de Madrid in Spain, for fault-       injected. This allow the BRF to corrupt the new
    tolerant digital integrated circuits in the RT              value being assigned to a signal. A BRF is a
182                                 The International Arab Journal of Information Technology, Vol. 1, No. 2, July 2004
  function associated with a signal type that will           the targeted faults can be injected into the prototype.
  resolve the value of a signal declared to be of said       This is sometimes called “instrumenting” the circuit
  signal type when the signal is being updated by two        description. As previously mentioned, the emulator
  different sources at the same real time. This              characteristics can preclude generating a single
  technique can be used with existing models with            instrumented description allowing injecting all the
  minimal changes to the existing code and it uses           targeted faults. This may be due to the limited number
  standard VHDL types to perform the fault injection         of available I/Os, or to the amount of hardware
  (it is simulator-independent method). The                  overhead induced by the logic elements added in the
  simulation time is reduced because the level of            circuit for fault injection. In that case, each version of
  modeled detail is reduced. However, this method is         the instrumented description targets a given subset of
  limited to processor fault-injection modeled in the        faults and has to be separately synthesized, placed,
  ISA level.                                                 routed and downloaded onto the emulator at different
                                                             phases of the injection campaign.
6. Emulation-Based Fault Injection                              To avoid any instrumentation of the circuit
                                                             description, another approach, called run-time
To cope with the time limitations imposed by                 reconfiguration emulation-based fault injection has
simulation and take into account the effects due to the      proposed in [21]. Instead of injecting the faults by
circuit environment in the application, in system            means of specific external signals controlling
emulation using hardware prototyping on FPGA-based           additional logic, these approaches rely on built-in
logic emulation systems has been proposed [9, 20].           reconfiguration capabilities of the FPGA devices. This
The circuit to analyze is implemented onto the FPGA          means that some run-time reconfiguration has to be
using a classical synthesis, placement and routing           done for each fault to inject; however, this avoids the
design flow starting from the high-level circuit             extra time spent in preparing the instrumented
description. The development board is connected to a         versions. The bit stream modification necessary to
host computer, used to define the fault injection            perform the reconfigurations is a very quick process
campaign, control the injection experiments and              compared for example with synthesis. Also, the
display the results.                                         reconfiguration time globally spent when running a
    In some limited cases, the approaches developed for      fault injection campaign on the hardware emulator
fault grading using emulators (for example [7]) may be       (FPGA) can be reduced by means of a partial
used to inject faults. However, such approaches are          reconfiguration of the emulator when such capabilities
classically limited to stuck-at fault injection. In most     are available.
cases, modifications must therefore be introduced in            The initial VHDL description is therefore
the circuit description taking into account that the         synthesized, placed & routed and a bit file is generated,
description must remain synthesizable and satisfying a       corresponding to the targeted circuit without any
set of constraints related to the emulator hardware. The     additional elements. The generated file is downloaded
modifications are therefore not easy and furthermore it      onto the FPGA and the injection campaign begins by
is often necessary to generate several modified              an execution of the studied workload (or test bench) on
descriptions, each of them allowing the injection of a       the implemented prototype. The result of this execution
given subset of faults. In such a case, the hardware         is later used as reference for analyzing the effects of
emulator has in general to be completely reconfigured        faults. Then, the same workload is run again as many
several times, that is quite time-consuming and reduces      times as there are faults (or fault configurations) to
the gain in execution time compared with simulation. It      inject. Run-Time Reconfiguration (RTR) had been
also implies additional synthesis, place and route           proposed as a technique to inject the faults. This
phases since the whole design flow has to be executed        methodology propose to inject the faults at “low-
for each modified description.                               level”, directly in the reconfigurable hardware, by
    FPGAs have already been used to accelerate fault-        modification of the design previously implemented in
injection in a number of cases. In general, these            the FPGA. So any fault injection can be realized
approaches aim at using the high running speed of a          without changing the initial description and without
hardware prototype to reduce the fault injection             additional hardware. The first advantage is to avoid
experiment time with respect to simulations. New             any hardware overhead for fault injection, that may
methodologies were also introduced combining                 allow the designer to perform the emulation on a
hardware-based and software-based techniques in              smaller FPGA. Also, carrying out the modifications
order to exploit the speed of hardware-based                 directly in the reconfigurable device can only take a
techniques and at the same time take profit of the           fraction of a second if partial reconfiguration can be
flexibility of software-based techniques. In general,        achieved. So noticeable time gains can be expected
additional control inputs and specific elements are          with respect to “classical” fault injection techniques,
introduced by modifying either the initial high-level        although a reconfiguration is required for each fault
circuit description or the gate-level description so that    configuration to inject. Then an extra time is needed in
A Survey on Fault Injection Techniques                                                                            183
each fault injection experiment, as a partial read back      • Since the algorithmic description are not yet widely
and a partial reconfiguration is needed to inject a fault.     accepted by synthesis tools in classical industrial
This extra time could be however relatively low                design flows, the approach using the emulation can
compared with a classical simulation cycle time.               often only be applied starting from RT-level
   Noticeable gains could be expected compared with            descriptions.
simulation-based injection experiments, provided that        • I/Os problems: When using a FPGA-based
the configuration of the FPGA is quick enough. This            development board, the main limitation becomes the
implies    to     optimize     several   implementation        number of I/Os of the programmable hardware,
characteristics:                                               which can be connected between the FPGA and the
• Intrinsic reconfiguration time of the reconfigurable         host computer, that can restricts the number of fault
   device (related to its architecture and to the place        injection signals and the number of monitored
   and route algorithms used); a good solution would           signals.
   be to use a device not only with partial                  • Necessity of high speed communication link
   reconfigurability but also with some kind of random         between the host computer and the emulation board:
   access to the configuration data                            This is the actual critical part of the emulation set-
• High configuration bandwidth on the development              up.
   board (high frequency configuration clock and/or
   configuration data sent in parallel mode onto the         7. Hybrid Fault Injection
   FPGA)                                                     A hybrid approach combines two or more of the other
• High bandwidth interface between the development           fault injection techniques to more fully exercise the
   board and the host computer.                              system under analysis. For instance, performing
In conclusion, let us summarize the advantages and           hardware-based or software-based fault injection
disadvantages of this technique.                             experiments can provide significant benefit in terms of
                                                             time to perform the fault injection experiments, can
Benefits:                                                    reduce the initial amount of setup time before
• Injection time is more quickly compared with               beginning the experiments, and so forth. The hybrid
  simulation-based techniques possibility of in-system       approach combines the versatility of software fault
  emulation, allowing the designer to evaluate much          injection and the accuracy of hardware monitoring.
  more precisely the behavior which can be expected          The hybrid approach is well suited for measuring
  in the final circuit environment.                          extremely short latencies. However, given the
• Would especially be interesting in the context of a        significant gain in controllability and observability
  system-on-chip development since it may lead to            with a simulation-based approach, it might be useful to
  efficient but low cost dependability analysis of re-       combine a simulation-based approach with one of the
  usable components (most often called IP blocks),           others in order to more fully exercise the system under
  before they are used in a given circuit.                   analysis. For instance, most researchers and
• The experimentation time can be reduced by                 practitioners might choose to model a portion of the
  implementing partially or totally the input pattern        system under analysis, such as the Arithmetic and
  generation in the FPGA. These patterns are already         Logic Unit (ALU) within the microprocessor, at a very
  known when the circuit to analyze is synthesized.          detailed level, and perform simulation-based fault
                                                             injection experiments due to the fact that the internal
Drawbacks:                                                   nodes of an ALU are not accessible using a hardware-
                                                             based or software-based approach.
• The initial VHDL description must be synthesizable
  and optimized to avoid requiring a too large and           Tools:
  costly emulator and to reduce the total running time
  during the injection campaign.                             • LIVE: Experimental evaluation of computer-based
• The cost of a general hardware emulation system              railway control systems, developed at Ansaldo-Cris,
  and/or the implementation complexity of a                    Italy integrates fault injection and software testing
  dedicated FPGA based emulation board. A low cost             techniques to achieve an accurate and non-intrusive
  can be reached but at the expense of a reduced               analysis of a system prototype [2]. It uses pin-level
  speed of the injection fault campaign.                       forcing or generates interrupts to activate software
• The emulation is only used to analyze the functional         fault injection procedures. A method combining
  consequences of a fault, the temporal impacts of the         software-based and simulation-based fault injection
                                                               developed at Chalmers University of Technology,
  faults are not considered. They are looking only at
  steady states of the signals at some particular              Sweden [15]. This hybrid fault injection technique,
  moments (in general just before the rising and/or            also known as mixed-mode fault injection, allows
  falling edge of the clock).                                  the advantages of both SWIFI (Software
184                                          The International Arab Journal of Information Technology, Vol. 1, No. 2, July 2004
   Implemented Fault Injection tool) and simulation                             Fault injection is an important technique for the
   based fault injection to be utilized, i.e. the actual                     evaluation of design metrics such as reliability, safety
   target system may be executed at full speed except                        and fault coverage. Fault injection involves inserting
   during the injection of a fault when a simulator                          faults into a system and monitoring the system to
   providing detailed access to the target system is                         determine its behavior in response to the fault.
   used instead. The technique is combined with                                 In this paper we have described several techniques
   operational-profile-based fault injection which only                      that have been made to develop techniques for
   injects faults in those parts (e.g. registers) which                      injecting fault into a system prototype or model. These
   contain live data, i.e. which will not be overwritten.                    techniques fall into five categories: Hardware-based
                                                                             fault injection, software-based fault injection,
8. Conclusions                                                               simulation-based fault injection, emulation-based fault
                                                                             injection and hybrid fault injection. In table 1, we
The last years marked growing demand for new                                 summarize the main advantages and disadvantages of
techniques to be applied in the design of fault tolerant                     these techniques.
electronic systems, and for new tools for supporting                            Most recent research in this area is converging
the designers of these systems. The increased interest                       towards hybrid fault injection combining the benefits
for the domain of fault tolerant electronic systems                          of both hardware and software fault injection
design stems primarily from the extension in their use                       techniques, while avoiding most of their disadvantages.
to many new areas. At the same time, the cost and                            This is becoming feasible due to the latest
time-to-market minimization constraints obviously                            advancements in the FPGA technology. Modern FGPA
affect the design of fault tolerant systems, and new                         devices can be fruitfully exploited to emulate systems
techniques and new tools are continuously needed to                          composed of hundreds of thousands of gates at a
face these constraints.                                                      reasonable cost.