0% found this document useful (0 votes)

113 views16 pages

Reliability, Availability, and Serviceability of IBM Computer Systems: A Quarter Century of Progress

Uploaded by

sicilyxc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

113 views16 pages

Reliability, Availability, and Serviceability of IBM Computer Systems: A Quarter Century of Progress

Uploaded by

sicilyxc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

M. Y.

Hsiao
W. C. Carter
J. W. Thomas
W. R. Stringfellow

Reliability, Availability, and Serviceability of IBM

Computer Systems: A Quarter Century of Progress

Computer systems have achieved significant progress in the areas of technology, performance, capability, and RAS
(reliability/availability/serviceability) during the last quarter century. In this paper, we shall review the advances of IBM
computer systems in the RAS area. This progress has for the most part been evolutionary; however, in some cases it has
been revolutionary. RAS developments have been driven primarily by technological advances and by increases in func-
tional capability and complexity, but RAS considerations have also played a leading role and have improved tech-
nological and functional capability. The paper briefly reviews the progress of computer technology. It points out how
IBM has maintained or improved its systems RAS capabilities in the face of the greatly increased number of components
and system complexity by improved system recovery and serviceability capability, as well as by basic improvements in
intrinsic component failure rate. The paper also covers the CPU, tape, and disk areas and shows how RAS improvements
in these areas have been significant. The main objective is to provide a comprehensive view of significant developments in
the RAS characteristics of IBM computer systems over the past twenty-five years.

Introduction and general concepts system faults may be caused by the intrinsic device
Reliability is a measure of the consistency with which a failure rate, by design faults, or by outside interference.
system successfully provides its specified services. Ser- When the fault causes an error, the first line of defense is
viceability is a measure of the ease with which the system error detection, followed by error correction (usually
is restored to its specified state. Availability is the per- with error-correction codes) or by retry. If the erroneous
centage of the time during which the system is providing effect of the fault no longer exists, operation continues
that specified service [1]. The characteristics of and the without repair in the reliable state. If these mechanisms
effect on the system with regard to these three interrelat- do not work, the effects of the error propagate to a
ed quantities are referred to as the system RAS. subsystem, and error recovery usually proceeds using an
error-recovery program, with deletion of the offending
The central issue in designing systems with good RAS subsystem. If the error cannot be contained within a
characteristics is recovery—reduction of fault occur- subsystem, other methods of correction are needed,
rence, detection and counteraction of errors [2], and possibly with human intervention. However, the system
efficient repair procedures. Recovery implies resumption still provides some service. Finally, the system may not
of operation with data integrity. Figure 1 illustrates the be able to proceed at all and immediate repair is neces-
basic relationship between faults and system RAS for a sary. In this case, serviceability is important for efficient
unified hardware/system of RAS. In the center circle, restoration of service.

Copyright 1981 by International Business Machines Corporation. Copying is permitted without payment of royalty provided that (1)
each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page.
The title and abstract may be used without further permission in computer-based and other information-service systems. Permission
to republish other excerpts should be obtained from the Editor. 453

IBM I. RES. DEVELOP. • VOL. 25 • NO. 5 • SEPTEMBER 1981 M. Y. HSIAO ET AL.

版权所有(c)国家科技图书文献中心 I23091900002HG00
User sees uncorrectable error The earliest commercially available products for data
processing used vacuum tubes to perform logic, and these
products packaged one or more vacuum tubes along with
associated passive components in afield-replaceable unit
(FRU). Logic functions were performed entirely by the
vacuum tubes. As product architectures increased in
complexity, germanium diodes replaced the vacuum
tubes for much of the logic switching. The FRUs in-
creased in size and housed several vacuum tubes along
with associated diode switches and passive components.

Maintenance for these systems followed the tradition of

the time in isolating faults. The customer engineer (CE)
was expected to have a comprehensive knowledge of the
logic of the system as well as an understanding of the
logic circuits. He was provided a complete set of logic
diagrams and detailed electrical diagrams for each of the
FRUs. Since there was minimal hardware checking in
these early products, the errors usually had to be recre-
Figure 1 System reliability, availability, and serviceability. ated with exercise programs so that fault isolation could
take place. Two methods of fault isolation were used; the
first was usually substitution of a spare FRU for the sus-
pected unit. If this was unsuccessful, the use of logic dia-
grams and an oscilloscope for signal tracing was invoked
to the faiUng condition. The most difficult problems to re-
solve were those that could not be recreated and would
In the following sections of this paper, we will first only occur during the customer's operation. Analysis de-
briefly examine the trends of hardware technology and pended on the ingenuity of the CE along with a very de-
their effect on RAS. Then we shall treat system RAS and tailed knowledge of the product and the customer's appli-
show how the basic ideas just described have been imple- cation.
mented. We will then examine the CPU and discuss the
special RAS features that have been added. We will also The vacuum-tube FRU was usually repaired in the field
consider magnetic storage and show how the reliability of by either replacing the vacuum tube or one of the passive
tape and disk devices has been markedly improved. Al- components. The use of transistors initiated the pluggable
though this paper includes the more significant advance- printed-circuit card as the new field-replaceable unit. For
ments and trends associated with RAS in IBM computer reliability reasons, no sockets were provided for any of
systems, it is not intended to be a complete history or to the components on these cards, including the transistors;
include the comprehensive progress of all systems, therefore, the CE was no longer expected to repair the
boxes, and devices. For example, the RAS advances of FRUs. This was the first step in a trend that was contin-
the IBM Federal Systems Division's products/systems ued with each new technology generation. The separation
are not covered here. The papers by James [3] and by between the logical elements and the CE has widened
Olsen and Orrange [4] touch on this area, while that by with each significant increase in logic density. The early
Jarema and Sussenguth [5] includes various error detec- transistor technologies were made up entirely of discrete
tion/correction codes as well as checking and diagnostic components housed on printed cards that were plugged
features of data communications systems. into a back panel and interconnected by means of wire-
wrapped connections. As a result of the higher density
Technology, system, and service trends and lower cost of semiconductors, error checking was be-
During the last twenty-five years, the computer industry coming more prevalent, and we began to see use of error-
has made tremendous progress in its technology and func- correcting codes in our most powerful system, the 7030
tional capability. Technological progress not only offers (Stretch).
cost and performance improvement, but also provides
significant advances in RAS. In the past decade, progress Accompanying these improvements was a significant
in technology from discrete components to LSI has been reduction in the failure rates of the logic and storage ele-
454 so rapid that it is frequently referred to as a revolution. ments. These trends are shown in Figs. 2(a) and (b). The

M. Y. HSIAO ET AL. IBM J. RES. DEVELOP. • VOL. 25 • NO. 5 • SEPTEMBER 1981

版权所有(c)国家科技图书文献中心 I23091900002HG00
reduction in the number of interconnections required con- (a)
tributed significantly to this reliability improvement; the 1 0 - ' - . 2 circuits/chip
environment in which the semiconductors reside is more . N.
«j 10 \ i o
protective as levels of integration increase. h --^
^^Xioo
1 ^ -5
-2^10
M J3 XX
Serviceability is enhanced by the inclusion of low-cost
3m
X
a °C I O—6
•a X1»704
error-detection hardware, which becomes economically t^ u
« «g 1 1 1
(X 1
feasible with higher levels of integration. With error de- 1965 1975 1985 1965 1975 1985
tection and logout, error re-creation is avoided; this is an
important advantage in the fault-isolation process. Fault Year
isolation is further enhanced by increases in the number Figure 2 Intrinsic failure rate improvement trends for IBM
of circuits contained in the FRUs. The number of logic technology, (a) Logic circuit reliability, where the per-circuit
elements on the FRU has gone from one circuit to ten percent failure rate is given per 10' hours. The numbers noted on
the curve refer to the number of circuits per chip, (b) Bipolar
thousand on current products. Replacement of one FRU memory reliability, where the per-bit percent failure rate is given
has a higher probability of a successful repair as the num- per lO' hours and the numbers noted on the line refer to the
ber of FRUs making up a product decreases. number of bits per chip (K = 1024).

Higher levels of integration and larger numbers of logic

elements on FRUs have decreased the number of test
points available. As the availability of test points has de- in question. This system was to be called Remote Techni-
creased, using an oscilloscope and logic diagrams to un- cal Access Information Network (RETAIN).
derstand and verify the operation of a FRU has become
impractical. This has minimized the need for discretion- At the same time that the planning and development
ary judgment on the part of service personnel and has work on RETAIN had been progressing, experimentation
shifted greater responsibility to the product developer to with technical instruction via the computer had also been
provide a highly serviceable product. underway. It offered advantages in rapid course update
and evaluation. This project was to be called Computer
Technological advances have made practical the use of Assisted Instruction (CAI). In 1965, both of these proj-
service processors either integrated into the product or as ects were initiated in selected locations on an experimen-
portable units brought to the product by service person- tal basis. By September 1967, the two applications were
nel. These are highly flexible minicomputers that are integrated under a single software system and supported
connectable via a service interface on the using product. 112 terminals.
Use of these service processors can significantly enhance
the effectiveness of on-site personnel. It is also possible The use of a remote data bank increased in importance
to provide similar visibility by way of a teleprocessing as a maintenance aid. A test center for teleprocessing
link into the system from a remote support capability, products was brought on line in 1969. This center was
allowing a more experienced person or an engineer to developed to provide the CE with a fast, efficient means
assist in the solution of a problem without the delay of of testing and verifying the performance of teleprocessing
travel. products by providing an alternate host capability. This
allowed "off-line" maintenance of teleprocessing equip-
In the early 1960s, the IBM Field Engineering Division, ment without interruption of other data processing opera-
while taking into account the growing complexity of hard- tions.
ware and software, concluded that a technical data access
system could be of significant value. It was proposed that Two important improvements were made available in
much of the rediscovery time by the CE could be saved if 1970, in a new version of RETAIN. A capability for struc-
data files were compiled on difficult hardware and soft- turing a search argument for a particular product symp-
ware problems and their associated symptoms. This tom was developed to provide more efficient and rapid
would be accomplished by way of a terminal and tele- data location. This capability reduced the teleprocessing
processing network with access to a central data bank. data loading for each problem inquiry. The second im-
Access to the network was to be made available to techni- provement was the provision for a data link between the
cal information center personnel contacted by CEs with customer's system and the RETAIN system. This data
unresolved problems. Using the product type as an index, link would now allow a specialist at a remote location to
it was possible for the information center people to access operate the customer's system while utilizing the full li-
all of the symptom faults in the data bank for the product brary of diagnostic tests available at the customer's site. 455

IBM J. RES. DEVELOP. • VOL. 25 • NO. 5 • SEPTEMBER 1981 M. Y. HSIAO ET AL.

版权所有(c)国家科技图书文献中心 I23091900002HG00
Intrinsic failure Electromagnetic radiation, novations needed for successul operation in the presence
rate ' RF interference, etc-
of faults were implemented in IBM systems.

% Very early computer systems (early 1950s)

In very early systems, each IBM installation had CEs ei-
ther resident or on quick call. Error detection was nor-
mally delayed, and the CEs isolated system faults through
re-creation of the failure. The CE then used the logic
drawings of the system, detailed electrical diagrams of the
pluggable units, simple diagnostic programs, and the con-
sole and an osciUoscope to complete his analysis [6].

At this time, the concept of using the computer to test

v ; , j r ( l - » i 2 ) ( l - J),) itself and help locate faults was born. The idea of having
specialists prepare a general set of tests to try to ensure
the absence of faults (test routines) and to locate faults
V ""'°^^ y V working y (diagnostics) was proposed and implemented [6]. How-
7)1 i)2-t -^i-tCI- ^j) % ever, re-creating the failure symptoms for a wide variety
of cases proved to be unexpectedly difiBcult. The proce-
Fignre 3 CPU recovery from faults. dures were temporarily successftil because of the ability
and dedication of the CEs, but better technology was
clearly needed. Error detection was needed since consid-
erable computer time was wasted using invalid data and
Results of the diagnostic rans could be displayed on a ter- diagnosing computer faults. The first diagnostic routines,
minal for analysis by the specialist, with the CE maintain- while making an improvement, were not adequate be-
ing telephone contact to assist in solving the problem. By cause error re-creation using software was difficult. The
1976, there were a series of on-line data systems in vari- idea of backward error recovery, specifically checkpoint-
ous locations around the world providing RETAIN ser- restart, was invented early on. Enough information to re-
vices for our CEs on a worldwide basis. In the following start the program was stored on an external medium and,
year, a worldwide data-link capability was provided. after an error was discovered, the program could be re-
sumed at such a point. However, this also was satisfac-
Progress in CPU and system RAS techniques tory only for the less complex programs and systems.

% System operation in the presence of faults % Early computers for defense

The basic treatment of faults is shown in Fig, 3. In a com- Early computers designed for defense applications re-
puter system, faults are caused by the intrinsic hariware quired good RAS and, because they were constructed
failure rate, by external influences, by design errors (tim- from relatively unreliable components, were carefully
ing, circuit input pattern sensitivity, circuit overloading, self-checked. An impressive example is the AN/FSQ-7
mistakes in logic, programs, etc.) and by operator mis- (SAGE) computer [7], used for real-time processing of ra-
cues. The first obvious step for good RAS is to reduce the dar scans for air defense as well as for potential missile
occurrence of faults. If the fault causes an error, the first launching. It was designed with fault-tolerant features to
step in recovery is detection of the error. If the error is achieve high availability and accuracy. The central pro-
not detected, computer system operation continues and cessor (still using vacuum tubes) contained one operating
the erroneous results prop^ate. Information is destroyed computer and a second acting as backup. Each computer
or modified erroneously and must be corrected at some had elaborate fault-detection schemes using parity and
time in the future. software testing and diagnostic programs. The I/O had
standby spares. The standby computer executed self-
When an error is detected, the procedure is to perform tests frequently and its memory was periodically updated
information-damage assessment, counteract the effects of with the current state of the operating computer. After
the fault, isolate it, and treat it properly. This treatment error detection, the switchover and recovery were exe-
may range from ignoring the fault to immediate repair. cuted by software. This effort showed that reliable opera-
Finally, system recovery must be effected and system op- tion could be attained, although the cost was high; and
eration resumed. As technologies and the use of com- much valuable experience resulting in improved RAS
456 puter systems changed, the necessary technical in- technology was gained. Error detection was increased,

M. Y. HSIAO ET AL. IBM J. RES. DEVELOP. « VOL. 25 % NO. 5 • SEPTEMBER !981

版权所有(c)国家科技图书文献中心 I23091900002HG00
Table 1 RAS features of some early IBM computers.
Machines Checking schemes

Dataflow paths Control units Arith. units Memory units IIO units Misc.

IBM 650 1. 2/5 code (single 1. clocking 1. bi-quinary 1. 2/5 codes 1. card Program checks
error detection) 2. proper sequence codes (SED) reading include:
of control signals 2. sign-agree- 2,• card 1. invalid
3. duplicate ment checking punch operation code
circuitry 3. correct 3. tape 2. invalid
4. accumulator complement add system addresses
interlock and true add parity 3. overflow
check 4. branch
(BCD) distributor
codes check
5. interlocks
IBM 7070 1. 2/5 code 1. validity 1. 2/5 code 1. address 1. BCD one- Program checks
(validity check) checks 2. sign-agree- checks bit parity include:
2. clocking ment checking 2. data in and check 1. invalid
out are 2. card operation code
validity- reading and 2. invalid
checked punching addresses
3. two-gap 3. use of
head, dual- instruction
level counter for
sensing in error routine
tape system
IBM 7030 1. parity check 1. validity 1. modulo- 3 residue 1. modified 1. 720 1. some
(Stretch) checks check for Hamming tape, bit duplication
2. clocking floating point code for error in circuitry
3. parity checks arithmetic single error detection
on various 2. parity correction 2. 7302 disk
register fields and double usage SEC-
error DED code
detection

redundant information was stored routinely for recovery, necessary to locate the fault that caused the error. These
computer self-tests and diagnostics were improved, and programs applied sequences of input vectors to logic cir-
the resulting serviceability was improved. cuits by loading registers accessible to the program and
then applying standard computer instructions.
• Batch data processing
The first stored-program computers introduced by IBM in The early IBM work for AN/FSQ-7 emphasized the ne-
the field of commercial computer processing, the 701 and cessity for good diagnostics [8]. To test for possible faults
702 systems, used vacuum tubes as the active devices and the "start large" approach was used. The system was
germanium diodes for logic switching. Field-replaceable stressed as much as possible and difiBcult CPU operations
units for these machines were made up of eight vacuum were run concurrently. The idea was to obtain maximum
tubes and associated passive components. These plug- fault coverage as quickly as possible. If a fault were in-
gable packages aided in fault diagnosis since they were dicated or if diagnostics were being run because of known
larger than previous pluggable packages. difficulties, the "start small" approach was used. The
simplest operations were performed and then tests using
When the 650 was introduced, more RAS features were small incremental amounts of circuitry were run. Special-
included in its design (see Table I). In addition, the 701/ ized routines were available for suspected units. These
702 evolved to the 704/705 with floating-point arithmetic efforts were supported by increased CE educational ef-
(and thus a more complex structure). To cope with this, forts.
extended CE education was begun. The first general RAS
support attempted was the production of self-test routines However, analysis indicated that software tests have
and diagnostics to aid in re-creating the effect of errors in serious deficiencies [9]. The first is control of the ele-
controlled situations, and to facilitate the logical analysis ments being tested. The fraction of circuits accessible to 457

IBM J. RES. DEVELOP. • VOL. 25 • NO. 5 • SEPTEMBER 1981 M. Y. HSIAO ET AL.

版权所有(c)国家科技图书文献中心 I23091900002HG00
direct control is small, so writing a software program to The SABRE system (American Airiines Electronic
produce a desired test pattern is difficult. The second dif- Reservations System) [12] did the first real-time process-
ficulty is the state-resolution problem: the interrogation of ing for airline reservations using two IBM 7090s in on-line
the results of the test-pattern application. Several instruc- and standby roles supporting a network of terminals and
tions and many intermediate states and timing cycles communications processors. In order to make this system
have to be used. Finally, test coverage is unmeasurable, successful, not only was there redundant equipment, as in
and control of the system of programs is difficult. SAGE, but much work was done on improving the tech-
niques available for the recovery programs.
The 702 had parity checking in its memory [10]. The
705 also had parity checking for each character. This im-
• System/360
proved data integrity and also helped RAS because of the
With improved capabilities emerging in both hardware
additional assurance such error detection gave to com-
and software, companies began to use computers inter-
mercial data processing, which differs from scientific
actively in their day-to-day affairs. At this point, it was
processing in that checking without duplication is diflfi-
becoming clear that in computer system RAS design pri-
cult. These systems were followed by the 7090 and 7094,
mary emphasis should be placed on the error-detection
which benefited from improved components (transistors
capability of the system, since without this first step re-
and magnetic cores) first introduced in the 7030 system
covery is greatly hindered.
(described next).

• Unique systems Error detection is required for all types of errors,

The 7030 (Stretch) computer [11] used two innovations whether caused by solid or intermittent faults, so that
that have had a long-range impact on RAS. First, it used when an error is detected, a recovery technique can be
single-error-correction, double-error-detection (SEC- invoked to ensure data integrity and availabihty of a valid
DED) codes for main storage, and parity, duplication, system. Furthermore, when hardware checkers for error
and modulo-3 codes for error detection. These character- detection are designed and placed at a suitable location,
istics allowed a single error in the memory to be automati- the error indication will provide maximum capability for
cally corrected and the fault to be located and later re- fault isolation, so that the failing FRU can be identified
moved during a scheduled maintenance period. Another quickly and service can be accomplished in a minimum
benefit was the ability to provide fault isolation, since the time. System/360 implemented many novel checking cir-
error-correction process requires identification of the fail- cuits [13], such as parity-predict adders, carry-depen-
ing bit. The second innovation was the storing on dent-sum adders, and two-rail logic checking.
punched cards of the status of all processor latches imme-
diately following the detection of an error. Table 1 com-
Another new idea was to reduce diagnostic time and
pares various checking/correction techniques imple-
improve the precision of fault location by using fault-lo-
mented in the IBM 650, 7070, and 7030 systems.
cating tests [14]. These tests used hardware to implement
a process called scan-in, scan-out. Scan-in, scan-out al-
The 7030 computer was one of the earliest systems to
lowed testing of combinational circuits by adding the
use the standard modular system (SMS) cards, which
hardware necessary to read a pattern from tape, insert the
were the field-replaceable units (FRUs). The 7030 also
bits in latches, step the computer one or more cycles,
used transistors and magnetic cores as its logic and stor-
then examine an output bit. Test patterns were generated
age components, which provided a great advantage over
by computer from circuit diagrams, using the single-
tubes and electrostatic storage. The circuit logic was
stuck-fault model, and patterns sequenced using sequen-
shown by printed logic drawings, produced by IBM's first
tial testing techniques. One combinational logic step was
design automation system. These drawings, one page per
satisfactory for the CPU, but the sequential nature of the
FRU, aided fault determination by error analysis.
channels required several steps—sequential scan [15].
The 7030 had these RAS innovations because it was the
most complex (as well as the fastest) computer system These tests were controlled by the microprograms, and
built up to that time by IBM, and new concepts were this led to the idea of microdiagnostics [16]; i.e., since the
needed to achieve the required reliability. Error detection control and interrogation in a few cycles were done by the
made error analysis much easier, although there were still microprogram, the diagnostic programs were put in ROM
difficulties with producing good diagnostics. Re-creating [14]. This improved flexibility and allowed very good
the environment that produced the error from the fault tests to be written for storage in System 360/50 as well as
was aided by the logout; however, this was awkward in automatic testing of the System 360/40 CPU whenever the
458 its early form. console START button was pressed. The System 360/30

M. Y. HSIAO ET AL. IBM 3. RES. DEVELOP. • VOL. 25 • NO. 5 • SEPTEMBER 1981

版权所有(c)国家科技图书文献中心 I23091900002HG00
used microdiagnostics for circuit-level tests and for tests code implementation in cost, performance, and reliabil-
to assist debugging. The System 360/25 was the first IBM ity. Examples of these new codes were implemented in
CPU with loadable control store; this eliminated the stor- the IBM System 370/158 and 168, and later in the 3031,
age space constraint for microdiagnostics. 3032, 3033, and 4300 series as well as in many other IBM
systems. Since its first publication in 1970 [26], this class
System/360 computers were designed so that the of codes has been widely used in the U.S. and abroad.
logged-out information about CPU latch status was stored
in main storage. This could then be analyzed by software • Hardware retry
routines and the pertinent information preserved or CPU instruction retry was first commercially imple-
printed for the customer engineer. This idea was im- mented on the System 370/155 and 165 [17], and followed
proved and joined with wide use of microdiagnostics for by the System 370/145. The techniques of checkpoint-re-
the later models of System/360 [17, 18]. start were designed for a single instruction to overcome
the eflfect of an intermittent error in the CPU. Data used
A standard diagnostic monitor was written to control by the instruction are stored at appropriate points during
the software programs written at many locations, and in- instruction execution, and the instruction progress is
terfacing with the standard supervision program was be- charted by a set of states. When an error is detected, the
gun so that maintenance began to be integrated with the microcode interrogates the state to determine the last
rest of the system functions. SpecisJ diagnostic hard- valid data, restores the operands from these data, and be-
ware—"channel wraparound" and storage, channel and gins from the last valid state. A count is kept in case the
I/O control unit state latches accessible to CPU inter- error is caused by a permanent fault, in which case, when
rogation—was added [14]. the count reaches a predetermined value, the retry is ter-
minated and a permanent fault is signaled. Since the basic
Maintenance analysis procedures (MAPs) were in- IBM patents in 1%8 [27-29], the computer industry has
troduced to aid the CE in his diagnostic task. These widely implemented the instruction-retry mechanism.
MAPs provided a step-by-step process to isolate the
A method of retrying catastrophic channel-check errors
cause of a failure and tended to reduce the need to teach
by OS/360 software augmented by hardware was imple-
CEs how a product worked, permitting emphasis of a
mented on the System 370/155 in 1971. A hardware and
"how-to-fix" maintenance philosophy [19]. MAPs were
microcode channel/control unit retry was also introduced
first applied to I/O products and small systems.
on the 370/155. When errors in a particular class, e.g.,
during the Direct Access Storage Device (DASD) SEEK
With the advent of System/360 and its supervisor pro-
time, were detected by a control unit, a unique ending
gram (OS), IBM was trying to do the whole diflficult job of
response was sent to the channel. The channel recognized
recovery and multiprocessing for the first time. Based on
this and reissued the last channel command. This retry
practical experience, the seminal papers on recovery
was transparent to the software since no interrupt was
were written early [20, 21]. Ideas described in those pa- issued at the program level.
pers have been expanded and incorporated into standard
IBM products [22-24]. Studies in the early System/360 design revealed a num-
ber of occasions when a second channel interface or a
System/370 DASD controller would allow both channel overloading
The ideas used in System/360 were continued and im- and channel interface enors to be bypassed; therefore, in
proved with the introduction of System/370; the concept 1971 I/O path switching was introduced on the 2314
of a unified hardware/software system for RAS, as shown DASD family. Previous design had allowed sharing be-
in Fig. 1 in the introductory section, has become a reahty. tween processors; this was now applied to a single pro-
A hierarchy of features now exist to aid recovery from cessor with multiple program-switchable I/O paths
intermittent and solid faults; these features are imple- [30, 31].
mented in microcode, hardware, and software and they
reside in the box, subsystem, and system levels. Such abilities are part of the standard MVS supervisor
[32], and the improved availability MVS offers stems
• Error-correction codes for main memory mainly from its ability to automatically reconfigure hard-
In 1968, a new class of single-error-correction, double- ware components.
error-detection codes was invented by two IBM engi-
neers [25]. These were called odd-weight-column codes • New diagnosis and testing techniques
[26]. With the same coding efficiency, this new class of For eflfective self-repair features, all faults must be effi-
codes made improvements over the standard Hamming ciently located. In the early 1960s, K. Maling, M. Evans, 459

IBM J. RES. DEVELOP. • VOL. 25 • NO. 5 • SEPTEMBER 1981 M. Y. HSIAO ET AL.

版权所有(c)国家科技图书文献中心 I23091900002HG00
and others under R. J. Preiss' direction wrote a test gen- ance of the hardware resources available for normal work
erator and deductive fault simulator program that was on the other processor. Thus, MP not only does more
used to generate test patterns for the larger models of work in the sense of doing two things at one time, but also
System/360 [15]. These programs were heuristic and not is more available, responding to the different needs of an
completely satisfactory. In 1966, Roth published a basic installation at different times.
paper showing a new technique called the D-calculus for
generating test patterns [33]. The importance of this tech- New RAS features In recent IBM systems
nique was quickly recognized [34] and has been widely New features to improve RAS have added innovations in
used to find test patterns for combinational circuits since. the IBM 4300 systems. For example, the diflRculties of
Another technique, using the Boolean difference [35] for test generation are made more acute by large-scale in-
test-pattern generation, has been an important tool for an- tegration (LSI) with the large number of circuits that can
alyzing testing theory. now be placed on a single chip, and the impossibility of
probing to find faults. A method of overcoming these dif-
It was with Systeni/370 that the idea of an autonomous ficulties makes use of level-sensitive scan design (LSSD)
diagnostic processor was begun as a supplement to the [37]. In this method, all latches in a chip are connected in
CE control. A universal system service adapter was in- a shift register as well as with their normal inter-
corporated in the System/370 Model 155 which provided a connections. The shift registers can easily be tested, and
standard interface to external equipment for testing the the other circuits are tested as combinational circuits.
155 when it was in a stopped or disabled condition [36].
This idea was extended to an independent processor As discussed previously, instantaneous error detection
which could control execution of diagnostic routines as and efficient/a(7M/-e isolation (ED/FI) are essential to sys-
the CE did at the console. Program trace facilities, the tem RAS. In order to obtain a quantitative measure, a
ability to capture logout data, continuous monitoring of basic evaluation method of a system's ED/FI capability is
selected logic points, and programs to analyze the cap- required as the system is being designed, so that efiicient
tured data have been added. ED/FI capability can be achieved. Beginning in 1972, an
ED/FI evaluation technique was developed in IBM, was
MVS software supervisor tested, and now is widely used while IBM products are
Multiprocessing (MP) is one means of increasing avail- being designed [38]. With ED/FI, the fault model is de-
ability; another is eliminating the need for unscheduled fined in terms of failure probabilities associated with
shutdowns. When an error occurred in previous systems, checkers and syndromes. Failure likelihood and probabil-
the system could not do any work until the installation ity of error detection are calculated from the circuit
reinitialized the system. One way of carrying out these count, failure rates, and check placement. This technique
procedures is by using a software supervisor which is especially significant for LSI designs; it is extremely
makes good use of the available hardware features. The important to have error-detection and fault-isolation ca-
standard IBM supervisor for System/370 is MVS; when pability built into the early logic design phase. The
an error occurs in the MVS system, the system attempts evaluation and design ideas go hand in hand with a
to continue operating. MVS attempts to retain availability maintenance philosophy which relies primarily on instan-
through error-recovery routines that isolate the record, taneous error detection and isolation of faults in the
clean up and repair, and retry and reconfigure. Processing operating environment, rather than on conventional diag-
continues while the system carries out these tasks. Pri- nostic methods of error re-creation. The result of the ED/
marily recovery management support and the recovery FI design evaluation determines the capability of the
termination manager perform these functions. system for diagnostic problem determination, remote
diagnostics, and customer service.
The improved availability MVS offers derives from the
ability to automatically switch from a failing unit to an This approach led to designing some IBM systems so
alternate. In addition, it is possible to reconfigure hard- that service tasks normally executed by CEs can be per-
ware components to fit an installation's needs or to re- formed by the customer's operators. Built-in error-detec-
configure hardware components allowing service person- tion and isolation circuitry isolate the failure automati-
nel to perform concurrent maintenance. Thus, over a pe- cally or semi-automatically, and the customer is in-
riod of time the system does more work because it loses structed to run additional tests by activating diagnostics
less time due to failing hardware. A multiprocessor instal- housed in the unit. Test results point the customer to the
lation can be divided into two systems so that only the correct place in the documentation, which instructs the
hardware components actually required for the special operator as to the corrective action. Customer-replace-
460 system are allocated to one processor, leaving the bal- able elements are easily accessible and are also guided,

M. Y. HSIAO m AL. IBM S. RES. DEVELOP. ^VOL. 25 %NO. 5 s SEPTEMBER 1981

版权所有(c)国家科技图书文献中心 I23091900002HG00
keyed, and color-coded to ensure correct positioning. System/370,303X,
The 3101 terminal announced in 1979 was the first IBM 4300,3081, etc, series
product to incorporate CPAR (Customer Problem Analy- System/370, Error detection
303X,4300,
sis Resolution). 3081, etc.
Extensive error
checking
Isolation
Automatic ED/FI
To reduce costs, eliminate delay, and allow indepen- Remote maintenance
dent customer ability to set up or relocate a product, Dedicated service
processor
starting with the IBM 3767 in 1975, a practice called cus- Maintenance device
tomer set-up (CSU), whereby a customer operator can Recovery
Reconfiguration
perform all of the procedures required to get a product
Multiprocessing
into operation on his system without IBM assistance, was Automatic software
recovery
initiated. The final test of a product, after CSU is com- Channel retry
plete, is activated by a simple key action which causes an Instruction retry
automatic repeat of the final test performed in manufac- System/360 series
turing prior to shipment. This concept has now been im- System/360 Error detection
Error checking
plemented on the IBM 3270 display system, including the Error status logging
IBM 3287 and 3289 printers in 1978, the IBM 8130 and Isolation
Fault-location tests
8140 systems in 1979, and many other display/printer Microdiagnostics
products. Logout analysis
Maintenance Analys s
Procedures (MAPs)
Recovery
In addition, portable service processors are available Checkpoint-restart
for special system problem determination, channel mon- Hardware retry
Limited software
itoring, and line/link monitoring. A portable service pro- recovery
cessor (the IBM Maintenance Device) was first in- 700,7000,1400 series
troduced in 1979 with the IBM 8130,8140, 3370, and 3380. 700,7000, • Error detection
It includes a microprocessor, file, keyboard/display, com- 1400 Incorrect results
Limited parity
munications port for remote support, and various ports to checking
connect to a variety of products. A diskette containing a Diagnostic
routines
unique maintenance package, including MAPs-Diagnos- • Isolation
tic Integrated (MDI) error analysis programs and symp- Re-create error
Manual scope
tom/failure indexes, is shipped with each product and is • Recovery
loaded at the start of a CE call. Error correcting
code (ECC)
Rerun
The next step was to use a communications network for 1950 1960 1970 1980
remote control and diagnosis of the ailing computer [16]
so that the logout patterns of a good computer can be Year
compared with those of a failing computer. Patterns mak-
Figure 4 RAS enhancements.
ing faults appear as errors are captured and used for re-
mote diagnosis and remotely controlled testing hardware.

These extended facilities improved hardware RAS suf- RAS progress in magnetic storage devices
ficiently so that the main difficulties arose from the more
complicated software which was attempting to make good • Magnetic tape area
use of the expanded hardware functionality. Early pro- Since the early 1950s, IBM has made very significant
grams were so small that they could be easily compre- progress in magnetic tape technology. The paper by Har-
hended and fixed, but with the advent of higher-level lan- ris et al. [39] in this issue has more detailed discussion on
guages, large main storage, and multiprogramming, soft- this subject. In this section, we shall only highlight the
ware errors have increased. IBM effort in software RAS RAS progress and assess its impact on the overall tech-
has been concentrated in three fields: conventional fault nology progress.
avoidance, support for hardware features, and basic re-
covery with operating systems and the problem of large To assess the impact of RAS progress on tape tech-
data base recovery. The major RAS enhancements of the nology, we shall use areal density as a key parameter to
last twenty-five years are exemplified in Fig. 4. measure progress. As areal density increases, the cost per 461

IBM J. RES, DEVELOP. « VOL. 25 % NO. 5 • SEPTEMBER 1981 M. Y. HSIAO ET AL.

版权所有(c)国家科技图书文献中心 I23091900002HG00
This contribution by IBM was a major step beyond the
state of the art and later became the industry standard.
Before the 2^)0 tape system, the IBM 7340 tape system
used the phase-encoding (PE) technique for data encod-
ing; this technique has a lower efficiency in recording den-
sity than the NRZI code for representing a data bit. How-
ever, the PE code has a self-clocking feature, and its dual-
level sensing provides a binary erasure channel property
which, by using a (10,8) code [42], can correct all single
errors and 33 out of 45 double errors within a codeword.
IBM soon went back to the standard nine-track system
because the ten tracks for the 7340 were non-
conventional. No following systems have used ten tracks.

The PE data encoding method was used, however, in

the 2400 tape series because of its self-clocking property,
which improved the linear density up to 1600 bpi. The
Figure S Progress in IBM tape technology. The following ab- erasure-channel correction was not used, but the single-
breviations have been used: BED = bit error detection, NSC = track correction system was employed. The PE code was
non-self-clocking, SC = self-clocking, PE = phase encoding, EC
= erasure correction, 1 (2) TC = l-(2-) track correction, MI = used until the introduction in 1973 of the 3420 tape series
mechanical improvements, SCT = self-clocking tracks, DS = Models IV, VI, and VIII, which used a (4,5) run-length
de-skew, LBEC = long-burst error correction. code [43] for data encoding. This code has better (69% vs.
50%) recording density efficiency than the PE code. For
these models a new code called group-coded recording
bit goes down and total storage size and data rate in-
(GCR), with a more powerftil error-correction system,
crease, which means a better figure of merit to end users.
was also invented by IBM engineers [44, 45]. Significant
As shown in Fig. 5, the 726 was the irst IBM parallel
progress has been made with this code, pushing the linear
track tape unit to use the nonreturn-to-zero inverted
density from 1600 to 6250 bpi. The GCR code became the
(NRZI) code, which is 100% efficient in recording den-
U.S. industry standard as well.
sity. (Here the efficiency is defined as the data bits di-
vided by flux reversals within a unit recording space.) The
GCR enables the nine-track system to correct any two-
key problem in the NRZI code is that it does not have a
track failure with a real-time pointer [46] and any single-
self-clocking capability; when bits from different tracks
track failure without a pointer. The GCR system has also
arrive at a variable time due to mechanical and electrical
helped move the data rate up to 1.25 megabytes per sec-
skew, an error occurs. Therefore, improvements were
ond. The error-recovery technique of the tape control
made to reduce the mechanical tolerance in the tape
unit also provides error-detection and re-read capability
transport by tuning the delay line or buffer register to the
to take care of intermittent errors during the read process.
read/write process. The NRZI technique reached its limit
In addition to the GCR code, there is a two-8-bit CRC at
in the 1960s when the linear density in the IBM 729 Mod-
the end of each record for extra detection capability [47].
els 5 and 6 and the early models of the 2400 tape series
was up to 800 bits per inch (bpi). Along with the coding progress, improvements made in
tape oxides and substrates, motion controls, direct-drive
Meanwhile, the capability of error-correcting codes, dc motors, and solid state electronic technology, im-
matched with the NRZI code, was greatly enhanced. proved contour-on-head design, and more electronic buf-
Both the 727 and 729 tape systems had a vertical redun- fering to trade off with mechanical tolerance have pushed
dancy check (VRC) and a longitudinal redundancy check tape technology up to 9042 flux changes per inch, i.e.,
(LRC) for error detection. Besides error-detection codes 6250 bpi. While progress was being made in the half-inch
in the half-inch tape system, it had read-after-write check tape area, IBM devoted great resource to develop a mass
to make sure the recorded data were correct. As the den- storage system (MSS), in which the data storage is even
sity increased, a stronger correction was needed. In the greater in size vnith extremely low-cost data, for replacing
2400 tape series, a combination of cyclic redundancy the half-inch tape library. Because the IBM MSS 3850
check (CRC) and VRC made a clever single-track correc- uses a cartridge and a rotating head concept, a new data
tion system [40, 41] which can correct a total track failure encoding method without a dc component was required.
462 within a block of data in the nine-parallel-track system. This requirement led to the invention of zero modulation

M. Y. HSIAO ET AL. IBM J. RES. DEVELOP. %V0L, 25 • N O . 5 •SEPTEMBER 1981

版权所有(c)国家科技图书文献中心 I23091900002HG00
(ZM) code. The cartridge concept provides a diflferent
7 A3380
data format from the conventional parallel-track system, 10
»>IC3370
thus abandoning the track correction scheme. Instead, it
uses an interleaved subfleld code to correct a single burst 3333,

of 128-bit-long or nonoverlapped 8-bit burst in 16 different - • ^ 3 0 • Burst error-correction

and detection code
sections [48]. The ZM code not only has a self-clocking . jiC2314
feature and 100% recording efficiency, but also provides a
powerful pointer to pick up extra error-detection capabil- 1 lo' mf&w
in3ii
,1301

1 ' Parity • Multi-error-

ity in addition to what ECC provides. ^ check J detection code Re-read
Alternative data block

* Magnetic disk area 1 3

10 " 3 0 5 1 1
1960
I
1970
Defect skip
1
1980
Since the introduction of the RAMAC 305, IBM disk
technology has made great progress in density, storage
Year
capacity, data rate, cost, and reliability [49]. This section
will highlight only the RAS aspects of IBM disk tech- Figure 6 Progress in IBM disk technology.
nology. Using areal density as a figure of merit to discuss
the progress. Fig. 6 shows that we have made an improve- ofiF-track interference. After nearly eight years in the
ment of about four orders of magnitude i.e., from lO' to burst-error-detecting mode, the large disk commercial
10' bits per square inch (bpsi). product made a great leap forward in reliability improve-
ment. The IBM 3330 used a Fire code [50] to correct a
As compared with magnetic tapes, the disk is of a bet- burst error up to 11 bits and had a very high error-detec-
ter, hence more costly and more rigid, material and is a tion capability up to a burst length of 45 bits. Besides er-
more precisely controlled device; it is therefore inher- ror-correcting codes, this device could perform alterna-
ently more reliable. Because it is a serial-access device tive track selection for avoiding a bad track and an echo
with respect to one track at a time, a skew problem of bits check to make sure data had been written on the disk.
in multi-track arrival at different times does not exist. The
disk has a higher data rate due to its rotation speed than The 3330 disk technology and its RAS features set a
tape. Its sensing circuit design has to allow greater toler- milestone in the I/O industry. Soon its features were
ance, and hence the clocking window for sending data is widely adopted within the U.S. and abroad. As the disk
narrower. The data encoding requirements for disk are areal density moved upward in the 1970s, MFM continu-
not completely the same as for tape. In the early low-to- ously performed well in data-encoding requirements. Er-
medium-density systems, the data-encoding method used ror-correcting codes were used basically as the Fire code
was mainly to achieve specified density. In the 1301 and but varied in burst-correction capability. In 1979 the IBM
1311 disks, the NRZI code was used for data encoding. 3370 disk design used interleaved b-adjacent code [51] to
As the areal density increased, a code called "Double replace Fire code for error correction.
Frequency" was used for data encoding in the 2302,2311,
and 2314. This code has a structure similar to that of the Besides ECC, other important recovery techniques
phase-encoding method; it is simply to insert a clocking used "were defect skip, alternative data block, andre-rearf
pulse in between ones. In the very early disks, i.e., RA- in the IBM disk product family to enhance reliability and
MAC 305 I and 11, only simple parity checks were used data integrity. It should be pointed out that significant
for checking data integrity. The IBM 1301 was the first basic technological progress has been made to improve
commercially available mass-produced disk using a 16-bit device reliability in spite of increased areal density.
polynomial code for improved error-detection capability.
This burst-error-detecting code was also used in the fol- Additionally, in the early 1960s IBM was the first
low-on products, making significant progress in moving company to implement the parallel linear-feedback shift-
away from parity check only. register scheme in the 2820 drum control unit. Since the
first disclosure of the theory behind it [52], industry has
With the introduction of the IBM high-density and applied it widely in parallel CRC implementation for tape
high-data-rate disk (3330) in 1971, basic disk technology and for communication networks as well.
made significant progress, A different data encoding
method called modified frequency modulation (MFM) The improvements in recording-media quality and
was used. This code provided a synchronized capability servo and head designs, and progress in error-correcting
without requiring insertion of extra bits as in the double- codes, error-recovery techniques, and data-encoding
frequency code. It also had a better capability to handle methods have made the high-density tape/disk product 483

IBM i. RES. DEVELOP. % VOL. 25 * N 0 . 5 % SEPTEMBER 1981 M. Y. HSIAO ET AL.

版权所有(c)国家科技图书文献中心 I23091900002HG00
even more reliable than before. In summary, the improve- 7. R. R. Everett, C. A. Zraket, and H. D. Benington,
ment in areal density of magnetic recording devices has "SAGE—A Data-Processing System for Air Defense," Pro-
ceedings of the Eastern Joint Computer Conference (EJCC),
brought great improvement in cost per bit, data rate, and Washington, DC, 1957, p. 148.
total on-line s t o r ^ e capacity. 8. J. J. Dent, "Diagnostic Engineering Requirements," AFIPS
Conference Proceedings 32 (1968 Spring Joint Computer
Conference, Atlantic City), 503-507 (1%8).
Conclusions 9. M. Ball and F. Hardie, "Eflfects and Detection of Inter-
In spite of increased product and system complexity, ad- mittent Failures in Digital Systems," AFIPS Conference
vances in reliability, availability, and serviceability have Proceedings 35 (1969 Fall Joint Computer Conference, Las
Vegas), 329-335 (1969).
made it possible for users to place additional reliance on 10. C. J. Bashe, P. W. Jackson, H. A. Mussell, and W. D.
computers, so much so that many users have committed Winger, "The Design of the IBM Type 702 System," paper
much of their business to data processing systems. This no. 55-719, AIEE Transactions 74 (Part I, Communication
and Electronics), 695-704 (1956).
has been possible because of the innovations described in 11. W. Buchholz, Ed., Planning a Computer System (Project
this and other articles in this issue. Stretch), McGraw-Hill Book Co., Inc., New York, 1962.
12. M. N. Perry and W. P. Plugge, "American Airlines 'SABRE'
Electronic Reservations System," Proceedings of the West-
The beginning was made with improvements in the area ern Joint Computer Conference, Los Angeles, 1%1, pp.
of error detection, which enables dynamic fault isolation 593-601.
13. F. F. Sellers, M. Y. Hsiao, and L. W. Beamson, Error De-
and is the basis for recovery and data integrity. Addition- tecting Logic for Digital Computers, McGraw-Hill Book
ally, technology has produced significant gains in reliabil- Co., Inc., New York, 1968.
ity. This has been augmented by the use of error-cor- 14. W. C. Carter, H. C. Montgomery, R. J. Preiss, and H. J.
Rcinheimer, "Design of Serviceability Features for the IBM
rection codes and redundancy. Availability is, in part, System/360," IBM J. Res. Develop. 8, 115-126 (1964).
achieved through a recovery hierarchy that forces recov- 15. R. J. Preiss, Chapter 7, Design Automation of Digital Sys-
ery at the lowest possible level. Serviceability has been tems, Vol. 1, M. A. Breuer, Ed., Prentice-Hall, Inc., Engle-
wood Cliffs, NJ, 1972, pp. 335-410.
significantly enhanced with automatic real-time fault iso- 16. F. J. Hackl and R. W. Shirk, "An Integrated Approach to
lation and, when necessary, remote assistance. Automated Computer Maintenance," IEEE Conference
Record on Switching Theory and Logical Design 16-C-13,
289-302 (1%5). See also A. M. Johnson, Jr., "The Micro-
The future holds many challenges and the demand for diagnostics for the IBM System 360/Model iO," IEEE Trans.
system integrity will continue to increase, which dictates Computers C-20, 798-803 (1971).
the need to continue to improve the state of the art in the 17. J. Fox, "Availability Design of the System/370 Model 168
Multiprocessor," Second USA-Japan Computer Confer-
field of RAS. ence Proceedings, Tokyo, August 1975, pp. 52-57.
18. A. N. Higgins, "Error Recovery through Programming,"
Acknowledgments AFIPS Corf ere nee Proceedings 33 (19^ Fall Joint Com-
puter Conference, San Francisco), 39-43 (1968).
The authors thank E. C. Byman for his support of this 19. D, C. Bumstine and W. H. Eppard, "Maintenance Strategy
paper, as well as P. E. Barshinger and R. C. Williams for Diagramming Techniques," Proceedings of 1966 Annual.
their assistance, and G. R. Santana for his comments on Symposium on Reliability, San Francisco, pp. 497-506.
20. L. A. Bjork, "Recovery Scenario for a DB/DC System,"
the disk section. We would also Hke to thank aU of the Proceedings of the ACM Annual Conference, Atlanta, 1973,
many IBMers who have contributed to the realization of pp. 142-146.
the actual progress in IBM RAS technology. 21. C. T, Davics, "Recovery Semantics for a DB/DC System,"
op, cif., Ref. 20, pp. 136-141.
22. Fast Path Feature General Information Manual, Order No.
References GH20-9069-1 (1976), available through IBM branch offices.
1. Computing Systems Reliability, an Advanced Course, T. 23. Information Management SystemlVirtual Storage (IMSI
Anderson and B. Eandell, Eds., Cambridge University VS), System Programming Reference Manual, Order No.
Press, Cambridge, England, 1979. SH20-9027-2 (1975), available through IBM branch offices.
2. J. von Neumann, "Probabilistic Logic and the Synthesis of 24. OSVS2 MVS Overview, Order No. GC20-0954-0, available
Reliable Organisms from Unreliable Components," Autom- through IBM branch offices.
ata Studies, C. E. Shannon and J. McCarthy, Eds., 25. M. Y. Hsiao and E. Kolankowsky, "Optimum Apparatus
Princeton University Press, Princeton, NJ, 1956, pp. 43-98. and Method for Check Bits Generation and Error Detection,
3. S. E. James, "Evolution of Real-Time Computer Systems Location, and Correction," U.S. Patent 3,623,155, Novem-
for Manned Spaceiight,"/AM X Res. Develop. 25, 417-428 ber 23, 1971.
(1981, this issue). 26. M. Y. Hsiao, "A Class of Optimal Minimum Odd-weight-
4. P. E. Olsen and R. J. Orrange, "Real-Time Systems for column SEC-DED Codes," IBM J. Res. Develop. 14, 395-
Federal Applications: A Review of Significant Technological 401 (1970).
Developments," IBM J. Res. Develop. 25, 405-416 (1981, 27. M. Bee, D. J, Lang, and A. D. Snyder, "Data Processing
this issue). Machine Function Indicator," U.S. Patent 3,539,996, Janu-
5. David R. Jarema and Edward H. Sussenguth, "IBM Data ary 15, 1966.
Communications: A Quarter Century of Evolution and Prog- 28. M. Bee and D. J. Lang, "Instruction Retry Byte Counter,"
ress,"/BMX Res. Develop. 25, 391-404 (1981, this issue). U.S. Patent 3,564,506, January 12, 1968.
6. L. R. Walters, "Diagnostics Programming Techniques for 29. B. McGilvray, D. J. Lang. W. E. Boehner, and M. W. Bee,
the IBM Type 701 E.D.P.M.," Convention Records of the "Data Processing System—Execution Retry Control," U.S.
464 IRE, 1953 National Convention, New York, NY. Patent 27,485, January 15. 1968.

M. Y. HSIAO ET AL. IBM J. RES. DEVELOP. • VOL. 25 • NO. 5 • SEPTEMBER 1981

版权所有(c)国家科技图书文献中心 I23091900002HG00
30. R. A. Bell, "(An) I/O Switching (Scheme) for Multiproces- 44. A. M. Patel and S. J. Hong, "Plural Channel Error Correct-
sors," Technical Report TR00.2105 (1970); available from ing Apparatus and Methods," reissue patent, U.S. Patent Re
IBM Data Systems Division laboratory, Poughkeepsie, NY. 30,187, January 8, 1980.
31. J. F. Thompson and C. A. Zito, "Channel Status Checking 45. A. M. Patel and S. J. Hong, "Optimal Rectangular Code for
and Switching System," U.S. Patent 3,286,240, March 3, High Density Magnetic Tapes," IBM J. Res. Develop. 18,
1971; W. Clark, K. A. Salmond, and T. S. Stafford, "Input/ 579-588 (1974).
Output Control," U.S. Patent 3,725,864, March 3, 1971; E. 46. H. C. Hinz, Jr., "Enhanced Error Detection & Correction
W. Devore, R. J. Smith, and J. M. Tyrrell, "Input/Output for Data Systems," U.S. Patent 3,639,900, February 1, 1972.
Unit Switch," U.S. Patent 3,372,378, March 5, 1968. 47. E. G. McDonald and A. M. Patel, "Error Detection Sys-
32. MVS Diagnostic Techniques, Order No. GC25-0725-2, avail- tems," U.S. Patent 3,786,439, January 19, 1974.
able through IBM branch oflftces. 48. Arvind M. Patel, "Error Recovery Scheme for the IBM 3850
33. J. P. Roth, "Diagnosis of Automata Failures: A Calculus and Mass Storage System," IBM J. Res. Develop. 24, 32-42
a Method," IBM J. Res. Develop. 10, 278-291 (1966). (1980).
34. H. Y. Chang, E. Manning, and G. Metze, Fault Diagnosis of 49. J. M. Harker, D. W. Brede, R. E. Pattison, G. R. Santana,
Digital Systems, Wiley-Interscience PubUshers, New York, and L. G. Taft, " A Quarter Century of Disk File In-
1970. novation," IBM J. Res. Develop. 25, 677-689 (1981, this
35. F. F. Sellers, M. Y. Hsiao, and L. W. Beamson, "Analyzing issue).
Errors with the Boolean Difference," IEEE Trans. Comput- 50. W. W. Peterson, Error Correction Codes, MIT Press, Cam-
ers C-17, 676 (1968). bridge, MA, 1961, Ch. 10.
36. D. C. Hitt and R. J. Woessner, "Universal System Service 51. D. C. Bossen, "b-Adjacent Error Correction," IBM J. Res.
Adapter," U.S. Patent 3,585,599, June 15, 1971. Develop. 14, 402-408 (1970).
37. E. B. Eichelberger and T. W. Williams, "A Logic Design 52. M. Y. Hsiao and K. Y. Sih, "Serial-to-Parallel Transforma-
Structure for LSI Testability," Proceedings Workshop on tion of Linear Feedback Shift Register Circuits," IEEE
Design Automation, New Orleans, 1977, p. 462. Trans. Electron. Computers EC-13, 738-740 (1964).
38. M. Y. Hsiao, "Hardware Error Detection and Failure Isola-
tion Design Evaluation Technique," Invited talk. Fault Tol-
erant Computing Symposium—9, June 20-22, 1979, Madi-
son, WI, and/F/P5-7'C, September 1979, London, England. Received May 2, 1980; revised August 25. 1980
39. J. P. Harris, W. B. Phillips, J. F. Wells, and W. D. Winger,
"Innovations in the Design of Magnetic Tape Subsystems,"
IBM J. Res. Develop. 25, 691-699 (1981, this issue).
40. D. T. Brown and F. F. Sellers, "Error Detection and Cor-
rection Features," U.S. Patents 3,508,194, 3,508,195, and M. Y. Hsiao is located at the IBM Data Systems Division
3,508,196, April 21, 1970. laboratory, Poughkeepsie, New York 12602. W. C. Carter
41. D. T. Brown and F. F. Sellers, Jr., "Error Correction for is located at the IBM Thomas J. Watson Research Cen-
IBM 800 bit-per-inch Magnetic Tape," IBM J. Res. De-
velop. 14, 384-389 (1970). ter, Yorktown Heights, New York 10598. J. W. Thomas is
42. M. Y. Hsiao and J. T. Tou, "Application of Error Correcting with the Data Processing Products Group at the IBM lab-
Codes in Computer Reliability Studies," IEEE Trans. Relia- oratory in Poughkeepsie, New York 12602. W. R. String-
bility R-18, 108-118(1969).
43. P. A. Franaszek, "Sequence-state Methods for Run-length- fellow is located at the IBM Field Engineering Center,
limited Coding," IBM J. Res. Develop. 14, 376-383 (1970). Research Triangle Park, North Carolina 27709.

465

IBM J. RES. DEVELOP. • VOL. 25 . NO. 5 • SEPTEMBER 1981 M. Y. HSIAO ET AL.

Directed flow graph

468

SOFTWARE TECHNOLOGY IBM J. RES. DEVELOP. • VOL. 25 • NO. 5 • SEPTEMBER 1981

Reliability Avalilability Serviceability
No ratings yet
Reliability Avalilability Serviceability
5 pages
Reliability of Computer Systems and Networks Fault Tolerance Analysis and Design 1st Edition Martin L. Shooman
No ratings yet
Reliability of Computer Systems and Networks Fault Tolerance Analysis and Design 1st Edition Martin L. Shooman
51 pages
Rtos Group 10
No ratings yet
Rtos Group 10
9 pages
Dolzilek D., MacDonald B. - in The - Recent Security Failures Prompt Review of Secure Computing Practices
No ratings yet
Dolzilek D., MacDonald B. - in The - Recent Security Failures Prompt Review of Secure Computing Practices
10 pages
STDcurs1 Merged
No ratings yet
STDcurs1 Merged
139 pages
Dependability in Computing Systems
No ratings yet
Dependability in Computing Systems
6 pages
Computer Operational System Prac
No ratings yet
Computer Operational System Prac
31 pages
l6 Reliability of System
No ratings yet
l6 Reliability of System
12 pages
ASS#5 - RAID, FAULT TOLERANCE, RELIABILITY AND HPC (Autosaved) .DCB
No ratings yet
ASS#5 - RAID, FAULT TOLERANCE, RELIABILITY AND HPC (Autosaved) .DCB
14 pages
Computerized Approach For Matrixform Fmea 1979
No ratings yet
Computerized Approach For Matrixform Fmea 1979
1 page
Lecture 2,3 4 Fault Tolerance Techniques in HPC
No ratings yet
Lecture 2,3 4 Fault Tolerance Techniques in HPC
78 pages
2017 April Newsletter
No ratings yet
2017 April Newsletter
14 pages
Software Solution Design For Application of Reliability Centered Maintenance in Preventive Maintenance Plan
No ratings yet
Software Solution Design For Application of Reliability Centered Maintenance in Preventive Maintenance Plan
4 pages
En 10COMPUTER PDF
No ratings yet
En 10COMPUTER PDF
8 pages
Computer Systems Maintenance in A Corporate Envir PDF
No ratings yet
Computer Systems Maintenance in A Corporate Envir PDF
8 pages
15 Storage
No ratings yet
15 Storage
26 pages
Robust Subh PDF
No ratings yet
Robust Subh PDF
30 pages
Scoring and Thresholding For Availability: Ibm Systems Journal, Vol 47, No 4, 2008 Heisig and Hosking
No ratings yet
Scoring and Thresholding For Availability: Ibm Systems Journal, Vol 47, No 4, 2008 Heisig and Hosking
14 pages
Reliable System Design: Hardware Design Checklist Testing Embedded Systems Critical Systems
No ratings yet
Reliable System Design: Hardware Design Checklist Testing Embedded Systems Critical Systems
28 pages
2017 Reliability Engineering - Theory and Practice PDFDrive
No ratings yet
2017 Reliability Engineering - Theory and Practice PDFDrive
24 pages
Extending Rt-Minix With Fault Tolerance Capabilities: Pablo J. Rogina
No ratings yet
Extending Rt-Minix With Fault Tolerance Capabilities: Pablo J. Rogina
8 pages
Reliability, Availability, Serviceability (RAS) : The Ibm
No ratings yet
Reliability, Availability, Serviceability (RAS) : The Ibm
25 pages
Reliable Computer Systems (Design and Evaluatuion) (2nd Edition) Siewiorek
No ratings yet
Reliable Computer Systems (Design and Evaluatuion) (2nd Edition) Siewiorek
10 pages
Design Patterns For High Availability
No ratings yet
Design Patterns For High Availability
10 pages
Reliability, Availability Maintainability New
No ratings yet
Reliability, Availability Maintainability New
28 pages
Dependability for IT Professionals
No ratings yet
Dependability for IT Professionals
21 pages
Mostafa Abd-El-Barr Design and Analysis of Reliabookfi
No ratings yet
Mostafa Abd-El-Barr Design and Analysis of Reliabookfi
463 pages
Design For Reliability Information and Computer Based Systems 1st Edition Eric Bauer Download
100% (1)
Design For Reliability Information and Computer Based Systems 1st Edition Eric Bauer Download
56 pages
BulletProof: Defect-Tolerant CMP Switch
No ratings yet
BulletProof: Defect-Tolerant CMP Switch
12 pages
CPE 508 Assignment
No ratings yet
CPE 508 Assignment
5 pages
CSC 308 Fault Tolerant Computing
No ratings yet
CSC 308 Fault Tolerant Computing
24 pages
Reliability, Availability, and Maintainability: Probability Models For Populations
No ratings yet
Reliability, Availability, and Maintainability: Probability Models For Populations
8 pages
Chapter I
No ratings yet
Chapter I
7 pages
Designing A Control System For High Availability
No ratings yet
Designing A Control System For High Availability
10 pages
Matrix Approach To Perform Dependent Failure Analysis in Compliance With Functional Safety Standards
No ratings yet
Matrix Approach To Perform Dependent Failure Analysis in Compliance With Functional Safety Standards
6 pages
1 Storage-150927084723-Lva1-App6892
No ratings yet
1 Storage-150927084723-Lva1-App6892
26 pages
Fault Avoidance and Tolerance Technique
No ratings yet
Fault Avoidance and Tolerance Technique
15 pages
Reliability Engineering Essentials
No ratings yet
Reliability Engineering Essentials
20 pages
Comprehensive RAS for Engineers
No ratings yet
Comprehensive RAS for Engineers
12 pages
Industrial Computing Systems: A Case Study of Fault Tolerance Analysis
No ratings yet
Industrial Computing Systems: A Case Study of Fault Tolerance Analysis
6 pages
Ece425 L25
No ratings yet
Ece425 L25
21 pages
M Chapter 7 and 8
No ratings yet
M Chapter 7 and 8
16 pages
Rts
No ratings yet
Rts
44 pages
4-Embedded System Design Issues
No ratings yet
4-Embedded System Design Issues
41 pages
Unit-II: Characteristics of Embedded Systems
No ratings yet
Unit-II: Characteristics of Embedded Systems
25 pages
Review of Memory RAS For Data Centers
No ratings yet
Review of Memory RAS For Data Centers
15 pages
Design of Fault Tolerant Systems
No ratings yet
Design of Fault Tolerant Systems
7 pages
CMP 321 Lecture Note
No ratings yet
CMP 321 Lecture Note
45 pages
Assignment 8
No ratings yet
Assignment 8
5 pages
Redundancy in Instrumentations.
No ratings yet
Redundancy in Instrumentations.
3 pages
Ch-4-Fault Tularance - Naming-SM
No ratings yet
Ch-4-Fault Tularance - Naming-SM
42 pages
Meth en Anglair
No ratings yet
Meth en Anglair
54 pages
Week09-Fault Tolerant System
No ratings yet
Week09-Fault Tolerant System
26 pages
Functional Safety: A Practical Approach For End-Users and System Integrators
No ratings yet
Functional Safety: A Practical Approach For End-Users and System Integrators
11 pages
II Fault Tolerant Techniques
No ratings yet
II Fault Tolerant Techniques
101 pages
Ram Additional Notes 2 Look at The Calculations
No ratings yet
Ram Additional Notes 2 Look at The Calculations
25 pages
Beyond Fault Tolerance: Third Generation SIS Approaches For Optimizing Safety Integrity and Operational Availability
No ratings yet
Beyond Fault Tolerance: Third Generation SIS Approaches For Optimizing Safety Integrity and Operational Availability
5 pages
Revision Notes - 02 Reliability in Computer Systems
No ratings yet
Revision Notes - 02 Reliability in Computer Systems
12 pages
? 50 Smart SEO Prompts For ChatGPT
No ratings yet
? 50 Smart SEO Prompts For ChatGPT
6 pages
PGP 2 Case Study
No ratings yet
PGP 2 Case Study
3 pages
Wa0035.
No ratings yet
Wa0035.
1 page
A Project Report: in Partial Fulfillment For The Award of The Degree
No ratings yet
A Project Report: in Partial Fulfillment For The Award of The Degree
50 pages
ERP Audit for Chartered Accountants
No ratings yet
ERP Audit for Chartered Accountants
30 pages
NCERT Solutions For Class 4 April 5 EVS Looking Around Chapter 14 Basvas Farm
No ratings yet
NCERT Solutions For Class 4 April 5 EVS Looking Around Chapter 14 Basvas Farm
6 pages
Jurnal Internasional
No ratings yet
Jurnal Internasional
18 pages
Financial For BYR
No ratings yet
Financial For BYR
8 pages
Teacher Performance Review Form
No ratings yet
Teacher Performance Review Form
13 pages
Al Bustan South Bridge Design and Built Project, Doha Qatar
No ratings yet
Al Bustan South Bridge Design and Built Project, Doha Qatar
11 pages
THC 8 Syllabus 2022 1
No ratings yet
THC 8 Syllabus 2022 1
14 pages
Hsslive Xi Sociology PDF Notes English Alphonsa PDF
No ratings yet
Hsslive Xi Sociology PDF Notes English Alphonsa PDF
80 pages
P L D 2000 Lahore 461 Section 90
No ratings yet
P L D 2000 Lahore 461 Section 90
42 pages
Classified Advertisement
No ratings yet
Classified Advertisement
5 pages
Assignment Inventory Management
No ratings yet
Assignment Inventory Management
7 pages
Occupations Vocabulary Esl Crossword Puzzle Worksheets For Kids
No ratings yet
Occupations Vocabulary Esl Crossword Puzzle Worksheets For Kids
10 pages
Untitled
No ratings yet
Untitled
116 pages
BS Accountancy Sample Thesis
78% (9)
BS Accountancy Sample Thesis
8 pages
Legal Precedents in Transport Liability
No ratings yet
Legal Precedents in Transport Liability
33 pages
Activity: 4.1 Separating The Components of A Mixture
No ratings yet
Activity: 4.1 Separating The Components of A Mixture
10 pages
CP4291 IOT LAb MANUAL-1
No ratings yet
CP4291 IOT LAb MANUAL-1
37 pages
Aquaprobe Fea100/Fea200: Electromagnetic Flowmeter Insertion-Type Flow Sensors
No ratings yet
Aquaprobe Fea100/Fea200: Electromagnetic Flowmeter Insertion-Type Flow Sensors
36 pages
Public Wifi Seminar Report
No ratings yet
Public Wifi Seminar Report
24 pages
Stainless Steel Metric Bolts, Screws, and Studs: Standard Specification For
No ratings yet
Stainless Steel Metric Bolts, Screws, and Studs: Standard Specification For
9 pages
DeKalb County Commissioners Chief of Staff Morris Williams P-Card Activity
No ratings yet
DeKalb County Commissioners Chief of Staff Morris Williams P-Card Activity
37 pages
Introduction to Index Numbers
No ratings yet
Introduction to Index Numbers
22 pages
BM8601 Dte-1 Question Bank
No ratings yet
BM8601 Dte-1 Question Bank
8 pages
Institution Chapter 1-2
No ratings yet
Institution Chapter 1-2
193 pages
FULL Version Testbank PointSet Topology With Topics Basic General Topology For Graduate Studies Robert Andre Multiple Formats
No ratings yet
FULL Version Testbank PointSet Topology With Topics Basic General Topology For Graduate Studies Robert Andre Multiple Formats
405 pages
KAIBEL CHEM Alisa
No ratings yet
KAIBEL CHEM Alisa
13 pages