194                                                                                 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO.
1, JANUARY 2011
A 45 nm Resilient Microprocessor Core for Dynamic
               Variation Tolerance
  Keith A. Bowman, Member, IEEE, James W. Tschanz, Member, IEEE, Shih-Lien L. Lu, Senior Member, IEEE,
        Paolo A. Aseron, Muhammad M. Khellah, Member, IEEE, Arijit Raychowdhury, Member, IEEE,
          Bibiche M. Geuskens, Member, IEEE, Carlos Tokunaga, Member, IEEE, Chris B. Wilkerson,
                  Tanay Karnik, Senior Member, IEEE, and Vivek K. De, Senior Member, IEEE
   Abstract—A 45 nm microprocessor core integrates resilient                                            I. INTRODUCTION
error-detection and recovery circuits to mitigate the clock fre-
quency (FCLK ) guardbands for dynamic parameter variations
to improve throughput and energy efficiency. The core supports
two distinct error-detection designs, allowing a direct comparison
of the relative trade-offs. The first design embeds error-detec-
                                                                                 V       ARIABILITY in device and circuit parameters adversely
                                                                                         affects the performance and energy efficiency of micro-
                                                                                 processors across all market segments, ranging from the small
tion sequential (EDS) circuits in critical paths to detect late                  embedded core in a system-on-chip (SoC) to large multi-core
timing transitions. In addition to reducing the FCLK guardbands                  servers. A dynamic parameter variation occurs in time during
for dynamic variations, the embedded EDS design can exploit
path-activation rates to operate the microprocessor faster than
                                                                                 the microprocessor operation, resulting from environmental
infrequently-activated critical paths. The second error-detection                and workload changes. Examples of dynamic variations in-
design offers a less-intrusive approach for dynamic timing-error                 clude supply voltage           droops, temperature changes, and
detection by placing a tunable replica circuit (TRC) per pipeline                transistor aging degradation.          droops result from abrupt
stage to monitor worst-case delays. Although the TRCs require                    changes in switching activity, inducing large current transients
a delay guardband to ensure the TRC delay is always slower
than critical-path delays, the TRC design captures most of the
                                                                                 in the power delivery system. The            droop magnitude and
benefits from the embedded EDS design with less implementation                   duration depend on the interaction of capacitive and inductive
overhead. Furthermore, while core min-delay constraints limit                    parasitics at the board, package, and die levels with changes
the potential benefits of the embedded EDS design, a salient                     in current demand [1].            droops contain high-frequency
advantage of the TRC design is the ability to detect a wider range               (i.e., fast changing) and low-frequency (i.e., slow changing)
of dynamic delay variation, as demonstrated through low supply
voltage (VCC ) measurements. Both error-detection designs inter-
                                                                                 components and occur locally and globally across the die.
face with error-recovery techniques, enabling the detection and                  Temperature variations occur at a relatively slow time scale
correction of timing errors from fast-changing variations such as                with local hot spots on the die, depending on environmental and
high-frequency VCC droops.                                                       workload conditions as well as the heat-removal capability of
   The microprocessor core also supports two separate error-re-                  the package. Transistor aging slowly degrades the drive current
covery techniques to guarantee correct execution even if dynamic
variations persist. The first technique requires clock control to re-
                                                                                 over time as a function of gate bias and temperature conditions.
play errant instructions at 1 2FCLK . In comparison, the second                  Conventional microprocessor designs build in clock frequency
technique is a new multiple-issue instruction replay design that                            guardbands to ensure correct functionality within
corrects errant instructions with a lower performance penalty and                the presence of worst-case dynamic variations. Consequently,
without requiring clock control. Silicon measurements demon-                     these inflexible designs cannot exploit the opportunities for
strate that resilient circuits enable a 41% throughput gain at
equal energy or a 22% energy reduction at equal throughput, as
                                                                                 higher performance by increasing               or lower energy by
compared to a conventional design when executing a benchmark                     reducing          during favorable operating conditions and lack
program with a 10% VCC droop. In addition, the microprocessor                    of aging degradation. Since most systems usually operate at
includes a new adaptive clock control circuit that interfaces with               nominal conditions where worst-case scenarios rarely occur,
the resilient circuits and a phase-locked loop (PLL) to track                    the necessary guardbands for these infrequent dynamic varia-
recovery cycles and adapt to persistent errors by dynamically
changing FCLK for maximum efficiency.
                                                                                 tions severely limit the performance and energy efficiency of
                                                                                 conventional designs.
   Index Terms—Resilient microprocessor, resilient design, re-                      On-die variation sensors coupled with adaptive circuit tech-
silient circuit, dynamic variation, timing error, error detection,
error-detection sequential circuit, tunable replica circuit, error               niques have been demonstrated to adjust                   , or body
correction, error recovery, multiple-issue instruction replay, vari-             bias in response to slow-changing         , temperature, and aging
ation tolerance, adaptive circuit, adaptive clocking.                            variations [2]–[4]. Since these designs require time to detect
                                                                                 and respond to dynamic variations to avoid actual timing vio-
  Manuscript received July 28, 2010; revised October 01, 2010; accepted Oc-      lations, these circuit techniques reduce the           guardbands
tober 14, 2010. Date of publication December 03, 2010; date of current version   for slow-changing global variations, resulting in higher average
December 27, 2010. This paper was approved by Associate Editor Bevan Baas.              . Alternatively, the average          benefits may be con-
  The authors are with Intel Corporation, Hillsboro, OR 97124 USA (e-mail:
keith.a.bowman@intel.com).                                                       verted to lower average energy by decreasing            . A disad-
  Digital Object Identifier 10.1109/JSSC.2010.2089657                            vantage of on-die sensors and adaptive circuits is the inability
                                                                                 to respond to fast-changing variations such as high-frequency
                                                                                        droops or local path-level variations. Thus, guardbands for
                                                                                 fast-changing variations are still required.
                                                               0018-9200/$26.00 © 2010 IEEE
BOWMAN et al.: A 45 NM RESILIENT MICROPROCESSOR CORE FOR DYNAMIC VARIATION TOLERANCE                                                                195
   In contrast to sensors and adaptive circuits that avoid timing
errors, a resilient design contains error-detection and recovery
capabilities [5]–[12] to maintain correct system functionality
while in the presence of internal errors. Resilient circuits enable
the microprocessor to operate at a higher             as compared
to a conventional design. When a dynamic parameter variation
induces a timing error, the resilient circuits detect and correct
the error. The key advantage of a resilient design over sensors
and adaptive circuits is the relaxed response-time constraint. As
long as the resilient design prevents the timing error from cor-
rupting the architectural state of the microprocessor, error cor-
rection can occur over multiple clock cycles. Thus, resilient cir-
cuits detect and correct timing errors from both fast- and slow-      Fig. 1. Resilient microprocessor block diagram. Error-detection circuits are
changing variations. Although error correction requires addi-         integrated into the first five pipeline stages. Errors are pipelined to the write-
tional clock cycles, the         gains from mitigating the            back (WB) stage to invalidate errant instructions and to the error-control unit
                                                                      for recovery. Adaptive clock control monitors the recovery rate to dynamically
guardbands for infrequent dynamic variations far outweigh the         change clock frequency (F         ) during a persistent variation.
recovery overhead, resulting in higher overall throughput [11].
Alternatively, the throughput benefit can be traded-off for lower
energy by reducing         . The disadvantage of resiliency is the
design complexity of the error-detection and recovery circuits.
   This paper presents a 45 nm resilient microprocessor core
that mitigates the           guardbands for dynamic variations
to maximize throughput or energy efficiency [13]. Section II
provides an overview of the microprocessor design, including
the resilient core and the adaptive clock control to dynam-
ically adjust          based on the operating environment for
maximum efficiency. Sections III and IV describe two sepa-            Fig. 2. Resilient microprocessor micrograph and characteristics.
rate error-detection designs and two separate error-recovery
techniques, respectively, allowing a direct comparison of the
relative trade-offs. Section V gives a detailed description of        the 32-bit, RISC, in-order pipeline is modified to incorporate
the design methodology for integrating the error-detection and        resiliency features. As described in Fig. 1, the seven-stage
recovery circuits into the microprocessor. Section VI provides        pipeline consists of instruction fetch (IF), decode (DE), register
the testing infrastructure for compiling and executing bench-         access (RA), execute (EX), memory (MEM), exception (X),
mark programs. Section VII presents the silicon measurements,         and write-back (WB) stages. The core only supports integer
highlighting the advantages and disadvantages of the two              operations since the floating-point unit (FPU) and hardware
error-detection designs and the two error-recovery techniques.        multiplier are omitted. Data is written into the register file at
Section VIII concludes by summarizing the key results and             the WB stage, and the register file is accessed at the RA stage.
insights.                                                             Data cache writes, which occur in the MEM stage, are locally
                                                                      buffered for one cycle to ensure the instruction is valid before
           II. MICROPROCESSOR DESIGN OVERVIEW                         committing the write.
   The microprocessor implementation allows a comparison of              The modified core integrates resilient error-detection and re-
resilient and conventional designs, including an analysis of re-      covery features. Error-detection circuits protect the first five
siliency overheads and silicon measurements of throughput and         pipeline stages (IF, DE, RA, EX, and MEM) by detecting late
energy while executing benchmark programs. As illustrated in          timing transitions. As discussed further in Section V, the X and
Fig. 1, the research microprocessor consists of a resilient core      WB stages are designed with additional timing guardband to en-
with an error-control unit (ECU), a 16 KB instruction cache,          sure dynamic-variation timing failures do not occur in these two
a 16 KB data cache, a register file (RF), and a clock gener-          stages. If a dynamic variation induces a timing failure in any of
ator with adaptive clock control. The microprocessor also con-        the first five pipeline stages, the error-detection circuits iden-
tains on-die noise injectors to induce       droop events. Mi-        tify the error and generate a single pipeline-error signal (e.g.,
croprocessor features are programmed through an IEEE 1149.1                     for the DE stage). This error signal is pipelined to the
JTAG scan controller. Subsections A–E describe each of these          WB stage to invalidate the errant instruction and to the ECU
components. The micrograph and characteristics are given in           for error recovery. At the WB stage, control logic also prevents
Fig. 2. The microprocessor is manufactured in a 45 nm logic           subsequent instructions from corrupting the architectural state.
technology [14] on a 4.4 3.1 mm packaged die.                         The scan-programmable ECU implements two distinct error-re-
                                                                      covery techniques based on replaying the errant instruction. If
A. Resilient Core Design                                              the errant instruction executes correctly during the replay, the
  The research microprocessor core is based on the                    instruction commits data to the architectural state, and then sub-
open-source, synthesizable, LEON-3 design [15], where                 sequent instructions continue normal operation.
196                                                                       IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011
B. Instruction and Data Caches and Register File                       scan programmable. This adaptive clock controller enables the
   The instruction and data caches are 16 KB each. The cache           microprocessor to adjust to the operating environment to maxi-
memory cell is based on the 8T L1 cache cell from the 45 nm            mize throughput.
Intel® Core™ i7 microprocessor [16]. The register file contains
                                                                       D. Noise Injector Circuits
40 entries and three ports, supporting one write and two read
operations per cycle. The core and memory structures are on               Programmable noise injector circuits are inserted at multiple
separate supply voltages with level shifters in between to en-         locations in the microprocessor to generate        droops as ob-
sure the minimum operating voltage               for the caches and    served during the normal operation of a larger microprocessor.
register file does not limit the core functionality at low voltages.   Noise injector circuits consist of scan-configurable current-sink
In addition, separate supply voltages allow independent testing        transistors activated by an external noise clock. By program-
of the core and memory.                                                ming the external noise clock and the number of current-sink
                                                                       transistors, the timing of the        droop as well as the droop
C. Clock Generator with Adaptive Clock Control                         magnitude and frequency are well-controlled. An on-die dy-
   The clock generator contains a phase-locked loop (PLL)              namic variation monitor [18] provides a cycle-accurate mea-
based on the PLL in the 45 nm Intel® Core™ i7 microprocessor           surement of the droop magnitude and frequency to guide the
[17]. In order to quickly generate the                signal for one   noise injector settings.
of the error-recovery techniques, a clock-divider circuit skips
                                                                       E. JTAG Scan Controller
alternate clock pulses to reduce           in half without requiring
PLL relock [11]. The ECU enables either the full-frequency                An on-die JTAG scan controller coordinates the program-
or half-frequency clock signal. This clock signal then drives          ming of all scan-enabled features in the microprocessor.
a scan-tunable duty-cycle control circuit [11] to maintain a           Separate JTAG scan chains are provided for the core, the in-
constant high-phase delay for the core clock at both high and          struction and data caches, and the clock generator. The on-die
low         values, providing min-delay protection for the em-         JTAG controller interfaces with a host computer, which runs
bedded error-detection sequential circuits as discussed further        custom testing software written in Perl and C programming
in Section III.A.                                                      languages. The testing software and JTAG scan controller load
   An adaptive clock control circuit interfaces between the ECU        the binary for the C-compiled benchmarks and the program
and the PLL to track recovery cycles and adjust the PLL divide         input data into the instruction and data caches, respectively,
ratio to dynamically adapt            to slow-changing variations.     for execution. During program execution, data is written to the
The adaptive clock controller consists of two counters and a fi-       register file and data cache. After the program completes, the
nite-state machine. The counters are based on a cascaded de-           register file and data cache contents are scanned out through the
sign to ensure path delays are not timing critical. The adaptive       JTAG scan controller and testing software to validate program
clock controller receives two inputs from the ECU: (i) Replay          functionality.
signal and (ii) Half-frequency signal. When replaying an errant
instruction, the replay signal is logically-high for the duration of                  III. ERROR-DETECTION CIRCUITS
the replay. If the ECU replays an errant instruction at            ,      This section presents two separate designs for timing-error
the half-frequency signal is also logically-high for the duration      detection: (i) Embedded error-detection sequential (EDS) cir-
of the replay. The adaptive clock controller counts the number         cuit and (ii) Tunable replica circuit (TRC). Both error-detec-
of replay cycles by incrementing a counter for every clock cycle       tion designs interface with error recovery to mitigate the
that the replay signal is logically-high. If the half-frequency        guardbands for dynamic variations. Scan bits can mask EDS
signal is also logically-high, this replay counter increments by       and/or TRC error signals, allowing separate testing of either
two, which accounts for the 2X replay penalty for each half-fre-       technique and for a conventional design without error detection.
quency clock cycle. The replay counter accumulates the number
of recovery cycles over a programmable sampling period (e.g.,          A. Embedded Error-Detection Sequential (EDS) Circuit
1 ms), which is monitored with a separate counter. The adap-              Fig. 3 describes the concept of timing-error detection for dy-
tive clock controller compares the output of the replay counter        namic variation tolerance. Fig. 3(a) represents a conventional
to a set of programmable thresholds. As discussed further in           design, consisting of a critical path with driving and receiving
Section VII, an optimum recovery rate exists to maximize the           flip-flops (FF). In Fig. 3(b), conceptual timing diagrams illus-
throughput of a resilient design. The upper and lower thresholds       trate the arrival times of the input data (D) to the receiving FF
are based on the optimum recovery rate. The adaptive clock con-        during worst-case dynamic variations and nominal conditions.
troller only initiates an        change if the number of recovery      Within the presence of worst-case dynamic variations, the input
cycles either exceeds the upper threshold or remains below the         data to the receiving FF must arrive a setup time prior to the
lower threshold for two consecutive sampling periods, which            rising clock edge to guarantee correct functionality. In com-
provides a long-duration measurement of the current operating          parison, the input data for the same path arrives much earlier
environment. If an          change is desired, the adaptive clock      during nominal conditions. The difference between the input
controller changes the PLL divide ratio. After the PLL relocks to      data arrival times for these two cases represents the effective
the new divide ratio, the adaptive clock controller monitors the       timing guardband required for dynamic variations. A resilient
recovery cycles at the new           value to ensure optimal oper-     design is created by replacing the receiving FF of the conven-
ation. Maximum, minimum, and intermediate                 values are   tional design with an EDS circuit [11] as described in Fig. 3(c).
BOWMAN et al.: A 45 NM RESILIENT MICROPROCESSOR CORE FOR DYNAMIC VARIATION TOLERANCE                                                                     197
Fig. 3. (a) Conventional design and (b) conceptual timing diagrams for worst-case dynamic variations and nominal conditions. (c) Resilient design employing
a double-sampling with time-borrowing (DSTB) error-detection sequential (EDS) circuit and (d) conceptual timing diagram for late arriving input data. For the
resilient design, CLK is duty-cycle controlled to satisfy min-delay requirements.
In Fig. 3(d), a conceptual timing diagram illustrates the EDS cir-              detection window as illustrated in Fig. 3. The purpose of the
cuit operation when the input data arrives late. The EDS circuit                transparency window in the datapath latch is to eliminate data-
in Fig. 3(c) is a double-sampling with a time-borrowing latch                   path metastability while detecting timing errors. When input
(DSTB) design [11]. The shadow FF and datapath latch sample                     data arrives late, the DSTB design generates an error signal
the input data on the rising and falling clock edges, respectively.             even though the input data traverses to the latch output. The
An XOR logic gate compares the latch and FF outputs to gen-                     error signal ensures that late arriving data from the path in the
erate the error signal (ERROR). If the input data transitions late              current pipeline stage does not affect the maximum path delay
as described in Fig. 3(d), latch and FF outputs differ, resulting in            (max-delay) constraint for adjoining fan-out paths in subsequent
a logically-high error signal. The error signals from each EDS                  pipeline stages. If ample max-delay margin is available for the
circuit in a pipeline stage are inputs to an OR logic tree to gen-              adjoining paths in the subsequent pipeline stage, then a pulsed
erate the single pipeline-error signal. The EDS circuit only de-                latch may replace the DSTB EDS circuit at the current pipeline
tects late timing transitions during the high clock phase. During               stage. This would enable traditional time borrowing between the
the low clock phase, latch and FF outputs remain at constant                    path in the current pipeline stage and the adjoining paths in the
logic values. The propagation delay through the OR logic tree                   subsequent pipeline stage.
is designed less than the delay of the low clock phase to guar-                    For the DSTB EDS circuit, the high clock phase defines the
antee proper error detection. As described in Section II.A, the                 error-detection window           as illustrated in Fig. 3(d). The
pipeline-error signal propagates to the WB stage to invalidate                  max-delay constraint within the presence of worst-case dynamic
the errant instruction and to the ECU for error recovery. By de-                conditions for max-delay is defined as
tecting and correcting late arriving data, the resilient design re-
duces the timing guardband for infrequent dynamic variations,                                                                                            (1)
enabling a higher          as compared to a conventional design.
                                                                                      is the maximum path delay, including the clock-to-output
   A critical issue for some previous EDS circuits [5]–[9] is the
                                                                                delay of the driving sequential circuit and the clock skew
susceptibility to datapath metastability when the input data ar-
                                                                                and jitter delays,         is the cycle time                , and
rives close to a rising clock edge, resulting in the possibility of
                                                                                             is the datapath latch setup time based on the falling
undetected errors. For the DSTB EDS circuit, the datapath latch
                                                                                clock edge. The minimum path delay (min-delay) constraint
operates as a pulsed latch, thus eliminating datapath metasta-
                                                                                during worst-case dynamic conditions for min-delay is calcu-
bility during a rising clock edge. Although datapath metasta-
                                                                                lated as
bility is removed, the shadow FF output can become metastable
during a rising clock edge. In contrast to the datapath, the error                                                                                       (2)
path does not fan-out and behaves similar to a traditional syn-
chronizer circuit, thus drastically simplifying the metastability                      is the minimum path delay, accounting for the clock-to-
problem. For a microprocessor design, the mean time between                     output delay of the driving sequential circuit and the clock skew
failures (MTBF) from error-signal metastability is over ten or-                 and jitter delays, and             is the hold time based on the
ders of magnitude larger than the MTBF targets for radiation-in-                falling clock edge. The max-delay and min-delay constraints in
duced soft errors [11].                                                         (1)–(2) only apply to paths with an EDS circuit as the receiving
   Although the DSTB design employs a datapath latch, path                      sequential circuit. For a target    , min-delay requirements are
timing constraints are still based on a FF design with an error-                satisfied in pre-silicon design by buffer insertion and sizing. As
198                                                                        IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011
     increases, the number of buffers increases, leading to larger
area and power. From (1) and (2), the fundamental trade-off
in the DSTB EDS circuit is max-delay versus min-delay. As
      increases,          may decrease to enable a higher
while satisfying the max-delay constraint in (1) at the cost of a
larger min-delay penalty in (2). For microprocessors with deep
pipelines (i.e., small number of logic stages between sequen-
tial circuits), this trade-off may not be advantageous due to the
stringent min-delay requirements. In recent technology genera-
tions, however, the microarchitecture for microprocessors has
moved towards shallow pipelines (i.e., large number of logic
stages between sequential circuits) to improve energy efficiency
                                                                       Fig. 4. Implementation of the error-detection sequential (EDS) circuit. Scan-
[19], [20]. Microprocessors with shallow pipelines greatly relax       configured mode signal enables either an EDS circuit (mode = 0), where the
the min-delay requirements as compared to a deep-pipeline de-          initial datapath latch remains transparent, or a traditional master-slave flip-flop
sign, enabling a more effective trade-off of max-delay improve-        (mode = 1), where ERROR is ignored.
ment for min-delay penalty.
   To ensure protection from min-delay violations, the high
clock phase (i.e.,      ) is tuned at post-silicon with a duty-cycle   ical or near critical timing are converted to the configurable EDS
control circuit. The duty-cycle control circuit maintains a            circuits in Fig. 4. In comparison to a FF, the DSTB EDS cir-
constant high-phase delay for the clock at low and high                cuit provides an error-detection capability at a cost in power and
values. The duty cycle is calibrated at the highest             and    area. These overheads are quantified at the microprocessor level
lowest temperature specifications to provide a worst-case              in Section V.
measurement for min-delay. At these conditions, the maximum
error-detection window                  is measured via functional     B. Tunable Replica Circuit (TRC)
testing (i.e., running programs). As a trade-off to reduce the            In comparison to the embedded EDS design, the tunable
calibration time at the cost of less potential benefits, the           replica circuit (TRC) design is a less-intrusive error-detection
tuning can target a delay equal to               minus a      guard-   technique [12] that does not affect critical-path timing. As
band, where the          guardband satisfies min-delay constraints     described in Fig. 5(a), the TRC consists of a toggle FF and a
across die-to-die and within-die process variations.                   scan-configurable buffer delay chain. The toggle FF switches
   In previous work, a variety of EDS circuits have been pro-          the input to the buffer delay chain every cycle. The TRC
posed [5]–[11]. The transition detector with time borrowing            output drives an EDS circuit to detect timing failures due to
(TDTB) is the lowest clock energy EDS circuit known [11].              dynamic variations. As illustrated in Fig. 5(b), a TRC with an
The TDTB circuit, however, is a complex design since the               EDS circuit is placed adjacent to each pipeline stage in the
dynamic transition detector is sensitive to within-die process         first five stages. At test time, the TRC delays are calibrated to
variations. DSTB is the lowest clock energy static-CMOS EDS            track critical-path delays per pipeline stage. The TRC and the
circuit known [11]. In comparison to TDTB, DSTB allows for             pipeline stage use the same local          and clock, enabling the
a simpler implementation at the cost of higher clocking energy.        TRC to detect            droops at fine granularity and to capture
Both TDTB and DSTB eliminate datapath metastability [11].              clock-to-data correlations per pipeline stage [12]. If a dynamic
For these reasons, the DSTB circuit is chosen as the embedded          variation induces a late timing transition in the TRC, the EDS
EDS circuit for the resilient microprocessor core.                     circuit generates an error signal, which represents the single
   Fig. 4 provides the schematic of the actual EDS circuit im-         pipeline-error signal as discussed in Section II.A. Although an
plementation, where a scan-enabled latch precedes the datapath         actual timing error may not have occurred in the pipeline if the
latch in the DSTB design from Fig. 3(c). If the scan-input mode        critical paths are not activated, this design inherently assumes a
signal is logically-low, the initial latch remains transparent and     critical-path error did occur and initiates recovery. As described
the circuit logically operates as the DSTB design in Fig. 3(c). A      for the embedded EDS circuits, the single pipeline-error signal
logically-high mode signal disables the EDS circuit, where the         propagates to the WB stage to prevent the potentially errant
error signal is ignored and the two datapath latches behave as         instruction from committing data to the architectural state and
a standard master-slave FF. Since the two datapath latches are         to the ECU to enable recovery.
designed with an equal CLK-to-Q delay and setup time as com-              The key insight of the TRC design is the integration with
pared to a standard FF library cell with equivalent output drive       error recovery. Previous designs with on-die sensors and adap-
strength, the configurable design in Fig. 4 allows for a direct        tive circuits (e.g., canary-based designs) [2]–[4] must detect the
comparison between a resilient design with embedded EDS cir-           path-delay change, communicate the delay change to the adap-
cuits and a conventional design with FFs. Moreover, the mode           tive circuit, and respond by adjusting the operating environment
signal assists silicon debug where groups of EDS circuits per          (e.g.,         or      ) to avoid an actual timing error. The com-
pipeline stage are either enabled or disabled for critical-path        munication and response-time constraints prohibit these designs
analysis. All non-critical paths in the core and memory con-           from detecting and responding to a sudden increase in path delay
troller use standard FF library cells as the receiving sequential      due to fast-changing variations such as a high-frequency
circuit. Only the receiving sequential circuits for paths with crit-   droop. On-die sensors and adaptive circuits primarily reduce the
BOWMAN et al.: A 45 NM RESILIENT MICROPROCESSOR CORE FOR DYNAMIC VARIATION TOLERANCE                                                              199
                                                                                                              TABLE I
                                                                                            ADVANTAGES AND DISADVANTAGES OF EDS AND TRC
                                                                                                       ERROR-DETECTION DESIGNS
Fig. 5. (a) Tunable replica circuit (TRC) with an EDS circuit. (b) TRC design      and consequently, less potential benefits. TRC settings are
integrates error recovery to detect and correct timing errors from fast-changing
variations such as high-frequency voltage droops.
                                                                                   validated with functional testing while injecting           droops
                                                                                   as described in Section II.D. Since transistor delay is more
                                                                                   sensitive to        than interconnect delay, the TRC contains
        guardband for slow-changing variations only, where an
                                                                                   a minimum amount of interconnect to ensure the TRC delay
       guardband for fast-changing variations is still required. In
                                                                                   degradation from a          droop is either larger or nearly equal
contrast, the TRC with error recovery eliminates the communi-
                                                                                   to the delay degradation of critical paths in the core. Therefore,
cation and response-time constraints imposed on canary-based
                                                                                   the TRC delay, which is adjusted slower than the critical-path
techniques, thus mitigating the         guardbands for both fast-
                                                                                   delays at nominal         , should remain slower than the crit-
and slow-changing variations. From this perspective, the TRC
                                                                                   ical-path delays during a         droop. By calibrating the TRC
calibration only requires the TRC to always fail if any critical
                                                                                   delay at the highest temperature specification, the TRC remains
path fails in the pipeline due to a dynamic variation. In guaran-
                                                                                   slower than the critical paths as temperature reduces, where the
teeing this constraint, the TRC delay is tuned slower than the
                                                                                   interconnect delay improves faster than the transistor delay at a
critical-path delays, hence replacing large delay guardbands for
                                                                                         of 1.0 V. From silicon measurements at 1.0 V, the TRC
dynamic variations with a much smaller TRC delay guardband.
                                                                                   frequency change tracks the microprocessor              change to
   The TRC delays are calibrated at nominal               and the
                                                                                   within 0.5% from 90 C to 30 C. At the cost of additional test
highest temperature specification. At these conditions with the
                                                                                   time, the guardband between the TRC delay and critical-path
TRCs disabled, the microprocessor maximum clock frequency
                                                                                   delay is further reduced by repeating the calibration steps with
           is measured via functional testing. Next, the core
                                                                                   a higher calibration         while continuing to validate the TRC
executes a no-operation (NOP) program at an                slightly
                                                                                   settings with functional testing.
less than          , which is referred as the calibration          .
Scan bit settings enable one TRC and disable the other four
                                                                                   C. Advantages and Disadvantages of EDS and TRC Designs
TRCs in the core. As illustrated in Fig. 1, the core error signal,
which interfaces with the WB pipeline stage and the ECU, is                           Table I lists the key trade-offs between the embedded EDS
driven off-chip. By observing the core error signal, the TRC                       and TRC designs. The EDS design detects critical-path timing
delay is tuned to the corresponding cycle time                . The                failures for fast and slow as well as long-range and local
delay calibration is then repeated for each TRC in the core.                       dynamic variations. In contrast, the TRC design cannot detect
Separately tuning each TRC mitigates the delay variations                          path-specific or highly-localized dynamic variations (e.g., delay
between critical paths and the TRC due to within-die process                       push-out from cross-coupling capacitance or multiple-inputs
variations [21] and allows the TRC to detect        droops at fine                 switching). Although transistor aging degradation affects the
granularity and capture clock-to-data correlations per pipeline                    individual transistors in a path depending on the gate voltage
stage. For microprocessor designs with individual pipeline                         and temperature conditions, a separate DC-stressed TRC with
stages spread over a large area, additional TRCs would improve                     a periodically-toggled input can track the worst-case delay of
the accuracy of monitoring critical-path delays at a cost of                       aging and recovery for critical paths and clocks, while capturing
longer calibration time. As a trade-off to reduce the calibration                  the effects of power cycling and sleep modes [12]. As discussed
time, only the last TRC in the pipeline (i.e., TRC in MEM                          earlier, the TRC requires a delay guardband to ensure the TRC
stage) requires tuning while disabling the other four TRCs                         delay is always slower than critical-path delays, thus preventing
during operation, resulting in a larger TRC delay guardband,                       the possibility of exploiting path-activation rates for higher
200                                                                                      IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011
Fig. 6. Multiple-issue (MI) instruction replay example with    N =3                                                              N
                                                                        : After flushing the pipeline, issue the errant instruction  times;N 01    issues are replica
                                                                                N
instructions to setup pipeline registers and do not affect the architecture state; th issue is a valid instruction and is allowed to commit data to the architectural
state.
performance as with embedded EDS circuits. Furthermore,                             A. Instruction Replay at
the TRC design may initiate an error recovery when an actual                           In this error-recovery design,         reduces in half while re-
error did not occur, resulting in unnecessary recovery cycles.                      playing the errant instruction [10], [11]. As illustrated in Fig. 1
In comparison to the EDS design, the TRC design significantly                       and described in Section II.C, the PLL drives a clock-divider
reduces the design complexity overhead. In particular, the TRC                      circuit to generate the           signal. When the ECU initiates
design does not affect the min-delay paths in the core, has lower                   an error recovery, the ECU signals the clock generator to reduce
clocking energy, and does not require a duty-cycle control cir-                             in half, while the duty-cycle control circuit maintains a
cuit. Moreover, since the core min-delay constraints limit the                      constant high-phase delay for the clock to provide min-delay
error-detection window for the EDS design, and consequently,                        protection for the embedded EDS circuits. This design allows
the maximum potential benefits as described in (1)–(2), the                         fast clock control without requiring PLL relock. As described
TRC design provides a larger error-detection window to detect                       earlier, the ECU flushes the pipeline and then reissues the errant
a wider range of dynamic delay variation. Both designs require                      instruction. Reducing         in half ensures the replayed instruc-
post-silicon calibration, which affects testing costs.                              tion executes correctly even if dynamic variations persist. After
                                                                                    the replayed instruction finishes, the ECU signals the clock gen-
                IV. ERROR-RECOVERY TECHNIQUES                                       erator to resume at the target         . Since        is halved for
                                                                                    all of the recovery cycles, the number of actual and effective re-
   This section describes two separate techniques for error re-                     covery cycles per error is 14 and 28, respectively.
covery: (i) Instruction replay at           and (ii) Multiple-issue
instruction replay at        . The core issues an instruction and                   B. Multiple-Issue Instruction Replay at
the corresponding program counter (PC) value at the IF pipeline                        The motivation for the multiple-issue instruction replay
stage. The PC value then propagates down the pipeline with                          design is to guarantee correct execution for the replayed in-
the instruction. The ECU locally stores the PC of an errant                         struction without changing           . As illustrated in the example
instruction to perform error recovery. Since the original core                      in Fig. 6, the instruction replay starts after the detected error
pipeline already sends the PC to most of the pipeline stages                        reaches the WB stage. After flushing the pipeline, the ECU is-
for exception handling, the additional overhead for propagating                     sues the errant instruction multiple         times without changing
the PC to the remaining pipeline stages is low. Both error-re-                             . The first         issues are replica instructions, which
covery techniques replay errant instructions, which is similar to                   do not affect the architecture state. The th issue is a valid
the approach for recovering from a branch misprediction. The                        instruction, which is allowed to commit data to the architectural
error-detection circuits prevent the errant instruction from cor-                   state. The replica instructions flow through the pipeline to setup
rupting the architectural state of the microprocessor. Prior to                     the register input nodes for the valid instruction. Any error that
replaying the errant instruction, the ECU initially flushes the                     occurs in the execution of these replica instructions is ignored
pipeline to resolve any complex bypass register issues. After                       and if the number of replica instructions is sufficient, the reg-
flushing the pipeline, the ECU reissues the errant instruction. If                  ister inputs for each pipeline stage statically settle to the correct
the replayed instruction executes without an error, control logic                   value, allowing the valid instruction to execute correctly. For
allows the instruction to commit data to the architectural state,                   the example in Fig. 6, the case of               corresponds to two
and then subsequent instructions continue normal operation. If                      replica instructions and one valid instruction. The number of
an error occurs for the replayed instruction, then the ECU re-                      recovery cycles equals          . If the ECU issues an insufficient
plays the errant instruction again. The error-recovery design and                   number of replica instructions such that an error occurs during
corresponding algorithm settings are programmed in the ECU                          the execution of the valid instruction, then the ECU replays the
through scan. When testing a conventional design without error                      errant instruction a second time with             (i.e., seven replica
correction, ECU scan bits disable the error-recovery circuits.                      instructions and one valid instruction). With             , the number
BOWMAN et al.: A 45 NM RESILIENT MICROPROCESSOR CORE FOR DYNAMIC VARIATION TOLERANCE                                                               201
Fig. 7. Design methodology for integrating the resilient error-detection and correction circuits into a standard microprocessor synthesis flow.
of replica instructions equals the number of pipeline stages to                    replaces the receiving FF for these critical paths to detect poten-
ensure the register inputs for each pipeline stage are set to the                  tial timing errors from dynamic variations. Non-critical paths
appropriate value, thus guaranteeing correct execution of the                      have sufficient timing margin and should not limit performance
valid instruction.                                                                 even with worst-case dynamic variations. An EDS circuit does
   In implementing the multiple-issue replay, an additional bit                    not replace the receiving FF for non-critical paths. This second
is added to the microprocessor pipeline to denote whether an                       step is unnecessary when only considering the less-intrusive
instruction is allowed to commit data to the architectural state.                  TRC error-detection design. Rather, post-silicon tuning guaran-
The ECU sets this bit to a logic-low for all replica instructions                  tees that the TRC detects any critical-path error in the pipeline.
and to a logic-high for the valid instruction. Since this error-                      As described in Fig. 7, these two additional steps are inserted
recovery design relies on setting up path nodes, this technique                    into a standard register-transfer-level (RTL) to layout synthesis
is directly applicable to static-CMOS circuit designs and would                    flow. The flow consists of RTL synthesis, timing analysis, and
not correct timing errors in dynamic logic circuits.                               automatic place and route (APR) with extraction and timing
                                                                                   convergence. The design methodology starts with the structural
                      V. DESIGN METHODOLOGY                                        RTL of the microprocessor core, consisting of both VHDL and
   The integration of resilient error-detection and correction cir-                Verilog code. Next, the FFs in the core are manually separated
cuits into a microprocessor core requires two additional steps                     into two lists: (i) Receiving FFs for recoverable paths and (ii)
beyond the typical design flow. First, the design is separated                     Receiving FFs for unrecoverable paths. The RTL is then up-
into two categories: (i) Recoverable circuits and (ii) Unrecover-                  dated with these two lists of FFs. As described earlier, additional
able circuits. Error recovery for some paths in the design is too                  timing margin is enforced on receiving FFs for unrecoverable
expensive to implement. For these unrecoverable circuits, extra                    paths to ensure correct timing even in the presence of dynamic
timing margin is added during design and timing analysis to pre-                   variations. These FFs map to a unique timing model, which
vent these circuits from being susceptible to dynamic-variation                    contains extra setup-time margin as illustrated in Fig. 8(a). At
timing errors. For the error-detection designs in Section III, ex-                 this point in the design flow, the receiving FFs for recoverable
amples of unrecoverable circuits for the core pipeline in Fig. 1                   paths use standard library FFs as provided in Fig. 8(b). The up-
include any operations in the X or WB pipeline stages. When an                     dated RTL is run through the synthesis and timing analysis flow,
error occurs in the core pipeline, the resilient design must pre-                  including the physical compiler for floor-plan generation. The
vent the erroneous data from corrupting the architectural state of                 FFs are appropriately sized during synthesis for timing analysis
the microprocessor. As described in Section III and illustrated                    while maintaining the distinction between recoverable and unre-
in Fig. 3(d), the timing-error detection for a path in a given clock               coverable paths. Static-timing analysis generates a timing report
cycle occurs during the next cycle. As an example, an error in                     specifying all of the critical paths.
the DE stage is not identified until the corresponding errant in-                     After timing analysis, the timing report specifies the min-
struction has already started execution in the RA stage. Thus, the                 imum timing margin for each receiving FF to separate the recov-
error-detection latency prevents these circuits from protecting                    erable paths into critical and non-critical. The non-critical re-
the X or WB stages since an error in either of these stages would                  ceiving FFs should not limit performance even under worst-case
be identified after erroneous data had already started writing to                  variations, so these sequential circuits remain as standard library
the register file. For this reason, the X and WB stages are de-                    FFs. Next, EDS circuits replace the critical receiving FFs. In
signed with additional timing margin to ensure dynamic-varia-                      assigning the EDS circuits, the critical FFs are separated into
tion timing failures do not occur in these two stages. From the                    three timing buckets in order of criticality. Bucket A repre-
original core design, the paths in these two stages are not timing                 sents the most timing-critical FFs in the design. Bucket B FFs
critical, resulting in a low overhead for applying the extra timing                contain better timing margins than bucket A, although these
guardband.                                                                         paths could potentially fail under worst-case dynamic varia-
   Second, the recoverable circuits are further subdivided into                    tions. In addition, path reordering after APR and manufacturing
critical and non-critical paths for the embedded EDS circuits.                     could result in these sequential circuits becoming more critical.
After timing analysis, paths with the least timing margin are                      Bucket C FFs are significantly less critical than the FFs in either
classified as critical, and consequently, could limit the core per-                bucket A or bucket B. Since these paths could potentially limit
formance under worst-case dynamic variations. An EDS circuit                       performance under the most severe dynamic-variation events,
202                                                                                     IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011
Fig. 8. Illustration of timing constraints for various sequential designs. (a) Unrecoverable paths apply an additional setup-time margin on the standard flip-flop.
(b) Recoverable non-critical paths retain standard flip-flop timing. (c) Recoverable critical paths insert EDS circuits as the receiving sequential circuit with the
setup-time margin based on the shadow FF and the hold-time margin based on the error-detection window.
bucket C FFs provide a safety option to ensure critical-path cov-                                              TABLE II
erage. For each pipeline stage, each EDS circuit in a particular                    MICROPROCESSOR CORE AREA AND POWER OVERHEADS WITH V                   =10V :
                                                                                                       FOR EDS AND TRC DESIGNS
timing bucket receives the same scan-enable mode signal as de-
scribed in Fig. 4. As an example, the same scan bit enables
all bucket A EDS circuits in the EX stage. With five pipeline
stages containing embedded EDS circuits with three bucket op-
tions, the microprocessor core supports fifteen scan bits for en-
abling/disabling EDS circuits. Grouping the EDS circuits into
multiple buckets enhances the observability during speed-path
silicon debug by isolating a failing path to a particular pipeline
stage and timing bucket. Moreover, this approach provides an
option for disabling EDS circuits in post-silicon in case of a
min-delay violation.
   The RTL is now updated again with the new EDS circuit as-                       erally places the TRC with an EDS circuit close to the desired
signments. Although EDS circuits contain a datapath latch, the                     pipeline stage to reduce error-signal routing. When needed, the
setup-time margin is based on the shadow FF. Since the trans-                      APR flow provides an option for placing the TRCs at fixed co-
parency window of the latch defines the error-detection window,                    ordinates in the design.
traditional time-borrowing is not allowed and FF-based timing                         Table II lists the area and power overheads for the resilient
is maintained as discussed in Section III.A and illustrated in                     error-detection and correction circuits. For the embedded EDS
Fig. 8(c). The EDS circuit requires a longer hold-time margin                      design, 12% of the core sequential circuits are converted to
based on the target error-detection window. The target error-de-                   EDS circuits, resulting in a 2.2% area penalty. The area and
tection window is designed as a specific fraction of the target                    power overheads for satisfying min-delay violations with EDS
cycle time, which determines the maximum potential benefits                        circuits are small due to the shallow-pipeline architecture.
for the EDS design. The updated RTL is re-synthesized to re-                       These overheads account for EDS circuits in all three timing
size logic gates in both critical and non-critical paths to mini-                  buckets. The area penalty for the TRC design is 0.8%. The total
mize power for specific cycle-time and error-detection-window                      area overheads for EDS and TRC designs are 3.8% and 2.2%,
targets. After running timing analysis again, the timing report is                 respectively, including a 1.4% area increase for the ECU and
verified to ensure every unrecoverable path contains sufficient                    clock control. The total power overhead is less than 1% for the
max-delay margin, every recoverable critical path is assigned                      EDS or TRC designs. The area and power overheads for the
an EDS circuit, and min-delay margins are satisfied. If there is a                 ECU and clock control circuits are expected to amortize further
discrepancy in the timing report, this portion of the design flow                  for a larger core design. Although a power overhead exists
is repeated. Once the design is validated with the timing report,                  when comparing a resilient design to a conventional design at
the standard APR flow is performed.                                                equal         and       , the resilient design enables significant
   In comparison to the EDS design, the TRC design would only                      performance or energy efficiency benefits from mitigating
require the first step of separating the core into recoverable and                        guardbands for dynamic variations as discussed further
unrecoverable circuits. For the TRC design only, every unre-                       in Section VII.
coverable path must contain sufficient max-delay margin, while
all recoverable paths (i.e., non-critical and critical) would use                                    VI. TESTING METHODOLOGY
standard library FFs and min-delay constraints would remain                           The 45 nm resilient microprocessor mounts on a 478-pin
identical to a conventional design. The APR design flow gen-                       flip-chip ball-grid-array (FC-BGA) package, which is socketed
BOWMAN et al.: A 45 NM RESILIENT MICROPROCESSOR CORE FOR DYNAMIC VARIATION TOLERANCE                                                               203
                                                                 TABLE III
                                THREE BENCHMARKS TO MEASURE THE BENEFITS OF THE RESILIENT MICROPROCESSOR CORE
                                                                        Fig. 10. Demonstration of resilient microprocessor while executing the
                                                                        edgedetect benchmark at an F       of 1.5 GHz and a V      of 1.0 V. (a) Input
                                                                        bitmap image. (b) Correct output of edge-detected image with resilient circuits
                                                                        enabled. (c) Output of edge-detected image when resilient circuits are disabled
                                                                        halfway through the image processing.
Fig. 9. Resilient microprocessor die package and testing board.
                                                                        the majority of the measurement results in Section VII, pos-
in a custom testing board as shown in Fig. 9. The testing               sesses the property that almost any processing error in the core
board interfaces with a logic analyzer and an oscilloscope for          produces an error in the final output image, thus reducing the
silicon debug as well as a host computer with C and Perl-based          probability of error masking.
testing software that communicates with the on-die JTAG                    In Fig. 10, the resilient microprocessor demonstrates the
scan controller for configuration and program execution. After          ability to detect and correct timing errors while executing the
compiling programs, the software and JTAG scan controller               edgedetect benchmark in conjunction with I/O code to send and
load the binary into the instruction cache and the input data           receive images over a universal serial bus (USB) connection
into the data cache. After resetting the microprocessor, pro-           to a host computer. Fig. 10(a) shows the original input bitmap
gram execution starts from the reset address. During program            image. The microprocessor is set at an          of 1.5 GHz and
execution, data writes to the register file and data cache. After       a        of 1.0 V. When the resilient EDS and multiple-issue
the program finishes, the contents of the register file and data        replay circuits are enabled, the edgedetect program executes
cache are scanned-out via JTAG scan and testing software                correctly to generate the expected output image as provided in
to verify proper functionality. In addition, general-purpose            Fig. 10(b). During this measurement, an error counter monitors
memory-mapped input and output data ports support external              the number of corrected errors as the resilient core detects
device control and visibility into program execution.                   and corrects more than one million errors per second while
   Since the resilient microprocessor core is based on a public-        maintaining 100% correct output. In Fig. 10(c), the resilient
domain design [15], assemblers and compilers are available.             circuits are disabled halfway through the image processing,
The only additional compiler configurations are disabling the           resulting in erroneous output for the bottom half of the image.
FPU and hardware multiplier instructions and setting the ap-            Thus, the core resiliency features allow correct operation at an
propriate memory locations for the reset address, stack pointers,              that is impossible for the conventional design.
and data locations. Original debug code is written in assembly
language to target specific features in the microprocessor. Ad-                             VII. MEASUREMENT RESULTS
vanced benchmark programs are written in the C programming                 The microprocessor core without error detection and cor-
language.                                                               rection (i.e., conventional design) executes the edgedetect
   Although many different programs have been successfully              benchmark at an           of 1.45 GHz at 1.0 V and consumes
compiled and executed on the resilient microprocessor, the three        135 mW of power. When a dynamic parameter variation in the
benchmarks in Table III are evaluated to measure the benefits           form of a 10%       droop is injected during program execution,
of resilient design in Section VII. The three benchmarks con-           the        reduces to 1.26 GHz, corresponding to a normalized
sist of an edgedetect image-processing algorithm, a linkedlist          throughput of one, as described in Fig. 11. As illustrated by
pointer-following sorting routine, and a bubble data-sorting pro-       the shaded region in Fig. 11, the difference between these two
gram based on the bubble-sort algorithm. These three bench-                    values represents the         guardband for a 10%
marks exercise a variety of common microprocessor instruc-              droop in the conventional design. EDS and TRC designs are
tions. In particular, the edgedetect program, which is used for         separately measured by enabling the appropriate error-detection
204                                                                                    IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011
                                                                                   Fig. 12. Measured throughput gain of EDS and TRC designs relative to a con-
                                                                                   ventional design for the applications in Table III.
Fig. 11. Measured throughput (TP), as normalized to the conventional max-
imum TP, and recovery cycles, as a percentage of total cycles, versus clock fre-
quency (F     ) for the edgedetect benchmark.
circuit and the error-recovery technique that replays instructions
at            . Throughput increases linearly as           increases
with no errors. Once errors are detected and corrected with
either the EDS or TRC designs, instructions per cycle (IPC)
reduce as a function of the recovery rate. As             increases,               Fig. 13. Measured throughput gain of EDS and TRC designs relative to a con-
the number of timing errors increases, corresponding to a higher                   ventional design for the edgedetect benchmark versus supply voltage.
recovery rate. For the EDS design, throughput gains continue as
        increases throughout the entire         guardband region,                  of resilient circuits to improve throughput by reducing the
where the recovery rate remains low. Once             increases be-                impact of a high-frequency          droop on        . The inclusion
yond 1.45 GHz for the EDS design, timing failures occur even                       of additional dynamic-variation sources (e.g., temperature
at nominal conditions, resulting in a sharp increase in recovery                   change) would further increase the              guardband for the
rate. Since the slowest paths are infrequently activated during                    conventional design, resulting in larger potential benefits for
the edgedetect benchmark, throughput continues to increase                         the resilient designs [11].
for higher         values. The maximum normalized throughput                          In Fig. 12, the throughput benefits for EDS and TRC de-
of 1.16 corresponds to an optimal            and recovery rate of                  signs relative to a conventional design are measured across the
1.46 GHz and 0.25%, respectively. Pushing               beyond this                three benchmarks in Table III at 1.0 V. Throughput gains for
optimum reduces throughput since the IPC reduction from a                          the EDS design range from 15% to 20%, demonstrating the dif-
larger recovery rate outweighs the           gains. In Fig. 11, the                ferent activation rates for critical paths among these three pro-
resilient EDS design enables a 16% throughput benefit over                         grams. Since the TRC cannot exploit path-activation rates, the
the conventional design by eliminating the               guardband                 throughput benefit for the TRC design remains at 12% across
for a 10%         droop and by exploiting the activation rates for                 all three benchmarks.
critical paths. In comparison, the resilient TRC design achieves                      Fig. 13 elucidates a key distinction between the EDS and
a 12% throughput gain at an           of 1.42 GHz and recovery                     TRC designs. In Fig. 13, the throughput gain as compared to
rate of 0.15% by mitigating most of the                 guardband.                 a conventional design is measured for EDS and TRC designs
The TRC design provides a smaller throughput advantage than                        across        while executing the edgedetect benchmark with a
the EDS design at 1.0 V for two reasons: (i) The TRC design                        10%         droop. For each        , the clock duty cycle and TRC
requires a delay guardband to ensure the TRC always fails                          delays are calibrated for EDS and TRC designs, respectively, as
if any critical path in the pipeline stage fails; consequently                     described in Section III. As         reduces, the path-delay sensi-
the slowest path limits the TRC performance. (ii) The TRC                          tivity to       amplifies, resulting in a larger         guardband
design results in unnecessary recovery cycles since a dynamic                      and higher potential benefits for resilient circuits. From Fig. 13,
variation may induce a TRC failure while an actual timing                          the EDS design throughput gain increases from 16% at 1.0 V to
error in the pipeline does not occur if the critical paths are                     28% at 0.8 V and then saturates at 28% from 0.8 V to 0.6 V. Al-
not activated. These measurements demonstrate the ability                          though the EDS design provides a larger benefit than the TRC
BOWMAN et al.: A 45 NM RESILIENT MICROPROCESSOR CORE FOR DYNAMIC VARIATION TOLERANCE                                                                     205
Fig. 14. Measured average recovery cycles per error for the edgedetect benchmark for instruction replay at 1=2F    and multiple-issue (MI) instruction replay
at F                                                           0
         with the number of issues (N ) ranging from 2 to 8 (N 1 replica instructions and 1 valid instruction).
design at 1.0 V, the core min-delay constraints limit the max-
imum error-detection window for EDS circuits, and the corre-
sponding maximum potential throughput gain, as described in
(1)–(2). At 0.8 V and 0.6 V, the dynamic delay variation ex-
ceeds the maximum error-detection window for EDS circuits.
Consequently, the EDS design can only mitigate a portion of the
        guardband at 0.8 V and 0.6 V, resulting in a throughput
benefit based on the maximum error-detection window. Since
the throughput gain for the EDS design remains constant from
0.8 V to 0.6 V, the maximum error-detection window for EDS
circuits as a percentage of the minimum cycle time
for the conventional deign also remains constant. In contrast,
the core min-delay constraints do not limit the error-detection
window for the TRC design, allowing the TRC design to cap-                      Fig. 15. Measured total energy consumption versus throughput for the edgede-
ture a wider range of dynamic delay variation. From Fig. 13,                    tect benchmark.
the TRC design enables throughput benefits of 12%, 30%, and
51% at 1.0 V, 0.8 V, and 0.6 V, respectively, thus highlighting
the opportunity of providing larger benefits than the EDS de-                   the core pipeline demonstrates that issuing only one replica
sign at lower         values. In Fig. 13 at 0.8 V and 0.6 V, the                instruction incurs the least number of recovery cycles per
throughput gains for the EDS and TRC designs are primarily                      error, resulting in a 46% reduction as compared to replaying at
independent of the benchmark. As described in Figs. 11 and 12                              . While the reduced number of recovery cycles per
at 1.0 V, the TRC design cannot benefit from infrequently acti-                 error improves performance, the salient advantage of the mul-
vated critical paths. For the EDS design at 0.8 V and 0.6 V, the                tiple-issue instruction replay is correcting errant instructions
larger dynamic delay variation consumes the entire error-detec-                 without requiring clock control.
tion window for EDS circuits, thus preventing the possibility of                   In Fig. 15, the total energy to execute the edgedetect bench-
exploiting path-activation rates across different programs.                     mark with a 10%           droop is measured at 0.6 V, 0.8 V, and
   In comparing the two error-recovery techniques in                            1.0 V, and then plotted across the measured throughput data
Section IV, the average number of recovery cycles per error                     in Fig. 13 for the same          values. The change in total en-
is measured in Fig. 14 for the instruction replay at                            ergy and throughput correspond to a change in        (i.e., higher
and the multiple-issue (MI) instruction replay at             with              total energy and throughput correspond to higher          ). For a
the number of issues          ranging from two to eight (                       given       , the EDS and TRC designs provide larger throughput
replica instructions and one valid instruction). As described in                and smaller energy as compared to the conventional design.
Section IV.B for the multiple-issue replay design with a small                  The total energy reduction directly results from executing the
   , an error may occur during the execution of the th issued                   program faster, which decreases leakage energy. In comparing
instruction if an insufficient number of replica instructions are               the EDS and TRC designs to a conventional design, silicon
issued. In this scenario, the errant instruction is replayed a                  measurements demonstrate that resilient circuits enable a 41%
second time with           to guarantee correct operation. Silicon              throughput gain at equal energy or a 22% energy reduction at
measurements are collected while executing the edgedetect                       equal throughput.
benchmark with the EDS design, a 10%               droop, a                        Persistent parameter variations, such as longer-term
of 1.0 V, and an         of 1.46 GHz, which corresponds to the                  droops, temperature changes, or transistor aging, can result in
maximum throughput in Fig. 11. Measured performance for                         long bursts of timing errors that degrade throughput. For these
206                                                                                 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011
Fig. 16. Demonstration of adaptive clock control to dynamically optimize clock frequency (F       ) based on recovery cycles for maximum efficiency. Recovery
cycle count is accumulated and compared to a set of thresholds per sampling period. During a persistent variation, the recovery cycle count exceeds the upper
threshold for two consecutive sampling periods, resulting in a lower F    for the duration of the variation.
types of dynamic variations, the core resiliency features guide                 embeds error-detection sequential (EDS) circuits into actual
an adaptive clock controller to dynamically change                .             critical paths to identify late timing transitions. In addition to
Silicon measurements demonstrate this capability in Fig. 16.                    reducing the          guardbands for dynamic variations, the EDS
As discussed in Section II.C, counters in the adaptive clock                    design can enable the microprocessor to operate faster than
control circuit track the number of recovery cycles over a pro-                 infrequently-activated critical paths during nominal conditions.
grammable sampling period and compare to a set of thresholds.                   The second design places a tunable replica circuit (TRC) per
As highlighted in Fig. 11, the maximum throughput directly                      pipeline stage to monitor critical-path delays. Although the
corresponds to an optimum recovery rate, which determines                       TRCs require a delay guardband to ensure the TRC delay is al-
the upper and lower threshold values. After encountering the                    ways slower than critical-path delays, the TRC design captures
persistent variation in Fig. 16, the number of recovery cycles is               most of the benefits from the embedded EDS design with less
greater than the upper threshold for two consecutive sampling                   implementation overhead. In contrast to the embedded EDS
periods. Consequently, the adaptive clock controller changes                    design where core min-delay constraints limit the error-detec-
the PLL divide ratio to reduce         . After the PLL relocks, the             tion window, and corresponding potential benefits, the TRC
adaptive clock controller monitors the recovery cycles at the                   design is independent of core min-delay constraints, resulting
lower          value. During the persistent variation, the lower                in a wider error-detection window and higher potential benefits
       value is optimal for maximum throughput. After nominal                   as demonstrated with measurements at low supply voltages
conditions are restored, the number of recovery cycles is less                          . The combination of either error-detection design with
than the lower threshold for two consecutive sampling periods,                  error recovery enables the detection and correction of timing
resulting in an         increase.                                               errors from fast-changing variations (e.g., high-frequency
   In a conventional design, changing the PLL divide ratio re-                  droops).
quires a pipeline hold while the PLL relocks. In contrast, the                     The microprocessor core also integrates two techniques for
resilient error-detection and recovery circuits allow the micro-                error recovery to guarantee correct execution even if dynamic
processor core to continue operation during PLL relock, where                   variations persist. The first technique replays errant instructions
timing errors due to short clock cycles are detected and cor-                   at            . In comparison, the second technique introduces a
rected. This requires a sufficiently large error-detection window               multiple-issue instruction replay design to correct errant instruc-
for either EDS or TRC designs to detect timing violations from                  tions with a lower performance penalty and without requiring
dynamic variations and short cycles during PLL relock. Com-                     clock control. This recovery technique issues the errant instruc-
bining error-detection and recovery circuits with dynamic adap-                 tion multiple        times. The first         issues are replica in-
tation enables the microprocessor to adapt to the operating en-                 structions, which do not affect the architecture state. The th
vironment to deliver maximum efficiency.                                        issue is a valid instruction, which is allowed to commit data to
                                                                                the architectural state. The replica instructions setup the register
                          VIII. CONCLUSION                                      input nodes for each pipeline stage, allowing the valid instruc-
   A 45 nm microprocessor core employs resilient error-de-                      tion to execute correctly.
tection and recovery circuits to improve performance and                           A description of the design methodology for integrating the
energy efficiency by mitigating the clock frequency                             error-detection and recovery circuits into a microprocessor core
guardbands for dynamic parameter variations. The core inte-                     clarifies the necessary steps beyond a standard design flow. Fur-
grates two separate designs for error detection. The first design               thermore, discussions of the post-silicon calibration for EDS
BOWMAN et al.: A 45 NM RESILIENT MICROPROCESSOR CORE FOR DYNAMIC VARIATION TOLERANCE                                                                              207
and TRC designs provide insight into the trade-off between po-                         [15] LEON-3 [Online]. Available: http://www.gaisler.com/cms/index.
tential benefits and testing cost. Silicon measurements from the                            php?option=com_content&task=section&id=4&Itemid=33
                                                                                       [16] R. Kumar and G. Hinton, “A family of 45 nm IA processors,” in IEEE
45 nm microprocessor demonstrate that resilient circuits enable                             ISSCC Dig. Tech. Papers, Feb. 2009, pp. 58–59.
a 41% throughput benefit at iso-energy or a 22% energy re-                             [17] N. Kurd, P. Mosalikanti, M. Neidengard, J. Douglas, and R. Kumar,
duction at iso-throughput, as compared to a conventional de-                                “Next generation Intel® Core™ micro-architecture (Nehalem)
                                                                                            clocking,” IEEE J. Solid State Circuits, pp. 1121–1129, Apr. 2009.
sign when executing a benchmark program with a 10%                                     [18] K. Bowman et al., “Dynamic variation monitor for measuring the im-
droop. In addition, the resilient circuits in the microprocessor                            pact of voltage droops on microprocessor clock frequency,” in Proc.
core guide a new adaptive clock control circuit that tracks re-                             IEEE CICC, Sep. 2010, no. 17-1.
                                                                                       [19] V. Srinivasan et al., “Optimizing pipelines for power and per-
covery cycles and adapts to persistent errors by changing      .                            formance,” in Proc. IEEE/ACM Int. Symp. Microarchitecture
The combination of error-detection and recovery circuits with                               (MICRO-35), Nov. 2002, pp. 333–344.
dynamic adaptation enables the microprocessor to adapt to the                          [20] A. Hartstein and T. R. Puzak, “The optimum pipeline depth considering
                                                                                            both power and performance,” ACM Trans. Arch. Code Opt. (TACO),
operating environment to deliver maximum efficiency.                                        pp. 369–388, Dec. 2004.
                                                                                       [21] K. A. Bowman, S. G. Duvall, and J. D. Meindl, “Impact of die-to-die
                                                                                            and within-die parameter fluctuations on the maximum clock frequency
                                                                                            distribution for gigascale integration,” IEEE J. Solid-State Circuits, pp.
                          ACKNOWLEDGMENT                                                    183–190, Feb. 2002.
   The authors express sincere appreciation to Ken Ikeda and
Pavan Karidi for mask design, Saurabh Dighe, Jason Howard,
Greg Ruhl, David Jenkins, and David Finan for design assis-
tance, Trang Nguyen for lab support, and Nitin Borkar and Greg                                                  Keith A. Bowman (S’97–M’02) received the B.S.
                                                                                                                degree in electrical engineering from North Carolina
Taylor for encouragement and support.                                                                           State University, Raleigh, NC, in 1994 and the M.S.
                                                                                                                and Ph.D. degrees in electrical engineering from
                                                                                                                the Georgia Institute of Technology, Atlanta, GA, in
                                                                                                                1995 and 2001, respectively.
                               REFERENCES                                                                         He is currently a Staff Research Scientist in the
                                                                                                                Circuit Research Lab (CRL) at Intel Corporation,
                                                                                                                Hillsboro, OR. From 2001 to 2004, he worked as a
   [1] A. Muhtaroglu, G. Taylor, and T. R. Arabi, “On-die droop detector for                                    Senior Computer-Aided Design (CAD) Engineer in
       analog sensing of power supply noise,” IEEE J. Solid-State Circuits,                                     the Technology-CAD Division at Intel, Hillsboro,
       pp. 651–660, Apr. 2004.                                                      to develop and support statistical-based models, methodologies, and software
   [2] T. Fischer, J. Desai, B. Doyle, S. Naffziger, and B. Patella, “A 90-nm       tools to predict microprocessor performance and power variability. Since
       variable frequency clock system for a power-managed itanium archi-           joining CRL in 2004, his research has focused on the development of circuit de-
       tecture processor,” IEEE J. Solid-State Circuits, pp. 218–228, Jan.          sign solutions to mitigate the impact of parameter variations on microprocessor
       2006.                                                                        performance and power. He has published over 50 technical papers in refereed
   [3] R. McGowen et al., “Power and temperature control on a 90-nm ita-            conferences and journals and presented 15 tutorials on variation-tolerant circuit
       nium family processor,” IEEE J. Solid-State Circuits, pp. 229–237, Jan.      designs.
       2006.
   [4] J. Tschanz et al., “Adaptive frequency and biasing techniques for tol-
       erance to dynamic temperature-voltage variations and aging,” in IEEE
       ISSCC Dig. Tech. Papers, Feb. 2007, pp. 292–293.
   [5] P. Franco and E. J. McCluskey, “Delay testing of digital circuits by                                   James W. Tschanz (M’99) received the B.S. degree
       output waveform analysis,” in Proc. IEEE Int. Test Conf., Oct. 1991,                                   in computer engineering and the M.S. degree in elec-
       pp. 798–807.                                                                                           trical engineering from the University of Illinois at
   [6] P. Franco and E. J. McCluskey, “On-line testing of digital circuits,” in                               Urbana-Champaign in 1997 and 1999, respectively.
       Proc. IEEE VLSI Test Symp., Apr. 1994, pp. 167–173.                                                       Since 1999, he has been a circuits researcher
   [7] M. Nicolaidis, “Time redundancy based soft-error tolerance to rescue                                   with the Intel Circuit Research Lab, Hillsboro, OR.
       nanometer technologies,” in Proc. IEEE VLSI Test Symp., Apr. 1999,                                     He also taught VLSI design for seven years as an
       pp. 86–94.                                                                                             adjunct faculty member at the Oregon Graduate
   [8] D. Ernst et al., “Razor: A low-power pipeline based on circuit-level                                   Institute, Beaverton, OR. His research interests in-
       timing speculation,” in Proc. IEEE/ACM Int. Symp. Microarchitecture                                    clude low-power digital circuits, design techniques,
       (MICRO-36), Dec. 2003, pp. 7–18.                                                                       and methods for tolerating parameter variations. He
   [9] S. Das et al., “A self-tuning DVS processor using delay-error detection      holds 41 issued patents in those areas.
       and correction,” IEEE J. Solid-State Circuits, vol. , pp. 792––804, , Apr.
       2006.
  [10] S. Das et al., “Razor II: In situ error detection and correction for PVT
       and SER tolerance,” IEEE J. Solid-State Circuits, pp. 32–48, Jan. 2009.                                Shih-Lien L. Lu (M’89–SM’10) received the B.S.
  [11] K. A. Bowman et al., “Energy-efficient and metastability-immune re-                                    degree in EECS from the University of California at
       silient circuits for dynamic variation tolerance,” IEEE J. Solid-State                                 Berkeley in 1980, and the M.S. and Ph.D. degrees in
       Circuits, pp. 49–63, Jan. 2009.                                                                        CSE from the University of California at Los Angeles
  [12] J. Tschanz et al., “Tunable replica circuits and adaptive voltage-fre-                                 (UCLA) in 1984 and 1991, respectively.
       quency techniques for dynamic voltage, temperature, and aging vari-                                       He worked on the MOSIS project at USC/ISI
       ation tolerance,” in IEEE Symp. VLSI Circuits Dig., Jun. 2009, pp.                                     which provides the research and education commu-
       112–113.                                                                                               nity VLSI fabrication services from 1984 to 1991
  [13] J. Tschanz et al., “A 45 nm resilient and adaptive microprocessor core                                 and served on the faculty of the ECE Department at
       for dynamic variation tolerance,” in IEEE ISSCC Dig. Tech. Papers,                                     Oregon State University (OSU) from 1991 to 1999.
       Feb. 2010, pp. 282–283.                                                                                While at OSU, he received the College of Engi-
  [14] K. Mistry et al., “A 45 nm logic technology with high-k+metal gate           neering Carter Award for outstanding and inspirational teaching in 1995 and the
       transistors, strained silicon, 9 Cu interconnect layers, 193 nm dry pat-     College of Engineering Engelbrecht Young Faculty Award in 1996. Currently,
       terning, and 100% Pb-free packaging,” in IEEE IEDM Tech. Dig., Dec.          he is a Principal Researcher and leads a research group on microarchitecture
       2007, pp. 247–250.                                                           in Intel Labs.
208                                                                                      IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011
                           Paolo A. Aseron received the B.S. degree in com-                                      Carlos Tokunaga (S’98–M’08) received the B.S. de-
                           puter engineering from the University of the Philip-                                  gree in electronics engineering from the University
                           pines in 2001.                                                                        of Los Andes, Bogota, Colombia, in 2001, and the
                              He has been with Intel Labs, Hillsboro, OR, since                                  M.S. and Ph.D. degrees in electrical engineering from
                           2006. Prior to joining Intel, he worked for Canon on                                  the University of Michigan, Ann Arbor, in 2005 and
                           Systems-on-a-Chip platform development from 2001                                      2008, respectively.
                           to 2003. His interests include high-performance low-                                     He is currently a Research Scientist at the Circuit
                           power architecture and circuits, memory, and inte-                                    Research Lab, Intel, Hillsboro, OR. His research in-
                           grated power delivery.                                                                terests include VLSI design with particular emphasis
                                                                                                                 on energy-efficient resilient circuits and security
                                                                                                                 based circuit design.
                           Muhammad M. Khellah (M’99) received the Ph.D.
                           in electrical and computer engineering from the Uni-                                  Chris B. Wilkerson graduated from Carnegie
                           versity of Waterloo, Ontario, Canada, in 1999.                                        Mellon University with his masters in 1996.
                              He is a Research Scientist at Intel Labs, Hillsboro,                                 He has published a number of papers on a
                           OR, where he does research on low-power circuits.                                     number of microarchitectural topics including value
                           He first joined Intel in 1999 and was involved in                                     prediction, branch prediction, cache organization,
                           the design of L1/L2 SRAM caches for the P3 and                                        runahead, and advanced speculative execution.
                           P4 microprocessor products. He has published                                          Recently, he has focused on low-power design
                           about 70 technical papers in refereed international                                   including microarchitectural mechanisms to enable
                           conferences and journals and has 60 patents granted,                                  low voltage operation for microprocessors.
                           and 10 pending.
   Dr. Khellah is a regular reviewer for JSSC, TCAD, TVLSI, and TCAS I and
II. He currently serves on the technical program committees of the IEEE CICC
and the IEEE ISLPED.
                                                                                                                 Tanay Karnik (M’88–SM’04) received the Ph.D. in
                                                                                                                 computer engineering from the University of Illinois
                                                                                                                 at Urbana-Champaign in 1995.
                          Arijit Raychowdhury (S’00–M’07) received the                                              He is a Principal Engineer and Program Director in
                          Ph.D. degree in electrical and computer engineering                                    Intel Lab’s Academic Research Office. His research
                          from Purdue University, West Lafayette, IN, in 2007.                                   interests are in the areas of variation tolerance, power
                             He is currently a research scientist in the Circuits                                delivery, soft errors and physical design. He has pub-
                          Research Lab, Intel Corporation, Hillsboro, OR.                                        lished over 45 technical papers, and has 44 issued and
                          Previously he worked as an Analog Circuit Designer                                     33 pending patents in these areas. He received an Intel
                          with Texas Instruments Inc., India (2002 to 2003)                                      Achievement Award for the pioneering work on inte-
                          and as a summer intern with Intel Corporation (2005                                    grated power delivery. He has presented several in-
                          and 2006). His research interests include low power        vited talks and tutorials, and has served on five Ph.D. students’ committees.
                          and high performance digital circuit design, design           Dr. Karnik was a member of ISSCC, DAC, ICCAD, ICICDT and ISQED
                          of on-chip sensors, and memory.                            program committees and JSSC, TCAD, TVLSI, TCAS review committees. He
   Dr. Raychowdhury has received academic excellence awards in 1997, 2000,           was the General Chair of ASQED’10, ISQED’08, ISQED’09 and ICICDT’08.
and 2001, the Meissner Fellowship from Purdue University in 2002, the Intel          He is an ISQED Fellow and has been a Guest Editor for JSSC.
Ph.D. Fellowship Award in 2005, and the Dimitri N. Chorafas Award for the
best doctoral thesis in 2007. He received the Best Paper Awards at the IEEE
Nanotechnology Conference 2003, ISLPED 2006. He has served on the Tech-
nical Program Committee of ICCAD, VLSI Conference, and ISQED.                                                    Vivek K. De (SM’07) received the Bachelor’s degree
                                                                                                                 in electrical engineering from the Indian Institute of
                                                                                                                 Technology, Madras, India, in 1985 and the Master’s
                                                                                                                 degree in electrical engineering from Duke Univer-
                            Bibiche M. Geuskens (M’07) received the B.S.                                         sity, Durham, NC, in 1986. He received the Ph.D. de-
                            degree in electrical engineering from the Vrije                                      gree in electrical engineering from Rensselaer Poly-
                            Universiteit Brussel, Brussels, Belgium, in 1992 and                                 technic Institute, Troy, NY, in 1992.
                            the M.S. and Ph.D. degrees in electrical engineering                                    He is an Intel Fellow and Director of Circuit
                            from Rensselaer Polytechnic Institute, Troy, NY, in                                  Technology Research in Intel Labs, Hillsboro, OR.
                            1993 and 1997, respectively.                                                         He joined Intel in 1996 as a staff engineer in the
                               In 2006, she joined the Circuit Research Lab                                      Circuits Research Lab (CRL) in Hillsboro. Since
                            (CRL) at Intel Corporation, Hillsboro, OR, as a Staff    that time he has led research teams in CRL focused on developing advanced
                            Research Scientist. From 1999 to 2006, she worked        circuits and design techniques for low-power and high-performance processors.
                            as a Staff/Senior Component Design Engineer in the       In his current role, he provides strategic direction for future circuit technologies
                            Memory Design Unit of the Microprocessor Design          and is responsible for aligning CRL’s circuit research with technology scaling
Division at Intel, Hillsboro. She was responsible for the design, implementation     challenges. Prior to joining Intel, he was engaged in semiconductor devices
and validation of numerous circuit blocks. Since joining CRL in 2006, her            and circuits research at Rensselaer Polytechnic Institute and Georgia Institute
research has focused on the development of low power circuit design techniques       of Technology, and was a visiting researcher at Texas Instruments.
for on-chip memories. Her current research interests include CMOS biosensor             Dr. De has published 167 technical papers in refereed conferences and jour-
design applications and on-chip power delivery circuit solutions.                    nals, and 6 book chapters on low power circuits. He holds 154 patents, with 40
                                                                                     more patents filed (pending). He received an Intel Achievement Award for his
                                                                                     contributions to a novel integrated voltage regulator technology.