Introduction to Intel® CoreTM Duo Processor Architecture
INTRODUCTION                                                 thread performance is rather costly in terms of power
The Intel® Core™ Duo processor is a new member of            and may achieve diminishing returns in terms of
the Intel® mobile processor product line. It is the first    efficiency, if major microarchitecture enhancements are
Intel® mobile microarchitecture that uses CMP (Core          not made. The big potential for improved performance
Multi-Processor; i.e., multi cores on die) technology.       is through exploring parallelism between threads.
Targeted to the market of general-purpose mobile             However, the CMP architecture presents many
systems, the Intel Core Duo core was built to achieve        challenges for power and thermal control to still fit into
high performance, while consuming low power and              the mobility constraints.
fitting into different thermal envelopes.
In order to achieve the required performance, a CMP-
based microarchitecture was designed to achieve power-
efficient architecture, each performance improvement
was evaluated against the power cost, and only the
power-efficient performance features were implemented.
On top of that, special hardware mechanisms were added
to better control the static and the dynamic power
consumption. As a result, the Intel Core Duo processor
provides higher performance in the same form factors
without needing to increase the cooling capability.
The Intel Core Duo processor is a new member of the
Intel mobile processor product line. It is the first Intel
mobile microarchitecture that uses CMP (multi cores on             Figure 1: Products using different thermal
die) technology. Building a general-purpose mobile core                           envelopes*
is a challenging task since, on the one hand, the system
needs to maintain the highest level of performance,          In this paper we present the new Intel Core Duo
while on the other hand, the system must fit into            microarchitecture and show how the need to target
different thermal envelopes, as illustrated by Figure 1,     power-efficient general-purpose processors has affected
and improve power efficiency.                                many of our decisions. We provide a general overview
                                                             of the different ingredients of the Intel Core Duo system,
Intel Core Duo is based on Pentium M processor               while the other papers in this issue of the Intel
755/745 core microarchitecture with few performance          Technology Journal focus on more specific aspects of
improvements at the level of each single core. The major     the system such as the CMP microarchitecture and the
performance boost is achieved from the integration of        power and thermal control methods.
dual cores on the die (CMP architecture). This agrees
with our assessment that continuing to improve single
                                                               The main focus of the core enhancements was to do the
                                                               following:
                                                                  Support virtualization (Virtualization Technology2)
                                                                   [3].
                                                                  Support the new Streaming SIMD Extension (SSE3)
                                                                   [4].
                                                                  Address performance inefficiencies mainly in the
                                                                   handling of SSE/SSE2, FP (x87) and some long
                                                                   latency integer instructions.
                                                               Intel   Core             Duo         Processor-based
                                                               Technology               Core           Performance
                                                               Improvements
                                                               Intel Core Duo processor-based technology introduces
    Figure 2: Intel Core Duo processor floor plan
                                                               performance improvements in the following areas:
As Figure 2 shows, Intel Core Duo technology is based
                                                                  Streaming SIMD Extensions (SSE/2/3)
on two enhanced Pentium M cores that were integrated
and use a shared L2 cache. The way we integrated the              Floating Point (x87)
dual core in the system had a major impact on our design
                                                                  Integer
and implementation process. In order to meet the
performance and power targets we aimed to do the               The main difficulty with SSE implementation in
following:                                                     Pentium M is caused by the fact that SSE/2/3 is a 128-bit
                                                               wide microarchitecture while the Pentium M execution
   Keep the performance similar to or better than that
                                                               core is 64-bits wide (in order to meet power and energy
    of single thread performance processors in the
                                                               constraints). Making the machine twice as wide may
    previous generation of the Pentium M family (that
                                                               produce more heat and so will have a significant impact
    use the same-size L2 cache).
                                                               on the Thermal Design Point (TDP) of the system as
   Significantly improve the performance for                  well as some impact on battery life. Since the Pentium
    multithreaded and multi-processes software                 M was primarily designed for mobility we preferred to
    environments.                                              make it relatively narrow and cope with the SSE
                                                               performance issues. The by-product of this tradeoff is
   Keep the average power consumption of the dual
                                                               that each SSE vector operation is “broken” into 64-bit
    core the same as previous generations of mobile
                                                               wide micro-operation (uOp) pairs. Such instructions
    processors (that use a single core).
                                                               suffer from several performance bottlenecks in the
   Ensure that this processor fits in all the different       Pentium M pipeline, mainly in the Front End (FE) of the
    thermal envelopes the processor is targeted to.            pipeline. For example, the Instruction Decoder in the
                                                               Pentium M processor can potentially handle three
In this paper we provide a high-level description of the
                                                               instructions per cycle but only the first decoder in a row
main Intel Core Duo features and discuss how each
                                                               is capable of handling complex instructions. The other
feature fits into the targets of the various projects.
                                                               two decoders are limited to single uOp instructions only.
                                                               This works fine in most cases since the most frequent
THE IMPROVED PENTIUM                                           instructions are single uOp. However, this is not the case
M PROCESSOR-BASED                                              with SSE instructions: only scalar SSE operations are
CORES                                                          single uOps while the vector operations are typically 2-4
                                                               uOps. This results in several potential bottlenecks in the
The core of the Intel Core Duo processor-based
technology is an enhanced Pentium M processor
755/7451 core converted to 65nm process technology.
                                                               2
                                                                Intel® Virtualization Technology requires a computer system
                                                               with a processor, chipset, BIOS, virtual machine monitor
1
   Intel processor numbers are not a measure of                (VMM) and applications enabled for virtualization technology.
performance. Processor numbers differentiate features within   Functionality, performance or other virtualization technology
each processor family, not across different processor          benefits will vary depending on hardware and software
families.                                                      configurations. Virtualization technology-enabled BIOS and
                                                               VMM applications are currently in development.
FE: the Instruction Decoder in the Pentium M can only        H/W prefetcher. This mechanism identifies streaming
handle one SSE vector operation per cycle, causing           loads at a very early stage in the machine and
starvation in the rest of the machine. This bottleneck was   speculatively predicts the future incarnation of these
addressed in the Intel Core Duo core: a new mechanism        loads. These speculative requests are looked up in the
was introduced that allows lamination of pairs of similar    shared L2 cache and if miss, they’re speculatively
uOps. This mechanism along with enhanced uOp fusion          prefetched from the external memory. This mechanism
allows handling of the SSE/2/3 vector operation by a         is dynamically deactivated whenever there are many
single laminated uOp. The instruction decoders were          demand requests pending (a watermark mechanism).
modified to handle three such instructions per cycle,        The benefit of this change is an average reduction in
increasing significantly the decode bandwidth of SSE         load latency.
vector operations. The laminated uOps streaming down
the pipe are at a certain point un-laminated, reproducing    The performance implication of these enhancements on
again the 64-bit wide uOp pairs to feed the machine.         single-threaded (ST) applications as well as on
These changes not only improve performance of vector         multithreaded (MT) applications are discussed in [1]
operations but also save some energy since the FE, no
more a bottleneck, can be clock gated whenever its uOp       CMP–GENERAL STRUCTURE
buffer is filled beyond a certain watermark.                 Intel Core Duo processor-based technology implements
Another bottleneck that was discovered was the handling      shared cache-based CMP microarchitecture in order to
of the floating point (FP) Control Word (CW). The FP         maximize the performance of both ST and MT
CW is part of the x87 state and was usually viewed as        applications (assuming the same L2 cache size). Figure 3
“constant”; namely it is loaded once at the beginning and    describes the general structure of our implementation.
stays constant throughout the program. This is indeed        The figure shows the following:
the way the FP CW is used by most of the programs.              Each core is assumed to have an independent APIC
However there are some FP applications that manipulate           unit to be presented to the OS as a “separate logical
the “rounding control” which is located in this register:        processor.”
the default rounding mode is “rounding to nearest even”
but before converting results to fixed point, some              From an external point of view the system behaves
applications change the round control to “chop” (this is         like a Dual Processor (DP) system.
the rule with C programs for example). Such behavior            From the software point of view, it is fully
was treated rather inefficiently by the Pentium M core:          compatible with Intel Pentium 4 processors with
each manipulation of the FP CW was effectively stalling          Hyper-Threading (HT) Technology3 [6], and DP-
the pipeline until its completion. The Intel Core Duo            based systems. However, special optimizations
core introduced a new renaming mechanism for the FP              could be applied to improve the performance of the
CW so that four different versions of this register can          share-based cache organization.
coexist on the fly without stalling the machine.
                                                                Each core has an independent thermal control unit
Intel Core Duo also improved the latency of some long            (discussed later in this paper and also covered in
latency integer operations such as Integer Divide (IDIV).        [2]).
Although these instructions are not very frequent,
because of their extremely long latencies, their                The system combines per-core power state together
accumulative affect on integer benchmark scores have             with package-level power state.
shown to be very significant. The basic Divide algorithm     The paper CMP Implementation in Intel Core Duo
has remained unchanged; however, Intel Core Duo              Systems [1] extends the discussion on the CMP
Divide logic exploits opportunities for “early exit.” The    implementation and compares its performance with
Divide logic calculates in advance the number of             other configurations such as the use of split cache
iterations that are required to accomplish the operation.    architecture. The results shown there indicate that the
This is indeed data dependent; however, it is often          new proposed
significantly smaller relative to the maximal number of
iterations. Once the required number of iterations is
accomplished the divider wraps up the results. This does     3
                                                               Hyper-Threading Technology requires a computer system
not impact the maximal Integer Divide latency; however,      with an Intel® Pentium® 4 processor supporting HT
on average it is much faster.                                Technology and a HT Technology enabled chipset, BIOS and
                                                             operating system. Performance will vary depending on the
Another enhancement that benefits different kinds of         specific hardware and software you use.
applications is the introduction of a new mechanism of
microarchitecture maximizes the performance benefits      of the Intel Core Duo processor. As can be seen, the
of both ST and MT execution at a given cache size. The    average power consumption was reduced by handling
enhancements we implemented in each of the cores          the problem at all different levels of the design, starting
allow us to improve both the ST performance (in           with adjusting the process technology through all the
specific cases) as well as the MT execution. It also      design stages of production.
allows us to improve the power and the thermal control
of the system, and to achieve similar average power
consumption, as was the case in the single-core Pentium
M processor.
                                                             Figure 4: Low-power processor–design process
                                                          In order to save leakage power, the Intel Core Duo
                                                          system uses mainly two techniques: enhanced sleep
                                                          states control and Dynamic Intel ® Smart cache sizing. In
                                                          order to control the active power consumption, Intel
                                                          Core Duo technology uses a technique based on Intel
                                                          SpeedStep® technology .
                                                          The traditional way to control the power and the thermal
                                                          of the system is via a software/hardware interface. One
                                                          of the most common schemes to achieve this is called
                                                          ACPI [5], where the system defines different levels of
                                                          sleep modes, and each of the states represents a more
                                                          efficient way to save power, at the expense of a longer
                                                          time to bring the system back into operational mode.
                                                          (For more details on this method, please see [2]). The
                                                          challenge of adding a second core on die while
                                                          improving the overall power-consumption demands an
                                                          improvement to the power states of the system in order
                                                          to avoid power being wasted whenever a core is not
Figure 3: The general structure of the Intel Core Duo
                                                          active. We face two main problems: (1) since only a
                  implementation
                                                          single power plane is used, it forces us to run all cores
                                                          with the same voltage and frequency, and (2) the chipset
POWER CONTROL                                             and the OS see both cores as a single entity that has the
Extending the battery life, while improving the           same state at the same time. Thus, the Intel Core Duo
performance, was one of the main goals in designing the   processor presents two separate views on the power state
Intel Core Duo processor. Battery life is affected by     of the system; internally we manage the states of each
dynamic power, caused when the processor is active, and   core independently (we call it per-core power state) and
by static power, which is the power wasted when a unit    externally we view the system as having a single,
or the entire processor is not active. Intel Core Duo     synchronized power state. Figure 5 provides an overview
microarchitecture saves both types of power.              of this approach.
Figure 4 describes the general process we followed in
order to reduce the power during the development cycle
                                                               mode. The new mechanism keeps only the minimum
                                                               cache memory size needed active, and it uses special
                                                               circuit techniques to keep the rest of the cache memory
                                                               in a state that consumes only a minimal amount of
                                                               leakage power.
                                                               In order to control the active power consumption, Intel
                                                               Core Duo technology uses Intel SpeedStep technology.
                                                               When a set of working points is defined, each one has a
                                                               different frequency and voltage and so different power
                                                               consumption. The system can define at what working
                                                               point it works in order to strike a balance between the
                                                               performance needs and the dynamic power consumption.
                                                               This is usually done via the OS, using the ACPIs.
  –    CPU/package sleep states:
  –    C0 – Active          CPU is on
  –    C1 – Auto Halt       Core clock is off
  –    C2 – Stop clock      Core and bus clock are off
  –    C3 – Deep sleep      Clock generator is OFF
  –    C4 – Deeper sleep    Reduced VCC
  –    DC4 –Deeper C4       Further reduced VCC
      Figure 5: Power states of the Intel Core Duo
                       processor
As we can see the Intel Core Duo processor defines five        Figure 6: Changing working point in Intel Core Duo
different sleep states of the system. The first three states                       processor
allow local power-saving measures to be activated
individually per core, while the last two states require a     The way the system moves from one working point to
coordination of the entire package for the power-saving        another is described in Figure 6. As illustrated, in order
measures to be activated.                                      to move from a “high” working point to a lower one, the
                                                               system can switch the frequency almost immediately,
A core which is in C0, power state is assumed to be in         but it will take the system some time to lower the
running mode. When the core has nothing to do, the OS          voltage. When moving from a low working point to a
issues a halt command that moves it to CC1, where              higher one, we need to increase the voltage first (slow
execution is halted and clocks are stopped. When it            operation) and only then can we increase the frequency.
detects even lower levels of activity (via the ACPI
mechanisms [2]), the OS will further promote the idle          By extending the hardware mechanisms to better support
state of each of the cores beyond CC1 to CC2, CC3, or          advanced power states and sleep states the Intel Core
CC4 states, based on the core activity history. In the         Duo processor achieves improved power performance
CC2 and CC3 states, additional core-level power-saving         efficiency. The power-efficiency improvement over
measures can be activated, achieving a lower average           processor generations is shown in Figure 7. As a result,
power consumption. Starting from C4 state, core voltage        the Intel Core Duo processor provides higher
reduction is applied to further increase average power         performance in the same form factor without needing to
savings. Since the cores are connected to the same power       increase the cooling capability.
plane, this must be done in coordination between the two
cores, and this is known as package-level C4 and
package-level DC4.
While being in a sleep state, the system still consumes
static power (leakage). In Intel Core Duo technology, we
implement an advanced algorithm that tries to anticipate
the effective cache memory footprint that the system
needs when moving from a deep sleep state to an active
                            Power Performance Efficiency
                                            200
                         Power Performance
                             Perf/W          Efficiency
                                    Performance optimized
  200                             180Perf/W Power optimized
         Perf/W Performance optimized
  180    Perf/W Power optimized
                                            160
  160
                                            140
  140
  120
                                            120
  100
                                            100
   80
   60                                        80
   40
                                             60
   20
    0                                        40
        Figure 7: Power performance
                            20
                                    efficiency
                                             0
THERMAL DESIGN        POINT
          Pentium-MPentium-M 700Core Duo
                                                                Figure 8: Analog vs. digital sensors in Intel Core Duo
Thermal management is another fundamental capability
                                                                                     processors
of all mobile platforms. Managing the platform thermals
enables us to maximize CPU and platform performance             As we can see the use of multiple sensing points
within thermal constraints. Thermal management also             provides high accuracy and close proximity to the hot
improves ergonomics with a cooler system and lower fan          spot at any time. An analog thermal diode is still
acoustic noise.                                                 available on the Intel Core Duo processor. The use of a
                                                                digital thermometer allows tighter thermal control
In order to better control the thermal conditions of the
                                                                functions, allowing higher performance in the same form
system, the Intel Core Duo processor presents two new
                                                                factor. The improved capability also allows us to achieve
concepts: the use of digital sensors for high accuracy die
                                                                better ergonomic systems that do not get too hot, can
temperature measurements and dual-core multiple-level
                                                                operate more quietly, and are more reliable. Unlike
thermal control.
                                                                diode-based thermal management algorithms that require
In the previous Pentium M processor, a single analog            some temperature guard band (or activating the self
thermal diode was used to measure die temperature.              throttle mechanism as a safety-net), the digital
Thermal diode cannot be located at the hottest spot of          thermometer is tested and calibrated against
the die and therefore some offset was applied to keep the       specifications. Full functionality and reliability of the
CPU within specifications. For these systems it was             processor are guaranteed, as long as the reported
sufficient, since the die had a single hot spot. In the Intel   temperature is equal to or below the maximum specified
Core Duo processor, there are several hot spots that            temperature. Any inaccuracy or offset are programmed
change position as a function of the combined workload          into the device and already accounted for.
of both cores. Figure 8 shows the differences between
                                                                The thermal measurement function provides interfaces to
the use of the traditional analog sensor and the use of the
                                                                power-management software such as the industry-
new digital sensors.
                                                                standard ACPI. Each core can be defined as an
                                                                independent thermal zone, or a single thermal zone for
                                                                the entire chip. The maximum temperature for each
                                                                thermal zone is reported separately via dedicated
                                                                registers that can be polled by the software.
                                                                In addition to the polling capability, the digital
                                                                thermometer implements event-based reporting. Control
                                                                software programs temperature thresholds that require
                                                                actions. Such actions can be fan activation or passive
                                                                control policy such as dynamic voltage and frequency
                                                                scaling. Upon temperature crossing of the threshold, an
                                                                APIC-defined interrupt is generated and it initiates the
                                                                requested action.
Intel Core Duo technology implemented a dual-core                  deliver high current at quick respond times. Intel Core
power monitor capability. Power monitor functionality is           Duo processors implemented a feedback mechanism to
provided in order to prevent thermal exceptions, and it            the VR. The CPU tracks its activity at any time. If
can throttle the CPU once the temperature exceeds                  utilization goes down, the CPU communicates a signal
specifications. The overview of the power monitoring               to the VR, allowing it to switch to a lower power
logic is described in Figure 9.                                    consumption. A lower power state can be either a
                                                                   reduced number of phases or asynchronous operation.
                                     Temperature
                                                                   The communication is done using the voltage ID lines
                                                          T
                                                              T    and PSI signal as described in Figure 10.
                                     P/C state request
                                     Control             P
                                                            P #1
                      Policy definition
                                                         Core                PSI-2
                                                                              PSI-2/ VID
                                                                                     /
                     Power  and thermal management
                        Policy
 External controls      definition
                         Power and
                                 Control
                         thermal
                                 Core #2
                                 P/C state request P Core
                                                     #2
                                                          TP
                                     Temperature                             VR
          Figure 9: Thermal control overview
The power monitor continuously tracks the die
temperature. If the temperature reaches the maximum                        Figure 10: Voltage regulator interface
allowed value, a throttle mechanism is initiated. A multi-         The CPU has internal knowledge of the activity demand
level tracking algorithm is implemented. Throttling                and it communicates a request to go to higher power
starts with the more efficient dynamic voltage scaling             early enough for the VR to get ready for the increased
policy and if not sufficient, the power monitor algorithm          demand.
continues lowering the frequency. If an extreme cooling
malfunction occurs, an Out of Spec notification will be            Another power optimization is load line control. At low
initiated, requesting controlled shutdown. Lastly, the             CPU activities, the voltage drop on the load line is
CPU can initiate a thermal shutdown and turn off the               smaller resulting in higher voltage and power to the
system.                                                            CPU. At low workloads, the CPU reduces the voltage
                                                                   request, and early enough, before power consumption
Power and thermal management activities in notebook                increases, a voltage increase request is sent to the VR.
computers are usually performed by the OS and platform
control functions. These thermal management features               Using utilization knowledge, available in the CPU, Intel
are designed to best serve user preferences under                  Core Duo technology made it possible to reduce
notebook constraint conditions. Thermal monitor                    platform power, increase battery life, and improve form
function is not expected to be activated under these               factor ergonomics.
normal operation conditions. The thermal monitor
mechanism ensures that the CPU will never exceed the               INTEL® CORE™ SOLO PROCESSOR
CPU-specified parameters and guarantees functionality              In order to fit into very limited thermal constraints and
and reliability at any time.                                       power consumption, the Intel Core Duo processor has a
The use of high accuracy temperature reading together              derivative that contains a single core only. This can be
with thermal monitoring protection enables high                    achieved by either disabling one of the cores either at the
performance in thermally limited form factors, while               OS level or as a BIOS option, or at the architecture level,
allowing improved ergonomics and high reliability.                 where one core is disconnected from the power grid.
                                                                   The first option is a user or OS decision. If you run a
PLATFORM POWER MANAGEMENT                                          single-core OS on an Intel Core Duo system, it will keep
Intel Core Duo processor technology closely interacts              the second core idle, at CC4 sleep state. Please note that
with other components on the platform. One such                    due to the way the BIOS is set, each time an interrupt is
component is the Voltage Regulator (VR). VR power                  received or a broadcast IPI is sent, this core may need to
losses at low CPU utilization may get as high as the               wake up and go immediately back to a sleep state,
CPU power. The losses of the VR are due to the                     consuming small amounts of dynamic power.
need to
The user can disable the second core via a BIOS option
as well. In this case, the system does not recognize the
other core and so it is kept in CC4 state all the time,
consuming no dynamic power at all.
The disadvantage of the two methods described above is
that the core still consumes static power. In order to
avoid this and reduce the power consumption of the core
even further, Intel introduces the single-core version of
Intel Core Duo technology, called Intel ® Core™ Solo
processor, which disconnects the non-active core from
the power grid, or saves the area and does not fabricate
this part at all.
CONCLUSION
The Intel Core Duo processor is the first Intel processor
that implements dual core on die. The processor
addresses new challenges for providing the best
performance under power and thermal constraints.
This paper described the main architectural features of
the new processor focusing on the different
performance, power, and thermal control features of the
processor and of the system.
By applying punctual control between the performance,
power and thermal features implemented in the Intel
Core Duo system, we achieved a significant
improvement in performance, at the same power
consumption, and with improved thermal control
mechanisms.