History of Superscalar Processors
History of Superscalar Processors
1 Superscalar processor
Emergence of SuperScalar Processor
Superscalar processor emerged in three consecutive phases as first, the idea was conceived, then a few architecture
proposals and prototype machines appeared, and finally, in the last phase, the commercial products reached the
market.
The concept of the superscalar issue was first developed as early as 1970 (Tjaden and Flynn, 1970). It was later
reformulated more precisely in the 1980s (Torng, 1982, Acosta et al, 1986).
Superscalar processor proposals and prototype machines followed as shown in the figure.
As far as prototype machines are concerned IBM was the first with two significant superscalar developments called
the Cheetah and America project. The Cheetah project (1982-83) and the subsequent America project (from 1985 on)
were the testbeds for IBM to study superscalar execution.
The four-way Cheetah machine served as a base for the America processor, which spawned the RS/6000 (1990),
which was later renamed the Power1. The Power 1 is almost identical to the America machine (Grohoski, 1990).
The term superscalar processor is assumed to have first appeared in connection with these developments in an internal
IBM Technical Report (Agarwala, T and Cocke, J. High-Performance Reduced Instruction Set Processors, 1987).
A second early player in the area of superscalar developments was DEC with its Multititan project, carried out from
1985 to 1987. While the Multititan project was the continuation of project Titan (1984), whose goal was to construct a
very high-speed RISC processor, this project did not contribute much to the development of the α line of processors.
The Intel 960CA embedded RISC processor was the first commercial superscalar machine, introduced in 1989. To
boost performance subsequently all major manufactures were forced to introduce the superscalar issue in their
commercial processor lines.
Superscalar RISC processors emerged according to two different approaches. Some appeared as the result of
transferring a current (scalar) RISC line into a superscalar one. Examples of this are the Intel 960, MC 88000, HP PA
(Precision Architecture), SunSparc, MIPS R, and AMD Am29000 RISC lines. The another significant approach was
to perceive a new architecture and to execute it from the very starting as a superscalar line. This happened when IBM
announced its RS/6000 processor in 1990, later renamed the Power1.
7. 7 ~reser~g sequential consistency of
instruction execution
We ~se the term ·complete ' as explained in lhe previous secl1on. We use term
Processor consistegty to indicate the consistency oj instruction .completi_on with
sequential instruction executiQ». ,
. As far as processor consistency is concerned, superscalar processors preserve
either a weak or a strong consistency. A weak processor consistency means
that instructions may complete out-of-order, provided that no data dependencies are
sacrified. In this case instrUctions may be reordered by the processor only if no
de~ndencies are viol~ted. In order to achieve this, data dependencies have to be
292 Superscalar Processors
Sequenttal consistency
of lnstructto~ execution
,-
of Instruction completions -~"'uence
of memory aceessea
~"'
_. --'""
Weak Strong
processc,r consistency Weak
Pf'OC8SSOr consistency memory consistency Strong
memory~
Instructions may complete Instructions Cdmplete Memory accesses
out-of-order, provided tl)at strictly in program due to load and store Memory i s ~
no dependencies are order instructions may be due to load 8/ld st
.~,1~rrutor-r-
adversely affected _ .. out-of-order, provided that instructions stncu-,
!-~ d ,..n,~h. .a Jala Ao 1
~t'\tl,f2t,CH,& no dependencies are Programorae,.
\ . , . _ "'°"'8ri"9
is allowed
No....._, "'°"'8ri"9
Is allowed
.___ ,..,.,ori,g
is allowed
!
No load/st019 l80!derkig
is allowed
Detection and resolution ROB ensures strong
of dependencies ensures Detection and resolution
processor consistency The ROB may be
weak processor of memory data
used to ensure strong
dependencies ensures
consistency memory consistency
weak memory consistency
Power1 (1990) ES/9000 (1992p)
Powet2 (1993) MC88110 (1993) ES/9000 (1992p)
PowerPC 602~20 PowerPC 602~20
MC88110 (1993) PentiumPro (1995) PowerPC 601 (1993)
PowerPC 601 (1993) UltraSparc (1995)
UltraSparc (1995) PM1 (1995)
a-line up to a 21164 PM1 (1995)
R8000 (1994) PA 8000 (1996)
Am29000 sup (1995) .R10000 (1996)
KS (1995)
PA 8000 (1996)
R10000 (1996)
Trend
T1'81ld, performance
used in superscalar
. • processors.
• Most recent processors guarantee t
consistency, smce it is easy to implement. s rong processor
The other aspect of superscalar instruction e . .
,ceesses are performed in the same order as in a se xecutmn " whether memory
d memory consistency. Here a • qu~nt~al processor. This aspect is
tefllle gam, we can d1stmgui h bet
memory consistency s ween weak and
strong ·
We say that memory. consistency is weak if memory accesses may be out of
d 'th
order compare wib a· stnct
l d sequential program execution · However, dala depen-
dencies _must n~t e vw ate · In ot?er words, weak consistency allows load/store
reordering provided that dependencies, particularly memory data dependencies are
detected and resolved. '
As. we shall discuss in the following sect1·on' weak memory cons1s
· tency IS
· a
means t? mcrease processor perfonrtance, so most up-to-date superscalar processors
rely on it.
The other alternative is strong memory consistency, in . which memory
accesses occur strictly in program order. Strong memory consistency forbids any
load/store reordering.
So far, we have discussed processor and memory consistency separately. The
sequential consistency model of a proces.wr integrates 'fiotfi"aspects. It specifies the
kind of consistency maintained by the processor and by the memory. Thus, by taking
into acco~t both aspects of processor and memory consistency, we arrive at four pos-
sible sequential consistency models (Figure 7.58). These are the WW, WS, SW and SS
consistency models, where the first character refers to the type of the processor consis-
tency (Weak/Strong) and the second the type of memory consistency (Weak/Strong).
As indicated earlier, strong processor consistency and weak memory consis-
tency have advantages. Consequently, recent processors tend to maintain the SW
consistency model.
In the following section ·we will discuss some aspects of the weak memory
\
consistency model. ·
Sequential consistency models .
(with regard to order, how instructions are completed ·and memory accessed)
SW
WW
ws
Strong processor
Strong processor consistency
Weak processor consistency
Weak processor Strong memory
consistency Weak memory
consisten<:'y consistency
strong memory consistency
Weak memory
consistency
consistency ES/9000 (1992pf
PowerPC 603 (1993)
PowerPC 601 (1993) PowerPG 604 (1994)
MC88110 (1993)
PowerPC 620 (1995)
PM1 (1995)
LJ/traSpat'C (1995)
PA 8000 (1996)
R10000 (1996)
Load and store instructions involve ac11d·ons affem ct1u·:~ :r~~h r:: ~i; ~roardadnr! th~
·1 . both load an stores - ress '
memory. Wh1 e executm dd nil Th;;] loads can access the data each..es
t b ompute y an ALU or a ress u · ' . . . e to
o ec =data which is then made available m a register. ihis .
fetch the req~ested ~1e1~ory 'd t be fini shed. The load is then completed usually. ~s
when a load instruction 1.8. sm O 'fi d hitectural register.' Y
. . th ti tched data mto the speci e arc
wntrngSto~e: have a different execution p~ttem. recei ving_the!!_gene~ ed
• fuJJ4siif OQerands to be avatlable. Unhke other instru
~ddresses, stores haxe =~~t be fini shed whe'; operands be~ me available. Now, 1~;
t10ns, a store is cOoBns!d. li,hen the ROB indkates that the store comes next i
us assume an R IS m use. t' • ed , n
. -- ni· ory address and 3aTa to be stor are 1orwarded to th
sequential ¾xecuuon, me em ........,~~ '1 e
cache and a cache store operation IS tni ttafed, . ..J ' . .
A processor that supports weakmemory consistency allows the reordenng of
memory accesses. This is advantageous for at least three reasons:
I
l
Reordering due to
1
Reordering in case of
load/store bypassing cache misses
comparisons there. To reduce the complexity of the _required circuitry: the addr
check is often restricted to a part of the full effective address. For instance ess
PowerPC 604 and the PowerPC 620 store and use only the low-order 12 bits of:~e
effective address for address checks. e
Cache misses are another source of performance impediment Which
. C¾
be reduced by load/store reordering. Usually, a cac he miss causes a blockage f
. Oa]I
subsequent operations of the same type. In other words, a Ioad m1ss blocks sub
quent loads and a store miss .
blocks subsequent stores. The resulting perfo'""' se.
d. ..,,ance
degradation can be reduced if.loads are allowed to bypass pen mg loads, as has bee
implemented in the UltraSparc, the PowerPC 620 and the R l ?000·
For instance, thn
PowerPC 620 can service loads in spite of up to three pendmg loads; the pend· e
loads are stored in a Load miss register (one entry) an d m t e t ree load/store resing
. h h
er.
vation stations (three entries).
vie t • ~r
1.t.( hob ij
4
,,b5equently
-~ate 5.,
~IQV"" iristructions
,s11ed eQuent entries
1
tosLlr,s
in,0rder
fre6 eritri8S
Active entnes
nan
. Y, they
B IS U
~ .., nd~
case of instruction states:
i: Issued
An instruction .
, as shown x: In execution 'f . may retire
I it has finished and
f: Finished
fre_e entry. all prior instructions
ictions are have already retired
Reorder buffer
(ROB)
ROB Retire
size rate
Pow::g 604
Pov., 03 (1993) KS (1995)
(1995) PentiumPro (1995)
PM 620 (1996)
RJ 7 (1995)
0000 (1996)
'\
Table 7.5 ROB imElementation details.
ROB Issue Retire Intermediate Designation
size rate rate results stored
Sequential consistency
of exception processing
Weak consistency
i ...,
Strong consist
ES/9000 (1992p)
PowerPC line
PA 8000 (1996)
R10000 (1996)
instruction causes an interrupt, for example because of an overflow, at the time when
the interrupt request of the 'div' is accepted the processor state is already 'corrupted'
by the result of a later instruction. Thus, without any additional measures, it becomes
impossible to reconstruct the correct state of the processor at the time when it
accepted the interrupt request caused by the 'div' instruction. In a number of ILP-
processors, including early superscalar processors, interrupts can be imprecise, such
as in the Power2 or R8000. In both these processors, the FP interrupts are imprecise.
A further example is the a architecture, where all arithmetic exceptions are imprecise.
Most advanced superscalar processors maintain strong sequential consistency
with respect to exception processing, so that after interrupts the state of the proces-
sor remains consistent with the state a sequential processor would have. An obvious
way to achieve precise interrupts is to maintain in-order instruction completion, for
instance by using an ROB, and to accept interrupts caused by an instruction only
when the related instruction e · e A few earlier and most recent superscalar
fodicated in Figure 7.64.
,out
r-e.riQhri e_ ~er~
-b~
00..U
;~'jJ •()es t ~ Vqfu e_ v~ e tdfe.a t
. ~ ,. V'~ ~
, ' 0:, ,
,
T
f
'2-
£)
1.l$ e,
L'fil ', '
J
t
' )
' . 1
~#=-Pc,~ ~ &e~r5v1 ;~
, 4 ~"->) t n-n o 'l-e
o~- 1 *~
Ldtteto t h ) ~
e•w...aeq t ., 1,,, '>-<l1l\
o#ter
exnrun~
(
, ;th"-';{
U
f n r~bil
,
VLIW Architecture receives Superscalar processor except for a traditional sequential flow
single multi-operation of instruction but it can issue multiple instructions.
instruction.
VLIW approach needs very The superscalar processor receives sequential streams then it
long instruction words to is decoded and the issues unit will issue multiple instructions
specify what each execution for multiple execution units. In superscalar, the instruction
unit should do. unit can issue 2 to 6 instructions per cycle.
VLIW is less complex. Superscalar processors with the same degree of parallelism
are more complex than VLIW architecture.