0% found this document useful (0 votes)

26 views33 pages

History of Superscalar Processors

Uploaded by

developerads134

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views33 pages

History of Superscalar Processors

Uploaded by

developerads134

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Unit-2.

1 Superscalar processor
Emergence of SuperScalar Processor

Superscalar processor emerged in three consecutive phases as first, the idea was conceived, then a few architecture
proposals and prototype machines appeared, and finally, in the last phase, the commercial products reached the
market.

The concept of the superscalar issue was first developed as early as 1970 (Tjaden and Flynn, 1970). It was later
reformulated more precisely in the 1980s (Torng, 1982, Acosta et al, 1986).

Superscalar processor proposals and prototype machines followed as shown in the figure.

As far as prototype machines are concerned IBM was the first with two significant superscalar developments called
the Cheetah and America project. The Cheetah project (1982-83) and the subsequent America project (from 1985 on)
were the testbeds for IBM to study superscalar execution.

The four-way Cheetah machine served as a base for the America processor, which spawned the RS/6000 (1990),
which was later renamed the Power1. The Power 1 is almost identical to the America machine (Grohoski, 1990).

The term superscalar processor is assumed to have first appeared in connection with these developments in an internal
IBM Technical Report (Agarwala, T and Cocke, J. High-Performance Reduced Instruction Set Processors, 1987).

A second early player in the area of superscalar developments was DEC with its Multititan project, carried out from
1985 to 1987. While the Multititan project was the continuation of project Titan (1984), whose goal was to construct a
very high-speed RISC processor, this project did not contribute much to the development of the α line of processors.

The Intel 960CA embedded RISC processor was the first commercial superscalar machine, introduced in 1989. To
boost performance subsequently all major manufactures were forced to introduce the superscalar issue in their
commercial processor lines.

Superscalar RISC processors emerged according to two different approaches. Some appeared as the result of
transferring a current (scalar) RISC line into a superscalar one. Examples of this are the Intel 960, MC 88000, HP PA
(Precision Architecture), SunSparc, MIPS R, and AMD Am29000 RISC lines. The another significant approach was
to perceive a new architecture and to execute it from the very starting as a superscalar line. This happened when IBM
announced its RS/6000 processor in 1990, later renamed the Power1.
7. 7 ~reser~g sequential consistency of
instruction execution

7. 7.1 Sequential consistency models

As discussed above, in superscalar processors, or in a more general sense in proces-
sors with multiple EUs operating in parallel, instructions.finish in general in an out-
of-order fashion. Nevertheless, overall instruction execution should mimic
sequential execution, that is, it should preserve sequential consistency. Although, as
stated before, the problem of preserving sequential consistency relates to a broader
class of processors than superscalar ones, for the sake of consistency with the recent
chapter, in the following we confine our discussion to superscalar processors.
Sequential consistency of instruction execution relates to two aspects: first,
to the o"'fd~r m w1iich instructions are com -zeted; anc rsecond, toJ he order in- whi~h
memory is accessed due to oa and store instructions or memory references of other
mstructions, as indicated in Figure 7.57.
Concerning the first aspect we are interested in whether instructions in a
~perscalar processor complete i!!, the s8!!1e ?rder as i~ a seq~enti~ proc: ssor. Here,
1

We ~se the term ·complete ' as explained in lhe previous secl1on. We use term
Processor consistegty to indicate the consistency oj instruction .completi_on with
sequential instruction executiQ». ,
. As far as processor consistency is concerned, superscalar processors preserve
either a weak or a strong consistency. A weak processor consistency means
that instructions may complete out-of-order, provided that no data dependencies are
sacrified. In this case instrUctions may be reordered by the processor only if no
de~ndencies are viol~ted. In order to achieve this, data dependencies have to be
292 Superscalar Processors

Sequenttal consistency
of lnstructto~ execution

o'ltdn U-, tohi,.h l 1~mJer,n, ;,,

~~,er~ .
l~ L.Jhi'o
tft..D'l')~rocessor consistency Gf t. ~
Memory consistency

Consistency of the sequence

Consistency of the ...,,

,-
of Instruction completions -~"'uence
of memory aceessea

~"'
_. --'""
Weak Strong
processc,r consistency Weak
Pf'OC8SSOr consistency memory consistency Strong
memory~
Instructions may complete Instructions Cdmplete Memory accesses
out-of-order, provided tl)at strictly in program due to load and store Memory i s ~
no dependencies are order instructions may be due to load 8/ld st

.~,1~rrutor-r-
adversely affected _ .. out-of-order, provided that instructions stncu-,
!-~ d ,..n,~h. .a Jala Ao 1
~t'\tl,f2t,CH,& no dependencies are Programorae,.

\ . , . _ "'°"'8ri"9
is allowed
No....._, "'°"'8ri"9
Is allowed
.___ ,..,.,ori,g
is allowed
!
No load/st019 l80!derkig
is allowed
Detection and resolution ROB ensures strong
of dependencies ensures Detection and resolution
processor consistency The ROB may be
weak processor of memory data
used to ensure strong
dependencies ensures
consistency memory consistency
weak memory consistency
Power1 (1990) ES/9000 (1992p)
Powet2 (1993) MC88110 (1993) ES/9000 (1992p)
PowerPC 602~20 PowerPC 602~20
MC88110 (1993) PentiumPro (1995) PowerPC 601 (1993)
PowerPC 601 (1993) UltraSparc (1995)
UltraSparc (1995) PM1 (1995)
a-line up to a 21164 PM1 (1995)
R8000 (1994) PA 8000 (1996)
Am29000 sup (1995) .R10000 (1996)
KS (1995)
PA 8000 (1996)
R10000 (1996)

Trend
T1'81ld, performance

F1gure 7.57 Interpretation of the concept of sequential consistency of instruction execution.

detected and appropliately resolved during superscalar executio~. IY5uperscalar

in Figure
processors usually provided a weak processor consistency, as mdicated
7.57. " ed 10
In the case of strong processor consistency, instrucbons . ,orereorder
. are
complete in strict program order. Usually, this is achieved by employlD~ apJetJlent
buffer (ROB). The ROB is a very practical tool as it can also be used to ":, widelY
ren~g and shelving as well, as emphasized earlier, and ROBs are no
"

Preserving sequential consistency

. of instruction execution 293

used in superscalar
. • processors.
• Most recent processors guarantee t
consistency, smce it is easy to implement. s rong processor
The other aspect of superscalar instruction e . .
,ceesses are performed in the same order as in a se xecutmn " whether memory
d memory consistency. Here a • qu~nt~al processor. This aspect is
tefllle gam, we can d1stmgui h bet
memory consistency s ween weak and
strong ·
We say that memory. consistency is weak if memory accesses may be out of
d 'th
order compare wib a· stnct
l d sequential program execution · However, dala depen-
dencies _must n~t e vw ate · In ot?er words, weak consistency allows load/store
reordering provided that dependencies, particularly memory data dependencies are
detected and resolved. '
As. we shall discuss in the following sect1·on' weak memory cons1s
· tency IS
· a
means t? mcrease processor perfonrtance, so most up-to-date superscalar processors
rely on it.
The other alternative is strong memory consistency, in . which memory
accesses occur strictly in program order. Strong memory consistency forbids any
load/store reordering.
So far, we have discussed processor and memory consistency separately. The
sequential consistency model of a proces.wr integrates 'fiotfi"aspects. It specifies the
kind of consistency maintained by the processor and by the memory. Thus, by taking
into acco~t both aspects of processor and memory consistency, we arrive at four pos-
sible sequential consistency models (Figure 7.58). These are the WW, WS, SW and SS
consistency models, where the first character refers to the type of the processor consis-
tency (Weak/Strong) and the second the type of memory consistency (Weak/Strong).
As indicated earlier, strong processor consistency and weak memory consis-
tency have advantages. Consequently, recent processors tend to maintain the SW
consistency model.
In the following section ·we will discuss some aspects of the weak memory
\

consistency model. ·
Sequential consistency models .
(with regard to order, how instructions are completed ·and memory accessed)

SW
WW
ws
Strong processor
Strong processor consistency
Weak processor consistency
Weak processor Strong memory
consistency Weak memory
consisten<:'y consistency
strong memory consistency
Weak memory
consistency
consistency ES/9000 (1992pf
PowerPC 603 (1993)
PowerPC 601 (1993) PowerPG 604 (1994)
MC88110 (1993)
PowerPC 620 (1995)
PM1 (1995)
LJ/traSpat'C (1995)
PA 8000 (1996)
R10000 (1996)

~.,nuential consistency models of instruction execution.

294 Superscalar Processors

7. 7.2 Load/store reordering

::i:
1

Load and store instructions involve ac11d·ons affem ct1u·:~ :r~~h r:: ~i; ~roardadnr! th~
·1 . both load an stores - ress '
memory. Wh1 e executm dd nil Th;;] loads can access the data each..es
t b ompute y an ALU or a ress u · ' . . . e to
o ec =data which is then made available m a register. ihis .
fetch the req~ested ~1e1~ory 'd t be fini shed. The load is then completed usually. ~s
when a load instruction 1.8. sm O 'fi d hitectural register.' Y
. . th ti tched data mto the speci e arc
wntrngSto~e: have a different execution p~ttem. recei ving_the!!_gene~ ed
• fuJJ4siif OQerands to be avatlable. Unhke other instru
~ddresses, stores haxe =~~t be fini shed whe'; operands be~ me available. Now, 1~;
t10ns, a store is cOoBns!d. li,hen the ROB indkates that the store comes next i
us assume an R IS m use. t' • ed , n
. -- ni· ory address and 3aTa to be stor are 1orwarded to th
sequential ¾xecuuon, me em ........,~~ '1 e
cache and a cache store operation IS tni ttafed, . ..J ' . .
A processor that supports weakmemory consistency allows the reordenng of
memory accesses. This is advantageous for at least three reasons:

• it pennits load/store bypassing,

9 it makes specul~tive loads or stores feasible, and
• it allows cache misses to be hidden.

Below we discuss these points on the basis of Figure 7.59.

Load/store bypassing means that either loads ca~ ,2XJ;ass pending stores or
vice v&ta, pfcWwe81liatjcµii~ipendencies are v10Iated. As Figure~
fiffiFcates,°a number of recent processors allo~ ioacls'to 'Syi)'t'ssst.ores (either non-
speculatively or speculatively, as will be explained later) but not v1c~ versa. I
_ Pennitting loads to bypass stores has the advantage of allowmg the runtime
overlapping of tight loops. The overlapping is achieved by allowing loads at the
beginning of an iteration to access memory without having to wait until stores at the
end of the previous iteration are completed. Notice that this runtime overlapping of
cycl~s is comparable with software pipelining (see Section 9.3.3). ,
Evidently, 1n order to avoid fetchin a false data value, a load can bypass
pending stores only if none o the recedin stores has the same target address as
the load. In order to c eek this requirement, t e _ ress o e oa m-
pared against the addresses of all pending store It ma be th t c sses of
pending stores are not yet available, in which ~ se no de1;,i~ n can yet be made as to
whether ffie load 1s dependent on the pending stores. There are two possible ways to
handle this situation. The simpler scheme is to delay permission for the load to
bypass until a11 pending store addresses are computed and a decision can be reached.
We tenn this the non-speculative execution of bypasses.
The more advanced handling of this situation is to let loads bypass stores
speculatively, that is, to allow speculative loads. Speculative loads avoid delaying
memory accesses until a11 required addresses have been computed and clashing
addresses can be ruled out. Instead, a mel]]Q,cy, acce§,S, ;will be. stgted,,irz spite qL
unresolved address cbeckr. For the vasf majority of bypassed loads the addresses of
subsequently computed preceding stores are not the same as the load address and
J' J
Preserving sequential con siStency
• of .instruction execution 295

Reordering of memory accesses

of load/store instructions

I
l
Reordering due to
1
Reordering in case of
load/store bypassing cache misses

Stores bypass Loads bypass Stores bypass

Loads bypass
loads loads stores
stores

L ). . .b M.el .b ~.A.~,A - ,, , -e.,, "'I"!

Non-s_r..~ . Speculative
execution execution
"'"I) v'\ ,;e,-~J ( (... '
Q u..Ji!lA~I 4- ' '-"',.tr, 'Y)Q
"' _ t_ _ ,d ,,,
!
of bypasses of bypasses t ;:ro 1:~ '

°'-elatJ- pea-mr§! ltw) w ~., ti.A. too.ti

~e'f l •
J 'f (oc.u! d,~ ~.J O'f\ .alo•.t 1:)'{
1"'0 Speculative \ ·
,J <l ~~-' loads
/BM~~0/91 (1967) PowerPC 602 (1995) UltraSparc (1995)"
MCB8110 (1993) PowerPC 603 (1993) PowerPC 620 (1996)
PowerPC 604 (1995)
PowerPC 620 (1996)
PM1 (1995)
U/traSparc (1995)
PA 8000 (1996)
R10000 (1996)

Figure 7.59 Reordering of memory accesses of load/store instructions.

this speculative behaviour is justified.. The correctness of speculative loads must be

checked in any case and if ne~essary the speculative load must be undone. This is
pe~onned as follows. When store addresses have been cg~d, ,!hey are co,mp_'lfs.,c;l
agamst the add.fesses pf all yQJlll~- If a hit is found, the corresponding
s_peculative load has fetched an incorrect data value. At this point the load instruc-
tion and all subsequent instructions are cancelled and ilie load 1s re-executed. 7.
S w_e note that speculative loads are quite- similar to speculative branches.
peculative branches (see Section 8.4.4) react in exactly the same way to unresolved
condition
· checks as speculative loads do to unresolved dependency · checks. A
;~her of recent processors allow speculative execution of loads, as indicated in
•gure 7.59.
addre The address checks are usually carried out by writing the computed target
sses of loads and stores into the ROB (or DRIS) and performing the address
296 Superscalar Processors

comparisons there. To reduce the complexity of the _required circuitry: the addr
check is often restricted to a part of the full effective address. For instance ess
PowerPC 604 and the PowerPC 620 store and use only the low-order 12 bits of:~e
effective address for address checks. e
Cache misses are another source of performance impediment Which
. C¾
be reduced by load/store reordering. Usually, a cac he miss causes a blockage f
. Oa]I
subsequent operations of the same type. In other words, a Ioad m1ss blocks sub
quent loads and a store miss .
blocks subsequent stores. The resulting perfo'""' se.
d. ..,,ance
degradation can be reduced if.loads are allowed to bypass pen mg loads, as has bee
implemented in the UltraSparc, the PowerPC 620 and the R l ?000·
For instance, thn
PowerPC 620 can service loads in spite of up to three pendmg loads; the pend· e
loads are stored in a Load miss register (one entry) an d m t e t ree load/store resing
. h h
er.
vation stations (three entries).

7.7.3 The reorder buffer (ROB)

The ROB was first described by Smith and Pleszkun in 1988. Originally, they
conceived the ROB to solve the precise intenupt problem. Today, an ROB is under.
stood as a tool whlch,,as,s.ures sequential consistency ot !'ecutJo1! Ln the case
multiple EUs operating in parallel.
Bas1cal1y, theROB is a c!rcular buffer with head and tail pointers, as shown
in Figure 7.60. The head pointer indicates the location of the next free entry.
Instructions are written into the ROB in strict program order. As instructions are
issued, a new entry is allocated to each in sequence. Each entry indicates the status
of the corresponding instruction: whether the instruction IS issued in execution
(xJ or already finiJ!ied (f). The tail pointer marks the instruction which will retire, that
is, leaveiGe R5B, next. An mstructton IS allowed to retire only if it has finished and
all previous instructions are already retired. This mechanism ensures that instruc-
tions retire strictly in order. Sequential consistency IS preserved in that oii'iyreiiriii'g
ffl§1ftldions are penrutted to complete, that is, to update the program state by writing
their result into the referenced architectural register(s) or memory.
Here, we note that an ROB can effectively support both speculative execution
and interrupt hand,liJlg. -
As we know, in speculative execution the processor carries on executing
instructions in spite of an unresolved condition such as an unresolved conditional
branch or memory address check. Later, when the condition IS resolved, it becomes
clear whether the &eculatJvely executed instructions can be affinned; if not,
they have to cancelled and the correct ins~ ctions execute<!} An ROB easily s~p-
ports speculabve execubon. Each ROB entry is extended to include a speculaave
status field, indicat4!g whether the corresponding instruction has been executed
speculaJively:'in addition, finished instructions may not retire until they are in the
speculative ~te. The whole operation is then as follows. Speculatively executed
instructions are marked as such and are not etigibJe for retirement. Later, when the
related condition is resolved, speculative execution is either affirmed or not. For
affinned instructions, in-order completion is maintained as described before. If the
speculative execution turns out to be in~orrect, the corresponding instructions which
Preservin
. 9 seq uentia1
consiste
(fl Head :)- ncy of .
rst free ent l t tnstrucr
ry) \ t,)tt , ton e,.
1 .. " •Jt....., eeuilvr,

vie t • ~r
1.t.( hob ij
4

,,b5equently
-~ate 5.,
~IQV"" iristructions
,s11ed eQuent entries
1
tosLlr,s
in,0rder

fre6 eritri8S
Active entnes

nan
. Y, they
B IS U
~ .., nd~
case of instruction states:
i: Issued
An instruction .
, as shown x: In execution 'f . may retire
I it has finished and
f: Finished
fre_e entry. all prior instructions
ictions are have already retired

~ e status Tail ~ p,,. II ,Ii 111.I \>111, I1

execution (next instruction to b 'Ot'frte. "f:l
retire, that retired)
tished and
it instruc- Figure 7.60 Principle of the ROB. y ,~1 !1111,111~ 111
:-rtit l~ll'I\\.I \>111\i Il l
Iy retiring i,111" oh 111
,y writing are marked as being speculative have to be cancelled and instruction execution con-
tinued with the correct instructions.
execution . An ROB also assists interrupt handling in a natural way. Interrupts generated
-
!xecuting
mconnecti.on with instruction execution can easily be made precise (that is, handled
the correct point in the execution), by accepting interrupt requests only when the
,nditional instruction becomes the next to retire,
relalfd n,o . as discussed
· 'th in the next seeuon.
shelving .
and register
becomes renami1\: Bs. were introduced primarily tn. connectionooB wimpared with . the .mtrodUC _
:; if not, .on ofng. Figure
. 7.61 shows the introduction of~ s co ,xcepuons,
'th a feW . renaming and
LSilY sup- ti s (renaming. It can easily be seen that, wi lar processors.
eculative d . · 1·n supersca .
ROB N an shelvmg) appeared at the same umeif ROBs Here, we are concerned with
eXecuted their b ~t, we will discuss the design space
O
bO~n in Figure 7.62-
re in we as,c layout, the ROB size and retire rate, as\ ROBS we inuoduce the term
executed •~rd Before
'I'll,,.
• discussing possible basic. layoutso d . nate the basic function
...n to esig
ofanf-
d ·ng out-o
,.,hen we Ro enng of instructions'. We use thiS te,u•. Jetion by ,:eor en
not, for B th . . trllcuon comP
If we ft is, to provide strict in-order IJIS
~· .tlwcll
.
'
rushed mstructions.
1s vr
Preserving sequential consistency of instruct·ion execution 299

Reorder buffer
(ROB)

ROB Retire
size rate

Figure 7.62 Design space of ROBs.

.,
" rnentioned before, ROBs are more versatile. As Figure 7.63
J-{owever, ,.sern lo ed in three different ways. In the simplest layout, an ROB
. {heY
can bed ring as d1scusse
. d ear1·1er. . - - ..
.n-''' ~. . st reori e ' . .
· . ides1u t layout, an ROB 1s used for both reordering and renaming as dis-
,,,1 nex '
1 )JI e . ? .5. In this case, each ROB entry has to provide space to hold the
. secuon . . 11 H
''\),sed1nf the co rresponding mstruct.lon as we . ere, .we note .that in Smith and
[(,ult o , . inal proposal ( 1988) the ROB also con tamed the mstruction results.
00
p\eszkun 5 !rd alternative, the ROB is used for shelving as well, and is frequentl y
In the the ORIS (Deferred scheduling, Register renaming Instruction Shelf) .
[(ie~ to as the ROB also has to provide space for shelving, which means either
In this case, . be f .
. .f the source register num rs or space or source operands, depending on the
;pace ~rfetch policy. Although the ORIS is an attractive construct, it is a fairly com-
operan e So ~ar the ORIS has only been proposed for one processor, the Lightning.
plex on • li • . . ·
We recall that this ambitious processor never reached the market.
we will briefly discuss two further aspects of the design space. In particular,
we are concerned with two implementation details, the ROB size and the retire
,11e. The ROB size detennines the numlW,..ef entrie§.m.,..an RP~· This parameter

Basic layout of ROBs

Reordering Reordering Reordering,

alone and renaming renaming and
shelving
(ORIS)

P ES/9000 Lightning (1991p)

OWerPC 6 Am29000 sup (1995)

Pow::g 604
Pov., 03 (1993) KS (1995)
(1995) PentiumPro (1995)
PM 620 (1996)
RJ 7 (1995)
0000 (1996)

Figure 7.63 Basic layout of ROBs.

0 Superscalar Processors
. I
limits the number of active, that is, issued but notyet comE,leted:i..:.in strucr
.,._....,_ _.,...-~-
I
processor. As Table 7.5 shows, recent ?owerful superscala_r processors~ in ,
many as 16--64 entries in the ROB. This means that a maximum of 1~o_vi~
tions may be active in these processors. · 1ns11uas
We note that the total number of shelves and the number of r rd c.
entries must be in balance. The reorder buffer holds all pending, that is eo er buyt
. . u· .. 'not Yet er
pleted ' instructions. Some of the pen dmg mstruc ons are . waitmg in she) v1ng . bco"'••1-
for their operands and/or for dispatch. Others are m the process of Uffcrs
Therefore, we expect more reorder buffer entries than shelves. in recent super execution•
processors. As the data in Table 7.6 demonstrates, thi s 1s true for most scalar
listed. Exceptions are the Nx586 and the RIOOOO, where more shelves thPr0cess0rg
buffer entries are available. an reorder
The retire rate specifies the maximum..number of ins!_1Jl£tions th
co__rn.pJeted by the ROB in each cycl~ This-value in~icates the maxi; um
ott1ie processor. - .. .,.. ougnpul"
As far as the maximum throughput is concerned, typically, the proces
be considered as consisting of a number of sequentially linked subsystemssor ~ay
are decoupled from each other by some kind of buffers. These subsystems ar;.;
ch
issue, dispatch (if applicable), execute and retire (or completion). The marj etc~
throughpu~ of the processor as a whole is d~termi?ed by the 'weakest' subsyste:~:
a good design, the throughput of all sequenttally hoked subsystems is balanced. Thi
means that the fetch width, the maximum issue rate, the maximum dispatch
(if applicable) and the maximum retire rate should be more or less the same. F
fine-tuning of these parameters, extensive simulations are needed using benc~
programs.

'\
Table 7.5 ROB imElementation details.
ROB Issue Retire Intermediate Designation
size rate rate results stored

ES/9000 ( 1992p) 32 2 2 No Completion control logic

PowerPC 602 (1995) 4 2 1 n.a. Completion unit
PowerPC 603 (1993) 5 3 2 No Completion buffer
PowerPC 604 (1995) 16 4 4 No ROB
PowerPC 620 (1996) 16 4 4 No ROB
PentiumPro (1995) 40 3 3 Yes ROB
Am29000 sup (1995) 10 4 2 Yes ROB
K5 (1995) 16 4 4 Yes ROB
·1
PMl (Sparc64, 1995) 64 4 4 No Precise state uru
·1
UltraSparc ( 1995) n.a. 4 n.a. n.a. Completion Ul1I
buffd
PA 8000 (1996) 56 4 4 Yes Instruction reorder
R l 0000 (l 996) 32 4 4 No A~
7,8 Preser~g the se'}.uentiaI consistency of
exception p rocessing
When instructions are executed in parallel, interrupt requests, which are caused by
...... · r• - - ,.
exceptions ansmg m mstructJ.on execunon, are also generated out-of-order. If these
requests are acted upon immediately, interrupts occur out-of-order, that is, in a dif-
ferent orde'ilhan in a sequentially "ci'p erating processor. In ifiis s1fuahon"we say that
the sequential consistency of th~ ~'!!.,e[ rif(J.ts is weak, or in other words that we have
todeal with imprecise interrue ts 1 as shown in Figure 7.64.
When an imprecise interrupt occurs, the processor is unable to reconstruct the
correct state unless appropriate additi(?nal mechanisms are employed .. For instance,
!et01us assume that the processor executes two subsequent instructions m p'."allel,
, ~er· division (div) and a 'younger' addition (ad). It can happen that t~e yo~ger,
ad finishes first and updates the processor ~tate. If subsequently the older div
302 Superscalar Processors

Sequential consistency
of exception processing

Weak consistency
i ...,
Strong consist

Imprecise interrupts Precise interrupts

Power1 (1990) MCBB 110 (1993)

Power2 (1993) Pentium (1993) and
a processors usually processors
making use of an ROB
such as: '

ES/9000 (1992p)
PowerPC line
PA 8000 (1996)
R10000 (1996)

Figure 7.64 Sequential consistency of exception processing.

instruction causes an interrupt, for example because of an overflow, at the time when
the interrupt request of the 'div' is accepted the processor state is already 'corrupted'
by the result of a later instruction. Thus, without any additional measures, it becomes
impossible to reconstruct the correct state of the processor at the time when it
accepted the interrupt request caused by the 'div' instruction. In a number of ILP-
processors, including early superscalar processors, interrupts can be imprecise, such
as in the Power2 or R8000. In both these processors, the FP interrupts are imprecise.
A further example is the a architecture, where all arithmetic exceptions are imprecise.
Most advanced superscalar processors maintain strong sequential consistency
with respect to exception processing, so that after interrupts the state of the proces-
sor remains consistent with the state a sequential processor would have. An obvious
way to achieve precise interrupts is to maintain in-order instruction completion, for
instance by using an ROB, and to accept interrupts caused by an instruction only
when the related instruction e · e A few earlier and most recent superscalar
fodicated in Figure 7.64.
,out

@ B~ft m~o.¼ ,~~

~l't..c{_ +y- C#Ct.~I

r-e.riQhri e_ ~er~

-b~
00..U
;~'jJ •()es t ~ Vqfu e_ v~ e tdfe.a t
. ~ ,. V'~ ~

, ' 0:, ,
,
T
f
'2-
£)
1.l$ e,
L'fil ', '
J

t
' )
' . 1
~#=-Pc,~ ~ &e~r5v1 ;~
, 4 ~"->) t n-n o 'l-e
o~- 1 *~
Ldtteto t h ) ~
e•w...aeq t ., 1,,, '>-<l1l\
o#ter
exnrun~
(
, ;th"-';{
U
f n r~bil
,

-.0ome._ fe-lfm~: ...

Comparison of VLIW and Superscalar Processor

VLIW Architecture Superscalar Processor

VLIW Architecture receives Superscalar processor except for a traditional sequential flow
single multi-operation of instruction but it can issue multiple instructions.
instruction.

VLIW approach needs very The superscalar processor receives sequential streams then it
long instruction words to is decoded and the issues unit will issue multiple instructions
specify what each execution for multiple execution units. In superscalar, the instruction
unit should do. unit can issue 2 to 6 instructions per cycle.

VLIW architecture processors Superscalar processors do not expect dependency-free code

expect dependency-free code. to deal with dependencies using special hardware.

VLIW is less complex. Superscalar processors with the same degree of parallelism
are more complex than VLIW architecture.

VLIW is used for static Superscalar is used for dynamic scheduling

scheduling.

Superscalar Processor
No ratings yet
Superscalar Processor
4 pages
The Microarchitecture of Superscalar Processors: Paper
No ratings yet
The Microarchitecture of Superscalar Processors: Paper
16 pages
Batch 2 ICS 2101 AND BIT 2102 (1) - 1
No ratings yet
Batch 2 ICS 2101 AND BIT 2102 (1) - 1
17 pages
System-On-Chip (Soc) Architecture Soc Example
No ratings yet
System-On-Chip (Soc) Architecture Soc Example
71 pages
Lecture 2
No ratings yet
Lecture 2
17 pages
Computer Architecture Unit 2 - Phase 1 PDF
No ratings yet
Computer Architecture Unit 2 - Phase 1 PDF
52 pages
15CS72 ACA Module2Final
No ratings yet
15CS72 ACA Module2Final
29 pages
ITEC582-Chapter 16m
No ratings yet
ITEC582-Chapter 16m
55 pages
Computer Architecture & Amdahl's Law
No ratings yet
Computer Architecture & Amdahl's Law
23 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
L27,28 Superscaler
No ratings yet
L27,28 Superscaler
28 pages
L8 Processor Types Parallelism Other Features Jan24 2024
No ratings yet
L8 Processor Types Parallelism Other Features Jan24 2024
2 pages
Chapter 13 - Instruction Level Parallelism
No ratings yet
Chapter 13 - Instruction Level Parallelism
16 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
36 pages
CH16 ParallelismSuperScalar 22 Slides
No ratings yet
CH16 ParallelismSuperScalar 22 Slides
22 pages
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
No ratings yet
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
32 pages
01-System Architecture
No ratings yet
01-System Architecture
55 pages
Coa 3.2 - Risc - Cisc
No ratings yet
Coa 3.2 - Risc - Cisc
20 pages
Chapter 5 PPTV 41 STDV 1
No ratings yet
Chapter 5 PPTV 41 STDV 1
47 pages
5th Sem - Unit 2-Ec355tbf
No ratings yet
5th Sem - Unit 2-Ec355tbf
104 pages
Superscalar Architectures
No ratings yet
Superscalar Architectures
36 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
Module3
No ratings yet
Module3
49 pages
Android Intents 1
No ratings yet
Android Intents 1
30 pages
P14-15 Superscalar
No ratings yet
P14-15 Superscalar
28 pages
CH - 14 - Instruction Level Parallelism and Superscalar Processors
No ratings yet
CH - 14 - Instruction Level Parallelism and Superscalar Processors
42 pages
CH16 COA9e Instruction Level Parallelism and Superscalar Processors
No ratings yet
CH16 COA9e Instruction Level Parallelism and Superscalar Processors
20 pages
Superscalar Processors & Parallelism
No ratings yet
Superscalar Processors & Parallelism
50 pages
Ilp - Superscalar Instruction Issue
No ratings yet
Ilp - Superscalar Instruction Issue
12 pages
William Stallings Computer Organization and Architecture: Instruction Level Parallelism and Superscalar Processors
No ratings yet
William Stallings Computer Organization and Architecture: Instruction Level Parallelism and Superscalar Processors
28 pages
Lect5 PDF
No ratings yet
Lect5 PDF
21 pages
William Stallings Computer Organization and Architecture 10 Edition
No ratings yet
William Stallings Computer Organization and Architecture 10 Edition
40 pages
Superscalar Architectures: COMP375 Computer Architecture and Organization
No ratings yet
Superscalar Architectures: COMP375 Computer Architecture and Organization
35 pages
Cs2354 Advanced Computer Architecture 2 Marks
No ratings yet
Cs2354 Advanced Computer Architecture 2 Marks
10 pages
Computer Architecture Unit 3
No ratings yet
Computer Architecture Unit 3
8 pages
Lec5 PDF
No ratings yet
Lec5 PDF
39 pages
7TH - Unit 2-21ec74h6 - Ca
No ratings yet
7TH - Unit 2-21ec74h6 - Ca
95 pages
3 Concurrency
No ratings yet
3 Concurrency
52 pages
Lecture 1
No ratings yet
Lecture 1
14 pages
Lecture 11: Consistency Models: Topics: Sequential Consistency, HW and HW/SW Optimizations
No ratings yet
Lecture 11: Consistency Models: Topics: Sequential Consistency, HW and HW/SW Optimizations
18 pages
CD 2025 26 HW2CD 2025 26 CD 2025 26 HW2HW2
No ratings yet
CD 2025 26 HW2CD 2025 26 CD 2025 26 HW2HW2
75 pages
Computer Architecture 2
No ratings yet
Computer Architecture 2
3 pages
ACA Mod2
No ratings yet
ACA Mod2
45 pages
Computer Architecture 09-Superscalar
No ratings yet
Computer Architecture 09-Superscalar
83 pages
Advanced Computer Architecture Prof Thriveni T K
No ratings yet
Advanced Computer Architecture Prof Thriveni T K
59 pages
DSP q1
No ratings yet
DSP q1
7 pages
Instruction-Level Parallelism and Superscalar Processors
100% (1)
Instruction-Level Parallelism and Superscalar Processors
22 pages
Module II
No ratings yet
Module II
60 pages
Input Unit: Memory: in Processing Element (PE) or CPU: Output
No ratings yet
Input Unit: Memory: in Processing Element (PE) or CPU: Output
24 pages
Computer Instructions
No ratings yet
Computer Instructions
15 pages
Chap2 Slides
No ratings yet
Chap2 Slides
127 pages
Parallel Processing
No ratings yet
Parallel Processing
127 pages
Intro To PA
No ratings yet
Intro To PA
15 pages
Parallel Processing: sp2016 Lec#3
No ratings yet
Parallel Processing: sp2016 Lec#3
23 pages
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
CH18 COA11e
No ratings yet
CH18 COA11e
37 pages
HSE-6-Soc Introduction To The System Design Approach
No ratings yet
HSE-6-Soc Introduction To The System Design Approach
69 pages
Daa Unit 1
No ratings yet
Daa Unit 1
49 pages
Unit-3.3 Dynamic Interconnection Network
No ratings yet
Unit-3.3 Dynamic Interconnection Network
15 pages
Static Interconnection Networks
No ratings yet
Static Interconnection Networks
10 pages
Distributed vs Shared Memory Systems
No ratings yet
Distributed vs Shared Memory Systems
11 pages
Au Aix6tuning PDF
No ratings yet
Au Aix6tuning PDF
20 pages
Decoder Vs Demultiplexer
No ratings yet
Decoder Vs Demultiplexer
2 pages
AMI BIOS Survival Guide
No ratings yet
AMI BIOS Survival Guide
22 pages
S7-Plcsim Advanced Function Manual
No ratings yet
S7-Plcsim Advanced Function Manual
93 pages
TDC 3000 Dcs
No ratings yet
TDC 3000 Dcs
7 pages
Section - A - Unit-2 STORED PROGRAM CONCEPT
No ratings yet
Section - A - Unit-2 STORED PROGRAM CONCEPT
100 pages
The Common Concepts of Laptop
No ratings yet
The Common Concepts of Laptop
22 pages
Understanding Computer Hardware Lesson Plan
No ratings yet
Understanding Computer Hardware Lesson Plan
2 pages
Multiprocessor Systems Overview
No ratings yet
Multiprocessor Systems Overview
26 pages
DLCO. 5th Unit Questions Wise
No ratings yet
DLCO. 5th Unit Questions Wise
39 pages
Cao Newest - Computer Architecture and Organization
No ratings yet
Cao Newest - Computer Architecture and Organization
124 pages
8382 Optimod 1.1.0 Manual Press
No ratings yet
8382 Optimod 1.1.0 Manual Press
244 pages
Algorithm Design and Problem Solving
No ratings yet
Algorithm Design and Problem Solving
51 pages
UNIT-V - Input-Output Organization
No ratings yet
UNIT-V - Input-Output Organization
14 pages
Lectures5 14
No ratings yet
Lectures5 14
85 pages
Clock Homework Ks1
100% (1)
Clock Homework Ks1
6 pages
Computer Architecture
No ratings yet
Computer Architecture
7 pages
FINAL Revised XI CS Computer Systems and Organisation Unit1 Part1 2020-21
No ratings yet
FINAL Revised XI CS Computer Systems and Organisation Unit1 Part1 2020-21
60 pages
CFD Assessment of Railway Tunnel Train Fire
No ratings yet
CFD Assessment of Railway Tunnel Train Fire
16 pages
80286
No ratings yet
80286
28 pages
8086 Processor
No ratings yet
8086 Processor
49 pages
Engineering
No ratings yet
Engineering
139 pages
Computer Science Exam Questions
No ratings yet
Computer Science Exam Questions
15 pages
Simulation of A Virtual CPU Executing Mathematical Functions in Python
No ratings yet
Simulation of A Virtual CPU Executing Mathematical Functions in Python
38 pages
Ca Lab Programs
No ratings yet
Ca Lab Programs
12 pages
Microprocessor vs. Microcontroller
No ratings yet
Microprocessor vs. Microcontroller
4 pages
HPE - A50004307enw - HPE ProLiant DL380 Gen11
No ratings yet
HPE - A50004307enw - HPE ProLiant DL380 Gen11
86 pages
Eh Cache User Guide
No ratings yet
Eh Cache User Guide
233 pages
Assembleur - Intel 64 and IA-32 Architectures Application Note TLBS, Paging Structure Caches and Their Invalidation (2008) - (Intel)
No ratings yet
Assembleur - Intel 64 and IA-32 Architectures Application Note TLBS, Paging Structure Caches and Their Invalidation (2008) - (Intel)
34 pages

History of Superscalar Processors

Uploaded by

History of Superscalar Processors

Uploaded by

Unit-2.

7. 7.1 Sequential consistency models

o'ltdn U-, tohi,.h l 1~mJer,n, ;,,

Consistency of the sequence

F1gure 7.57 Interpretation of the concept of sequential consistency of instruction execution.

detected and appropliately resolved during superscalar executio~. IY5uperscalar

Preserving sequential consistency

~.,nuential consistency models of instruction execution.

7. 7.2 Load/store reordering

• it pennits load/store bypassing,

Below we discuss these points on the basis of Figure 7.59.

Reordering of memory accesses

Stores bypass Loads bypass Stores bypass

L ). . .b M.el .b ~.A.~,A - ,, , -e.,, "'I"!

°'-elatJ- pea-mr§! ltw) w ~., ti.A. too.ti

Figure 7.59 Reordering of memory accesses of load/store instructions.

this speculative behaviour is justified.. The correctness of speculative loads must be

7.7.3 The reorder buffer (ROB)

~ e status Tail ~ p,,. II ,Ii 111.I \>111, I1

Figure 7.62 Design space of ROBs.

Basic layout of ROBs

Reordering Reordering Reordering,

P ES/9000 Lightning (1991p)

Figure 7.63 Basic layout of ROBs.

ES/9000 ( 1992p) 32 2 2 No Completion control logic

Imprecise interrupts Precise interrupts

Power1 (1990) MCBB 110 (1993)

Figure 7.64 Sequential consistency of exception processing.

@ B~ft m~o.¼ ,~~

-.0ome._ fe-lfm~: ...

VLIW Architecture Superscalar Processor

VLIW architecture processors Superscalar processors do not expect dependency-free code

VLIW is used for static Superscalar is used for dynamic scheduling

You might also like