Text
Text
pipelining so we have seen that in general pipelining allows to increase the performance of execution
but at the
same pro at the same time pipelining can be a problem can create problems uh we
have seen some of this problem starting from um yesterday but I can ensure that
there are more I will go a bit farther we have seen for example okay how the
different part of the pipeline works this is something that I also explained to you during the
labs you have seen how we can obtain a pipeline data puff actually by recurring
was referring to so the azard actually theard are only or mainly present with
pipeline architecture with on Pipeline architecture most of the problem we have here are not uh
present because actually
you have only one instruction in execution at a time so you will never have dataart because one
instruction it
starts and another ends each time instead when you have concurrency you
can have data zard as for example you have maybe uh programmed with thread or
you will start programming with threads you will know that actually threads uh
have a concurrency problem so if you have shared variables shared memory pipes semap for or
whatever uh you can
see that these structure are essential for allowing a um the core cor use of
variables but this actually depends on you you must understand how this structure works and then
you can create
return to our discussion aards are problems that avoid the pipeline to work
stall um sorry sorry structural aard which I remember is due to the fact that there is some
limitation in the hardware that avoids the pipeline to go to the possible Fair
performance he can reach this was a case in where for example we had a limitation
of one read at a time so due to this reason I could have not read I have a
loot instruction so I need to read to the memory but but at the same time I need to read the memory
in order to
perform the fetch So to avoid this problematic the
general data zards are due to the fact that you are pipelining instruction this
instruction can be deeply connected like in the case we have here so the same register is used by
multiple instruction
and so we have the problem that we want to assess that register take the value
but this value is not ready to be used we have the old value so it's a matter
for the bre back cycle to have a value of the register ready to be used
remember that obviously there are some other cases for example load instruction load instruction
have the operant
available uh after the memory stage but obviously then it must be Wren to the
final register in the right back stage so there are different things to
remember in general the problem is they needed to wait for this stage sometimes
maintain the correctness of the code we can stall the execution so we can block we can put some sort
of bubbles in the
preserve the correctness of working the second case instead is more smart but
obviously in this case each solution that increase the performance normally requires a cost so the
complex and in general it allows to assess not only to the the final register so the register file
itself but even to Temporary register so from temporary register I can take the
instructions this happens for the execution stage which is the one in the middle but for the load
instruction for
example the load instruction as I as I anticipated they have the operand after
reality so after it's finished I have the value available in the lmd so the
load memory data and for load instruction I need to wait then for the fourth cycle instead for all the
other
okay then there is something that we discussed okay this is a recap of the
this example here and this example was since there are no um data dependency
between all the other rows we can possibly move some instruction after but
this depends on the level you are writing the code because if you are writing the code in an assembly
level
assembly and the Machine code so unless you use a third program that makes the
movement or by yourself you modify the program in order to be performant uh there is no solution
to
this problem instead if you are in a high level Lang language you have the compiler the compiler
normally performs
as most as possible some optimization you can normally if you have tried for
example GCC GCC as a setting for the optimization level that goes from zero
to three if I remember
correctly okay here is the same thing so now we can move to the new part of the
today's lecture so we have seen both structural and data Hazard now we will
discuss about control hazards so in general here the problem is that even
first instruction I
write the second instruction and more and I expect that the flows is from the
that if I have an instruction like X here this instruction can go back and
I understand that this instruction is a branch sooner I can modify even the
program counter sooner otherwise I need as we have discussed yesterday I will have some um
stalls that are due to the fact that this instruction in some way
um or better other instruction are fetched but they are not the correct instruction to be
basic solution for the meeps pipeline that I just discussed yesterday so the idea is okay we need to
anticipate as
soon as possible the calculation of the program en counter and and this is the solution actually
adopted by win mips so the win mips emulator has the solution that I'm introducing to you now and
obviously it
mentioned before uh in the previous case sorry I calculated during the execution stage
the condition and the addressing where to jump this in case I have a branching
instruction then during the memory stage I modify the program counter if I have a
branch or not this was the previous flow so the basic pipeline that we have seen
before so the problem here is that I need to wait this stage here to have the
calculation and this stage here to have the modification of the program counter
if the branch must be taken if not it must not be taken I will continue the execution
sequentially so why not having something like that instead so the modification with respect
condition and the calculation of the address all together and then modify
here okay so in this way the penalty will be surely reduced because normally
I have that there should be an example later but if the more I need to wait to
have the updated program counter the more instruction are wrongly fetched and
then they will continue the execution in this way instead I have only one
instruction at least in this architecture that is wrongly fetched because when the branch instruction is
in the code stage there will be a fetch of a new instruction but this instruction if it
is wrong will be removed from the python so this is something that we have
discussed uh yesterday uh I don't know if maybe we can take wind meeps a moment
just to see it
performance so this is what it is happening here so this vess will have the result
of the condition calculated here and then I have this loose lost of
performance and then the next instruction would require more clock
cycle here there is another cause of stall in reality because I have nearby
true instruction the DDI that modifies R six and the Bess
move R six or to get R six directly in this position so when the execution
stage is finished I have a forwarding of this value directly here here I want to point
out that the notation of wi nips is a bit strange because actually when a new
block appears the name of the block is written so ID but then you see the writing r a w
so read after write but in reality I want to notice to let you note
that the read after right does not happen here the read after right happens here
at this point here this is because here I tries to I try to read R6 but R6 is
in this point here R6 is available so I can get it so the read after right you
see where it is written actually is not a real after right it is in the cycle
before because you remember in the fetch I need to read all the operants I need
to go into the execution stage but what happens here is that actually in the
first this data azard here so I have that these two disol
instead of losing one clock cycle they will be true because first there is a
then hold which is the wrong instruction to be fetched is fetched but I suddenly
realize that it is the wrong instruction to be fetched here so at this point I will modify the
program counter and I will at the next cycle fetch the correct instruction
so this is what it is happening in this code so it is always the lab one so now
you should understand a bit better what it is happening but let's see the situation without
different so in general you can see that the time required by each instruction is
longer even uh if we look at the last instruction that we see here so the DDI
R six and then the NZ that controls the same variable you see that we cannot
perform the uh ex the code stage because we need to wait for the
right back and from the right back the value is directly forwarded into the
you need to wait for the right back so I cannot move the value if even if I have
see let's perform a reset and add the forwarding and perform 24
instruction so you can see pre-instruction more actually so we have increased the performance of EX
of
execution we have only two cases in where the situation is not as expected
this one is due to the load and this one is due to the data Hazard here the load since I need to use
here so the execution stage can be made can be executed here so again you see
when you see this case in where first you have ex and then read after write
probably the read after write happened in the cycle before not in the one it is written so wind Ms is
not perfect this
understand the correct program counter so the program counter of the next
instruction we can execute the instruction sooner in in Wind mips we have the
optimization I just discussed so the question was okay why we uh perform this
check in the decode stage in the architecture that I mentioned just to repeat previously we calculated
the
condition and the address in the execution stage in the old one and then we performed the
modification on the of
the program counter according to the condition in the memory stage in this case the wait
for the new program counter would have been longer what could have happened in this
case obviously this code is not um correct I can say for the case I want to
show but if you need to wait for the for the memory stage this means that you
here you are in M but look just below you have one instruction which is in
execution one instruction which is in the decode stage one instruction which is in the fetch
stage so if you add like in the previous situation in the memory stage I modifi
the program counter but I have free instruction in the pipeline so this
means that I need to flush all this free instruction to maintain the correctness
if you fetch instructions which are malev instructions and you are able to execute
this instruction before you understand that the program counter is wrong obviously you can create
problems
in your system so this is something even connected with security I don't know if
speculation so this is an argument that probably you will see with Professor
executed each clock cycle in nowadays computers instead we are able to execute more than one
instruction per clock
cycle this is because we have have multiple functional unit that can work effectively in parallel so not
skewed so
here the pipeline is skewed so we are in a in this situation here and you have
multiple cores too you have multiple threads too so it is a mess inside
nowadays processor so in a simpler situation because you need to understand okay we are now in
this situation but
how we reached the situation where we have multiple core multiple multiple functional unit that
works in
parallel we pass it to this situation and if you for some reason execute the
wrong instruction so this is the general idea you can have problem of security
and this is something that was exploited in the past actually and example of this
are leaks caused by Spectral mcdown that were two um major faults in Intel
processors so here you are not executing an instruction but what I want to say is
if you were able to execute an instruction before you understand that the program counter is wrong it
is also a problem of
security and for some of you who are in cyber security this is quite important
okay the question is uh can we do all the operation together but the problem is not performing all the
operation
together because the operation are already made together in the decode
stage the better would be to do this operation in the fetch stage in reality so as soon as you able to
understand the next program counter the better it is because you will not fetch the wrong
instruction so it's not a matter of Performing all the instruction so all the operation in the M stage in
the X
this operation the better it is so this is the in general what I can say to answer your question and uh
when you
will see Branch prediction because there there is another argument that will deepen
this you will see that there is a possibility of trying to predict if you need to take
which is something related to prediction so do I need to predict that if the branch is taken or not as
soon as I am
but if I'm not able to do this operation Inc correctly I will lose performance so I will also deepen a part
of this argument in the next slide but uh actually being able to improve the
because for Intel processor there are not five stages normally Intel processor
nowadays have 14 stages if I remember correctly something like that so if you
do Branch prediction at the ninth stage for example this means that in the meantime
probably you have exe already executed some instruction and you have fulfilled
the pipeline of one core but you have even multiple cores that works Al
the more stages I have and the worse it is for branches so the the possible
solution would be okay I need to do it as soon as possible and then possibly blocking instruction
which could be
Malolos because uh what we will see at the moment this code it doesn't is that
take if I can the lab maybe in the second exercise just let
me copy and paste the code from the second exercise with floting
Point
okay let's try this one because I don't remember actually if this problem
because there is a strong data dependency but if there is not data dependency some instruction can
finish
security that I was referring to actually or you are in some way not
maybe later because I will open this CH but I will let you see later in the
it okay so pipelining
perfect okay so control haard so we have already seen some of the problem that
was as I'm able to understand if I need to take the branch or not the Lesser will be the penalty that I
have and the
fetch actually the wrong instruction in this case as the branch is not taken but
then I will fetch the the right instruction so this is a losing of clock
managing the branches the simplest case is the most stupid okay if we
branch I can in some way stall the pipeline obviously it doesn't grant me
any advantage both in case the branch is taken or not taken so I have a branch
instruction I will block the pipeline until the next program counter is
calculated obvious this is a worse situation even with respect to the wind Ms situation that you have
seen because in the last clock cycle what wind mips does instead is to
will show you and is the one that uh actually with me 64 implements and uh
that the branch is not taken I will continuously execute the next instruction
that the branch is untaken so it's like the last case for
the exercises in the lab and in this case I don't have any penalty induced by the
branch instead if the the branch must be taken this is the case that I discussed
so I lose one clock cycle because in reality instruction e+1 is not the
correct instruction to be executed but is the branch Target so I will have with predict
untaken a penalty only if the branch must be taken so it's very simple to
something that in the Meep 64 version that I will discuss during lecture
predict if I have a branch instruction that the branch will be taken but the problem here is that
this uh algorithm works if I'm able to to perform the calculation of the
condition of the address and modify the program counter in the fetch stage so in the fetch stage I
need to do all these
will see cannot be implemented because we know the condition we know with the
updated program counter only after the decode stage so you in the M wind what
happens is that you fetch the next instruction so we fall into predict untaken
actually correct at this point if I'm able to to perform the the calculation
directly in the fetch why do I need to have the predict taken this is why actually this is not
implemented a
second secondary thing actually but obviously if you are able to perform the calculation directly in
the fetch stage
you don't need to perform a prediction you have already the corrected uh program counter to
execute the next
some assumption and this can be useful especially if I cannot do the operation
in the fetch stage because I repeat why this is something that uh in the fetch
stage is not normally made because normally the problem is that the fetch
stage is low the fetch stage need to access to the memory to get the
fetch stage is low so we need in some way to move the logic for the um
instruction so I cannot since I don't know which instructure it is I cannot perform the calculation of
the condition
I am just taking 32 bits from the memory that's it they are aligned I don't know
normally since I have assumed that this problematic the compiler can be used so
in the fetch stage I will do it in the decode stage for example but I want to
increase the probability of guessing that if I need to take the branch or not obviously I need in some
way to have the
hardware supports both predict taken and untaken very difficult due to the situation that
um we will see in the next pack of slide that just to anticipate there
is we are assuming that our processor is divided into five stages but this is not
their normal so this is something that I want to transmit the F stage since it is lower
untaken can be implemented but normally not maybe in the first cycle but maybe in the second
because I have already the
instruction and I can do some Hardware calculation for all instruction because
I don't know if it is a branch or not but I will check for example the value
in the future you will deepen because there are some tables that can store the
result of the jump or the prediction they are called uh Branch history table
Branch Target buffer there are even more but uh for the sake of the discussion we
that supports in some way predict taken or are taken in normally what was seen
branches will be taken with more probability and this is the case of Loops because when you have
loops like
in the case of the um of the lab one we have five Loops actually
so for for four time I will take the loop the last one I will not take
it so if I was was able to predict the branch has taken obviously the penalty
would be lesser with respect to the case uh we have seen so the compiler
obviously can be built in a way uh that it is possible to guess the prediction
obviously it depends even on the architecture if the architecture allows to perform these it could be
possible to
have a smart compiler that performs or to that increase the performance of guessing if we need to
take a branch or
not there is a fourth then uh technique that I want to discuss which is called
slot and the idea is the following let's imagine that I have a
actually if I have an instruction which is not in some way related with this
with respect to the previous here for example I have a read after read case so
there is a data dependency but is of the type we are not scared of because here
so these two have a data dependency but I'm not scared about since I know that
instruction that normally comes before the branch I can move it here so it is an
an instruction that actually needs to be executed and not for example example the AL that we have
seen because the AL must be executed only in the last cycle so instead of the
have that you have seen it just before after the bz for five time I will try to
which come before it is this instruction for example which if you take