PDF 4
PDF 4
Abstract can deviate the reported findings. This work aims to unveil different
arXiv:2503.20481v1 [cs.AR] 26 Mar 2025
GPUs are the most popular platform for accelerating HPC work- features and details of various components that modern NVIDIA
loads, such as artificial intelligence and science simulations. How- GPU architectures use to improve the accuracy of academic mi-
ever, most microarchitectural research in academia relies on GPU croarchitecture models. The model and details explained in this
core pipeline designs based on architectures that are more than 15 work allow researchers to better identify challenges and opportuni-
years old. ties for improving future GPUs. In summary, this paper makes the
This paper reverse engineers modern NVIDIA GPU cores, un- following contributions:
veiling many key aspects of its design and explaining how GPUs • Describes the operation of the issue stage, including de-
leverage hardware-compiler techniques where the compiler guides pendence handling, readiness conditions of warps, and the
hardware during execution. In particular, it reveals how the issue issue scheduler policy.
logic works including the policy of the issue scheduler, the structure • Describes a plausible operation of the fetch stage and its
of the register file and its associated cache, and multiple features scheduler that coordinates with the issue stage.
of the memory pipeline. Moreover, it analyses how a simple in- • Provides important details of the register file and explains
struction prefetcher based on a stream buffer fits well with modern the behavior of the register file cache. Moreover, it shows
NVIDIA GPUs and is likely to be used. Furthermore, we investigate that modern NVIDIA GPUs do not use an operand collec-
the impact of the register file cache and the number of register file tion stage or collector units.
read ports on both simulation accuracy and performance. • Reveals multiple details of the components of the memory
By modeling all these new discovered microarchitectural details, pipeline.
we achieve 18.24% lower mean absolute percentage error (MAPE) • Redesigns the SM/core model used in Accel-sim simulator
in execution cycles than previous state-of-the-art simulators, result- [49] from scratch and integrates all the details we revealed
ing in an average of 13.98% MAPE with respect to real hardware in this paper into the model.
(NVIDIA RTX A6000). Also, we demonstrate that this new model • Validates the new model against real hardware and com-
stands for other NVIDIA architectures, such as Turing. pares it against the Accel-sim simulator [49]. Our new
Finally, we show that the software-based dependence manage- model achieves a mean absolute percentage error (MAPE)
ment mechanism included in modern NVIDIA GPUs outperforms a of execution cycles against real hardware of 13.98% for the
hardware mechanism based on scoreboards in terms of performance NVIDIA RTX A6000 (Ampere), which is 18.24% better than
and area. the previous simulator model.
• Demonstrates that a naive stream buffer for instruction
1 Introduction prefetching provides a greater performance accuracy, and
In recent years, GPUs have become very popular for executing its performance is similar to a perfect instruction cache.
general-purpose workloads [21] in addition to graphics. GPUs’ ar- • Shows how the register file cache and the number of register
chitecture provides massive parallelism that can be leveraged by file read ports affect simulation accuracy and performance.
many modern applications such as bioinformatics [23, 59], physics [52, • Compares the performance, area, and simulation accuracy
87], and chemistry [39, 89], to name a few. Nowadays, GPUs are of the dependence management system that we unveil in
the main candidates to accelerate modern machine learning work- this paper against the traditional scoreboard. The compar-
loads, which have high memory bandwidth and computation de- ison reveals that this novel software-hardware co-design
mands [27]. Over the last years, there have been significant innova- is a more efficient alternative than handling dependencies
tions in GPUs’ microarchitecture, their interconnection technolo- with traditional scoreboards.
gies (NVLink [63]), and their communication frameworks (NCCL [64]). • Shows the portability of the model to other NVIDIA archi-
All these advances have enabled inference and training of Large Lan- tectures such as Turing.
guage Models, which require clusters with thousands of GPUs [56]. The rest of this paper is organized as follows. First, we introduce
However, there is scarce information on the microarchitecture the background and motivation of this work in section 2. In sec-
design of modern commercial GPUs, and current academic stud- tion 3, we explain the reverse engineering methodology that we
ies [17, 49] take the Tesla microarchitecture [51] as the baseline, have employed. We describe the control bits in modern NVIDIA
which was launched in 2006. Today GPU architectures has under- GPU architectures and their detailed behavior in section 4. Later,
gone significant changes since Tesla, hence, a model based on that we present the core microarchitecture of these GPUs in section 5.
Huerta and Abaie, et al.
Next, we describe the features we have modeled in our simulator in instructions are issued in-order, their operands might be fetched
section 6. Section 7 evaluates the accuracy of our model against real out of order. This happens for variable-latency instructions such
hardware and compares it to the Accel-sim framework simulator, as memory instructions. These instructions are queued after being
analyzes the impact of a stream buffer for instruction prefetch- issued and may read its source operands after a younger arithmetic
ing, studies the effect of the register file cache and the number of instruction writes its result, causing a WAR hazard if the source
register file read ports, compares different dependence manage- operand of the former is the same as the destination operand of the
ment mechanisms, and discusses how the model stands for other latter.
NVIDIA architectures. Section 8 reviews previous related work. Once an instruction is issued, it is placed in a Collector Unit
Finally, section 9 summarizes the main lessons of this work. (CU) and waits until all its source register operands are retrieved.
Each sub-core has a private register file with multiple banks and a
2 Background and Motivation few ports per bank, allowing for multiple accesses in a single cycle
Most GPU microarchitecture research in academia relies on the at a low cost. An arbiter deals with the possible conflicts among
microarchitecture that GPGPU-Sim simulator employs [1, 17]. Re- several petitions to the same bank. When all source operands of
cently, this simulator was updated to include sub-cores (Processing an instruction are in the CU, the instruction moves to the Dispatch
Blocks in NVIDIA terminology) approach that started in Volta. Fig- stage, where it is dispatched to the proper execution unit (e.g.,
ure 1 shows a block diagram of the architecture modeled in this memory, single-precision, special function) whose latencies differ
simulator. We can observe that it comprises four sub-cores and depending on the unit type and instruction. Once the instruction
some shared components, such as the L1 instruction cache, L1 data reaches the write-back stage, the result is written in the register
cache, shared memory, and texture units. file.
This GPU microarchitecture modeled in Accel-sim [49] resem-
bles NVIDIA GPUs based on Tesla [51], which was released in 2006,
Sub-core
and updated with a few modern features, mainly a sub-core model
Instruction L1 instruction cache
Buffer and sectored caches with IPOLY [75] indexing similar to Volta [49].
Fetch Fetch Fetch Fetch However, it lacks some important components that are present in
Scoreboard Scoreboard Decode Decode Decode Decode modern NVIDIA GPUs, such as the L0 instruction cache [20, 28, 65–
RAW/WAW RAW 4 69] and the uniform register file [20]. Moreover, some main compo-
GTO Issue nents of sub-cores, such as the issue logic, register file, or register
Operand collection Sub-core Sub-core Sub-core file caches, among others, are not updated to reflect current designs.
Register file CUs
This work aims to reverse engineer the microarchitecture of the
Bank 0 CU 0
1
core in modern NVIDIA GPUs and update Accel-sim to incorpo-
Memory unit
Bank 1 CU 1 L1 data cache rate the unveiled features. This will allow users of this updated
Shared memory Accel-sim simulator to make their work more relevant by starting
WB Dispatch with baselines closer to those proven successful by the industry in
1
Local execution units commercial designs.
INT 32 FP 32
SM/Core
Tensor Special 3 Reverse Engineering Methodology
cores function
This section explains our research methodology for discovering the
microarchitecture of the cores (SMs) in NVIDIA Ampere GPUs.
Figure 1: SM/Core academia design. Our approach is based on writing small microbenchmarks that
consist of few instructions and measure the execution time of a
particular small sequence of instructions. The elapsed cycles are
In the Fetch stage of this GPU pipeline, a round robin scheduler
obtained by surrounding a region of the code with instructions that
selects a warp whose next instruction is in the L1 instruction cache
save the CLOCK counter of the GPU into a register, and store it in
and has empty slots in the Instruction Buffers. These buffers are
main memory for later post-processing. The evaluated sequence of
dedicated per warp and store the consecutive instructions of a warp
instructions typically consists of hand-written SASS instructions
after they are fetched and decoded. Instructions stay in this buffer
(and their control bits). Depending on the test, we visualize the
until they are ready and selected to be issued.
recorded cycles to confirm or refute a particular hypothesis about
In the Issue stage, a Greedy Then Oldest (GTO) [77] scheduler se-
the semantics of the control bits or a particular feature in the mi-
lects a warp if it is not waiting on a barrier and its oldest instruction
croarchitecture. Two examples to illustrate this methodology are
does not have data dependence with other in-flight instructions
given below:
in the pipeline. Previous works assume that each warp has two
scoreboards for checking the data dependence. The first one marks • We have used the code in Listing 1 to unveil the conflicts
pending writes to the registers to track WAW and RAW depen- of the multi-banked register file (subsection 5.3). Replacing
dencies. An instruction can be issued only when all its operands R_X and R_Y by odd numbers (e.g., R19 and R21), we get
are cleared in this scoreboard. The second scoreboard counts the an elapsed time of five cycles (the minimum since each
number of in-flight consumers of registers to prevent WAR haz- sub-core can issue one instruction per cycle). If we change
ards [57]. The second scoreboard is necessary because although R_X to an even number (e.g., R18) while maintaining R_Y
Analyzing Modern NVIDIA GPU cores
odd (e.g., R21), the reported number of cycles is six. Finally, of the producing instruction minus the number of instructions
the elapsed time is seven cycles if we use an even number between the producer and the first consumer. All these per-warp
for both operands (e.g., R18 and R20). In summary, two Stall counters are decreased by one every cycle until they reach 0.
consecutive instructions can have from 0 to 2 cycles of The issue logic will simply check this counter and will not consider
bubbles in between, depending on which registers they use. issuing another instruction of the same warp until its value is zero.
• Figure 4 is an example of a graphical representation of For example, an addition whose latency is four cycles and its
multiple recorded time marks, which in this case, it has first consumer is the following instruction encodes a four in the
been employed for discovering the issue policy of warps as Stall counter. Using the methodology explained in section 3, we
explained in section 5.1.2. have verified that if the Stall counter is not properly set, the result
of the program is incorrect since the hardware does not check for
CLOCK
NOP RAW hazards, and simply relies on these compiler-set counters. In
FFMA R11 , R10 , R12 , R14
FFMA R13 , R16 , R_X , R_Y addition, this mechanism has benefits in terms of area and energy
NOP
CLOCK
wiring. Keep in mind that wires from fixed-latency units to the
dependence handling components are not needed, in contrast to a
Listing 1: Code used to check Register File read conflicts. traditional scoreboard approach where they are required.
Another control bit is called Yield, and is used to indicate the
Although NVIDIA has no official tool to write SASS code (NVIDIA
hardware that in the next cycle it should not issue and instruction
assembly language) directly, various third-party tools allow pro-
of the same warp. If the rest of the warps of the sub-core are not
grammers to rearrange and modify assembly instructions (including
ready in the next cycle, no instruction is issued.
control bits). These tools are used, for example, to optimize perfor-
Each instruction sets the Stall counter and the Yield bit. If the
mance in critical kernels when the compiler generated code is not
Stall counter is greater than one, the warp will stall for at least one
optimal. MaxAS [36] was the first tool for modifying SASS binaries.
cycle, so in this case, no matter whether Yield is set or not.
Later on, other tools such as KeplerAS [93, 94] were developed for
On the other hand, some instructions (e.g., memory, special func-
Kepler architecture. Then, TuringAS [90] and CUAssembler [29]
tions) have variable latency, and the compiler does not know their
appeared to support more recent architectures. We have decided
execution time. Therefore, the compiler cannot handle these haz-
to use CUAssembler due to its flexibility, extensibility, and support
ards through the Stall counter. These hazards are resolved through
for the latest hardware.
Dependence counters bits. Each warp has six special registers to
store these counters, which are referred to as SBx with x taking a
4 Control Bits in Modern NVIDIA GPU value in the range [0 − 5]. Each of these counters can count up to
Architectures 63.
The ISA of modern NVIDIA GPU architectures contains control These counters are initialized to zero when a warp starts. To
bits and information that the compiler provides to maintain cor- handle a producer-consumer dependence, the producer increases a
rectness. Unlike GPU architectures in previous works that check particular counter after issue, and decreases it at write-back. The
data dependence by tracking register reads and writes at run time consumer instruction is instructed to stall until this counter is zero.
(see section 2), these GPU architectures rely on the compiler to For WAR hazards, the mechanism is similar, with the only differ-
handle register data dependencies [62]. For this purpose, all as- ence being that the counter is decreased after the instruction reads
sembly instructions include some control bits to properly manage its source operands, instead of decreasing the counter at write-back.
dependencies in addition to improving performance and energy In each instruction, there are some control bits to indicate up to
consumption. two counters that are increased at issue. One of these counters will
Below, we describe the behavior of these control bits included in be decreased at write-back (to handle RAW and WAW dependencies)
every instruction. The explanation is based on some documents [29, and the other at register read (to handle WAR dependencies). For
36, 43, 44], but these documents are often ambiguous or incomplete, this purpose, every instruction has two fields of 3 bits each to
so we use the methodology described in section 3 to uncover the indicate these two counters. Besides, every instruction has a mask
semantics of these control bits and verify that they act as described of 6 bits to indicate which dependence counters it has to check to
below. determine if it is ready for issue. Note that an instruction can check
Sub-cores can issue a single instruction per cycle. By default, the up to all six counters.
Issue Scheduler tries to issue instructions of the same warp if the Consider that if an instruction has multiple source operands
oldest instruction in the program order of that warp is ready. The whose producers have variable latency, the same Dependence counter
compiler indicates when an instruction will be ready for issue using can be used by all these producers without losing any parallelism.
the control bits. If the oldest instruction of the warp that issued an It is important to note that this mechanism may encounter limits of
instruction in the previous cycle is not ready, the issue logic selects parallelism in scenarios where there are more than six consumer in-
an instruction from another warp, following the policy described structions with different variable-latency producers. In such cases,
in subsection 5.1. the compiler must choose between two alternatives to manage the
To handle producer-consumer dependencies of fixed-latency situation: 1) group more instructions under the same Dependence
instructions, each warp has a counter that is referred to as Stall counter, or 2) reorder the instructions differently.
counter. If this counter is not zero, this warp is not candidate to The incrementing of the Dependence counters is performed the
issue instructions. The compiler sets this counter with the latency cycle after issuing the producer instruction, so it is not effective until
Huerta and Abaie, et al.
one cycle later. Therefore, if the consumer is the next instruction, also relies on a hardware-software codesign to manage dependen-
the producer has to set the Stall counter to 2, to avoid issuing the cies and boost performance. Similar to NVIDIA’s DEPBAR.LE in-
consumer instruction the following cycle. struction, AMD employs a waitcnt instruction; depending on the
An example of handling dependencies with variable-latency pro- architecture, each wavefront (warp) has three or four counters,
ducers can be found in Figure 2. This code shows a sequence of with each counter dedicated to specific instruction types and its use
four instructions (three loads and one addition) with its associated required to protect data hazards created by those instructions. AMD
encoding. As the addition has dependencies with the loads (variable- does not allow regular instructions to wait for a counter to reach
latency instructions), Dependence counters are used to prevent data zero using control bits, requiring an explicit waitcnt instruction
hazards. The instruction at PC 0x80 has RAW dependencies with instead, which increases the number of instructions. This design
instructions at 0x50 and 0x60. Thus, SB3 is incremented by instruc- reduces the decoding overhead, yet increases the overall instruction
tions 0x50 and 0x60 at issue and decremented at write-back. On the count. In contrast, NVIDIA’s alternative enables more concurrent
other hand, the addition has WAR dependencies with instructions dependence chains even within the same instruction type, as it has
at 0x60 and 0x70. In consequence, SB0 is increased by instructions up to two counters more per warp, and the counters are not tied
0x60 and 0x70 at issue and decreased after reading their respective to any instruction type. Although AMD does not need software or
register source operands. Finally, the Dependence counters mask compiler intervention to avoid data hazards with ALU instructions,
of the addition encodes that before being issued, SB0 and SB3 must it introduced the DELAY_ALU instruction in RDNA 3/3.5 architec-
be 0. Note that instruction 0x70 also uses SB4 to control RAW/WAR tures [13, 15] to mitigate pipeline stalls caused by dependencies.
hazards with future instructions, but instruction 0x80 does not wait Conversely, NVIDIA depends on the compiler to correctly handle
for this Dependence counter since it does not have any dependence data dependencies by setting the Stall counter for fixed-latency pro-
with that load. Clearing WAR dependencies after reading the source ducers, resulting in a lower instruction count but higher decoding
operands is an important optimization, since source operands are overhead.
sometimes read much earlier than the result is produced, especially
for memory instructions. For instance, in this example, instruction 5 GPU cores Microarchitecture
0x80 waits until instruction 0x70 reads R2 for clearing this WAR In this section, we describe our findings regarding the microarchi-
dependence, instead of waiting until instruction 0x70 performs its tecture of GPU cores of modern commercial NVIDIA GPUs, using
write-back, which may happen hundreds of cycles later. the methodology explained in section 3. Figure 3 shows the main
An alternative way of checking the readiness of these counters components of the GPU cores’ microarchitecture. Below, we de-
is through the DEPBAR.LE instruction. As an example, DEPBAR.LE scribe in detail the microarchitecture of the issue scheduler, the
SB1, 0x3, {4,3,2}, requires the Dependence counter SB1 to have a front-end, the register file, and the memory pipeline.
value less or equal to 3 to continue with the execution. The last argu-
ment ([, {4,3,2}]) is optional, and if used, the instruction cannot 5.1 Issue Scheduler
be issued until the values of the Dependence counters specified by
In this subsection we dissect the Issue Scheduler of modern NVIDIA
those IDs (4, 3, 2 in this example) are equal to 0.
GPUs. First, we describe which warps are considered candidates
DEPBAR.LE can be especially useful in some particular scenarios.
for issue every cycle in subsubsection 5.1.1. Then, we present the
For instance, it allows the use of the same Dependence counter for
Selection policy in subsubsection 5.1.2.
a sequence of 𝑁 variable-latency instructions that perform their
write-back in order (e.g., memory instructions with the STRONG.SM 5.1.1 Warp Readiness. A warp is considered a candidate for issuing
modifier) when a consumer needs to wait for the first 𝑀 instruc- its oldest instruction in a given cycle if some conditions are met.
tions. Using a DEPBAR.LE with its argument equal to 𝑁 − 𝑀 makes Some of these conditions depend on previous instructions of the
this instruction wait for the 𝑀 first instructions of the sequence. same warp, while others rely on the global state of the core.
Another example is to reuse the same Dependence counter to pro- An obvious condition is having a valid instruction in the Instruc-
tect RAW/WAW and WAR hazards. If an instruction uses the same tion Buffer. Another condition is that the oldest instruction of the
Dependence counter for both types of hazards, as WAR hazards are warp must not have any data dependence hazard with older instruc-
resolved earlier than RAW/WAW, a following DEPBAR.LE SBx, 0x1 tions of the same warp not yet completed. Dependencies among
will wait until the WAR is solved and allow the warp to continue instructions are handled through software support by means of the
its execution. A later instruction that consumes its result needs to control bits described above in section 4.
wait until this Dependence counter becomes zero, which means that Besides, for fixed-latency instructions, a warp is candidate to
the results have been written. issue its oldest instruction in a given cycle only if it can be guar-
Additionally, GPUs have a register file cache used to save energy anteed that all needed resources for its execution will be available
and reduce contention in the register file read ports. This structure once issued.
is software-managed by adding a control bit to each source operand, One of these resources is the execution unit. Execution units
the reuse bit, which indicates the hardware whether to cache or have an input latch that must be free when the instruction reaches
not the content of the register. More details about the organization the execution stage. This latch is occupied for two cycles if the
of register file cache are explained in section 5.3.1. width of the execution unit is just half warp, and for one cycle if its
Finally, although this paper focuses on NVIDIA architectures, width is a full warp.
exploring AMD GPU ISAs documentation [6–15] reveals that AMD For instructions that have a source operand in the Constant
Cache, the tag look-up is performed in the issue stage. When the
Analyzing Modern NVIDIA GPU cores
Fetch L0 Icache data dependence hazards unless the first instruction sets the Yield
L1 Icache /
Stream Buffer Arbiter Ccache bit or a Stall counter bigger than one.
Arbiter
Decode
The second stage only exists for fixed-latency instructions. In
Instruction L0 FL
Buffer Ccache
this stage, the availability of register file read ports is checked, and
Sw-Hw Sub-core Sub-core Sub-core the instruction is stalled in this stage until it is guaranteed that
dependence handler 1 2 3 it can proceed without any register file port conflict. We call this
stage Allocate. More details about the register file read and write
CGGTY Issue
pipeline and its cache are provided in subsection 5.3.
Allocate Control Memory Arbiter Variable-latency instructions (e.g., memory instructions) are de-
local unit livered directly to a queue after going through the Control stage
Fixed latency Special Shared memory unit
RF reads function (without going through the Allocate stage). Instructions in this
L1 data cache
queue are allowed to proceed to the register file read pipeline when
L0 VL constant cache
Register file
they are guaranteed not to have any conflict. Fixed-latency instruc-
Fixed latency Texture cache
execution units Regular RFC
tions are given priority over variable-latency instructions to allocate
Shared memory
FP 32 register file ports, as they need to be completed in a fixed number
Bank 0 Bank 1
INT 32 of cycles after issue to guarantee the correctness of the code, since
Uniform
Tensor cores SM/Core dependencies are handled by software as described above.
RF write
Result queue priority arbiter
Sub-core 0 5.1.2 Scheduling Policy. To discover the policy of the issue sched-
uler, we developed many different test cases involving multiple
warps and recorded which warp was chosen for issue in each cycle
Figure 3: Modern NVIDIA GPU SM/Core design. by the issue scheduler. This information was gathered through in-
structions that allow to save the current CLOCK cycle of the GPU.
However, as the hardware does not allow issuing two of these in-
oldest instruction of the selected warp requires an operand from the structions consecutively, we employed a controlled number of other
constant cache and the operand is not in the cache, the scheduler instructions in between (normally NOPs). We also varied the specific
does not issue any instruction until the miss is serviced. However, values in the Yield and the Stall counter control bits.
if the miss has not been served after four cycles, then the scheduler Our experiments allowed us to conclude that the warp scheduler
switches to a different warp (the youngest with a ready instruction). uses a greedy policy that selects an instruction from the same warp
As for the availability of read ports in the register file, the issue if it meets the eligibility criteria described above. When switching to
scheduler is unaware of whether the evaluated instructions have a different warp, the youngest one that meets the eligibility criteria
enough ports for reading without stalls in the following cycles. We is selected.
have reached this conclusion after observing that the conflicts in the This issue scheduler policy is illustrated with some examples
code shown in Listing 1 do not stall the issue of the second CLOCK of our experiments in Figure 4. This figure depicts the issue of
if we remove the NOP between the last FFMA and the last CLOCK instructions when four warps are executed in the same sub-core for
instruction. We performed a multitude of experiments to unveil the three different cases. Each warp executes the same code composed
pipeline structure between issue and execute, and we could not find of 32 independent instructions that can be issued one per cycle.
a model that perfectly fits all the experiments. However, the model In the first case, Figure 4 ○a , all Stall counters, Dependence masks
that we describe below is correct for almost all the cases, so it is the and Yield bits are set to zero. The scheduler starts issuing instruc-
model that we assume. In this model, fixed-latency instructions have tions from the youngest warp, which is W3, until it misses in the
two intermediate stages between the Issue stage and the stage(s) Icache. As a result of the miss, W3 does not have any valid in-
for reading source operands. The first stage, which we call Control, struction, so the scheduler switches to issue instructions from W2.
is common for fixed and variable latency instructions, and its duty W2 hits in the Icache since it reuses the instructions brought by
is to increase the Dependence counters or read the value of the clock W3, and when it reaches the point where W3 missed, the miss has
counter if needed. This causes, as corroborated by our experiments, already been served, and all remaining instructions are found in
that an instruction that increases the Dependence counter and the the Icache, so the scheduler greedily issues that warp until the end.
instruction that waits until that Dependence counter is 0 requires Later, the scheduler proceeds to issue instruction from W3 (the
at least one cycle in between to make visible that increase, so two youngest warp) until the end, since now all instructions are present
consecutive instructions cannot use Dependence counters to avoid in the Icache. Then, the scheduler switches to issue instructions
Huerta and Abaie, et al.
a
b
c
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100 104 108 112 116 120 124 128 132 136 140
Cycle
Warp W3 Warp W2 Warp W1 Warp W0
from W1 from beginning to end, and finally, it does the same for Buffer would happen relatively often, and we have not observed this
W0 (the oldest warp). in our experiments. Based on that, we assume that each sub-core
Figure 4 ○b shows the timeline of when instructions are issued can fetch and decode one instruction per cycle. The fetch scheduler
when the second instruction of each warp sets its Stall counter to tries to fetch an instruction from the same warp that has been issued
four. We can observe that the scheduler swaps from W3 to W2 after in the previous cycle (or the latest cycle in which an instruction
two cycles, to W1 after another two cycles, and then back to W3 was issued) unless it detects that the number of instructions already
after another two cycles (since W3 Stall counter has become zero). in the Instruction Buffer plus its in-flight fetches are equal to the
Once W3, W2, and W1 have finished, the scheduler starts issuing Instruction Buffer size. In this case, it switches to the youngest warp
from W0. After issuing the second instruction of W0, the scheduler with free entries in its Instruction Buffer. We assume an Instruction
generates four bubbles because it has no other warp to hide the Buffer with three entries per warp since this is enough to support
latency imposed by the Stall counter. the greedy nature given that there are two pipeline stages from fetch
Figure 4 ○c shows the scheduler’s behavior when Yield is set in to issue. Were the Instruction Buffer of size two, the Greedy policy
the second instruction of each warp. We can see that the scheduler of the issue scheduler would fail. For example, assume a scenario in
switches to the youngest among the rest of the warps after issuing which the Instruction Buffer has a size of two and all the requests
the second instruction of each warp. For instance, W3 switches to hit in the Icache, all warps have their Instruction Buffer full, and
W2, and W2 switches back to W3. We also tested a scenario where in cycle 1, a sub-core is issuing instructions from warp W1 and
Yield is set and no more warps are available (not shown in this fetching from W0. In cycle 2, the second instruction of W1 will be
figure), and we observed that the scheduler generates a bubble of issued, and the third instruction will be fetched. In cycle 3, W1 will
one cycle. have no instructions in its Instruction Buffer because instruction 3
We call this issue scheduler policy Compiler Guided Greedy is still in the decode. Therefore, its greedy behavior would fail, and
Then Youngest (CGGTY) since the compiler assists the scheduler it would have to switch to issue from another warp. Note that this
by means of the control bits: Stall counter, Yield and Depencence would not happen in case of having three entries in the Instruction
counters. Buffer, as corroborated by our experiments. Note that most previous
However, we have only confirmed this behavior for warps within designs in the literature normally assume a fetch and decode width
the same CTA, as we have not yet devised a reliable methodology of two instructions and an Instruction Buffer of two entries per
to analyze interactions among warps from different CTAs. warp. In addition, those designs only fetch instructions when the
Instruction Buffer is empty. Thus, the greedy warp always changes
at least after two consecutive instructions, which does not match
5.2 Front-end
our experimental observations.
According to diagrams in multiple NVIDIA documents [20, 28, 65–
69], SMs have four different sub-cores and warps are evenly dis- 5.3 Register File
tributed among sub-cores in a round robin manner (i.e., warp ID
%4) [43, 44]. Each sub-core has a private L0 instruction cache that We have performed a multitude of experiments by running different
is connected to an L1 instruction cache that is shared among all combinations of SASS assembly instructions to unveil the register
four sub-cores of the SM. We assume there is an arbiter for dealing file organization. For example, we wrote codes with different pres-
with the multiple requests of different sub-cores. sure on the register file ports, with and without using the register
Each L0 Icache has an instruction prefetcher [60]. Our experi- file cache.
ments corroborate previous studies by Cao et al. [24] that demon- Modern NVIDIA GPUs have various register files:
strated that instruction prefetching is effective in GPUs. Although • Regular: Recent NVIDIA architectures have 65536 32-bit
we have not been able to confirm the concrete design used in registers per SM [65–69] used to store the values operated
NVIDIA GPUs, we suspect it is a simple scheme like a stream by threads. The registers are arranged in groups of 32, each
buffer [47] that prefetches successive memory blocks when a miss group corresponding to the registers of the 32 threads in
occurs. We assume that the stream buffer size is 16 based on our a warp, resulting in 2048 warp registers. These registers
analysis, as detailed in subsection 7.3. are evenly distributed between sub-cores, and the registers
We could not confirm the exact instruction fetch policy with our in each sub-core are organized in two banks [43, 44]. The
experiments, but it has to be similar to the issue policy; otherwise, number of registers used by a particular warp can vary
the condition of not finding a valid instruction in the Instruction from 1 to 256, and it is decided at compile time. The more
Analyzing Modern NVIDIA GPU cores
registers used per warp, the fewer warps can run parallel bubbles depends on the type of instructions and the task of each
in the SM. operand in the instructions. We found that the best approximation
• Uniform: Each warp has 64 private, 32-bit registers that that matches almost all the experiments we tested is a scheme
store values shared by all the threads of the warp [43]. with two intermediate stages between the instruction issue and
• Predicate: Each warp has eight 32-bit registers, each bit be- operand read of fixed-latency instructions, which we call Control
ing used by a different thread of the warp. These predicates and Allocate. The former has been explained in subsubsection 5.1.1.
are used by warp instructions to indicate which threads The latter is in charge of reserving the register file read ports.
must execute the instruction and, in the case of branches, Each bank of the register file has one read port of 1024 bits, and
which threads must take the branch and which ones not. read conflicts are alleviated by using a register file cache (more
• Uniform Predicate: Each warp has eight 1-bit registers details later). Our experiments showed that all the fixed-latency
that store a predicate shared by all the treads in the warp. instructions spend three cycles for reading source operands, even
• SB Registers: As described in section 4, each warp has six if some of these cycles the instruction is idle (for instance when
registers, called Dependence counters, that are used to track there are only two source operands) because FADD and FMUL have
variable-latency dependencies. the same latency as the FFMA despite having one operand less, and
• B Registers: Each warp has at least 16 B registers for man- FFMA always has the same latency regardless of whether its three
aging control flow re-convergence [79]. operands are in the same bank or not. If the instruction in the
• Special Registers: Various other registers are used to store allocate stage realizes that it cannot read all its operands in the next
special values, such as the thread or block IDs. three cycles, it is held in this stage (stalling the pipeline upwards)
Unlike previous works [1, 18, 49] that assume the presence of and generates bubbles until it can reserve all ports needed to read
an operand collector to deal with conflicts in the register file ports, the source operands in the next three cycles.
modern NVIDIA GPUs do not make use of it. Operand collector 5.3.1 Register file cache. The use of a Register file cache (RFC) in
units would introduce variability in the elapsed time between issue GPUs has been investigated for relieving contention in the Register
and write-back, making it impossible to have fixed-latency instruc- File ports and saving energy [2, 30, 32, 33, 78].
tions for the NVIDIA ISA, whose latency must be known at compile Through our experiments, we observed that the NVIDIA design
time to handle dependencies correctly as explained section 4. We is similar to the work of Gebhart et al. [33]. In line with that design,
have confirmed the absence of operand collectors by checking the the RFC is controlled by the compiler and is only used by instruc-
correctness of specific producer-consumer sequences of instruc- tions that have operands in the Regular Register File. Regarding the
tions when varying the number of operands that go to the same Last Result File structure, what we call the result queue behaves
bank. We observed that regardless of the number of register file similarly. However, unlike the above-cited paper, a two-level issue
port conflicts, the value required in the Stall counter field of instruc- scheduler is not used, as explained above in subsubsection 5.1.2.
tions to avoid data hazards and the elapsed time to execute the Regarding the organization of the RFC, our experiments showed
instruction remains constant. that it has one entry for each of the two register file banks in
Our experiments revealed that each register file bank has a dedi- each sub-core. Each entry stores three 1024-bit values, each corre-
cated write port of 1024 bits. Besides, when a load instruction and sponding to one of the three regular register source operands that
a fixed-latency instruction finish at the same cycle, the one that is instructions may have. Overall, the RFC’s total capacity is six 1024-
delayed one cycle is the load instruction. On the other hand, when bit operand values (sub-entries). Note that there are instructions
there is a conflict between two fixed latency instructions, for in- that have some operands that require two consecutive registers (e.g.,
stance, a IADD3 followed by an IMAD that uses the same destination tensor core instructions). In this case, each of these two registers
bank, none of them is delayed. This implies the use of a result queue come from a different bank, and are cached in their corresponding
like the one introduced in Fermi [61] for fixed-latency instructions. entries.
The consumers of these instructions are not delayed, which implies The compiler manages the allocation policy. When an instruction
the use of bypassing to forward the results to the consumer before is issued and reads its operands, each operand is stored in the RFC
being written in the register file. if the compiler has set its reuse bit for that operand. A subsequent
Regarding reads, we have observed a bandwidth of 1024 bits instruction will obtain its register source operand from the RFC
per bank. The measurements were obtained through various tests if the instruction is from the same warp, the register ID coincides
that recorded the elapsed time of consecutive FADD, FMUL, and FFMA with the one stored in the RFC, and the operand position in the
instructions 1 . For instance, FMUL instructions with both source instruction is the same as in the instruction that has triggered the
operands in the same bank create a 1-cycle bubble, whereas if the caching. A cached value is unavailable after a read request arrives
two operands are in different banks, there is no bubble. FFMA in- to the same bank and operand position, regardless of whether it
structions with all three source operands in the same bank generate hits in the RFC. This is illustrated in example 2 of Listing 2; to allow
a 2-cycle bubble. the third instruction to find R2 in the cache, the second instruction
Unfortunately, we could not find a read policy that matches all must set the reuse bit of R2 in spite of R2 being already in the cache
the cases we have studied, as we observed that the generation of for the second instruction. Listing 2 shows three other examples to
illustrate the RFC behavior.
1 Ampere allows the execution of FP32 operations into the FP32 and INT32 execution
# Example 1
units [67]. Therefore, there are no bubbles between two consecutive FP32 instructions IADD3 R1 , R2.reuse , R3 , R4 # Allocates R2
due to execution unit conflicts. FFMA R5 , R2 , R7 , R8 # R2 hits and becomes unavailable
Huerta and Abaie, et al.
Memory Dependency type the instruction. WAR dependencies have the same latency for all
Instruction address latencies
granularities since they are released when the address is computed.
register type WAR RAW/WAW
The RAW/WAR dependency is released when the instruction’s read
Load Global 32 bit Uniform 9 29
step has finished, regardless of the instruction’s granularity.
Load Global 64 bit Uniform 9 31
Load Global 128 bit Uniform 9 35
Load Global 32 bit Regular 11 32
Load Global 64 bit Regular 11 34
6 Modeling
Load Global 128 bit Regular 11 38 We have designed from scratch the SM/core model of the Accel-sim
Store Global 32 bit Uniform 10 - framework simulator [49] by modifying the pipeline to implement
Store Global 64 bit Uniform 12∗ - all the details explained in section 4, section 5 and depicted in Fig-
Store Global 128 bit Uniform 16∗ - ure 3. The main new components are outlined below.
Store Global 32 bit Regular 14 - First, we have added an L0 instruction cache with a stream buffer
Store Global 64 bit Regular 16 - prefetcher for each sub-core. L0 instruction and constant caches
Store Global 128 bit Regular 20 -
are connected with a parameterized latency to an L1 instruction/-
Load Shared 32 bit Uniform 9 23
constant cache. We have chosen the size of caches, hierarchy, and
Load Shared 64 bit Uniform 9 23
Load Shared 128 bit Uniform 9 25
latencies according to previous measurements done or the Ampere
Load Shared 32 bit Regular 9 24 architecture described by Jia et al [45].
Load Shared 64 bit Regular 9 24 We have modified the Issue stage to support the control bits, the
Load Shared 128 bit Regular 9 26 tag look-up to the newly added L0 constant caches for fixed-latency
Store Shared 32 bit Uniform 10 - instructions, and the new CGGTY issue scheduler. We included
Store Shared 64 bit Uniform 12 - the control stage, in which instructions increase the dependence
Store Shared 128 bit Uniform 16 - counters, and the Allocate stage, in which fixed-latency instructions
Store Shared 32 bit Regular 12 - check for conflicts in the access to the register file and the register
Store Shared 64 bit Regular 14 - file cache.
Store Shared 128 bit Regular 18 - Regarding memory instructions, we have modeled a new unit
Load Constant 32 bit Immediate 10 26 per sub-core and a unit shared among sub-cores, with the latencies
Load Constant 32 bit Regular 29 29
presented in the previous section.
Load Constant 64 bit Regular 29 29
LDGSTS 32 bit Regular 13 39
Additionally, as Abdelkhalik et al. [3] have demonstrated that
LDGSTS 64 bit Regular 13 39 the latency of a tensor core instruction depends on its operand’s
LDGSTS 128 bit Regular 13 39 numeric types and sizes, so we have adapted the model to use the
Table 2: Memory instructions latencies in cycles. Values with correct latency for each operand type and size.
∗ are approximations as we were unable to gather these data. Other details that we have modeled are the execution pipeline
shared by all sub-cores for double precision instructions in ar-
from the memory to the register file. We have measured that the chitectures without dedicated double precision execution units
bandwidth for this transfer is 512 bits per cycle. in each sub-core. Moreover, we accurately model the timing for
In addition, we can observe that the constant cache’s WAR la- reading/writing operands that use multiple registers, which was
tency is significantly greater than loads to the global memory, previously approximated by using just one register per operand.
whereas RAW/WAW latencies are a bit lower. We could not con- Furthermore, we fixed some inaccuracies in instruction addresses
firm any hypothesis that explains this observation. However, we reported in a previous work [40].
discovered that accesses to the constant memory done by fixed- Apart from implementing the new SM/core model in the simula-
latency instructions go to a different cache level than load constant tor, we have extended the tracer tool. The tool has been extended
instructions. We confirmed this by preloading a given address into to dump the ID of all types of operands (regular registers, uniform
the constant cache through an LDC instruction and waiting for the registers, predication registers, immediate, etc.). Another important
instruction until it is complete. Then, we issued a fixed-latency extension is the capability to obtain the control bits of all instruc-
instruction using the same constant memory address and mea- tions since NVBit does not provide access to them. This is done
sured that the issue was delayed 79 cycles, which corresponds to by obtaining the SASS through the CUDA binary utilities [70] at
a miss, instead of causing no delay, which would correspond to compile time. This implies modifying the compilation of the appli-
a hit. Therefore, fixed-latency instructions accessing the constant cations to generate microarchitecture-dependent code at compile
address space use the L0 FL (fixed latency) constant cache, while time instead of using a just-in-time compilation approach. Unfor-
accesses made through LDC instructions utilize the L0 VL (variable tunately, for a few kernels (all of them belonging to Deepbench),
latency) constant cache. NVIDIA tools do not provide the SASS code, which prevents ob-
Finally, we analyze LDGSTS instruction, which was introduced to taining the control bits for these instructions. To simulate these
reduce the pressure of the register file and improve the efficiency of applications, we use a hybrid mode for dependencies where tradi-
transferring data to the GPU [38]. It loads data from global memory tional scoreboards are employed in kernels that do not have the
and stores it directly into the shared memory without going through SASS code; otherwise, control bits are used.
the register file, which saves instructions and registers. We can We have extended the tool to capture memory accesses to the con-
see that its latency is the same regardless of the granularity of stant cache or global memory through descriptors. Despite the latter
Huerta and Abaie, et al.
Suite Applications Input sets GPUs, and for the biggest GPU, the NVIDIA RTX A6000, the MAPE
Cutlass [71] 1 17 is less than half that of Accel-sim. Regarding the correlation, both
Deepbench [58] 3 27
models are very similar, but our model is slightly better.
Dragon [86] 4 6
GPU Microbenchmark [49] 15 15
ISPASS 2009 [17] 8 8 500 Accel-sim
Lonestargpu [22] 3 6 Our paper
400
Error (%)
Pannotia [25] 7 11
Parboil [81] 6 6 300
Polybench [35] 8 8 200
Proxy Apps DOE [84] 3 4
Rodinia 2 [26] 10 10 100
Rodinia 3 [26] 15 25 0
0 20 40 60 80 100 120 140
Total 83 143 Workload
Table 3: Benchmarks suites. Figure 5: NVIDIA RTX A6000 percentage absolute error for
each benchmark. Benchmarks are sorted by error in ascend-
type of accesses being claimed to be introduced in Hopper [70],
ing order in each of the configurations.
we have noticed that Ampere already uses them. Descriptors are a Figure 5 shows the APE of both models for the NVIDIA RTX
new way of encoding memory references that uses two operands. A6000 and each of the 143 benchmarks sorted in increasing error
The first operand is a uniform register to encode the semantics of for each of the models. We can see that our model consistently has
the memory instruction, while the second encodes the address. We less absolute percentage error than Accel-sim for all applications,
have extended the tracer to capture the address. Unfortunately, the and the difference is quite significant for half of the applications.
behavior encoded in the uniform register remains untracked. Moreover, we can observe that Accel-sim has an absolute percent-
We plan to make public all the simulator and tracer changes age error greater or equal to 100% for 10 applications, and it reaches
made in the Accel-sim framework. 543% in the worst case, whereas our model never has an absolute
percentage error greater than 62%. If we look at the 90th percentile
7 Validation as an indication of the tail accuracy, Accel-sim absolute percentage
In this section, we evaluate the accuracy of our proposed microar- error is 82.64%, whereas it is 31.47% for our model. This proves
chitecture for the GPU cores. First, we describe the methodology that our model is significantly more accurate and robust than the
that we have followed in subsection 7.1. Then, we validate the de- Accel-sim model.
sign in subsection 7.2. Next, we examine the impact of the register
file cache and the number of register file read ports on accuracy 7.3 Sensitivity Analysis of Instruction
and performance in subsection 7.4. Later, we study the impact of Prefetching
two different components in the design, such as the instruction
The characteristics of the stream buffer instruction prefetcher have
prefetcher in subsection 7.3 and an analysis of the dependence
high impact on the global model accuracy. In this section, we ana-
checking mechanisms in subsection 7.5. Finally, we discuss how the
lyze the error of different configurations, including disabling the
model can be seamlessly adapted for NVIDIA architectures other
prefetcher, having a perfect instruction cache, and a stream buffer
than Ampere in subsection 7.6.
prefetcher with sizes 1, 2, 4, 8, 16, and 32 entries. All the configu-
rations are based on the NVIDIA RTX A6000. The MAPE for each
7.1 Methodology configuration is shown in Table 5. We can see that the best accuracy
We validate the accuracy of the proposed GPU core by comparing is obtained with a stream buffer of size 16.
the results of our version of the simulator against hardware counter- We can draw an additional conclusion from the speed-up results
metrics obtained in a real GPU. We use four different Ampere [67] shown in Table 5: a straightforward prefetcher, such as a stream
GPUs whose specifications are shown in Table 4. All the GPUs use buffer, behaves close to a perfect instruction cache in GPUs. This is
CUDA 11.4 and NVBit 1.5.5. We also compare our model/simulator because the different warps in each sub-core usually execute the
with the vanilla Accel-sim simulator framework, since our model same code region and the code of typical GPGPUs applications
is built on top of it. do not have a complex control flow, so prefetching 𝑁 subsequent
We use a wide variety of benchmarks from 12 different suites. A lines usually performs well. Note that since GPUs do not predict
list of the suites that have been employed and the number of appli- branches, it is not worth implementing a Fetch Directed Instruction
cations and different input data sets can be found in Table 3. In total, prefetcher [76] because it would require the addition of a branch
we use 143 benchmarks, 83 of those being different applications, predictor.
and the rest corresponding to just changing the input parameters. Regarding simulation accuracy, we conclude that when instruc-
tion cache enhancements are not under investigation, using a per-
7.2 Performance Accuracy fect instruction cache usually yields comparable accuracy with
Table 4 shows the reported mean percentage absolute error (MAPE) faster simulation speeds. However, for benchmarks where control
for both models (our model and Accel-sim) with respect to the flow is relevant —such as dwt2d [26], lud [26], or nw [26]— employ-
real hardware for each of the GPUs. We can see that our model ing a perfect instruction cache or omitting stream buffers results in
is significantly more accurate than Accel-sim in all the evaluated significant inaccuracies (more than 20% difference compared to a
Analyzing Modern NVIDIA GPU cores
Ampere Turing
RTX 3080 RTX 3080 Ti RTX 3090 RTX A6000 RTX 2080 Ti
Specifications
Core Clock 1710 MHz 1365 MHz 1395 MHz 1800 MHz 1350 MHz
Mem. Clock 9500 MHz 9500 MHz 9750 MHz 8000 MHz 7000 MHz
# SM 68 80 82 84 68
# Warps per SM 48 48 48 48 32
Total Shared
mem./L1D per SM 128 KB 128 KB 128 KB 128 KB 96 KB
# Mem. part. 20 24 24 24 22
Total L2 cache 5 MB 6 MB 6 MB 6 MB 5.5 MB
Validation
Our model MAPE 17.15% 18% 17.93% 13.98% 19.73%
Accel-sim MAPE 27.95% 28.19% 28.5% 32.22% 26.67%
Our model Correl. 0.99 0.99 0.98 0.98 0.98
Accel-sim Correl. 0.98 0.98 0.98 0.97 0.95
Table 4: GPUs specifications and performance accuracy.
1R RFC on 1R RFC off 2R RFC off Ideal For MaxFlops, performance is identical regardless of whether
MAPE 13.98% 16.05% 13.38% 13.57% the RFC is present, since only one static instruction makes use of
Speed-up 1x 0.984x 1.012x 1.013x
it. Notably, performance improves dramatically by approximately
MaxFlops APE 2.82% 2.82% 28.97% 28.97%
44% when two read ports per register file bank are employed. This
MaxFlops speed-up 1x 1x 1.44x 1.44x
improvement is logical given that three operands per instruction are
Cutlass APE 9.72% 39.35% 0.97% 2.3%
Cutlass speed-up 1x 0.79x 1.11x 1.12x common and four read ports (two per bank) are sufficient to meet
demand. In contrast, from a simulation accuracy perspective, the
Table 6: MAPE of different RF configurations in the NVIDIA
configuration of the two read ports exhibits a significant deviation.
RTX A6000 and speed-up of each configuration with respect
In the case of Cutlass with sgemm, a single-port configuration
to the baseline (1 read port and RFC enabled).
without a register file cache leads to a substantial performance
degradation (0.78x). This performance drop is consistent with the
perfect instruction cache and more than 200% respect to an instruc-
observation that 35.9% of the static instructions in the program
tion cache without prefetching). This inaccuracy arises because a
make use of the register file cache in at least one of their operands.
perfect instruction cache fails to capture the performance penalties
However, introducing a register file with two read ports per bank
incurred by frequent jumps between different code segments, while
yields a 12% performance improvement, which suggests that there
the absence of prefetching overly penalizes the execution of other
is room for improvement in the organization of the register file and
parts of the program, which also reveals that there is an opportunity
its cache.
for improvement in those benchmarks.
In summary, the register file architecture, including its cache,
has an important effect on individual benchmarks, so its accurate
7.4 Sensitivity Analysis of the Register File model is important. A single port per bank plus a simple cache
Architecture performs close to a register file with an unbounded number of
Table 6 illustrates how the presence of a register file cache and ports on average, but for some individual benchmarks the gap is
an increased number of register file read ports per bank affect significant, which suggests that this may be an interesting area for
simulation accuracy and performance. It also shows the results for research.
an Ideal scenario where all operands can be retrieved in a single
cycle. The average performance and accuracy across all benchmarks
are similar for all configurations. However, a closer examination
7.5 Analyzing Dependence Management
of specific benchmarks —such as the compute-bound MaxFlops Mechanisms
(Accel-sim GPU Microbenchmark) [49] and Cutlass [71] configured In this subsection, we analyze the impact on performance and the
with the sgemm parameter —reveals more nuanced behavior. Both area of the software-hardware dependence handling mechanism
benchmarks rely heavily on fixed-latency arithmetic instructions, explained in this paper and compare them with the traditional
which are particularly sensitive to stalls caused by register file scoreboard method used by former GPUs. Table 7 shows the results
access limitations because they typically use three operands per for both metrics. Area overhead is reported relative to the area of
instruction. the regular register file of an SM, which is 256 KB.
Huerta and Abaie, et al.
Compiler hints (a.k.a. control bits) have been used in GPUs at References
least since the NVIDIA Kepler architecture [93, 94]. A description [1] Tor M. Aamodt, Wilson Wai Lun Fung, and Timothy G. Rogers. 2018. General-
of these control bits was published by Gray et al. [36]. In Kepler, purpose Graphics Processor Architectures. Morgan & Claypool Publishers.
[2] Mojtaba Abaie Shoushtary, Jose Maria Arnau, Jordi Tubella Murgadas, and
Maxwell, and Pascal architectures, one out of each 3 to 7 instructions Antonio Gonzalez. 2024. Memento: An Adaptive, Compiler-Assisted Register
is usually a compiler-inserted hint instruction. Jia et al. [43, 44] File Cache for GPUs. In 2024 ACM/IEEE 51th Annual International Symposium on
Computer Architecture (ISCA).
revealed that newer architectures such as Volta or Turing have [3] Hamdy Abdelkhalik, Yehia Arafa, Nandakishore Santhi, and Abdel-Hameed A.
increased the instruction bit size from 64 to 128 bits. As a result, Badawy. 2022. Demystifying the Nvidia Ampere Architecture through Mi-
newer architectures have shifted from having specific instructions crobenchmarking and Instruction-level Analysis. In 2022 IEEE High Performance
Extreme Computing Conference (HPEC). 1–8. https://doi.org/10.1109/HPEC55821.
for hints to including the hint bits in each instruction. These bits are 2022.9926299
intended not only to improve the hardware performance but also to [4] Jaeguk Ahn, Jiho Kim, Hans Kasan, Leila Delshadtehrani, Wonjun Song, Ajay
ensure the program’s correctness by preventing data hazards. CPUs Joshi, and John Kim. 2021. Network-on-Chip Microarchitecture-based Covert
Channel in GPUs. In MICRO-54: 54th Annual IEEE/ACM International Sympo-
have used compiler hints to help hardware make better decisions sium on Microarchitecture (Virtual Event, Greece) (MICRO ’21). Association for
about how to use the different resources [80]. These hints have Computing Machinery, New York, NY, USA, 565–577. https://doi.org/10.1145/
3466752.3480093
been used to support different components of the CPUs, such as [5] Ayaz Akram and Lina Sawalha. 2019. A Survey of Computer Architecture
data bypass, branch prediction, and caches, among others. Simulation Techniques and Tools. IEEE Access 7 (2019), 78120–78145. https:
To the best of our knowledge, our work is the first one to unveil //doi.org/10.1109/ACCESS.2019.2917698
[6] AMD. 2016. AMD Graphics Core Next Architecture, Generation 3. Reference Guide.
the GPU core microarchitecture of modern NVIDIA GPUs and Technical Report. AMD.
develop an accurate microarchitecture simulation model. Some [7] AMD. 2020. "AMD Instinct MI100" Instruction Set Architecture. Reference Guide.
novel features discovered in our work are: a complete definition Technical Report. AMD.
[8] AMD. 2020. "RDNA 1.0" Instruction Set Architecture. Reference Guide. Technical
of the semantics of the control bits and the microarchitectural Report. AMD.
changes to support them, the behavior of the issue scheduler, the [9] AMD. 2020. "RDNA 2" Instruction Set Architecture. Reference Guide. Technical
Report. AMD.
microarchitecture of the register file and its associated cache, and [10] AMD. 2020. Vega 7nm Instruction Set Architecture. Reference Guide. Technical
various aspects of the memory pipeline. These aspects are critical Report. AMD.
for an accurate modeling of modern NVIDIA GPUs. [11] AMD. 2020. Vega Instruction Set Architecture. Reference Guide. Technical Report.
AMD.
[12] AMD. 2022. "AMD Instinct MI200" Instruction Set Architecture. Reference Guide.
9 Conclusion Technical Report. AMD.
[13] AMD. 2023. "RDNA 3" Instruction Set Architecture. Reference Guide. Technical
This paper unveils the industry’s modern NVIDIA GPU microarchi- Report. AMD.
tecture by reverse engineering it on real hardware. We dissect the [14] AMD. 2024. "AMD Instinct MI300" Instruction Set Architecture. Reference Guide.
issue stage logic, including analyzing warp readiness conditions Technical Report. AMD.
[15] AMD. 2024. "RDNA 3.5" Instruction Set Architecture. Reference Guide. Technical
and discovering that the issue scheduler among warps follows a Report. AMD.
CGGTY policy. In addition, we unveil different details of the register [16] Tanya Amert, Nathan Otterness, Ming Yang, James H. Anderson, and F. Donelson
Smith. 2017. GPU Scheduling on the NVIDIA TX2: Hidden Details Revealed. In
file, such as the number of ports and their width. Also, we reveal 2017 IEEE Real-Time Systems Symposium (RTSS). 104–115. https://doi.org/10.
how the register file cache works. Moreover, this paper exhibits 1109/RTSS.2017.00017
some important characteristics of the memory pipeline, like the [17] Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M.
Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator.
size of load/store queues, contention between sub-cores, and how In 2009 IEEE International Symposium on Performance Analysis of Systems and
latencies are affected by memory instruction granularity accesses. Software (ISPASS). 163–174. https://doi.org/10.1109/ISPASS.2009.4919648
Furthermore, we analyze the fetch stage and propose one that meets [18] Aaron Barnes, Fangjia Shen, and Timothy G. Rogers. 2023. Mitigating GPU
Core Partitioning Performance Effects. In 2023 IEEE International Symposium on
the requirements of modern NVIDIA GPUs. High-Performance Computer Architecture (HPCA). 530–542. https://doi.org/10.
Additionally, the paper compendiums previous public informa- 1109/HPCA56546.2023.10070957
[19] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali
tion about control bits by organizing, explaining it in detail, and Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh
extending it. Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D.
In addition, we model all these details in a simulator and compare Hill, and David A. Wood. 2011. The gem5 simulator. SIGARCH Comput. Archit.
News 39, 2 (Aug. 2011), 1–7. https://doi.org/10.1145/2024716.2024718
this new model against real hardware, demonstrating that it is closer [20] John Burgess. 2020. RTX on—The NVIDIA Turing GPU. IEEE Micro 40, 2 (2020),
to reality than the previous models by improving its accuracy in 36–44. https://doi.org/10.1109/MM.2020.2971677
cycles by more than 18.24%. [21] Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of
irregular programs on GPUs. In Proceedings - 2012 IEEE International Symposium
Besides, we demonstrate that instruction prefetching with a on Workload Characterization, IISWC 2012. 141–151. https://doi.org/10.1109/
simple stream buffer in GPUs performs well in terms of simulation IISWC.2012.6402918
[22] Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of
accuracy and performance, approaching a perfect instruction cache. irregular programs on GPUs. In 2012 IEEE International Symposium on Workload
Also, we show how the dependence management mechanism based Characterization (IISWC). 141–151. https://doi.org/10.1109/IISWC.2012.6402918
on control bits used in modern NVIDIA GPUs outperforms other [23] Alhadi Bustamam, Kevin Burrage, and Nicholas A. Hamilton. 2012. Fast Parallel
Markov Clustering in Bioinformatics Using Massively Parallel Computing on
alternatives, such as traditional scoreboarding. GPU with CUDA and ELLPACK-R Sparse Format. IEEE/ACM Transactions on
Finally, we investigate how the register file cache and the number Computational Biology and Bioinformatics 9, 3 (2012), 679–692. https://doi.org/
register file read ports affect simulation accuracy and performance. 10.1109/TCBB.2011.68
[24] Jianli Cao, Zhikui Chen, Yuxin Wang, He Guo, and Pengcheng Wang. 2021.
Overall, we can conclude that GPUs are hardware-compiler code- Instruction prefetch for improving GPGPU performance. IEICE Transactions
sign where the compiler guides the hardware in handling depen- on Fundamentals of Electronics, Communications and Computer Sciences E104A
dencies and introduces hints that can improve performance and (2021), 773–785. Issue 5. https://doi.org/10.1587/TRANSFUN.2020EAP1105
energy.
Huerta and Abaie, et al.
[25] Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt, and Kevin Skadron. Implications on Interconnect Architecture. https://doi.org/10.1109/MICRO61859.
2013. Pannotia: Understanding irregular GPGPU graph applications. In 2013 2024.00070
IEEE International Symposium on Workload Characterization (IISWC). 185–195. [47] N.P. Jouppi. 1990. Improving direct-mapped cache performance by the addition
https://doi.org/10.1109/IISWC.2013.6704684 of a small fully-associative cache and prefetch buffers. In 1990 17th Annual
[26] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, International Symposium on Computer Architecture. 364–373. https://doi.org/10.
Sang Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for hetero- 1109/ISCA.1990.134547
geneous computing. In Proceedings of the 2009 IEEE International Symposium on [48] Mahmoud Khairy, Jain Akshay, Tor Aamodt, and Timothy G. Rogers. 2018. Ex-
Workload Characterization, IISWC 2009. 44–54. https://doi.org/10.1109/IISWC. ploring Modern GPU Memory System Design Challenges through Accurate
2009.5306797 Modeling. (2018). http://arxiv.org/abs/1810.07269http://dx.doi.org/10.1109/
[27] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John ISCA45697.2020.00047
Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primi- [49] Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and Timothy G. Rogers. 2020.
tives for Deep Learning. (10 2014). https://doi.org/10.48550/arxiv.1410.0759 Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling. In
arXiv:1410.0759 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture
[28] Jack Choquette, Olivier Giroux, and Denis Foley. 2018. Volta: Performance and (ISCA). 473–486. https://doi.org/10.1109/ISCA45697.2020.00047
Programmability. IEEE Micro 38, 2 (2018), 42–52. https://doi.org/10.1109/MM. [50] Ahmad Lashgar, Ebad Salehi, and Amirali Baniasadi. 2016. A Case Study in Re-
2018.022071134 verse Engineering GPGPUs: Outstanding Memory Handling Resources. SIGARCH
[29] Cloudcores. [n. d.]. CuAssembler: An unofficial cuda assembler, for all genera- Comput. Archit. News 43, 4 (4 2016), 15–21. https://doi.org/10.1145/2927964.
tions of SASS. https://github.com/cloudcores/CuAssembler 2927968
[30] Hodjat Asghari Esfeden, Amirali Abdolrashidi, Shafiur Rahman, Daniel Wong, [51] Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA
and Nael Abu-Ghazaleh. 2020. BOW: Breathing Operand Windows to Exploit Tesla: A Unified Graphics and Computing Architecture. IEEE Micro 28, 2 (2008),
Bypassing in GPUs. In 2020 53rd Annual IEEE/ACM International Symposium on 39–55. https://doi.org/10.1109/MM.2008.31
Microarchitecture (MICRO). 996–1008. https://doi.org/10.1109/MICRO50266.2020. [52] Weiguo Liu, Bertil Schmidt, Gerrit Voss, and Wolfgang Müller-Wittig. 2008.
00084 Accelerating molecular dynamics simulations using Graphics Processing Units
[31] Massimiliano Fasi, Nicholas J. Higham, Mantas Mikaitis, and Srikara Pranesh. with CUDA. Computer Physics Communications 179, 9 (2008), 634–641. https:
2021. Numerical behavior of NVIDIA tensor cores. PeerJ Computer Science 7 (2 //doi.org/10.1016/j.cpc.2008.05.008
2021), 1–19. https://doi.org/10.7717/PEERJ-CS.330/FIG-1 [53] S. Markidis, S. Chien, E. Laure, I. Peng, and J. S. Vetter. 2018. NVIDIA Tensor Core
[32] Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Programmability, Performance & Precision. In 2018 IEEE International Parallel and
Dally, Erik Lindholm, and Kevin Skadron. 2011. Energy-efficient mechanisms Distributed Processing Symposium Workshops (IPDPSW). IEEE Computer Society,
for managing thread context in throughput processors. In 2011 38th Annual Los Alamitos, CA, USA, 522–531. https://doi.org/10.1109/IPDPSW.2018.00091
International Symposium on Computer Architecture (ISCA). 235–246. https: [54] Matt Martineau, Patrick Atkinson, and Simon McIntosh-Smith. 2019. Bench-
//doi.org/10.1145/2000064.2000093 marking the NVIDIA V100 GPU and Tensor Cores. In Euro-Par 2018: Parallel
[33] Mark Gebhart, Stephen W. Keckler, and William J. Dally. 2011. A compile- Processing Workshops: Euro-Par 2018 International Workshops, Turin, Italy, Au-
time managed multi-level register file hierarchy. In 2011 44th Annual IEEE/ACM gust 27-28, 2018, Revised Selected Papers (Turin, Italy). Springer-Verlag, Berlin,
International Symposium on Microarchitecture (MICRO). 465–476. Heidelberg, 444–455. https://doi.org/10.1007/978-3-030-10549-5_35
[34] Prasun Gera, Hyojong Kim, Hyesoon Kim, Sunpyo Hong, Vinod George, and [55] Clémentine Maurice, Nicolas Le Scouarnec, Christoph Neumann, Olivier Heen,
Chi-Keung Luk. 2018. Performance Characterisation and Simulation of Intel’s and Aurélien Francillon. 2015. Reverse Engineering Intel Last-Level Cache Com-
Integrated GPU Architecture. In 2018 IEEE International Symposium on Perfor- plex Addressing Using Performance Counters. In Research in Attacks, Intrusions,
mance Analysis of Systems and Software (ISPASS). IEEE Computer Society, Los and Defenses, Herbert Bos, Fabian Monrose, and Gregory Blanc (Eds.). Springer
Alamitos, CA, USA, 139–148. https://doi.org/10.1109/ISPASS.2018.00027 International Publishing, Cham, 48–65.
[35] Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John [56] Microsoft. 2023. How Microsoft’s bet on Azure unlocked an AI revolu-
Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In tion. https://news.microsoft.com/source/features/ai/how-microsofts-bet-on-
2012 Innovative Parallel Computing (InPar). 1–10. https://doi.org/10.1109/InPar. azure-unlocked-an-ai-revolution/.
2012.6339595 [57] Michael Mishkin. 2016. Write-after-Read Hazard Prevention in GPGPUsim.
[36] Scott Gray. [n. d.]. MaxAS: Assembler for NVIDIA Maxwell architecture. https: (2016).
//github.com/NervanaSystems/maxas [58] S. Narang and G. Diamos. 2016. DeepBench: Benchmarking Deep Learning oper-
[37] Anthony Gutierrez, Bradford M. Beckmann, Alexandru Dutu, Joseph Gross, ations on different hardware. https://github.com/baidu-research/DeepBench
Michael LeBeane, John Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon [59] Marco S Nobile, Paolo Cazzaniga, Andrea Tangherloni, and Daniela Besozzi.
Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei 2016. Graphics processing units in bioinformatics, computational biology
Zhang, Akshay Jain, and Timothy Rogers. 2018. Lost in Abstraction: Pitfalls of and systems biology. Briefings in Bioinformatics 18, 5 (07 2016), 870–885.
Analyzing GPUs at the Intermediate Language Level. In 2018 IEEE International https://doi.org/10.1093/bib/bbw058 arXiv:https://academic.oup.com/bib/article-
Symposium on High Performance Computer Architecture (HPCA). 608–619. https: pdf/18/5/870/25581212/bbw058_supplfile_3.pdf
//doi.org/10.1109/HPCA.2018.00058 [60] NVIDIA. [n. d.]. NVIDIA Developer Forums: Instruction cache and instruc-
[38] Steven J Heinrich and A L Madison. 2019. Techniques for efficiently transferring tion fetch stalls. https://forums.developer.nvidia.com/t/instruction-cache-and-
data to a processor. 417 (2019). Issue 62. instruction-fetch-stalls/76883
[39] Francisco E. Hernández Pérez, Nurzhan Mukhadiyev, Xiao Xu, Aliou Sow, Bok Jik [61] NVIDIA. 2009. NVIDIA’s Next Generation CUDA TM Compute Architecture: Fermi.
Lee, Ramanan Sankaran, and Hong G. Im. 2018. Direct numerical simulations Technical Report. NVIDIA.
of reacting flows with detailed chemistry using many-core/GPU acceleration. [62] NVIDIA. 2012. Technology Overview NVIDIA GeForce GTX 680. Technical Report.
Computers & Fluids 173 (2018), 73–79. https://doi.org/10.1016/j.compfluid.2018. NVIDIA.
03.074 [63] NVIDIA. 2014. NVIDIA NVLink TM High-Speed Interconnect: Application Perfor-
[40] Rodrigo Huerta, Mojtaba Abaie Shoushtary, and Antonio González. 2024. Analyz- mance. Technical Report. Nvidia.
ing and Improving Hardware Modeling of Accel-Sim. arXiv:2401.10082 [cs.AR] [64] NVIDIA. 2016. NVIDIA Collective Communications Library (NCCL). https:
https://arxiv.org/abs/2401.10082 //developer.nvidia.com/nccl.
[41] Gorka Irazoqui, Thomas Eisenbarth, and Berk Sunar. 2015. Systematic Reverse [65] NVIDIA. 2017. NVIDIA Tesla V100 GPU architecture the world’s most advanced
Engineering of Cache Slice Selection in Intel Processors. In 2015 Euromicro data center GPU. Technical Report. NVIDIA.
Conference on Digital System Design. 629–636. https://doi.org/10.1109/DSD.2015. [66] NVIDIA. 2018. NVIDIA TURING GPU architecture Graphics Reinvented NVIDIA
56 Turing GPU Architecture. Technical Report. NVIDIA.
[42] Charles Jamieson, Anushka Chandrashekar, Ian McDougall, and Matthew D. [67] NVIDIA. 2020. NVIDIA AMPERE GA102 GPU architecture Second-Generation RTX
Sinclair. 2022. gem5 GPU Accuracy Profiler (GAP). In 4th gem5 Users’ Workshop. NVIDIA Ampere GA102 GPU Architecture. Technical Report. NVIDIA.
[43] Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo Scarpazza. 2019. [68] NVIDIA. 2022. NVIDIA ADA GPU architecture. Technical Report. NVIDIA.
Dissecting the NVIDIA Turing T4 GPU via Microbenchmarking Technical Report. [69] NVIDIA. 2022. NVIDIA H100 Tensor Core GPU Architecture. Technical Report.
(2019). NVIDIA.
[44] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele Paolo Scarpazza. 2018. [70] NVIDIA. n.d.. CUDA binary utilities documentation. https://docs.nvidia.com/
Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. CoRR cuda/cuda-binary-utilities/
abs/1804.06826 (2018). arXiv:1804.06826 http://arxiv.org/abs/1804.06826 [71] NVIDIA. n.d.. CUTLASS: CUDA Templates for Linear Algebra Subroutines.
[45] Zhe Jia and Peter Van Sandt. 2021. Dissecting the Ampere GPU Architecture https://github.com/NVIDIA/cutlass
through Microbenchmarking. In NVIDIA GTC 2021. NVIDIA. https://www. [72] Md Aamir Raihan, Negar Goli, and Tor M. Aamodt. 2019. Modeling Deep Learning
nvidia.com/en-us/on-demand/session/gtcspring21-s33322/ Accelerator Enabled GPUs. In 2019 IEEE International Symposium on Performance
[46] Zhixian Jin, Christopher Rocca, Jiho Kim, Hans Kasan, Minsoo Rhu, Ali Bakhoda,
Tor M Aamodt, and John Kim. [n. d.]. ´Uncovering Real GPU NoC Characteristics:
Analyzing Modern NVIDIA GPU cores
Analysis of Systems and Software (ISPASS). 79–92. https://doi.org/10.1109/ISPASS. Wall: A Path to Exascale. In SC ’14: Proceedings of the International Conference
2019.00016 for High Performance Computing, Networking, Storage and Analysis. 830–841.
[73] Vishnu Ramadas, Daniel Kouchekinia, Ndubuisi Osuji, and Matthew D. Sinclair. https://doi.org/10.1109/SC.2014.73
2023. Closing the Gap: Improving the Accuracy of gem5’s GPU Models. In 5th [85] Oreste Villa, Daniel Lustig, Zi Yan, Evgeny Bolotin, Yaosheng Fu, Niladrish
gem5 Users’ Workshop. Chatterjee, Nan Jiang, and David Nellans. 2021. Need for Speed: Experiences
[74] Vishnu Ramadas, Daniel Kouchekinia, and Matthew D. Sinclair. 2024. Further Building a Trustworthy System-Level GPU Simulator. In 2021 IEEE International
Closing the GAP: Improving the Accuracy of gem5’s GPU Models. In 6th Young Symposium on High-Performance Computer Architecture (HPCA). 868–880. https:
Architects’ (YArch) Workshop. //doi.org/10.1109/HPCA51647.2021.00077
[75] B. Ramakrishna Rau. 1991. Pseudo-randomly interleaved memory. In Proceedings [86] Jin Wang and Sudhakar Yalamanchili. 2014. Characterization and analysis of
of the 18th Annual International Symposium on Computer Architecture (Toronto, dynamic parallelism in unstructured GPU applications. In 2014 IEEE International
Ontario, Canada) (ISCA ’91). Association for Computing Machinery, New York, Symposium on Workload Characterization (IISWC). 51–60. https://doi.org/10.
NY, USA, 74–83. https://doi.org/10.1145/115952.115961 1109/IISWC.2014.6983039
[76] G. Reinman, B. Calder, and T. Austin. 1999. Fetch directed instruction prefetching. [87] Craig Warren, Antonios Giannopoulos, Alan Gray, Iraklis Giannakis, Alan Pat-
In MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium terson, Laura Wetter, and Andre Hamrah. 2019. A CUDA-based GPU engine for
on Microarchitecture. 16–27. https://doi.org/10.1109/MICRO.1999.809439 gprMax: Open source FDTD electromagnetic simulation software. Computer
[77] Timothy G. Rogers, Mike Oconnor, and Tor M. Aamodt. 2012. Cache-conscious Physics Communications 237 (2019), 208–218. https://doi.org/10.1016/j.cpc.2018.
wavefront scheduling. In Proceedings - 2012 IEEE/ACM 45th International Sympo- 11.007
sium on Microarchitecture, MICRO 2012. IEEE Computer Society, 72–83. https: [88] Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and An-
//doi.org/10.1109/MICRO.2012.16 dreas Moshovos. 2010. Demystifying GPU microarchitecture through mi-
[78] Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, Hamid crobenchmarking. In 2010 IEEE International Symposium on Performance Analysis
Sarbazi-Azad, Mario Drumond, Babak Falsafi, Rachata Ausavarungnirun, and of Systems and Software (ISPASS). 235–246. https://doi.org/10.1109/ISPASS.2010.
Onur Mutlu. 2018. LTRF: Enabling High-Capacity Register Files for GPUs 5452013
via Hardware/Software Cooperative Register Prefetching. In Proceedings of [89] Jian Liu Xiaoxia Li, Zheng Mo and Li Guo. [n. d.]. ´Revealing chemical re-
the Twenty-Third International Conference on Architectural Support for Pro- actions of coal pyrolysis with GPU-enabled ReaxFF molecular dynamics and
gramming Languages and Operating Systems (Williamsburg, VA, USA) (ASP- cheminformatics analysis. ([n. d.]).
LOS ’18). Association for Computing Machinery, New York, NY, USA, 489–502. [90] Da Yan. [n. d.]. TuringAS: Assembler for NVIDIA Volta and Turing GPUs. https:
https://doi.org/10.1145/3173162.3173211 //github.com/daadaada/turingas
[79] Mojtaba Abaie Shoushtary, Jordi Tubella Murgadas, and Antonio Gonzalez. 2024. [91] Da Yan, Wei Wang, and Xiaowen Chu. 2020. Demystifying Tensor Cores to
Control Flow Management in Modern GPUs. arXiv:2407.02944 [cs.AR] https: Optimize Half-Precision Matrix Multiply. In 2020 IEEE International Parallel and
//arxiv.org/abs/2407.02944 Distributed Processing Symposium (IPDPS). 634–643. https://doi.org/10.1109/
[80] Aviral Shrivastava and Jian Cai. 2017. Hardware-aware compilation. Springer IPDPS47924.2020.00071
Netherlands, Netherlands, 795–827. https://doi.org/10.1007/978-94-017-7267- [92] Hosein Yavarzadeh, Mohammadkazem Taram, Shravan Narayan, Deian Stefan,
9_26 and Dean Tullsen. 2023. Half&Half: Demystifying Intel’s Directional Branch
[81] J.A. Stratton, C. Rodrigues, I.J. Sung, N. Obeid, L.W. Chang, N. Anssari, G.D. Liu, Predictors for Fast, Secure Partitioned Execution. In 2023 IEEE Symposium on
and W.W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Security and Privacy (SP). 1220–1237. https://doi.org/10.1109/SP46215.2023.
Commercial Throughput Computing. Center for Reliable and High-Performance 10179415
Computing (2012). [93] Xiuxia Zhang. [n. d.]. KeplerAs: An Open Source Kepler GPU Assembler. https:
[82] Wei Sun, Ang Li, Tong Geng, Sander Stuijk, and Henk Corporaal. 2023. Dis- //github.com/xiuxiazhang/KeplerAs
secting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric [94] Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, and Mingyu
Behaviors. IEEE Transactions on Parallel and Distributed Systems 34, 1 (2023), Chen. 2017. Understanding the GPU Microarchitecture to Achieve Bare-Metal
246–261. https://doi.org/10.1109/TPDS.2022.3217824 Performance Tuning. In Proceedings of the 22nd ACM SIGPLAN Symposium on
[83] Yifan Sun, Trinayan Baruah, Saiful A. Mojumder, Shi Dong, Xiang Gong, Shane Principles and Practice of Parallel Programming (Austin, Texas, USA) (PPoPP
Treadway, Yuhui Bao, Spencer Hance, Carter McCardwell, Vincent Zhao, Harri- ’17). Association for Computing Machinery, New York, NY, USA, 31–43. https:
son Barclay, Amir Kavyan Ziabari, Zhongliang Chen, Rafael Ubal, José L. Abellán, //doi.org/10.1145/3018743.3018755
John Kim, Ajay Joshi, and David Kaeli. 2019. MGPUSim: Enabling Multi-GPU [95] Zhenkai Zhang, Tyler Allen, Fan Yao, Xing Gao, and Rong Ge. 2023. TunneLs
Performance Modeling and Optimization. In 2019 ACM/IEEE 46th Annual Inter- for Bootlegging: Fully Reverse-Engineering GPU TLBs for Challenging Isolation
national Symposium on Computer Architecture (ISCA). 197–209. Guarantees of NVIDIA MIG. In Proceedings of the 2023 ACM SIGSAC Conference
[84] Oreste Villa, Daniel R. Johnson, Mike Oconnor, Evgeny Bolotin, David Nellans, on Computer and Communications Security (Copenhagen, Denmark) (CCS ’23).
Justin Luitjens, Nikolai Sakharnykh, Peng Wang, Paulius Micikevicius, Anthony Association for Computing Machinery, New York, NY, USA, 960–974. https:
Scudiero, Stephen W. Keckler, and William J. Dally. 2014. Scaling the Power //doi.org/10.1145/3576915.3616672