0% found this document useful (0 votes)
13 views173 pages

AHA Unit - 4

The document discusses architectures for hardware acceleration, focusing on the role of Graphics Processing Units (GPUs) in enhancing computational performance for tasks like linear algebra and graphics rendering. It highlights the limitations of general-purpose processors and the advantages of GPUs, such as their ability to handle high instruction-level parallelism (ILP) and perform numerous arithmetic operations simultaneously. The evolution of GPUs from graphics-specific tasks to general-purpose computations is also outlined, emphasizing the development of shaders and the graphics pipeline for efficient image synthesis.

Uploaded by

REJITH KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views173 pages

AHA Unit - 4

The document discusses architectures for hardware acceleration, focusing on the role of Graphics Processing Units (GPUs) in enhancing computational performance for tasks like linear algebra and graphics rendering. It highlights the limitations of general-purpose processors and the advantages of GPUs, such as their ability to handle high instruction-level parallelism (ILP) and perform numerous arithmetic operations simultaneously. The evolution of GPUs from graphics-specific tasks to general-purpose computations is also outlined, emphasizing the development of shaders and the graphics pipeline for efficient image synthesis.

Uploaded by

REJITH KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 173

Architectures for Hardware Acceleration

Dr. Sudeendra kumar K


Department of Electronics and Communication
Engineering
ARCHITECTURES FOR HARDWARE ACCELERATION
Graphics Processors

Unit-5: Lecture 1

Sudeendra kumar K
Department of Electronics and Communication Engineering
Architectures for Hardware Acceleration
Contents

• Introduction to Systolic Architectures


• Applications
• More details
Architectures for Hardware Acceleration
Why GPU

• General purpose processors, which are used to run all kinds of code have
their limitations. They can at best achieve an IPC of 4 or 6. However, this
is the best case and in practice to reach an IPC of 4 or 6, we need a near-
perfect branch predictor, a very accurate prefetcher, and copious
amounts of ILP.
• It is very hard to find all of these highly desirable traits in regular integer
programs; however, it is often easier to find them in numerical programs
that use a lot of floating point operations.
• We also have issues with power consumption and complexity. Having
large instruction windows, rename tables, and elaborate wakeup-select
logic requires a lot of power, and since power is one of the largest
bottlenecks in modern processors, it is often very difficult to scale the
issue width beyond 6 instructions per cycle.
Architectures for Hardware Acceleration
Why GPU

• Many programs such as airplane wing simulation, weather prediction,


image recognition have far more ILP and far higher performance
requirements.
• Linear algebra forms a core of these methods.
Architectures for Hardware Acceleration
Why GPU

• Many linear algebra-based programs have a high ILP


• Existing multicore processors are constrained in terms of their
fetch and issue width
• Solution: create a new processor that has 100s of ALUs for
computing such instructions in parallel
• Method:
• Have a very simple ISA
• Limit irregular memory accesses, conditional branches, and
Architectures for Hardware Acceleration
Linear algebra in GPU
• There are two loops: one for rows and one for columns. These loops are
used to traverse large matrices. In other words, the parameter N can be
very large(let’s say>1000), and thus the sophistication of the branch
predictor does not matter because most branches are very predictable in
this case.
• Secondly, in a lot of cases Nis known in advance (during compile time).
Thus, the programmer can manually optimize the program, and even
break a large computation into several disjoint parts such that these
individual parts can be run on different cores of a multiprocessor.
• It is true that the portability of code is an issue: a piece of code
optimised for one computer will not run efficiently on another computer.
Architectures for Hardware Acceleration
Linear algebra in GPU
• We further observe that we haveN2 additions, where there are no
dependences between these individual addition operations. They can
be done in parallel, and thus we have a massive amount of ILP in this
program.
• If we were to run this program on an aggressive out-of-order(OOO)
processor, we will get a very high IPC (close to the fetch or issue width
(minimum of the two)).
• However, to get even larger values of IPC, we need to create a
processor that has many adders that can operate in parallel. If we have
a processor with 100 adders, then we can perform 100 additions in
parallel. If we have 1000 adders, then we can do 1000 additions in
parallel.

• In principle, this sounds to be a very enticing idea. If we have a different kind


of processor that has hundreds of simple ALUs, then we can perform
hundreds of arithmetic operations per second. This will give us the required
throughput to execute large scientific applications very quickly
Architectures for Hardware Acceleration
Linear algebra in GPU
• Now, for such programs, we would definitely benefit if we have a
processor with let’s say a 100 ALUs. This means that at least in theory we
can execute100 operations simultaneously, which is way better than an
aggressive OOO processor that can at best provide a throughput of 4-6
operations per cycle.
• There is a flip side, which is that once we dedicate all our resources to
these ALUs, and math processing engines, we are left with little area and
energy to implement traditional functionalities of a processor such as
fetch, decode, schedule, memory access, and commit.
• This implies that we need to simplify all of this logic, and create a
processor that can execute a limited set of operations very efficiently. We
essentially need to constrain the scope of the programs, and provide
extremely high instruction throughput for the programs in this limited set.
Architectures for Hardware Acceleration
Linear algebra in GPU
Architectures for Hardware Acceleration
Computation Problem

• If we consider a 3 GHz processor with an IPC equal to 2 (floating point


+memory operations), then we are only allowed to execute 7633
instructions per pixel per second.
• If we consider a 60 Hz monitor that displays 60 frames per second, it
means that we can only run 127 instructions per pixel per frame. This
might be enough for minimising a window; however, this is clearly not
enough for all the games that we play.
• We need to calculate a lot of things such as scenes, animations,
shadows, illumination, and even the way that the hair of the
characters will bounce when they jump.
• Sadly, the resources of a general purpose processor falls short for the
requirements of modern games, which run on extremely high
resolution displays (e.g. 3840×2160 UHD displays). Even if we can
barely process the scenes in a game, we cannot do anything else like
simultaneously browsing the web.
Architectures for Hardware Acceleration
Computation Problem

• In response to such requirements, processor manufacturers increasingly


star-ted to ship their system with an additional graphics processor
abbreviated as a GPU (graphics processing unit) along with regular
processors. The job of the GPU was to process the graphics operations. This
required support at multiple levels.
Architectures for Hardware Acceleration
Required Support at Multiple Levels

• It is important to note that till 2007 GPUs were predominantly used for
graphics intensive tasks. There was no explicit thought of using GPUs for
other generic tasks as well. This actually came gradually after 2010, when
the community realized that GPUs could be re-purposed for general
purpose computations as well. Thus the idea of a GPGPU was born.
Architectures for Hardware Acceleration
Early days of GPU
• Graphics applications can trace their origin to the 70s when the main
applications of computer graphics were in CAD (computer aided design) and
flight simulation.
• Gradually, with an increase in computing power, many of these ap-plications
started migrating to workstations and desktops. By the late eighties, some new
applications such as video editing and computer games had also arrived, and
there was a fairly large market for them as well.
• This is when a gradual transition from CPUs to GPUs began. The modern GPU
can trace its origin from early designs proposed by NVIDIA and ATI in the late
nineties
Architectures for Hardware Acceleration
Early days of GPU
• Before we proceed forward, it is important to define two terms here –vector
and raster. These concepts can be explained as follows. There are two ways to
define a rectangle.
• First, we can define it by the coordinates of its top-left corner, its height and
width. In this case, each rectangle can be defined by 4 floating point values.
This is a vector graphics system.
• We can optionally specify the colour and width of the boundary of the
rectangle, and may be the colour that is used to fill the inside of the rectangle.
• In comparison, in a raster system we store a rectangle as a matrix of pixels.
Each value in the matrix represents the colour of the pixel. This means that if a
rectangle contains 10,000 pixels, we need to store 10,000 values.

Vector Graphics Raster Graphics


Architectures for Hardware Acceleration
Early days of GPU

• Both the systems have their relative advantages and disadvantages. If we


take an image (drawn using a vector graphics software) and enlarge it,
then it will still retain its visual appeal.
• On the enlarged display, the system will still be able to draw the rectangle
correctly. However, if we do the same with raster graphics, then it is
possible that the image might actually look very grainy.
• Also, for stretching and transforming an object, vector graphics is much
better. In comparison, if we need to add effects such as illumination,
shadows, or blurring, then raster graphics is more preferable. In general,
we prefer raster graphics when we work with photographs.
Architectures for Hardware Acceleration
Early days of GPU
• Given the fact that there is no clear winner, it is advisable to actually
support both the methods while creating a graphics processing
engine.
• This engine needs to have multiple types of units for performing
different kinds of tasks. Early systems were multi-pass systems [Blythe,
2008] – multiple computational passes were made on the same
image.
• Each pass added a particular kind of transformation. This becomes
particularly difficult, when images require a large amount of memory.
Hence, a single-pass method is desired, where we can conceptually
create a graphics pipeline. Images, scenes, or videos enter the
pipeline at one end, undergo a process of transformation, and then
exit from the other end. The end of the pipeline is typically connected
to a display device such as a monitor.
Architectures for Hardware Acceleration
Early days of GPU
• By the beginning of this millennium the idea of creating a graphics
pipeline with many specialized units, as well as many general purpose
units that could be programmed, started to take shape.
• Also by that time a lot of graphics applications had emerged, and the
space of applications was just not limited to a few well defined tasks.
• Fixed function pipelines that consisted of a set of simple units that
could just perform basic vertex and pixel transformation tasks were not
powerful enough for these new classes of applications.
Architectures for Hardware Acceleration
Early days of GPU
• As a result, the best option was to provide
users much more flexibility in terms of
what they could do. Thus the idea of a
shader program was born.
• A shader is a small program that processes
a fixed set of vertices or pixels. It is
typically used to apply transformations to
images, and add specialized effects such as
rotation, shading, or illumination.
• The conceptual diagram of a shader is
shown in Figure. Researchers further
started working on custom languages for
writing shaders. There was a need to make
these languages independent of the
underlying hardware such that they could Conceptual Diagram of a Shader
run on different kinds of GPUs.
Architectures for Hardware Acceleration
Early days of GPU

• Shaders have matured over the years. Starting from very rudimentary
vertex and pixel processing operations, they have become significantly
sophisticated. They are used for all kinds of applications: - motion
detection, adding texture, lighting, shadows, edge detection, and
blurring.
• Much of today’s research into the graphics aspect of GPUs is focused on
designing and supporting more sophisticated shaders. Note that even
though a shader is a program written in software, to run efficiently, it
needs elaborate hardware support. This is where architectural
techniques become important.
Architectures for Hardware Acceleration
High Level View of a Graphics Pipeline

• A graphics pipeline is meant to synthesize images, it is not meant to


only display images. Most of the units in a graphics pipeline are
dedicated to image synthesis. This process is also known as rendering

A basic rendering Pipeline


Architectures for Hardware Acceleration
A Basic Rendering Pipeline
• A basic rendering pipeline is shown in Figure. We have
four units. The programmer specifies a scene as a set
of objects and a set of rules to manipulate those
objects. These objects correspond to different distinct
parts of a scene. These rules are written in high level
graphics programming languages or APIs such as
DirectX or OpenGL.
• Graphics processors have their own assembly language
and instruction formats. The NVIDIA family of graphics
processors compile programs written in C/C++(using
the CUDA library) to a virtual instruction set called PTX
(Portable Thread eXecution).
• PTX is a generic instruction set that is compatible with
abroad line of NVIDIA processors. At runtime PTX is
compiled to SASS (Shader ASSembler). SASS is a device
specific ISA and is typically not compatible across
different lines of processors, even from the same
vendor.
Architectures for Hardware Acceleration
Vertex Processor
• We start out by dispatching a set of
instructions and a set of objects (to be
manipulated) from the CPU to the GPU.
It accepts several objects as inputs
along with rules for manipulating them.
• Even though it is more intuitive to
represent objects as polygons; however,
in the world of computer graphics it is
often much more easier to represent
them as a set of triangles.
• For each triangle, we simply need to
specify the coordinates of its three
vertices, and optionally the colour of its
interior.
Architectures for Hardware Acceleration
Vertex Processor

• There are several advantages of using triangles: -


• When we want to represent a 3D surface, decomposing it into
triangles is always preferred. This is because the three vertices of
a triangle will always be on the same plane. However, this is not
true for a quadrilateral. Hence, even if the surface has many
bumps, twists, bends, and holes, it can still be efficiently
represented by a set of triangles.
• The technique of using triangles is the simplest method to
represent a surface. Hence, we can design many fast algorithms in
hardware to quickly process them.
Architectures for Hardware Acceleration
Vertex Processor
• To generate realistic images it is often necessary to treat a light source
as consisting of many rays of light. For each ray, we need to find its
point of intersection with a surface.
• If a surface is decomposed into triangles, then this is very easy
because there are very efficient methods for computing the point
of intersection between a ray and a triangle.
• A lot of algorithms to add colour and texture to a surface assume a
very simple surface that is easy to work with.
• The surface of a triangle is very easy to work with in this regard
Moving, rotating, and scaling triangles can be represented as
matrix operations. If we can quickly process such matrix
operations, then we can quickly do many complex operations with
sets of triangles
Architectures for Hardware Acceleration
Vertex Processor

• Since complex rendering tasks can be achieved by manipulating the


humble triangle, we can design the vertex processor by having a lot of
small triangle processing engines.
• Each such engine can further support primitives such as translating,
rotating, scaling, and re-shaping triangles. Such geometrical operations
can be done in parallel for different objects and even for different
triangles within an object.
• Each triangle can additionally be augmented with more information
such as the colour associated with each vertex, and the depth of the
triangle
Architectures for Hardware Acceleration
Polymorphic Engine
• In modern GPUs [2011], the Vertex Processor has been
re-placed by a more sophisticated Polymorph Engine. This
is because the demands of modern applications such as
3D scene rendering and virtual reality require complex 3D
operations that are well beyond the capabilities of
traditional Vertex Processors. Hence, there is a need to
create a new pipeline with new functional units. Stages in the Polymorphic Engine
• The key idea of this engine is to focus on what is called
“geometry processing”. We divide the surface of an
object into a set of polygons (often tri-angles). This
process is known as tessellation. Subsequently, we
perform very complex operations on these polygons in
parallel.
• Figure shows the 5-stagepipeline of the Polymorph
Engine (source: NVIDIA GF100). Let us proceed to
describe the Polymorph engine that is a part of most
modern GPUs today. In specific, let us look at the pipeline
of the NVIDIA-Fermi processor
Architectures for Hardware Acceleration
Vertex Fetch

• The input to this stage is a set of objects with 3D coordinates. The


coordinates are in the object space, where the coordinates are local to the
object. At the end of this stage, all the vertices are in world coordinates;
this means that all of them use the same reference axes and the same 3D
coordinate system.
• We start out with fetching the vertex data from memory. Subsequently we
perform two actions: - Vertex shading and Hull shading.
• GPUs consist of groups of cores known as streaming multiprocessors (SMs).
The Polymorph Engine delegates a lot of its work to different SMs.
• Specifically, SMs perform two tasks in this stage: vertex shading and hull
shading. Vertex shaders are particularly useful in 3D scenes. They are used
to add visual effects to a scene. For example, a vertex shader can be used
to compute the effect of lighting a surface, or to simulate bones in a lifelike
character.
Architectures for Hardware Acceleration
Vertex Fetch

• In the latter case, we need to compute the new position of each vertex in
the bone as the arm that contains the bones moves. This can be done by
the vertex position translation feature of a vertex shader. Thus to
summarize, the vertex shader works at the level of vertices, and can
primarily change the coordinates, colour, and the texture associated with
each vertex.
• The hull shader divides polygons into several smaller polygons. This is be-
cause we want different degrees of granularity at different points in the
generated image. The objects that are closer to the viewpoint need a
finer granularity as compared to objects that are far away.
Architectures for Hardware Acceleration
Tessellation

• The process of tessellation involves breaking down every polygon in the


image into several smaller structures: triangles and line segments.
• The tessellation stage uses inputs from the hull shader. The main reason
for doing tessellation is to create more detail on the surface and to also
enable later stages of the pipeline to create an elaborate surface texture.
Architectures for Hardware Acceleration
Viewport Transformation
• In general, when we create a scene we are more interested in the objects and
the rules that govern the interaction between them. For example, if we are
rendering a scene in a game, we care about the position of the characters,
and how they interact with their environment.
• However, we do not want to show the entire scene on the screen. It might be
too large, and also all the parts of the scene may not be relevant. Let us refer
to the scene that we have worked with up till now as the window. Let us
define a viewport, which is a portion of the coordinate space that we would
like to show.
• There is thus a need to transform the 3D scene in the window to the scene
that will be visible in the viewport. We first need to clip the scene (select a
portion of it) and then perform scaling if the aspect ratio(width/height) of the
viewport is different from that of the display device.
Architectures for Hardware Acceleration
Attribute Setup

• Now that we have created a scene in the viewport, we need to ensure


that it renders correctly, particularly when we create the final 2D
image.
• We do not want the backs of objects to be visible. Furthermore, if there
is a light source, the illumination depends on the direction of the light
rays, and the outward normal of the surface at each point.
• The dot product between the outward normal and the light rays
determines the degree of illumination. For each constituent triangle in
the image, we compute the image of the plane(known as the plane
equation) that it belongs to, and annotate each triangle with this
information.

Stream Output: -
• The list of triangles is finally written to memory such that it can be used by
subsequent stages. We typically do not have sufficient storage on the GPU
to store all this information
Architectures for Hardware Acceleration
Rasterization
• This process converts all the triangles to sets of pixels. Each such set of
pixels is known as a fragment. This can be achieved by overlaying an
uniform grid over each graphical object.
• Each cell of this grid consists of multiple pixels and can be considered
as the fragment. In this stage, we can optionally compute a colour for
the fragment by considering its centre point. We can interpolate its
colour by considering the colours of the vertices of the triangle, which
this point is a part of
• Note that we do not discard all the information about triangles that
comes from the Vertex Processor. Often all of this information is
passed to the sub-sequent stage (Fragment Processor).
• Since the process of rasterisation typically is not associated with a lot
of flexibility, we can have a dedicated unit for rasterisation. It need not
be very programmable.
Architectures for Hardware Acceleration
Rasterization
• Furthermore, there is some degree of variance in the rasterisation
stage among different processors. In earlier processors such as
NVIDIA Tesla this stage was relatively smaller.
• However, in NVIDIA Fermi and beyond, this unit does visibility
calculations as well. This means that we compute which parts of
objects are visible in a scene.
• There is no hard and fast rule on which action needs to be
performed in which stage as long as all the actions that it is de-
pendent upon are done.
Architectures for Hardware Acceleration
Fragment Processor
• Subsequently, we need to compute the final colour value of each pixel
fragment. We need to take all kinds of visual effects and textures
before computing this value.
• The job of the fragment processor is to perform all these
computations. The simplest method is to use the interpolated colour
value of the centroid of the entire fragment.
• However, this produces fairly grainy and uneven images. Instead, we
can use a more elaborate process. This often requires solving com-plex
equations and performing a lot of linear algebra operations.
Architectures for Hardware Acceleration
Fragment Processor: Interpolation
• How do we compute the value of the colour at each pixel?.
• Some of the common techniques in this space are Goraud shading and
Phong shading.
• Goraud shading is a simple linear interpolation based model where we
can compute colour values based on the colours of the vertices, the
nature of the ambient light source, and a model of reflectivity of the
surface.
• It assumes that a triangle is a flat surface, whereas Phong shading,
which isa more involved technique does not make this assumption. It
assumes a smoothly varying normal vector (perpendicular to the
surface) across the surface of the triangle, and has a much more
complex model for reflectivity
Architectures for Hardware Acceleration
Fragment Processor: Texture Mapping
• Consider a realistic image. It is very unlikely that its surface will be a single
colour, or even be a gradient. For example, the colour of a tree’s surface is not
exactly brown, neither does the colour uniformly vary between different
shades of brown.
• A tree’s colour has a texture. Refer to Figure 6.7 for examples of different kinds
of textures. Based on rules specified by the programmer, the GPU maps
textures to triangles such that the surface of a tree looks realistic. We show
the effect of adding a wooden texture to an object in the following figure. In
modern graphics processors it is possible to apply several textures and nicely
blend them to produce a combined effect.
Architectures for Hardware Acceleration
Fragment Processor: Fog Computation
• Distance Fog is a 3D rendering technique where pixels that have a
greater depth (further away from the eye) are shaded
(coloured)differently to give a perception of distance.. This information
can be used here to colour objects farther away slightly differently
Architectures for Hardware Acceleration
Pixel Engine
• The job of the Pixel Engine is to take the output of the Fragment Processor
and populate the frame buffer. The frame buffer is a simple 2D matrix that
holds only the colour information for each pixel on the screen. The frame
buffer is directly sent to the monitor for display

• Depth and colour buffering: Fragments have different depths (known as the
Z-depth). While rendering 3D images, one fragment might block the view of
another fragment. It is possible to find out if this is happening by comparing
their coordinates and Z-depths.
• Once this computation is done, we can find the fragments that should be
visible and the fragments that should not be visible. We can then look at
their colours (computed from the previous stage) and use them to create a
2D image out of the visible fragments.
Architectures for Hardware Acceleration
Pixel Engine

• Transparency effects Modern colouring systems are based on three


colours: -red, green, and blue (RGB).
• In addition, they take an additional value called alpha that specifies
the degree of transparency. It varies from 0.0(fully transparent) to 1.0
(fully opaque).
• If a translucent object (semi-transparent) is in front of an opaque
object, then we should actually be able to see both.
• This part of the graphics pipeline ensures that this is indeed the case.
Once the frame buffer is populated, it is ready to be sent to the display
device
Architectures for Hardware Acceleration
Applications of GPU
• The different structures inside a traditional GPU. The crux of the discussion
centered around the concept of breaking complex structures into triangles
and then operating on sets of triangles.
• This was achieved by converting operations on triangles to different kinds
of matrix operations.
• Note that many such linear algebra based operations form the backbone of
a lot of scientific code starting from weather simulation to finding the drag
experienced by a wing of an aircraft.
Architectures for Hardware Acceleration
Applications of GPU

• Starting from 2006, the scientific community gradually woke up to the


fact that graphics processors are almost as powerful as large servers if we
simply compare the number of floating point operations that they can
execute per second (FLOPs).
• For example, in April 2004, high end desktop processors had a peak
throughput of roughly 20 GFLOPs (giga FLOPs), whereas some of NVIDIA’s
GPUs had a peak throughput of 50 GFLOPs [Geer, 2005].
• By May 2005, the peak GPU throughput had increased to 170 GFLOPs
with the CPU performance remaining more or less the same. This is where
programmers sensed an opportunity. They started thinking on how to
commandeer the resources of a GPU to perform their numerical
calculations.
Architectures for Hardware Acceleration
Applications of GPU
• Gradually, GPU designers understood that it is better to make the GPU far
more flexible and programmable. This is the most beneficial: it opens up
new markets because now a GPU can be used for many other purposes
other than rendering scenes.
• Vendors of GPUs also realized that many of the operations in vertex and
fragment processing are in reality massively parallel computations mostly
based on linear algebra.
• To increase the degree of programmability, it is wiser to make these units
more generic in character. Furthermore, since engineers working on high
performance computing also require such capabilities, it is advisable to
create features such that they can write and run their algorithms on a GPU.
Architectures for Hardware Acceleration
Applications of GPU
• Anticipating such trends, NVIDIA released the CUDA (Compute Unified
Device Architecture) framework in February 2007, which was the first
widely available software development kit (SDK) that could be used to
program a GPU.
• Thus, the modern GPGPU (general purpose GPU) was born. It is a very
programmable and flexible processor that can be used to perform almost
all kinds of high performance numerical computations.
• GPUs today are no more limited to just processing and creating computer
graphics, instead, graphics is just one of the applications that is supported
by a GPU.
Architectures for Hardware Acceleration
Programming GPGPUs

• Graphics processors have elaborate structures to per-form graphics intensive


tasks. However, they can also be effectively used for regular scientific code such
as weather simulation, or computing the distribution of temperature in an
object using finite element analysis techniques. Such computations more or less
share the same characteristics.
• In fact, today GPUs can do most of the heavy lifting when it comes to
computational work.
Architectures for Hardware Acceleration
GPU ISAs
• The process of compiling a program on a GPU is a two-step process. The
user writes her program using a variant of C++which can be one of the
popular GPU programming languages: CUDA or OpenCL.
• Let us describe NVIDIA’s CUDA toolkit. CUDA is an abbreviation for
Compute Unified Device Architecture. It is an extension of C++where a
user writes a program for NVIDIA’s entire line of GPUs.
• A typical CUDA program looks almost like a C++program with some
additional directives that specify which part of the code needs to run on
the CPU, and which part of the code is meant for the GPU
Architectures for Hardware Acceleration
GPU Programming
• NVIDIA provides a dedicated compiler called nvcc that
can be used to com-pile CUDA code. It uses separate
tools to compile the CPU code and the GPU code. The
GPU code is first processed by a C++ preprocessor that
replaces macros.
• Then a compilation pass compiles the code meant for
the GPU into the PTX instruction set, which is a virtual
instruction set. PTX stands for Parallel Thread
eXecution.
• It is a RISC-like ISA with an infinite set of registers,
where the compiler generates code mostly in the single
assignment form – each variable is assigned a value
exactly once.
Architectures for Hardware Acceleration
GPU Programming
• Using a virtual instruction set is preferable because
there is a large diversity in the underlying
hardware and if we generate code for one kind of
hardware, the code will lose its portability.
• Subsequently, we generate a fat binary, which
contains the PTX code, and also contains the
machine code for different popular models of
GPUs. In other words, a single binary contains
different images, where each image corresponds
to a specific GPU ISA.
• Simultaneously, we compile the C++part of the
code using standard C++compilers, embed
references to GPU functions, and link functions
provided by the CUDA library. nvcc finally produces
one executable that contains both the CPU and the
GPU code.
Architectures for Hardware Acceleration
GPU Programming
• When the program is run, the runtime dispatches the PTX code to the GPU
driver that also contains a compiler. If we are not using a pre-compiled
binary, then this compiler performs just-in-time (JIT) compilation.
• The advantages of just-in-time compilation is that the code can be
optimised for the specific GPU. Given that PTX assumes a virtual machine,
specific optimizations need to be made at a later stage to generate the
final machine code.
• Furthermore, un-like general purpose processors, GPGPUs are still not
completely standardized; fairly invasive changes are happening every
generation.
• Hence, to ensure that code written in the past runs efficiently is a
challenge, and this necessitates compilation at runtime. The PTX code is
compiled to SASS (Shader ASSembler)code, which is native to the machine.
It can be generated separately from the PTX binary using the CUDA utility
ptxasas well.
Architectures for Hardware Acceleration
Kernels, Threads, Blocks, and Grids
• A basic function that is to be executed on a GPU is known as a kernel.
It typically represents a function that needs to be invoked for each
item in a list of items.
• For example, if we are adding two vectors, then the kernel can be a
simple add function that adds two elements in a vector. Each kernel
is called by a CUDA thread, where a thread is defined as a process
(running instance of a program) that can share a part of its address
space with other threads.
• In a GPU we can think of the threads as separate programs that
execute the same code, share data via shared memory structures,
and execute in parallel.
• Each such thread has a unique thread id, which can be obtained by
accessing the threadIdx variable.
Architectures for Hardware Acceleration
Kernels, Threads, Blocks, and Grids

• The _ _global_ _ directive precedes every kernel indicating that it should


run on a GPU. Let us now explain the built-in threadIdx variable.
• In the CUDA programming language threads are grouped into blocks of
threads. A block of threads contains a set of threads, where each thread
operates on a set of variables assigned to it.
Architectures for Hardware Acceleration
Kernels, Threads, Blocks, and Grids

• For example if we are working on a cube, it makes sense to arrange the threads
as per their x, y, and z coordinates. threadIdx in this case has three components:
threadIdx.x, threadIdx.y, and threadIdx.z.
• For the code that we have shown, threads are arranged along a single
dimension, hence we only use threadIdx.x to get the index of the thread. We
further assume that if we have N threads, then each vector also has N elements:
assign one element to each thread.
• For each thread we read the corresponding array index, get the data values, add
them, and write the result to the array C. If all of the threads work in parallel,
then we can potentially get an N times speedup.
Architectures for Hardware Acceleration
Kernels, Threads, Blocks, and Grids
• The main advantage of arranging threads as a 1D chain, 2D matrix, or a
3Dcuboid is that it is easy to partition the data among the threads
because the arrangement of the threads mimics the structure of the data.
• Thread blocks typically cannot contain more than 768 or 1024 threads
(depending on the architecture). On similar lines we can group blocks of
threads into a grid
Architectures for Hardware Acceleration
Kernels, Threads, Blocks, and Grids

• In the main function we define an


object called threadsPerBlock of
type dim3.
• dim3 is a built-in type, which is a
3-tuple that can contain up to 3
integers: x, y, and z.
• If any integer is unspecified, its
value defaults to 1. In this case,
we are defining threadsPerBlock
to be a pair of integers, which has
two elements: N and N. The third
element has a default value of 1.
• Let us now show an example of a matrix addition kernel that • Thus the value of the variable
uses a 2D block of threads. In this case we shall show a part of threadsPerBlock is (N,N,1).
the main function that invokes the kernel Subsequently, we invoke the
kernel function matAdd that will
be executed on the GPU
Architectures for Hardware Acceleration
Kernels, Threads, Blocks, and Grids
• Let’s say we want to add two N×N matrices, where N= 1024. Further
more, assume that we cannot create more than 768 threads. In this case
let us limit ourselves to 16 threads per dimension (assuming 2
dimensions). We can then create N/16×N/16 blocks, where each block’s
dimensions are 16×16.
Architectures for Hardware Acceleration
Kernels, Threads, Blocks, and Grids

• Similar to threadIdx, blockIdx stores the coordinates of


the block. The variable blockDim stores the
dimensions of each block. It has an x, y, and z
component, which are represented as blockDim.x,
blockDim.y, and blockDim.z respectively.
• Blocks of threads are meant to execute completely
independently on the GPU. They can be scheduled in
any order. However, threads within a block can
synchronise between themselves and share data.
• For synchronizing threads, we can call the
syncthreads() function, which acts as a barrier for all
the threads in the block.
• A barrier is a point in the code where all the threads
must reach before any of the threads is allowed to
proceed past it.
Architectures for Hardware Acceleration
Kernels, Threads, Blocks, and Grids: Summary
• A kernel is a function in CUDA code that executes on a GPU. It is invoked by
the host (CPU) code.
• A thread in the context of a GPU is a process running on the GPU that is
spawned at runtime. Similar to CPU threads, different GPU threads can
share data amongst each other.
• However, the rules for sharing data are far more complex. While invoking a
kernel we typically specify the number of threads that need to be created.
They are created by the runtime.
• Each thread is assigned some data for performing its computation. This
depends on its position within the block. In general, the threads execute in
parallel, and this is why GPUs provide very large speedups.
• A block is a group of threads, where the threads can be organised in a1D,
2D, or 3D arrangement. The threads in a block can share data, and can
synchronise with each other.
• Similar to threads in a block, blocks are arranged in a grid in a 1D,2D, or 3D
arrangement. Blocks within a grid are supposed to execute independently
without any form of synchronization.
Architectures for Hardware Acceleration
Memory Access
• In GPUs there are two separate memory spaces: one on the
CPU, and one on the GPU. We need to explicitly manage the
flow of data between these spaces, and ensure that the
process of transferring data is overlapped with computation
as much as possible.
• GPUs have many kinds of memories. Each thread has a per-
thread local memory, then it has a shared memory that is
visible to the rest of the threads in the block, and the lowest
level is called the global memory, which is visible to all the
threads.
Architectures for Hardware Acceleration
Kernels, Threads, Blocks, and Grids
• Additionally, many GPUs provide access to different read-only
memories. Two of the popular memories in this class are the constant
memory(for read-only constants) and the texture memory.

• The latter is used to store the details of textures in graphics oriented


tasks. In CUDA while allocating a region in memory, we can optionally
add a memory specifier, which indicates where the bytes will be stored.
Architectures for Hardware Acceleration
CUDA code to add two vectors

• First we allocate memory on the host for


the three arrays (Lines 10 to 12).Then we
create three arrays on the device (GPU)
with the same dimensions. These arrays
need to be allocated space on the GPU
(Lines 18 to 20).
• We use the function cudaMalloc for this
purpose, which allocates space in the
global memory. Then, we need to copy the
contents of the arrays from the host’s
memory space to the device’s (GPU’s)
memory space.
Architectures for Hardware Acceleration
CUDA code to add two vectors
• We use the cudaMemcpy function to copy
the arrays; this function takes three
arguments: destination array, source array,
and the direction of the transfer.
• The third argument (direction of the
transfer) specifies whether we are
transferring data from the host to the
device or from the device to the host. It
thus can have two values:
cudaMemcpyHostToDevice and
cudaMemcpyDeviceToHost.
Architectures for Hardware Acceleration
Kernels, Threads, Blocks, and Grids

• Then we invoke the kernel in Line 27. We are using N


threads and 1 block. Furthermore, this is a synchronous
call, which means that we wait till the GPU has
computed the result.
• Once this is done, we transfer the contents of the array
g_C from the device to the host. We again call the
cudaMemcpy function; however, this time data is
transferred in the reverse direction. Finally, we free the
arrays on the device in Lines 33-35 using the function
cudaFree.
Architectures for Hardware Acceleration
Kernels, Threads, Blocks, and Grids
• For multi-dimensional arrays we can use the function calls
cudaMallocPitch() and cudaMalloc3D(), which are used to allocate
2D and 3D arrays respectively.
• It is recommended to use these functions rather than using
cudaMalloc because these functions take care of the alignment
issues in data.
• Additionally, we have similar functions to copy data from the device
to the host and vice versa: cudaMemcpy2D and cudaMemcpy3D.
Architectures for Hardware Acceleration
Streams, Graphs, and Events

• The code discussed is to a certain degree inefficient because it has three


well defined phases – transfer the data to the GPU, perform the
computation, and transfer the results back to the CPU – that execute
sequentially even when they are not required to.
• It would have been much better if we could pipeline these phases such
that the GPU’s compute engines are not idle when the data is being
transferred to them.
• Additionally, if we consider multiple kernels, then one kernel needs to
wait till all the outputs of the previously executed kernel have been
transferred to the CPU. This is again inefficient.
• To solve problems of such a nature, the designers of CUDA introduced the
concept of streams. A stream is defined as a sequence of commands that
are meant to execute in order. The commands can be initiated by
different threads on the host
Architectures for Hardware Acceleration
Streams, Graphs, and Events
• We can create different streams. There is no ordering of
commands across the streams – they are independent.
Depending on the compute capabilities of the device it might
allow different kinds of overlaps to occur or it might preclude
them.
• Consider the code once again. It has three distinct phases:
transfer to the GPU (ToGPU), compute the result (Compute),
and transfer to the host (ToCPU).
• Without any streams, their execution is as shown in Figure (a).
There is no overlap between the phases. Let us now create
two streams where we partition the arrays into two parts.
• If the arrays have N elements each, let the two parts cover the A sample execution with two streams
indices [0, N/2], and [N/2+1, N−1] respectively. Stream 1 is
assigned the first half of indices, and stream 2 is assigned the
second half. In our code that performs simple addition, stream
1 and stream 2 do not have any dependences between them
Architectures for Hardware Acceleration
Streams, Graphs, and Events

• Figure (b) shows the timeline of the execution with


streams. We divide each phase into two parts, and
distribute it among the two streams.
• Let us quantitatively compare the difference in the
execution time. Assume that the three phases take 1 unit of
time each.
• In Figure (b), each phase takes 0.5units of time. Stream 1
starts at time t= 0 and ends at t= 1.5. Stream 2 starts at t=
0.5, and ends at t= 2.0. We thus observe that by using
streams, we speed up the execution from 3 units of time to
2 units of time.
• We reduced the time of execution by one-third. Depending
on the amount of resources, and the nature of overlaps
A sample execution with two streams
that are allowed, we can increase the speedup even more
by having more streams.
• For example, some devices allow the concurrent execution
of kernels and some allow concurrent data transfers
Architectures for Hardware Acceleration
Graphs
• In the CUDA framework, the costs of launching a kernel and managing the data
transfers between the CPU and GPU are high. This cost is even more
pronounced when we have a lot of kernels that run for a short duration.
• Even if we group kernels into streams, the static overhead of setting up the
kernels in the GPU, loading their instructions, and initialising the GPU’s
hardware structures, does not reduce. In such scenarios CUDA graphs are useful

A graph in CUDA (the example shows a graph where


node C is a subgraph
Architectures for Hardware Acceleration
Graphs
• A graph is a data structure that contains a set of vertices (nodes) and edges.
As shown in Figure, each edge joins two nodes.
• In CUDA, a graph can contain the following types of nodes
• Kernel (runs on the GPU)
• Function call on the CPU.
• Memory copy and initialization
• Another child graph
Architectures for Hardware Acceleration
Kernels, Threads, Blocks, and Grids
• The programmer creates a graph by specifying the kernels, the
nature of the data transfer, the CPU function calls, and the
dependences between them. If there is an edge from nodeA to
nodeB, then it means that taskB starts only after taskA executes
completely.
• This graph is first validated and then the CUDA runtime prepares
itself for running the tasks associated with the graph. The
advantage of this approach is that when an ensemble of tasks is
presented to the CUDA runtime, it can reduce the cost of
launching kernels, and setting up data transfers, significantly.
• This is done by pre-computing the schedule of actions,
prefetching data and code, and allocating resources for
intermediate values and prefetched data.
Architectures for Hardware Acceleration
General Purpose Graphics Processors
• Over the last 10-15 years (as of 2020) many new designs of GPUs have
come up; however, the broad architecture of a GPU has more or less
stayed the same barring small modifications and improvements made
every new generation.

Architecture of a GPU (adapted from NVIDIA)


Architectures for Hardware Acceleration
General Purpose Graphics Processors
• Figure shows the reference architecture of a GPU. Let us start
with the interface. There are several ways in which the GPU
can be connected with the CPU. It can be on the same die as
the CPU like Intel Sandy Bridge or it can be housed separately.
In the latter case, the CPU and GPU need to be connected
with a high bandwidth interconnect such as PCI Express.
• The interconnect is connected to the Thread Engine (Giga
Thread Engine in Figure) whose role is to schedule the
computation (graphics and general purpose) on different
compute units. Given that a GPU has a lot of compute units,
programmers typically run a lot of multithreaded code on the
GPU.
• It is too expensive to let the operating system or other
software units schedule the threads as is the case in a normal
CPU. It is best to have a hardware scheduler that can very
quickly schedule the threads among the compute units
Architectures for Hardware Acceleration
General Purpose Graphics Processors
• In the architecture shown in Figure, we divide the
compute units in a GPU into six separate clusters –
each cluster is known as a GPC (Graphics processing
cluster).
• Since a GPU is expected to have a lot of cores, we
shall quickly see that it is a wise idea to divide it into
clusters of cores for more efficient management of
the tasks assigned to the cores.
• Finally, in the high level picture, we have a large L2
cache, which as of 2020is between 4-6 MB.
• The GPU has multiple on-chip memory controllers
to read and write to off-chip DRAM modules. In this
case the High Bandwidth Memory2 (HBM2)
technology is used.
Architectures for Hardware Acceleration
General Purpose Graphics Processors
• Most high performance systems today are multi-GPU
systems. A large problem is split into multiple parts
and each part is assigned to a separate GPU.
• The GPUs need to coordinate among themselves to
execute the problem. As a result a very high
bandwidth interconnect is required to connect the
GPUs.
• NVIDIA created the NVLink interconnect that can be
used to create such multi-GPU systems. The
architecture shown in Figure has six NVLink
controllers that can be used to communicate with
other sister GPUs
Architectures for Hardware Acceleration
Structure of GPC
• Let us now look at the structure of a GPC in Figure.
Each GPC has a rasterisation engine (referred to as
the Raster Engine), which does the job of pixel
rasterisation.
• This unit is connected to seven Texture Processing
Clusters (TPCs). Even though the TPC has maintained
its historical name – as of today it consists of two
distinct parts.
• The first is a vertex processor called the Polymorph
Engine and the second is a set of two compute
engines called Streaming Multiprocessors (SMs)
Architecture of a GPU (NVIDIA)
Architectures for Hardware Acceleration
Structure for GPC
• In our reference architecture the GPU has six GPCs
• Each GPC has a large Raster Engine for rasterisation, and seven TPCs.
• Each TPC has a vertex processor called the Polymorph Engine, and two
SMs (Streaming Multiprocessors).
• We thus have a total of 84 SMs
Architectures for Hardware Acceleration
Structure of SM
• Figure shows the structure of an SM. An SM can be further sub-
divided into two parts: memory structures and groups of simple
cores. In GPUs, we have the following types of memory structures
within an SM. Some designs use the same structure for multiple
functions.
• Instruction Cache : This is very similar to an i-cache in a regular
processor. It contains the instructions that need to be executed.
• L1 Cache : This cache stores regular data that cores write (similar to a
regular data cache).
• Texture Cache: This cache contains texture information. This
information is provided to the texture units that colour the fragment
with a given texture.
• Constant Cache Stores read-only constants.
• Shared Memory This is a small piece of memory (64-128 KB) that all
the cores in an SM can access. This can explicitly be used to store
data that all the cores can quickly reference. CUDA programs can be
Layout of an SM (adapted from NVIDIA)
directed to store arrays in the shared memory by using the shared
specifier.
Architectures for Hardware Acceleration
Structure of SM

• SM has four processing blocks (PBs) that contain cores and special
computing units. They are used to do all the mathematical processing in
a GPU.
• Additionally, each SM has four texture processing units that process
texture information. The job of each such unit is to fetch, process, and
add textures to the rendered image.

Layout of an SM (adapted from NVIDIA)


Architectures for Hardware Acceleration
The Compute Part of an SM: Processing Block
• Figure shows the detailed structure of a processing
block (PB) in an SM.
• It is a simple in-order pipeline. Instructions pass
through an instruction buffer(L0 i-cache), a
scheduler based on a scoreboard, a dispatch unit, a
large register file, and then they enter the compute
cores.
• Other than the scheduler, the rest is similar to a
regular in-order pipeline.

Layout of a Processing Block (NVIDIA)


Architectures for Hardware Acceleration
The Compute Part of an SM: Processing Block
• In our reference architecture, a PB consists of 16 integer cores,
16 single precision floating point cores, 8 double precision
floating point cores, and two tensor processing cores (for
matrix operations).
• Each core is very simple: slightly more sophisticated than a
regular ALU. When we have so many cores, they are bound to
have a large memory requirement. We thus need 8 LD/ST units
for accessing memory.
• In addition, scientific programs use a lot of transcendental
and trigonometric functions. As a result, we need a special
function unit (SFU)to compute the results of all of these
functions.
• Additionally, SFUs have special support for interpolating the
colour of pixels.
• Recall that we had discussed that in the fragment processing
stage, we need to interpolate the colour of pixels based on the
colours of adjoining pixels or the colours of the vertices of the
triangle that the pixel is a part of. SFUs have special hardware Layout of a Processing Block (NVIDIA)
to support these operations.
Architectures for Hardware Acceleration
The Compute Part of an SM: Processing Block

• The cores are typically referred to as SPs (streaming processors). They


contain fully pipelined functional units. Additionally, most GPUs support
the multiply-and-add (MAD) instruction, which is of the form: a=a+b∗c.
• Such instructions are very useful while performing linear algebra
operations such as matrix multiplication.
Architectures for Hardware Acceleration
Concept of a Warp

• We need to understand that we are not running a single pipeline. Instead


we are running a complex system with 40+ computing units.
• If we make this a free-for-all system, where any instruction from any
thread can execute, then the entire system of threads will become very
complex.
• We will need complex logic to handle branch statements, memory
accesses, and dependences between the threads. This extra logic will
increase the area of each core, and also increase its power consumption.
In a system with hundreds of cores, power consumption will be high,
which is not recommended. As a result, some order needs to be imposed
on the threads
Architectures for Hardware Acceleration
Concept of a Warp

• Modern GPUs (notably NVIDIA’s GPUs) follow a SIMT model (single


instruction, multiple thread) model. Here, we group a set of threads into
warps, where a warp typically contains 32 threads.
• Each thread has the same set of instructions (this is where single instruction
comes from). When the threads in a warp execute, all of them start from the
same point (same program counter).
• The scheduler maps each thread in the warp to an individual core, and then
the threads start executing. However, note that the execution takes place in a
special way. Threads do not run uncoordinated; they run in lockstep.
• This means that after a warp begins, the PB executes the first instruction in
the warp for all the threads.
• Once the first instruction has finished executing for all the threads, it executes
the second instruction for all the threads, and so on. The threads, of course,
work on different pieces of data.
Architectures for Hardware Acceleration
Concept of a Warp

• The SIMT model – single instruction, multiple thread – is followed in most


modern GPUs. Here, the threads run in lockstep. Conceptually, this is like all
the threads executing an instruction, waiting till all the threads complete, and
then moving on to the next instruction.
• The concept of the warp is integral to the SIMT model. It is a group of threads
that are scheduled together on a PB and executed in lockstep.
Architectures for Hardware Acceleration
Concept of a Warp
• This is a very simple model of execution and we do not need to have
sophisticated hardware to ensure that all the threads are synchronized.
• The scheduler simply picks the next instruction in a warp and sends it
to all the cores. Since all the cores execute the same instruction –
albeit on different pieces of data –instruction fetch and dispatch are
not difficult tasks.
• After all the cores finish an instruction, we send the next instruction.
Architectures for Hardware Acceleration
Concept of a Warp : Varying Execution Times
• First, instructions might take different amounts of time to execute. For
example, it is possible that a memory instruction might not find its data
in the L1 cache, and hence it needs to access the L2 cache.
• In this case it will take more time to complete than the rest of the
instructions who possibly find their data in the L1cache. Instead of letting
the rest of the threads continue as in an OOO processor, we make all the
threads wait till all the lagging instructions have completed.
• In general, this is not a very serious problem because all the data for
graphical and numerical tasks is typically co-located. We normally do not
deal with a lot of irregular memory accesses where data can be present
at arbitrary locations.
• Nevertheless, it is always possible that accesses straddle cache lines and
a few of these lines are not present in the cache. In such cases, the SIMT
model will lead to a severe performance degradation.
Architectures for Hardware Acceleration
Concept of Warp: Branch Divergence and Predicated Execution
• There is another problem is even more pernicious and difficult to solve.
What if we have branches? In this case, different threads will clearly
execute different sets of instructions. It will be hard to maintain the
SIMT and lockstep properties.
• It is possible that based on the data one thread might execute the if
portion of the code, and the other thread might execute the else
portion of the code. In this case the lockstep property will be violated.
Furthermore, we need to have a mechanism to wait for the threads to
converge
Architectures for Hardware Acceleration
Concept of Warp: Branch Divergence and Predicated Execution
• A normal processor would make thread1 execute Lines 2-4 and then
directly jump to Line 9. Similarly, it would make thread2 execute Lines 6
and 7 and then proceed to Line 9.
• However, this requires complex logic to compute branch targets, and add
offsets to program counters. The two threads will follow divergent paths
till they reconverge at Line 9.
• Given the fact that thread 1 needs to execute one more instruction as
compared to thread 2, we need to make thread 2 wait for the time it
takes to execute one instruction.
• Subsequently, both the threads can start to execute the instruction at
Line 9 (point of reconvergence) at the same point of time in lockstep. This If-else statement
is a very complex mechanism and is expensive.
• Furthermore, it is possible to have nested branches within the if portion
or the else portion. This will further complicate matters.
Architectures for Hardware Acceleration
Concept of Warp: Branch Divergence and Predicated Execution
• To keep things simple, GPUs use predicated execution. In this model, all the
threads follow the same path. This means that thread 1 processes all the
instructions – Lines 2 to 9 – and so does thread2. However, processing an
instruction does not mean executing it: -
• Thread 1 executes the instructions at Lines 2 – 4.
• However, when it comes to the instructions at Lines 6 and 7, it ignores
them. Nevertheless it waits for other threads to finish executing them,
if they have to.
• On similar lines, thread 2 ignores the instructions at Lines 2 – 4. It is
not the case that it ignores these instructions and moves ahead.
Instead it waits for thread 1 to finish executing these instructions. The
threads still move in lockstep.
• Thread 2 executes the instructions at Lines 6 and 7.
• Finally, both the threads reconverge at the instruction in Line
Architectures for Hardware Acceleration
Concept of Warp: Branching (if-else)

• Predicated execution refers to a paradigm where a thread executes


(processes) instructions belonging to both the paths of a branch instruction.
The instructions on the correct path are fully executed, and they are allowed
to modify the architectural state.
• However, the instructions on the wrong path are discarded, and not allowed
to modify the architectural state. In the context of GPUs, predicated
execution is heavily used in the SIMT model. All the threads execute all the
paths of branches in lockstep.
Architectures for Hardware Acceleration
Concept of Warp: Branching (if-else)
• They pretty much treat the instructions in a warp as a
sequential piece of code. However, some of these instructions
are on the wrong path. The hardware keeps track of these
instructions and does not allow them to modify the
architectural state.
• However, this does not mean that the thread moves ahead.
The thread still follows the lockstep property, and waits for all
the other threads to finish executing that instruction.
• The instruction scheduler maintains a mask for threads in a
warp. Only those threads whose bit in the mask is 1 execute
the instruction and the rest ignore it.
• Alternatively, we can say that if the ith bit in the mask is 1, then
it means that the current instruction is in the correct branch
path for thread i
Architectures for Hardware Acceleration
Concept of Warp

• Let us associate a stack with every thread. Every time we enter the code of
a branch (branch path) we push an entry on the stack.
• If we are on the correct path we push a 1, otherwise we push a 0. If we
have nested branches (branches within branches),we do the same.
• Similarly, when we exit a branch path, we pop the stack. This means that
for every line of code in a warp, we maintain a small stack with1-bit
entries.
Architectures for Hardware Acceleration
Concept of Warp:

• We execute a given instruction and commit it if all the entries in its


associated stack are 1. This means that we are on the correct path of all
the branches encountered so far.
• However, if this is not the case, then we ignore the instruction because
we are on the wrong path of at least one branch.
• Note that if the stack is empty, then we execute and commit the
instruction because this corresponds to the case, where the code is
outside the scope of any branch.
• Before executing every instruction it is not possible to read the contents
of the entire stack and compute a logical AND of the bits.
Architectures for Hardware Acceleration
Concept of Warp:

• We can succinctly store this information in a bit mask that


contains 32 bits – one for each thread. If the ith bit is 1, then
it means that thread i can correctly execute the instruction.
• This bit mask is updated when we either enter the body of a
conditional statement or exit it.
• Here, we consider three threads: 1A, 1B, and 2. We modify
the code to add another nested if statement in the body of
the first if statement.
• Threads 1A and 1B execute the body of the first if statement,
whereas thread 2 does not. Inside the body of the first if
statement, thread 1A executes the body of the second if
statement, whereas thread 1B does not.
• The stack associated with the branch paths is shown in the
figure beside the tick/cross marks. Please note how we push
and pop entries into the stack as we enter and exit a group
of conditional statements.
• Finally, we reach the point of reconvergence for all threads. Using a stack for predicated execution in GPUs
Architectures for Hardware Acceleration
Concept of Warp:

• A point of reconvergence is an instruction (point in the program) that is


executed by all the threads in the warp and is just outside the scope of all
previous conditional statements.
Architectures for Hardware Acceleration
Warp Scheduling
• We need to appreciate that it is necessary to group computations into warps
in a GPU. This keeps things simple and manageable.
• Otherwise, if we schedule every instruction independently, the overheads
will be prohibitive; it will simply be impractical to do so.
• Hence, we have the concept of warps. However, we need to schedule warps
and this requires a scheduler.
• It is the job of the warp scheduler (typically a part of the PB or SM) to
schedule the warps. It keeps a set of warps in a buffer. Every few cycles it
selects a new warp and executes a few instructions from it.
Architectures for Hardware Acceleration
Warp Scheduling
• For example in the NVIDIA Tesla GPU, the warp scheduler stores a
maximum of 24 warps.
• Every cycle it can choose one of the warps and make it run. Later designs
have modified this basic design, and have made the warp scheduler more
complicated.
• For example, the NVIDIA Fermi GPU can select two warps at a time, and
execute them simultaneously – each warp has 16 cores,16 load/store units,
or 4 SFUs at its disposal.
• Later designs such as NVIDIA Kepler have four warp schedulers per SM. In
our reference architecture inspired by NVIDIA Volta, we divide an SM into
four PBs, and we have one warp scheduler per PB
Architectures for Hardware Acceleration
Warp Scheduling
• The simplest strategy is to run a single warp at a time. However, running multiple
warps at a time has some inherent advantages. Let us explain with an example.
Consider an SM with 16 load/store units, and 16 ALUs.
• Furthermore, assume that a warp has 32 threads. Given that we execute the
instructions inlock step, all the instructions will be of the same type. Let us assume
that we can either have memory instructions (load/store) or ALU instructions.
• This means that we can keep only half the number of functional units busy: either
16 ALUs or 16 load/store units. However, if we are able to schedule two unrelated
warps at the same time, then we can do better. It is possible to make full use of
there sources if we can overlap the execution of ALU instructions of one warp with
the memory instructions of the other warp.
• In this case, one warp will use 16 ALUs, and the other warp will use the 16
load/store units. We will thus have100% utilization
Architectures for Hardware Acceleration
Warp Scheduling:
• Another possible option is to have a single warp scheduler with 32 ALUs
and 32 load/store units. From the point of view of execution latency, this
is a good idea; however, this is wasteful in terms of resources. It is best to
have a heterogeneous set of units, and have the capability to schedule
threads from multiple unrelated warps in the same cycle.
• If we have a good scheduler it will be able to ensure a high utilization
rate of the functional units, and thus increase the overall execution
throughput.
Architectures for Hardware Acceleration
Independent Scheduling of Threads in a Warp
• Till now we have treated a warp as a group of 32 threads that have a
single program counter. Furthermore, we have a 32-bit mask associated
with each warp.
• For an instruction, if the ith bit is set, then only thread i executes it and
commits the results to the architectural state. Otherwise, for thread i
this instruction is on the wrong path, and it is not executed. This
architecture unfortunately has a problem.
Architectures for Hardware Acceleration
Independent Scheduling of Threads in a Warp
• Consider the code below in a system with a hypothetical 4-thread warp

• We have four threads with ids 0, 1, 2, and 3


respectively. Two of the threads will execute the while
loop (Line 3), and two threads will execute the code in
Lines 5 and 6.
• If we run the code on a regular multicore processor,
then there will be no deadlock. This is because first
threads 0 and 1 will wait at the while loop.
• Then either thread 2 or thread 3 will set x equal to 1
Deadlocks in a CUDA program in Line 5. This will release threads 0 and 1. However, in
the case of a GPU with our lock stepped threading
model, this code will have a deadlock.
Architectures for Hardware Acceleration
Independent Scheduling of Threads in a Warp

• All the four threads will first arrive at the while loop in
Line 3. For two threads (0 and 1) the while loop is on the
correct path, and for the other two threads (2 and 3), it
is on the wrong path.
• Threads 2 and 3 will not execute the loop; however,
they will wait for threads 0 and 1 to finish executing the
while loop.
• Unfortunately, this is where the problem lies. Threads 0
and 1 will never come out of the loop. They will be stuck
forever because x= 0. Threads 2 and 3 will never reach
Deadlocks in a CUDA program Line 5 where they can set x to 1.
• This is an unacceptable situation. We can perform such
synchronizing accesses between threads across
different warps but not between threads in the same
warp!.
Architectures for Hardware Acceleration
Independent Scheduling of Threads in a Warp

• The solution is to maintain separate execution state for each thread. This
includes a separate thread specific program counter, and a call stack. This
however does break the notion of threads executing in lockstep, and has
the potential for increasing the overheads significantly.
• Let us look at how the designers of NVIDIA Volta solved this problem. They
introduced the notion of restricted lockstep execution. This is shown in
Figure. In the figure we define three blocks of instructions: W(while loop),
X (x= 1), and Y (y= 1)

SIMT execution with per thread state


Architectures for Hardware Acceleration
Independent Scheduling of Threads in a Warp

• As we can see, the execution model is still SIMT. In any cycle, all the active threads in the
warp still execute the same instruction. However, unlike our previous model, we do not
proceed sequentially. We increase the degree of concurrency by executing the code blocks
X and Y concurrently with the code block W. Let us follow the timeline. We first execute
the code block W. We are not able to make progress because x= 0. Then we execute the
code blocks X and Y. Subsequently, we execute the code block W once again. This time we
are able to make progress because x has been set to 1.
• We thus leave the if-else block and our divergent threads reconverge We can also force
reconvergence between the threads by calling the CUDAfunctionsyncwarp(). In general,
the role of the GPU is to ensure as much of SIMT execution as possible. This means that it
needs to group together as many active threads as it can per instruction, and also ensure
that all threads make forward progress. The latter ensures that we do not have deadlocks
as we showed in Listing 6.18. The reader needs to convince herself that this method allows
us to use lock and unlock functions in GPU threads similar to regular CPU threads.
Architectures for Hardware Acceleration
The GPU Pipeline
• In our reference architecture, the GPU consists of a set of 6 GPCs, a large
6MB L2 cache, 8 memory controllers, and 6 NVLink controllers.
• Each GPC consists of 14 SMs (streaming multiprocessors).
• Each SM consists of 4 processing blocks, an L1 instruction cache, and a
128 KB data cache.
• Each processing block (PB) contains 16 integer cores, 16 single precision
FP cores, and 8 double precision FP cores. It additionally contains two
tensor cores for matrix operations, 8 load/store units, and a dedicated
special function unit.
• Each PB executes a warp of threads in parallel. The threads in the warp
access the large 64 KB register file, and the L1 cache of the SM most of
the time. If they record misses in these top level memories ,then they
access the L2 cache and finally the off-chip DRAM.
Architectures for Hardware Acceleration
A GPGPU’s pipeline

• Figure shows the pipeline of a


GPGPU core. Once we have
decided to schedule a warp, we
read its instructions from the i-
cache.
• We decode the instructions and
while dispatching the
instructions we check for
dependences.
• We typically do not use
expensive mechanisms like the
rename table or reservation
stations.
• They consume excessive
amounts of power and are also
not efficient in terms of area
Architectures for Hardware Acceleration
A GPGPU’s pipeline
• We use the simple scoreboard based mechanism. Recall that a
scoreboard is a simple table that we use to track dependences between
instructions. Once the instructions in a warp are ready to be issued, we
send them to the register file. Unlike a CPU’s register file, a register file
in a GPU is a very large structure. It is almost as large as a cache – 64 KB
in our reference architecture.
• To support lockstep execution of all the active threads, we need to read
all their data at once. This requires a very high throughput register file.
• Once we have read all the data, we send it to the functional units. They
compute the result, access memory (if required), and finally write the
results back to the register file or the relevant memory structure.
Architectures for Hardware Acceleration
The Register File in a GPU
• The PTX ISA assumes an infinite number of registers.
The advantage of this is that the PTX code can remain
platform independent, and the code can be written in
terms of virtual registers, which improves the
effectiveness of a host of compiler optimisations.
• While generating the binary code we can assign real
registers to the virtual registers. This can be done by
the PTX assembler, ptxas, or in the process of JIT (just-
in-time) compilation of the binary.

• In a GPU we need a very high-throughput register file given the bandwidth requirements. It is
impractical to read and write data at the granularity of 4-byte words given that we have at least 32
threads running at the same time in a processing block.
• Consider a piece of code where all the threads in a warp use a 32-bit local variable. We need to
create 32 copies of this variable. The total number of bytes that we need to allocate is 32×32 =
1024 bits. We thus set the block size in the register file to 1024 bits (or 128 bytes). This is shown in
Figure (b) that shows a bank in a register file with a 1024-bit block size.
Architectures for Hardware Acceleration
The Register File in a GPU

• Let us now design a register file (Figure(a)). Assume


we have a 64KB register file.
• We can divide it into 16 banks, where the size of each
bank is4 KB. If the size of a single entry is 128 bytes,
we shall have 32 such entries in each register file
bank.
• In a lot of modern GPUs that have many outstanding
instructions and frequent warp switches, there are
many memory instructions in flight.
• There is thus an elevated chance of bank conflicts–
conflicting accesses to the same bank. This will further
cause delays because we can process only one
memory request at a time per bank.
Architectures for Hardware Acceleration
The Register File in a GPU
• In addition, we cannot read and transfer 1024 bits at the
same time; we need to read 1024 bits over several cycles.
Moreover, we may have wider operands such as double
precision values.
• In this case, we need to store the set of 32 values in
multiple banks. All of these values have to be read,
collected, and then sent to the execution units within the
PB.
Architectures for Hardware Acceleration
The Register File in a GPU

• We thus create a set of buffers known as operand collectors, where each entry
is associated with an instruction. It stores the values of all the source
operands.
• We connect the banks and the operand collectors with a crossbar switch,
which is an N×M switch. We have N ports for the banks in the register file, and
M ports for the operand collectors.
• We can route data from any bank to any operand collector (all-to-all traffic).
Once each entry in the operand collector receives all its source operands in
entirety, the values are sent to the arrays of execution units within the PB
Architectures for Hardware Acceleration
L1 Caches

• Similar to the register file, the memory bandwidth demands are very high
for the caches particularly the L1 cache that sees all the accesses.
• If the L1 hit rate is high, then the traffic that reaches the L2 cache is
significantly reduced. Furthermore, since the L2 cache is much larger, we can
afford to create many more banks to sustain more parallelism.
• Hence, out of all the memory structures the L1 cache is the most critical.
Architectures for Hardware Acceleration
L1 Caches
• In the case of memory accesses, we will have a set of accesses: one for
each thread. Since we have 8 load/store units in our reference
architecture, we can issue 8 or 16 memory accesses per cycle depending
upon the parallelism in the load/store units.
• The second stage is an arbiter, which figures out the bank conflicts
between the accesses. It splits the set of accesses into two subsets. The
first subset does not have any bank conflicts between the accesses, and
the second set of accesses
s have bank conflicts with the first subset, and
might have more conflicts between them.
• The addresses in the first subset are then routed to the L1 cache. If the
accesses are writes, then the process ends here. If the accesses are reads,
we read the data and route them to the operand collectors associated with
the register file. Subsequently, we send the requests from the second set
that do not have bank conflicts, and so on.
Architectures for Hardware Acceleration
L1 Caches

• Let us now look at some special cases. Assume we have a miss in the
cache. We then use a structure like an MSHR (miss status handling
register) to record the miss.
• Similar to a traditional MSHR, we can merge requests if they are for
different words in the same block.
• Once the block arrives into the cache, we lock the corresponding line to
ensure that it is not evicted, and replay the load/store accesses for
different words within that block from the MSHR.
• Similarly, we replay the instructions that access words within the block
and could not be sent to the cache because of bank conflicts. Once the
accesses are done, we unlock the line
Architectures for Hardware Acceleration
References

• Textbook: Advanced Computer Architecture, Smruti Sarangi, Chapter: GPU


THANK YOU

Sudeendra kumar K
Department of Electronics and Communication
Engineering
sudeendrakumark@pes.edu
Architectures for Hardware Acceleration

Dr. Sudeendra kumar K


Department of Electronics and Communication
Engineering
ARCHITECTURES FOR HARDWARE ACCELERATION
Architectures for Machine Learning

Unit-5: Lecture

Sudeendra kumar K
Department of Electronics and Communication Engineering
Architectures for Hardware Acceleration
Contents

• In this chapter,
Architectures for Hardware Acceleration
Introduction to Machine Learning
• We still do not have good algorithms for complex tasks such as face
recognition, feature recognition in images, and speech synthesis. These
tasks are traditionally associated with the human brain; training machines
to do them using current techniques is very difficult.
• Such tasks were almost impossible to successfully complete in the 2010-
2015 time frame with old school artificial intelligence technologies.
However, off late the scenario has changed.
• With the advent of technologies like deep learning that mimic the
processes in the human brain like never before, it is possible to solve
many of these complex problems to some degree.
Architectures for Hardware Acceleration
Introduction to Machine Learning

• All of these extremely interesting things are possible because


instead of using traditional algorithms based on data structures and
graphs, such speech recognition software use methods known as
deep learning or deep neural networks.
• A deep learning system is nothing but a hierarchy of simple learners
that increasingly learn more and more complex concepts – this is
inspired by the way the human brain learns new concepts
Architectures for Hardware Acceleration
Introduction to Machine Learning
• Any deep learning system is divided into a set of layers. Each layer
processes the inputs by computing a function over the set of inputs.
They can either be linear or nonlinear functions.
• Designers typically alternate the linear and nonlinear layers to learn
extremely complex functions. The layers learn increasingly complex
concepts, and ultimately they are able to recognize faces or
transcribe speech into text.
• Note that this kind of computation is very different from the kind of
computation that happens in normal integer or floating point
programs. Furthermore, the computation is massively parallel and
requires a very high memory bandwidth
Architectures for Hardware Acceleration
Basics of Deep Learning

• The aim of any learning system is to learn a function that is hidden.


The learner is given a set of inputs and their corresponding outputs.
Based on them it needs to estimate the function that computes the
outputs given the inputs.
• The process of trying to learn this function is known as training.
Once the learner (also known as the model) has been trained, it can
be used to predict the output of an hitherto unseen input. This
process is known as testing.
Architectures for Hardware Acceleration
Basics of Deep Learning

• Since we do not know the actual process that converts the inputs to
the outputs, the function that is estimated will be an approximation
of the real function.
• Hence, it is expected that there will be some error in the output.
Better learners minimize this error over a set of test inputs.
• In the learning literature there are several ways to measure the
error: absolute value of the difference, mean square error, and so
on.
• Regardless of the way that the error is measured, the main aim of
any learning process is to minimize the error for unseen test inputs.
Architectures for Hardware Acceleration
Formal Model of the Learning Problem

• A learning system typically takes an n-element vector x as input


and returns an output y, which can be a floating point value or an
integer. In addition, let the real output be ˆy in this case. The error
is thus a function of y and ˆy.
• Furthermore, following convention let x be a column vector. We
traditionally write x ∈ Rn, which means that x is a vector that
contains n real numbers. In general, the training algorithm is given
a set of inputs and a set of outputs.
Architectures for Hardware Acceleration
Formal Model of the Learning Problem

• Let us represent the set of inputs as X, where the ith column is the ith
input vector, and the set of outputs as a column vector y (the ith entry
is the ith output). We want to find the relationship – function f∗ –
between the set of inputs and the set of outputs.
• Given that we will never get to know what the real relationship
actually is, we have two make an intelligent guess from the training
data that has been provided to us: X and Y.
• It is important to note here that the hidden function f∗ does not take
into account the order of inputs. It is in a sense memoryless – does
not remember the last input.
Architectures for Hardware Acceleration
Formal Model of the Learning Problem

• The main aim of the learning problem is to find a good estimate for
f∗. The number of possible functions is very large, and unless we
simplify the problem, we will not arrive at a good function.
• The simplifying assumption is that in this case we desire a universal
approximator. A universal approximator is an algorithm that does not
rely on any a priori estimate of f∗.
• It can be used to approximate any continuous function with inputs in
Rn. Every such approximator takes in a list of parameters that
completely specify its behaviour; it is possible to approximate
different hidden functions just by changing the parameters.
• This method simplifies the learning problem significantly; all that we
need to do is simply estimate the parameters of a universal
approximator.
Architectures for Hardware Acceleration
Formal Model of the Learning Problem
• A universal approximator is an algorithm that does not rely on
any a priori estimate of the function to be learnt (f∗). Moreover,
it can be used to approximate any continuous function with
inputs in Rn, and its behaviour can be fully controlled by a list of
parameters.
• Most of the initial learning algorithms were not universal
approximators; hence, they failed for many classes of learning
problems.
Architectures for Hardware Acceleration
Linear Regression

• The simplest approach is to assume that f∗ is a linear function. We


can thus estimate y as follows:

• Here, w is a weight vector and b is a bias parameter. Even though


this approach is very simple, however its main shortcoming is that
if f∗ is not linear then the estimate can turn horribly wrong

• For example if f∗ consists of sine, cos, tan, and other transcendental


functions then a linear estimate will give us a very low accuracy. Such a
linear approach is not a universal approximator because of this issue.
Architectures for Hardware Acceleration
ML using Nonlinear Models
• Given that linear models are associated with significant limitations,
much of the research in machine learning has moved towards
nonlinear models.
• Consider complex nonlinear functions that are parameterised by a
set of constants. The task of the learning algorithm is to find an
appropriate set of constants that minimise the error.
Architectures for Hardware Acceleration
Modelling Nonlinearity with Neural Networks

• Given that the space of functions that we want to approximate is


potentially infinite, the learning model should be general enough
such that we can achieve a low error with almost any data set.
• Neural networks(inspired by the human brain) are universal
approximators and also general is every well.
• The key idea here is to introduce nonlinear transformations along
with linear transformations such that the relationship between
the input and the output can be captured accurately.
• We introduce a function g such that the output can be represented
as :
g(wTx + b)
• The function g is typically is a sigmoid function.
Architectures for Hardware Acceleration
Modelling Nonlinearity with Neural Networks
Architectures for Hardware Acceleration
Modelling Nonlinearity with Neural Networks

The Sigmoid,tanh, and ReLU activation functions


Architectures for Hardware Acceleration
Modelling XOR Function

• Consider the input to be a column vector [a, b], where the output is
a⊕b.
• Simply by multiplying weights with a and b and adding the results, it is
not possible to realize a XOR function.
• Note that for the sake of readability we will be writing the column
vectors horizontally. For example, as per our representation [a, b] is a
column vector and [a, b]T is a row vector.
Architectures for Hardware Acceleration
Modelling XOR Function

• The aim is to identify and “somehow nullify” the inputs


when a=b. Let us compute the vector product [1,−1]T
[a, b] (dot product of [1,−1] and [a, b]) where [1,−1] is
the weight vector.
• This is arithmetically the same as computing a−b. For
Karnaugh map
the inputs (0,0) and (1,1), the result of this operation is
0. For the inputs where a not equal to b, the result is
non-zero (1 and -1).

Outputs after computing a dot product with [1,-1]


Architectures for Hardware Acceleration
Modelling XOR Function

• The final output needs to be the modulus of this result.


Computing |x| is easy. It is equal to ReLU(x) +ReLU(−x).
• Note that we have two functional units in the first
linear layer. Each unit computes a dot product between
a weight vector and the input vector.
• For the first functional unit the weight vector is [1,−1],
and for the second functional unit it is [−1,1]. The
second weight vector is generated by multiplying
[1,−1] with -1 because we wish to compute ReLU(−x) in
the next nonlinear layer.
Structure of the network
Architectures for Hardware Acceleration
Modelling XOR Function
• The structure of the network has an interesting property: it has
alternating linear and nonlinear layers. The inputs are fed into the
first linear layer, which computes dot products with different weight
vectors.
• Subsequently, the outputs of this layer are passed to a nonlinear
layer that uses the ReLU function. Finally, the outputs of the
nonlinear layer are passed to a linear layer that generates the final
output of the neural network.
Architectures for Hardware Acceleration
Modelling with Neural Networks
• In general, it is possible to take such neural networks and train the weights
to compute any given function. The architecture is fairly generic, and we
can learn functions by simply changing the weight vectors.
• There are two important terms that we need to introduce here:
• Training and Inferencing: While training a neural network we are
provided a set of known inputs and outputs, and then we try to
compute the weights such that the outputs of the network match the
given outputs as far as possible. Almost all neural networks use the
backpropagation algorithm for computing the weights in the training
phase.
• It is important to note that the architecture of a neural network in
terms of the number and type of layers, and the nature of functional
units within the layers is decided a priori. Subsequently, these
parameters do not change. The training phase is used to only compute
the weights in the network. This phase is carried out offline and there
is typically no need to accelerate this phase in hardware.
Architectures for Hardware Acceleration
Modelling with Neural Networks

• However, for hardware designers, the inferencing part where


given an un-known input we try to predict the output, is far more
important.
• Almost all neural architectures as of 2020 focus exclusively on
accelerating the inferencing. Note that the expressive power of a
neural network is dependent on the number of layers and the
number of functional units within each layer.
• Hence, for learning complex functions it is typically necessary to
have deeper neural networks with more layers.
• With an increase in compute power, the rise of GPUs and FPGAs,
and also a concomitant increase in memory capacity, we can now
afford to have large neural networks with a few hundred layers.
This has spawned the revolution in the design of large deep
neural networks.
Architectures for Hardware Acceleration
Deep Neural Networks

• The simple example that we saw in XOR example had three layers: two
linear layers and one nonlinear layer. Such small networks are good for
learning small and simple functions.
• However, to identify far more complex patterns such as the number of
faces in an image or perform face recognition, we need many more
layers.
• Neural networks with a large number of layers are known as deep
neural networks or simply DNNs. It is not uncommon for a modern
neural network to have 100+ layers with millions of weights.
• Such deep neural networks have a series of layers that progressively
learn more and more complex concepts.
Architectures for Hardware Acceleration
Why CNN

• In each functional unit of a linear layer we compute a dot


product between the ifmaps (input feature maps) and weight
vectors. In most modern neural networks that process complex
images or analyse speech, the ifmaps are very large.
• In each linear layer in particular, we need to store large weight
vectors and then compute the dot products. If we have millions
of pixels in an ifmap, we also need to store millions of weights
just for each layer.
• The storage complexity, memory access overheads, and
compute time will make the process of training and inferencing
intractable. Hence, we typically create two kinds of linear layers:
one that has a lot of weights, and a layer that uses a small set of
weights.
Architectures for Hardware Acceleration
Fully Connected Layer
• This is a traditional linear layer where we simply multiply each
element in an ifmap with a weight. If the ifmap has N elements, then
we also need N weights.
• Typically the last layer in a DNN is a fully connected layer. This layer is
presented with a lot of high level concepts identified by earlier layers,
and it simply needs to make the final decision.
• This layer has very high memory requirements. Given that there are a
very few such layers, their total contribution to the computational
time is small (≈10%).
Architectures for Hardware Acceleration
Convolutional Layer
• For intermediate linear layers we do not store large weight vectors.
Instead, we store a very small set of weights known as a filter. We
compute a convolution between typically two-dimensional ifmaps and
the filter to compute an ofmap
• For detecting edges, and simple shapes, we do not need to compute a
dot product with a weight vector that is as large as the ifmap.
Conceptually, this is a local operation, and computing a localized dot
product with a small weight vector should suffice.
• We consider a small filter with R rows and S columns and a portion of
the ifmap with the same dimensions and just compute a dot product.
This is known as the convolution operation.
• We can extend our definition of a dot product of two vectors to a dot
product of two n-dimensional matrices. Here, we multiply
corresponding elements, and the value of the final dot product is a
sum of the individual element-wise products.
Architectures for Hardware Acceleration
The Convolution Operation
• It shows an ifmap with H rows and W columns, and a given position within it
(h, w). The convention that we adopt is that we list the row number first and
then the column number (similar to addressing a matrix).
• We draw an R×S element shaded rectangle whose left top is at (h, w). We
then compute a dot product between the elements of the filter and the
shaded rectangle in the ifmap.
• Let us represent the ifmap by the matrix I, the ofmap by the matrix O, and
the filter by the matrix F. We can then write this formally as
Architectures for Hardware Acceleration
The Convolution Operation

• This is not a traditional convolution operation that we learn in a signal


processing course – this is an element-wise dot product. However, we
can modify the classical convolution equations to become equivalent to
Equation by changing the sign of some variables.
Architectures for Hardware Acceleration
The Convolution Operation

• The main advantage here is that we need not store very large
weight vectors. We can just store small filters.
• First, an ofmap is typically dependent on multiple ifmaps, and there
is a unique filter for each ifmap-of map pair.
• Second, we typically compute a set of ofmaps in each layer, and
finally to maximize the reuse of inputs and filter weights, we
process a batch of input images in one go.
• Let us thus introduce some additional terminology. Consider a layer
that takes as input C ifmaps, and produces K output ofmaps.
Additionally, the entire neural network processes N input images in
a batch.
• For the sake of simplicity, let us assume that all the ifmaps and
ofmaps have the same dimensions: H×W (row-column format).
Architectures for Hardware Acceleration
The Convolution Operation

• Moreover, we observe that for each input image, we compute a convolution.


Each pixel of an ofmap is dependent on all the input ifmaps, and for each ifmap-
ofmap pair we have a filter.
• This is the basic equation for a convolution, which allows us to avoid heavy
computations with large weight vectors. Such convolutional neural networks are
known as CNNs, and they have proved to be extremely useful in very diverse
fields.
Architectures for Hardware Acceleration
Design of a CNN
• We have four kinds of layers in a CNN: convolutional layer, fully
connected layer, ReLU layer, and pooling layer. The latter two layers are
nonlinear layers.
Architectures for Hardware Acceleration
Design of a CNN

• The order of the loops does not matter.


Since there are no dependences
between the loop variables, we can
reorder the loops in any way we wish.

• For each output pixel, we perform C× R× S multiplications and the same number of
additions. We thus perform a total of 2C × R × S operations.
• The number of operations in such algorithms is typically mentioned in terms of the number
of MAC operations. We thus observe that per output pixel we perform C×R×S MAC
operations where we add partial sums to the output pixel (initialized to 0). Here, the product
I[n][c][h+r][w+s]×F[k][c][r][s] is referred to as a partial sum.
Architectures for Hardware Acceleration
Design of a CNN

• We are essentially defining a 7-dimensional space where the


dimensions are independent. This space can be tiled– broken
down into subspaces.
• Assume that we change the increment for the loop iterators w
and h from 1 to 3. It means that we are considering 3×3 tiles of
output pixels. Then we need to add two inner loops that traverse
each tile (3×3 space) and compute the corresponding partial
sums and add them to obtain the ofmaps.
• Furthermore, since these processors only have to perform MAC
operations and iterate through a loop, we do not need regular
processors. Akin to a GPU, we can create an array of very small
and simple processors.
Architectures for Hardware Acceleration
A Reference Architecture

• A high level reference architecture


is presented in Figure that will
allow us to achieve these
objectives.
• Such an architecture typically has
a 1D array or a 2D matrix of PEs,
some local storage in each PE
(akin to the L1 cache), a large L2
cache, and an off-chip main
memory.
• In the figure, the local buffer (LB)
in each PE is analogous to the L1
cache and the global buffer (GB) is
analogous to the L2 cache. The
PEs are interconnected with an
Network on Chip (NoC). Reference Architecture with a matrix of PEs
Architectures for Hardware Acceleration
A Reference Architecture

• From the point of view of software we need at least one complex


processor that we refer to as the dispatcher.
• It controls and orchestrates the entire computation. This includes
dividing the work amongst the PEs, sending fetch and prefetch
commands to memory, and moving outputs from the global buffer
to the main memory.
• We can reorder, tile, and parallelize the loops in many different
ways. There are implications in terms of data locality in the GB and
LBs.
• Additionally, we need to consider the overhead of moving data over
the NoC, sending multicast messages, and computing the final
output value for a pixel after adding the partial sums. The latter
operation is also known as reduction.
Architectures for Hardware Acceleration
Formal Representation of the Nested Loops

• Let us represent them by the notation:

• which represents the temporal order of loops. Let us refer


to this as a mapping because we are in essence mapping
the computations to PEs.
• In this case it is a single PE; however, very soon we shall
introduce directives to parallelize this computation.
• Here, the operator indicates a temporal relationship. The
loops on the right hand side are strictly nested within the
loops on the left hand side. In the context of this notation,
it means that we process one input image at a time. Then
for an input image, we process the ofmaps one after the
other, and so on
Architectures for Hardware Acceleration
Formal Representation of the Nested Loops

• Let us introduce another operator to denote the possible parallel


execution of a loop (distributed across the PEs) with the symbol ‖
where the notation n ‖ means that we can process all the N input
images in parallel.
• It does not mean that we necessarily have adequate hardware to
actually run all the iterations in parallel; it just says that it is possible
to do so if we have enough hardware.
• If we do not have enough hardware then each PE needs to run
several iterations. Furthermore, if we have two consecutive ‖
symbols then it means that both the corresponding loops run in
parallel.
• For example h ‖ w ‖ means that each(row, column) pair is processed
in parallel. We can also enclose it in square brackets for readability
such as h ‖ w ‖ . This notation denotes a single level of parallelism.
We create H×W parallel copies of loops and map them to the PEs.
Architectures for Hardware Acceleration
Formal Representation of the Nested Loops

• We might however wish to implement hierarchical parallelism. This


means that we might first want to parallelize the loop with iterator
h, map each iteration to a group of PEs, and then assign one PE in
each group to an iteration of w. This would be represented by:

• This provides a structure to the parallel execution. The rule of


thumb here is that represents a sequential order and ‖ represents
parallelism.
Architectures for Hardware Acceleration
Formal Representation of the Nested Loops

• The ‖ operator indicates that we “can parallelize a loop” by


distributing its iterations among the PEs. Assume a case where the
loop has 1024 iterations and we just have 256 PEs.
• In this case, we can run the first 256 iterations in parallel, then
schedule the next 256 iterations and so on.
• The way that we interpret the ‖ operator is that the loop
corresponding to the loop iterator preceding the operator can be
executed in parallel if we have enough PEs.
• However, if we do not have enough PEs then the execution will be
a hybrid of parallel and sequential execution. In this case, we say
that the execution is folded.
Architectures for Hardware Acceleration
Points to remember
Architectures for Hardware Acceleration
Software Model
• Our formalism for describing the nested loops predominantly
captures the control flow of the CNN program, and the nature
of parallelism. For the same control flow we can have
different types of data flow. For example, we can cache some
data in the local buffer (LB) of the PE, move some data from
the GB to the PEs, and also move some data values between
PE’s.
• The only extension that we propose is to encompass some
loops in a shaded box to indicate that the corresponding data
is cached locally within a PE. For example the mapping:

• indicates that the entire filter (R×S elements) is cached


locally within a PE. This is an example of temporal reuse
where a given piece of data is cached within a PE
Architectures for Hardware Acceleration
Software Model

• Each PE runs a thread that computes partial sums according


to the mapping. We split the iteration space among the
threads, and each PE executes a set of threads
(iterations)mapped to it.
• Secondly, we assume very coarse grained synchronization
among the threads. For the mapping:

• where we parallelize the loops on the basis of input


channels, after every iteration of k, we insert a barrier.
• This means that after we have processed all the channels for
a given value of k, we encounter a barrier (a synchronization
point for all the threads).
• The threads otherwise need not run in lockstep, and we
assume that they do not suffer from data races while
computing the values of output pixels.
Architectures for Hardware Acceleration
Design Space of Loop Transformations

• The aim is to consider different types of architectures where we


keep different types of data cached in each PE. Caching data in local
buffers is known as stationarity.
• The key idea behind stationarity of any data is to use it as frequently
as possible when it is resident in the local cache, and then minimise
the number of times the data needs to be reloaded into the cache.
• This means that the stationary data should be reused as much as
possible and then replaced. Note that it is possible to design
architectures where there is no stationarity. Such architectures are
known as No Local Reuse (NLR) architectures.
Architectures for Hardware Acceleration
Weight Stationary (WS) Architecture

• Consider the example:

• This means that we compute the results for each ifmap-ofmap


pair and image sequentially. Next, we divide an ofmap (and
corresponding ifmap) into small tiles and distribute them across
the 2D PE array.
• Each PE computes the output pixels for each tile. It reads the
corresponding inputs, and repeatedly accesses the filter weights.
• In this case, it makes sense to cache the filter weights for a given
ifmap-ofmap pair in each PE. This is an example of filter reuse or
weight reuse.
• We can keep the filter weights stationary at each PE and this will
ensure that we only need to stream the inputs and outputs to and
from each PE. This is an example of a weight stationary or WS
architecture.
Architectures for Hardware Acceleration
Making it more efficient: Weight Stationary (WS) Architecture
• The aim is to reuse the filter weights as much as possible. This
means that we need to load them once, finish all the
computations that require them, and then load the next filter.
• This is equivalent to maximizing the distance between the
iterations that change the filter; this also maximizes the number of
iterations that use the filter in the mapping. One such mapping is
as follows:

• The parameters k and c change the filter and thus they are the
iterators of the outermost loops. The loops traversing the filter are
the innermost. Note the shaded box . In this case it
indicates that the entire filter (dimensions R × S) is cached within a
PE.
Architectures for Hardware Acceleration
Making it more efficient: Weight Stationary (WS) Architecture
• We cache the entire filter (dimensions: R×S) in the LBs (Local
Buffer) of the PEs. The filter depends on k and c, which are
incremented in the two outermost loops. Each PE is assigned a tile

• At runtime, each PE can read the part of the ifmap that


corresponds to its tile, and keep computing the convolution. Note
that to compute the partial sums for a tile with dimensions
we actually need to read input pixels. We
assume that all of these reads are performed by the system
Architectures for Hardware Acceleration
Input Stationary (IS) Architecture

• Consider another kind of mapping where we distribute parts of


the ifmaps(inputs) to every PE, and keep them stationary in the LB
there. We tile the loops with iterators h and w.
• We thus break the ifmap into tiles of dimensions
(Th+R−1)×(Tw+S−1). Given that there are C input channels, we
can store C such tiles in every PE assuming we have the space for
them. Note that in this case, the tiles stored across the PEs have
an overlap.
• This had to be done to ensure that we can efficiently compute the
convolutions for pixels at the right and bottom edges of each tile.
• Otherwise, we need to communicate values between the tiles.
Such pixels at the bottom and right edges of the tile for which we
need to store or communicate extra information are known as
halo pixels.
Architectures for Hardware Acceleration
Input Stationary (IS) Architecture
• In this case, the mapping is as follows for such an input
stationary or IS architecture.

• The input ifmaps are stationary. In each PE we store C ifmap


tiles. Each PE reads the relevant filter weights and computes
the corresponding partial sums.
• Finally, it adds the partial sums for the corresponding output
tile and computes the output; then it moves to the next
image. This limits the number of partial sums that need to be
stored.
• Given that a PE stores tiles for all the C channels, it can
compute all the convolutions locally. There is no need to
handle halo pixels in a special manner.
Architectures for Hardware Acceleration
Output Stationary (OS) Architecture
• On similar lines we can define an output stationary or OS
architecture. Here we distribute the output pixels across the PEs.
They read the relevant inputs and filter weights, and then compute
the partial sums.

Input stationary
Output stationary

Weight stationary
Architectures for Hardware Acceleration
Row Stationary (RS) Architecture
• We can alternatively distribute rows of the ifmap and the filter
across the PEs.
• They can compute the relevant partial sums. For a given of map
row one of the PEs can take up the role of aggregating the partial
sums.
• The rest of the Pes that have computed partial sums for that of
map row, albeit with different filter weights and ifmap rows, can
send the partial sums to the aggregating PE. A possible mapping is
as follows:

The important point to note here is that there are


many ways of creating stationary architectures.
Even for a given type of architecture, there are
many ways of distributing the computations and
organizing the data flow.
Architectures for Hardware Acceleration
Hardware Architecture for ML
• We described temporal reuse, where we keep some data stationary in
the PEs such that it can be used by later computations. However, for
building an efficient hardware implementation, we also need to consider
efficient spatial reuse, which means that we read a block of data once
from memory and try to reuse it as much as possible.
• For example, this can be done by multicasting it to a set of PEs. The PEs
do not have to issue separate reads to memory.
• Alternatively, data values can flow from one PE to the next along the
same row or column. Again this reduces the memory bandwidth
requirement.
• Any hardware implementation has to consider such kinds of spatial
reuse to reduce the pressure on the global buffer.
Architectures for Hardware Acceleration
Hardware Architecture (HA): Weight Stationary Architecture
• We need to realize this on a 2D array of PEs, which is the most
generic architecture in this space. Each PE has a filter stored for an
ofmap-ifmap(k, c) pair.

• For a 2D-matrix of PEs, we need to structure the matrix in such a way


that it allows us to aggregate the outputs. We thus assign the rows to
channels, and the columns to ofmaps. The mapping thus becomes
Architectures for Hardware Acceleration
Hardware Architecture (HA): Weight Stationary Architecture

• In Figure, there are no connections between


the vertical and horizontal wires. Note that
there are no connection symbols (filled dark
circles) at their intersections. For systolic
transfer between the PEs, the vertical links
are used. Note that the arrows between PEs
(denoting systolic transfer) are conceptual:
they are only showing the direction of the
flow of data.

A Weight Stationary Architecture


Architectures for Hardware Acceleration
Hardware Architecture (HA): Weight Stationary Architecture

• Phase I: First for a given (k, c) pair we need to load


the filter weights in the PEs. Each PE can issue
reads to the GB via the NoC.
• Each row of the 2D-array corresponds to one
channel (c1. . . c4), and each column of this array
corresponds to a given ofmap (total of K (= 4) such
ofmaps). We thus have C rows and K columns.
• Phase II: For each channel, we send a block of
values from the corresponding ifmap along the
rows. We have a choice here, either we can send
data byte by byte, pixel by pixel, or as a tile of
pixels.
• Phase III: Each PE computes the convolutions
between the input data and the filter data.
A Weight Stationary Architecture
Architectures for Hardware Acceleration
Hardware Architecture (HA): Weight Stationary Architecture
• Phase IV: Note that all the partial sums computed
in each column need to be added to get the value
of the corresponding output pixels.
• We need to sum up the values column wise. This
can be done in two ways. We can either have a tree
of adders at the end of each column.
• All the values can be sent to the adder tree via the
NoC. This is known as parallel reduction. The other
option is to opt for a systolic transfer.
• A PE in the highest row transfers its partial sums to
the PE below it (in the same column). This PE adds
the received partial sums with the partial sums it
has computed, and transfers the result to the PE
below.
• This process continues and finally the result leaves A Weight Stationary Architecture
the PE array via the last row of PEs (bottom row in
Figure).
Architectures for Hardware Acceleration
References

• Textbook: Advanced Computer Architecture, Smruti Sarangi, Chapter:


Security
THANK YOU

Sudeendra kumar K
Department of Electronics and Communication
Engineering
sudeendrakumark@pes.edu

You might also like