0% found this document useful (0 votes)

35 views23 pages

Tilining

Uploaded by

swatiulligeri52

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views23 pages

Tilining

Uploaded by

swatiulligeri52

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Tiling/Performance

A Common Programming Strategy

• Global memory resides in device memory (DRAM)
- much slower access than shared memory
• So, a profitable way of performing computation on the device
is to tile data to take advantage of fast shared memory:
– Partition data into subsets that fit into shared memory
– Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory, using multiple
threads to exploit memory-level parallelism
• Performing the computation on the subset from shared memory; each thread
can efficiently multi-pass over any data element
• Copying results from shared memory to global memory
A Common Programming Strategy (Cont.)
• Constant memory also resides in device memory (DRAM)
- much slower access than shared memory
– But… cached!
– Highly efficient access for read-only data
• Carefully divide data according to access patterns
– R/Only -> constant memory (very fast if in cache)
– R/W shared within Block -> shared memory (very fast)
– R/W within each thread -> registers (very fast)
– R/W inputs/results -> global memory (very slow)
Idea: Use Shared Memory to reuse global memory data

• Each input element is

WIDTH
read by Width threads.
• Load each element into Shared
Memory and have several threads M P

use the local version to reduce the ty

memory bandwidth

WIDTH
– Tiled algorithms
tx
WIDTH WIDTH
Tiled Multiply
bx
0 1

tx
012
Break up the execution of the Nd

TILE_WIDTH TILE_WIDTH
TILE_W
IDTH-1
kernel into phases so that the data accesses in

WIDTH
each phase is focused on one subset (tile) of
Md and Nd

Md Pd

TILE_WIDTHE
1

WIDTH
Pdsub
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
Breaking Md and Nd into Tiles
Nd0,0 Nd1,0

Nd0,1 Nd1,1

Nd0,2 Nd1,2

Nd0,3 Nd1,3
Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3

Each phase of a Thread Block uses
one tile from Md and one from Nd
Phase 1 Step 4 Step 5 StepPhase
6 2
T0,0 Md0,0 Nd0,0 PValue0,0 += Md2,0 Nd0,2 PValue0,0 +=
Mds0,0*Nds0,0 + Mds0,0*Nds0,0
↓ ↓ ↓ ↓
Mds0,0 Mds1,0*Nds0,1 +
Nds0,0 Mds0,0 Nds0,0 Mds1,0*Nds0,1

T1,0 Md1,0 Nd1,0 PValue1,0 += Md3,0 Nd1,2 PValue1,0 +=

Mds0,0*Nds1,0 + Mds0,0*Nds1,0
↓ ↓ ↓ ↓
Mds1,0 Mds1,0*Nds1,1 +
Nds1,0 Mds1,0 Nds1,0 Mds1,0*Nds1,1

T0,1 Md0,1 Nd0,1 PdValue0,1 += Md2,1 Nd0,3 PdValue0,1 +=

Mds0,1*Nds0,0 + Mds0,1*Nds0,0
↓ ↓ ↓ ↓
Mds0,1 Mds1,1*Nds0,1 +
Nds0,1 Mds0,1 Nds0,1 Mds1,1*Nds0,1

T1,1 Md1,1 Nd1,1 PdValue1,1 += Md3,1 Nd1,3 PdValue1,1 +=

Mds
time0,1*Nds1,0 + Mds0,1*Nds1,0
↓ ↓ ↓ ↓
Mds1,1 Mds1,1*Nds1,1 +
Nds1,1 Mds1,1 Nds1,1 Mds1,1*Nds1,1
Threads, Warps, Blocks
• There are (up to) 32 threads in a Warp
– Only <32 when there are fewer than 32 total
threads
• There are (up to) 16 Warps in a Block
• Each Block (and thus, each Warp) executes on a single SM
• G80 has 16 SMs
• At least 16 Blocks required to “fill” the device
• More is better
– If resources (registers, thread space, shared memory) allow, more than 1
Block can occupy each SM
First-order Size Considerations in G80
• Each thread block should have many threads
– TILE_WIDTH of 16 gives 16*16 = 256 threads

• There should be many thread blocks

– A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks

• Each thread block perform 2*256 = 512 float loads from global
memory for 256 * (2*16) = 8,192 mul/add operations.
– Memory bandwidth no longer a limiting factor
How about performance on a GPU

– All threads access global memory for their input matrix elements
– One memory accesses (4 bytes) per floating-point addition
– 4B/s of memory bandwidth/FLOPS
– Assume a GPU with
– Peak floating-point rate 1,500 GFLOPS with 200 GB/s DRAM bandwidth
– 4*1,500 = 6,000 GB/s required to achieve peak FLOPS rating
– The 200 GB/s memory bandwidth limits the execution at 200/4 = 50 GFLOPS

– This limits the execution rate to 3.3% (50/1500) of the peak

floating-point execution rate of the device!

– Need to drastically cut down memory accesses to get close to

the1,500 GFLOPS
Outline of Tiling Technique

– Identify a tile of global memory contents that are accessed by multiple threads
– Load the tile from global memory into on-chip memory

– Use barrier synchronization to make sure that all threads are ready to start the phase

– Have the multiple threads to access their data from the on-chip memory
– Use barrier synchronization to make sure that all threads have completed the
current phase
– Move on to the next tile
Objective

– To understand the design of a tiled parallel algorithm

for matrix multiplication
– Loading a tile
– Phased execution
– Barrier Synchronization
Loading a Tile
– All threads in a block participate
– Each thread loads one M element and one N element in tiled
code
CUDA Code – Kernel
Execution Configuration
// Setup the execution configuration

dim3 dimBlock(TILE_WIDTH, TILE_WIDTH); dim3

dimGrid(Width / TILE_WIDTH,
Width / TILE_WIDTH);
Tiled Matrix Multiplication Kernel
global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
1. shared float Mds[TILE_WIDTH][TILE_WIDTH];
2. shared float Nds[TILE_WIDTH][TILE_WIDTH];

3. int bx = blockIdx.x; int by = blockIdx.y;

4. int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work on

5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;

7. float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute the Pd element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of Md and Nd tiles into shared memory
9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10. Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];
11. syncthreads();

11. for (int k = 0; k < TILE_WIDTH; ++k)

12. Pvalue += Mds[ty][k] * Nds[k][tx];
13. Synchthreads();
14. }
13. Pd[Row*Width+Col] = Pvalue;
}
Tiled Multiply
bx
0 1

• Each block computes one tx

012
square sub-matrix Pdsub of size Nd

TILE_WIDTH TILE_WIDTH
TILE_W
m IDTH-1
TILE_WIDTH

WIDTH
• Each thread computes one bx k
element of Pdsub

Md Pd
by
0
m
0

TILE_WIDTHE
1

WIDTH
Pdsub

by 1
ty 2
k
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
G80 Shared Memory and Threading
• Each SM in G80 has 16KB shared memory
– SM size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory.
– Can potentially have up to 8 Thread Blocks actively executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
– The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block,
allowing only up to two thread blocks active at the same time
• Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16
– The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!
0
10
20
30
40
50
60
70
80
90
100

not tiled
tiled
only

4x4 tiles
tiled &
unrolled

tiled
8x8 tiles only

tiled &
unrolled

tiled
only

tiled &
12x12 tiles

unrolled

tiled
only
Tiling Size Effects

tiled &
16x16 tiles

unrolled

UNIT-5 Tiling
No ratings yet
UNIT-5 Tiling
23 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
No ratings yet
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
31 pages
12 Gpu Cuda 3
No ratings yet
12 Gpu Cuda 3
58 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
VSCSE Lecture3 Cuda Memory Model 2012
No ratings yet
VSCSE Lecture3 Cuda Memory Model 2012
31 pages
HPC Revision
No ratings yet
HPC Revision
16 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
CUDA Matrix Multiplication Quiz
No ratings yet
CUDA Matrix Multiplication Quiz
12 pages
CUDA Memory
No ratings yet
CUDA Memory
56 pages
217 Lec6
No ratings yet
217 Lec6
23 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
No ratings yet
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
53 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
CUDA Part-2
No ratings yet
CUDA Part-2
49 pages
HPC File
No ratings yet
HPC File
22 pages
CSED405 Lec3-Memory and Locality - 240912 - 113301
No ratings yet
CSED405 Lec3-Memory and Locality - 240912 - 113301
65 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
CUDA Programming Quiz
100% (5)
CUDA Programming Quiz
4 pages
5 Computation
No ratings yet
5 Computation
13 pages
CUDA Part-2
No ratings yet
CUDA Part-2
61 pages
Web GPU
0% (1)
Web GPU
40 pages
Class 10
No ratings yet
Class 10
13 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
APT06 2024S2 New
No ratings yet
APT06 2024S2 New
21 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
CUDA Matrix Multiplication Guide
No ratings yet
CUDA Matrix Multiplication Guide
38 pages
GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use
No ratings yet
GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use
52 pages
Cuda Notes From Udacity Lecture
No ratings yet
Cuda Notes From Udacity Lecture
3 pages
Input: Output: 1. Sub String Program
No ratings yet
Input: Output: 1. Sub String Program
8 pages
GPU Programming Slides 3
No ratings yet
GPU Programming Slides 3
73 pages
Advanced CUDA Programming Guide
No ratings yet
Advanced CUDA Programming Guide
64 pages
Chapter 3
No ratings yet
Chapter 3
20 pages
GPU Architecture and Parallel Programming: Tiled Convolution Analysis
No ratings yet
GPU Architecture and Parallel Programming: Tiled Convolution Analysis
18 pages
Cuuda Nvidai Guide - Part3
No ratings yet
Cuuda Nvidai Guide - Part3
15 pages
Lab7 GPU
No ratings yet
Lab7 GPU
10 pages
CUDA Class Lecture03
No ratings yet
CUDA Class Lecture03
18 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
Processors
No ratings yet
Processors
25 pages
Deep Learning Kernel Operations Explained
No ratings yet
Deep Learning Kernel Operations Explained
27 pages
Comp422 2011 Lecture8 UPC
No ratings yet
Comp422 2011 Lecture8 UPC
44 pages
GPUs and GPGPU
No ratings yet
GPUs and GPGPU
15 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
1083 Wang
No ratings yet
1083 Wang
56 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Architecture1 1 (2012)
No ratings yet
Architecture1 1 (2012)
87 pages
Multithreaded Architectures: Lecture 5: Performance Considerations
No ratings yet
Multithreaded Architectures: Lecture 5: Performance Considerations
49 pages
HPC-Practical-4Addition of Two Large Vectors
No ratings yet
HPC-Practical-4Addition of Two Large Vectors
4 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
Moving To Parallel - Addition of 2 Matrices
No ratings yet
Moving To Parallel - Addition of 2 Matrices
14 pages
Parralel Demro 003
No ratings yet
Parralel Demro 003
46 pages
Assignment 04
No ratings yet
Assignment 04
16 pages
Unit 1notes Full
No ratings yet
Unit 1notes Full
20 pages
Vlsi Interview Questions 2
100% (1)
Vlsi Interview Questions 2
15 pages
Business Relationship Building
No ratings yet
Business Relationship Building
25 pages
Day '0' DevOps Ebook - v1.0
No ratings yet
Day '0' DevOps Ebook - v1.0
66 pages
Document 1
No ratings yet
Document 1
444 pages
Conditional Access App Control - Microsoft Defender For Cloud Apps Microsoft Learn
No ratings yet
Conditional Access App Control - Microsoft Defender For Cloud Apps Microsoft Learn
6 pages
(Ebook PDF) Instructor'S Manual To Accompany Operating System Concepts PDF Download
No ratings yet
(Ebook PDF) Instructor'S Manual To Accompany Operating System Concepts PDF Download
144 pages
My Questions 1
No ratings yet
My Questions 1
26 pages
Pie: Pooling CPU Memory For LLM Inference: Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, Ion Stoica
No ratings yet
Pie: Pooling CPU Memory For LLM Inference: Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, Ion Stoica
13 pages
Level Up Your Game
No ratings yet
Level Up Your Game
11 pages
Best Pracsticbestes PTC Windchill On SQL Server
No ratings yet
Best Pracsticbestes PTC Windchill On SQL Server
35 pages
Computer Basics Quiz
No ratings yet
Computer Basics Quiz
6 pages
10 DSA Hashing Question You Should Practice
No ratings yet
10 DSA Hashing Question You Should Practice
14 pages
EEprom 24 C65
No ratings yet
EEprom 24 C65
25 pages
Bhatti WebTechnology With .NET Unit-3
No ratings yet
Bhatti WebTechnology With .NET Unit-3
55 pages
Grok System Design Interview
No ratings yet
Grok System Design Interview
163 pages
Zen of Reactive Ui
No ratings yet
Zen of Reactive Ui
81 pages
Core Java Question
No ratings yet
Core Java Question
4 pages
Computer Memory MCQs for Students
No ratings yet
Computer Memory MCQs for Students
36 pages
OS Concepts: Caching & Virtualization
No ratings yet
OS Concepts: Caching & Virtualization
26 pages
A Risc MIPS, Mflops
No ratings yet
A Risc MIPS, Mflops
16 pages
Java Performance Optimization: Patterns and Anti-Patterns
No ratings yet
Java Performance Optimization: Patterns and Anti-Patterns
7 pages
Popularity-Driven Content Caching
No ratings yet
Popularity-Driven Content Caching
9 pages
All1 7ForMidTerm PDF
No ratings yet
All1 7ForMidTerm PDF
97 pages
Introduction To MIS: Information Technology Foundations
No ratings yet
Introduction To MIS: Information Technology Foundations
57 pages
A Proposed Model For Web Proxy Caching Techniques To Improve Computer Networks Performance
No ratings yet
A Proposed Model For Web Proxy Caching Techniques To Improve Computer Networks Performance
12 pages
E20-594 Avamar Specialist
No ratings yet
E20-594 Avamar Specialist
102 pages
Back To The Roots Oracle Database IO Management
No ratings yet
Back To The Roots Oracle Database IO Management
35 pages
Top 25 System Design Interview Questions
No ratings yet
Top 25 System Design Interview Questions
16 pages
SSRN Id4565813
No ratings yet
SSRN Id4565813
50 pages

Tilining

Uploaded by

Tilining

Uploaded by

Tiling/Performance

A Common Programming Strategy

• Each input element is

use the local version to reduce the ty

Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3

T1,0 Md1,0 Nd1,0 PValue1,0 += Md3,0 Nd1,2 PValue1,0 +=

T0,1 Md0,1 Nd0,1 PdValue0,1 += Md2,1 Nd0,3 PdValue0,1 +=

T1,1 Md1,1 Nd1,1 PdValue1,1 += Md3,1 Nd1,3 PdValue1,1 +=

• There should be many thread blocks

– This limits the execution rate to 3.3% (50/1500) of the peak

– Need to drastically cut down memory accesses to get close to

– To understand the design of a tiled parallel algorithm

dim3 dimBlock(TILE_WIDTH, TILE_WIDTH); dim3

3. int bx = blockIdx.x; int by = blockIdx.y;

// Identify the row and column of the Pd element to work on

11. for (int k = 0; k < TILE_WIDTH; ++k)

• Each block computes one tx

You might also like