Unroll this CUDA_KERNEL_LOOP(index, n) loop to optimize its execution?

Hello,

By using NVIDIA Nsight Compute to analyze yolov4_darknet, we found non-negligible instruction dependence within this loop [CUDA_KERNEL_LOOP](https://github.com/kiyoshiiriemon/yolov4_darknet/blob/0f58f6bd1c432b84948ea00d500d4e747c1bdf9a/src/im2col_kernels.cu#L2240). In our evaluation, unrolling this loop (see the code below) can mitigate the perf issue. Because it gives the loop body more instructions, thus increasing the likelihood of hiding dependency-related GPU stalls.

+# pragma unroll 4 // or other proper unroll factors
+for (int index = blockIdx.x * blockDim.x + threadIdx.x; index < n; index += blockDim.x * gridDim.x) { 
-CUDA_KERNEL_LOOP(index, n) {



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unroll this CUDA_KERNEL_LOOP(index, n) loop to optimize its execution? #8928

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Unroll this CUDA_KERNEL_LOOP(index, n) loop to optimize its execution? #8928

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions