Skip to content

Unroll this CUDA_KERNEL_LOOP(index, n) loop to optimize its execution? #8928

@YawesomeM

Description

@YawesomeM

Hello,

By using NVIDIA Nsight Compute to analyze yolov4_darknet, we found non-negligible instruction dependence within this loop CUDA_KERNEL_LOOP. In our evaluation, unrolling this loop (see the code below) can mitigate the perf issue. Because it gives the loop body more instructions, thus increasing the likelihood of hiding dependency-related GPU stalls.

+# pragma unroll 4 // or other proper unroll factors
+for (int index = blockIdx.x * blockDim.x + threadIdx.x; index < n; index += blockDim.x * gridDim.x) {
-CUDA_KERNEL_LOOP(index, n) {

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions