Hello,
By using NVIDIA Nsight Compute to analyze yolov4_darknet, we found non-negligible instruction dependence within this loop CUDA_KERNEL_LOOP. In our evaluation, unrolling this loop (see the code below) can mitigate the perf issue. Because it gives the loop body more instructions, thus increasing the likelihood of hiding dependency-related GPU stalls.
+# pragma unroll 4 // or other proper unroll factors
+for (int index = blockIdx.x * blockDim.x + threadIdx.x; index < n; index += blockDim.x * gridDim.x) {
-CUDA_KERNEL_LOOP(index, n) {