Skip to content

Conversation

@antonysigma
Copy link
Contributor

Tile the image with TailStrategy::GuardWithIf whenever can_parallelize(c, 3) is encountered and when the split factor (3) is smaller than GPU's warp size (=32).

For 2D/3D tile schedules having mixed tail strategies, pick the most conservative one, i.e. TailStrategy::GuardWithIf.

Enable iir_blur tests on the Github Actions/Buildbot.

Before:

iir_blur_auto_schedule failure:
Error: Input buffer input is accessed at 5, which is beyond the max (3) in dimension 2

After:

Manually-tuned time: 4.09081ms
Auto-scheduled time: 1769.91ms

Note: We are not trying to compete with manual GPU schedules. We just want it to pass correctness tests.

cc'ed @alexreinking and @mcourteaux .

@antonysigma antonysigma force-pushed the mullapudi-tail-strategy branch from ac74c4b to 53b1567 Compare June 24, 2025 03:23
Tile the image with `TailStrategy::GuardWithIf` whenever
`can_parallelize(c, 3)` is encountered and when the split factor is
smaller than GPU's warp size (=32).

For 2D/3D tile schedules having mixed tail strategies, pick the most
conservative one, i.e. `TailStrategy::GuardWithIf`.

Enable `iir_blur` tests on the Github Actions/Buildbot.
@antonysigma antonysigma force-pushed the mullapudi-tail-strategy branch from 53b1567 to 1452961 Compare June 24, 2025 03:31
@antonysigma
Copy link
Contributor Author

The Testbench passes for targets metal and cuda for all OS'es, except OpenCL. I will modify the test code to skip OpenCL targets then.

Error message:

OpenCL error: CL_INVALID_COMMAND_QUEUE clFinish failed

Copy link
Contributor Author

@antonysigma antonysigma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help wanted to troubleshoot the buildbot failure for host-metal target.

@abadams
Copy link
Member

abadams commented Jun 30, 2025

The Halide buildbots are physical machines we have sitting in various offices and homes, so the cost is just electricity. Provided you aren't starving other PRs, it's fine to add experimental commits just to get more debugging info from the tests.

@antonysigma antonysigma force-pushed the mullapudi-tail-strategy branch from 93076ae to 38b75b4 Compare June 30, 2025 22:39
@antonysigma antonysigma force-pushed the mullapudi-tail-strategy branch from 38b75b4 to 988e83f Compare July 1, 2025 02:27
Copy link
Contributor Author

@antonysigma antonysigma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mcourteaux and @alexreinking ,

Hardcoding the input and output image sizes for the pipeline iir_blur seemed to resolve the OSX Metal error. I also found that the channel count estimates (=3) was inaccurate; the generated pipeline was fed with rgba.png having 4 channels. So I also modified the test code to feed a 3-channel image rgb.png.

The buildbot now reports separate errors regarding the max GPU threads per group error for other pipelines. I will resolve that in the next PR.

I have already documented the thread count limit at #8640 . I will study after this tail-strategy PR.

New errors:

/Users/halidenightly/build_bot/worker/halide-testbranch-main-llvm20-x86-64-osx-cmake/halide-build/apps/bilateral_grid/bilateral_grid_process "gray.png" "out.png" "0.1" "10"
2025-06-30 23:21:11.717 bilateral_grid_process[32169:21028558] Metal API Validation Enabled
-[MTLDebugComputeCommandEncoder _validateThreadsPerThreadgroup:]:1267: failed assertion `(threadsPerThreadgroup.width(64) * threadsPerThreadgroup.height(16) * threadsPerThreadgroup.depth(1))(1024) must be <= 896. (kernel threadgroup size limit)

/Users/halidenightly/build_bot/worker/halide-testbranch-main-llvm20-x86-64-osx-cmake/halide-build/apps/lens_blur/lens_blur_filter "rgb_small.png" "32" "13" "0.5" "32" "3" "out.png"
2025-06-30 23:26:02.260 lens_blur_filter[32272:21031150] Metal API Validation Enabled
-[MTLDebugComputeCommandEncoder _validateThreadsPerThreadgroup:]:1267: failed assertion `(threadsPerThreadgroup.width(32) * threadsPerThreadgroup.height(32) * threadsPerThreadgroup.depth(1))(1024) must be <= 896. (kernel threadgroup size limit)'
Manually-tuned time: 40.1912ms

-Antony

Comment on lines +158 to +165
// Hardcode all CPU/GPU kernel bounds via Halide's constant bound propagation.
input.dim(0).set_bounds(0, 1536).set_stride(1);
input.dim(1).set_bounds(0, 2560).set_stride(1536);
input.dim(2).set_bounds(0, 3).set_stride(1536 * 2560);
output.dim(0).set_bounds(0, 1536).set_stride(1);
output.dim(1).set_bounds(0, 2560).set_stride(1536);
output.dim(2).set_bounds(0, 3).set_stride(1536 * 2560);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: OSX Metal buffer allocation error.

Hardcoding the input and output image sizes in the iir_blur pipeline appears to resolve the Metal GPU buffer size limit error.

@antonysigma antonysigma force-pushed the mullapudi-tail-strategy branch from bb0534d to 25c27b2 Compare July 1, 2025 23:30
@antonysigma
Copy link
Contributor Author

Hi @alexreinking , I am done with the bugfixes. There is one minor buildbot error though. What does it mean? How do I help resolving CMake errors?

[2466/4259] Building CXX object test/correctness/CMakeFiles/correctness_fuzz_simplify.dir/fuzz_simplify.cpp.o
FAILED: test/correctness/CMakeFiles/correctness_fuzz_simplify.dir/fuzz_simplify.cpp.o 
...
fatal error: file '/Users/halidenightly/build_bot/worker/halide-testbranch-main-llvm20-arm-64-osx-cmake/halide-build/include/Halide.h' has been modified since the precompiled header '/Users/halidenightly/build_bot/worker/halide-testbranch-main-llvm20-arm-64-osx-cmake/halide-build/test/CMakeFiles/_test_internal.dir/cmake_pch.hxx.pch' was built: size changed (was 1338765, now 1338767)
note: please rebuild precompiled header '/Users/halidenightly/build_bot/worker/halide-testbranch-main-llvm20-arm-64-osx-cmake/halide-build/test/CMakeFiles/_test_internal.dir/cmake_pch.hxx.pch'
1 error generated.

@alexreinking
Copy link
Member

I think that's a ccache bug... I'll clear that out.

@alexreinking alexreinking merged commit 4df93ec into halide:main Jul 2, 2025
19 checks passed
@alexreinking
Copy link
Member

@antonysigma - it was indeed ccache. Merged! 🙂

@antonysigma antonysigma deleted the mullapudi-tail-strategy branch July 3, 2025 05:58
@antonysigma
Copy link
Contributor Author

@alexreinking Thanks for merging it! The rest of the issues (e.g. metal GPU having smaller register count than Nvidia GPU on buildbot) are somewhat minor, and they should not be blocking the next Halide release. Let me know if you have other concerns related to bugfixes and/or scheduled releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants