Enable experimental Mullapudi2016 GPU scheduler for test-bench #8650

antonysigma · 2025-06-13T17:19:41Z

(Github, unlike Gerrit, does not support daisy-chaining of PRs. Also, it doesn't support git-diff of parent & child PRs. Marking this as "Draft PR" to emulate the daisy-chaining feature.)

Depends on PR: #8647 .

Set autoschedule.experimental_gpu_schedule = 1 for selected Mullapudi2016 benchmarking tests in the path ./apps/*/. Override the GPU shared memory size estimates for specific algorithm pipelines to satisfy the Github/Buildbot's choice of GPU and hardware specifications:

bilateral_grid;
camera_pipe;
local_laplacian;
stencil_chain;

Manually tune the autoscheduler.parallelism from 128 to 4096 for the following pipelines to speed up the benchmarking tests. That is, to improve the auto-scheduled GPU algorithm by >5X. Default value (=128) is recommended in the original publication. It makes sense for GPUs of the 2016-era (e.g. Tesla K40), but the GPU foundry has improved since then. We are keeping the default value (=128) so as to honor the original publication.

conv_layer;
lens_blur;
local_laplacian;

Skip experimental GPU schedules for the following pipelines:

iir_blur: Problems about accessing index 5 for images having only 3 channels (likely the TailStrategy issues);
nl_means: More than 3 nested levels of gpu_threads and/or gpu_blocks detected.

alexreinking · 2025-06-13T18:32:34Z

Github, unlike Gerrit, does not support daisy-chaining of PRs.

Have you heard of gh-stack? That might support this workflow better, though we've never used it on this repository before.

alexreinking

One little thing to fix, then rebase on main when the dependent PR is merged and I'll approve.

apps/compositing/CMakeLists.txt

alexreinking · 2025-06-13T21:49:01Z

@antonysigma - I just merged #8647. Please update this PR.

Set `autoschedule.experimental_gpu_schedule = 1` for selected Mullapudi2016 tests in the path `./apps/*/`. Manually adjust L2/L3 cache size for specific algorithm pipelines. Skip experimental GPU schedules for the following cases: - iir_blur: Problems about accessing index 5 for images having only 3 channels (likely the TailStrategy issues); - nl_means: More than 3 nested levels of `gpu_threads` and/or `gpu_blocks` detected.

alexreinking · 2025-06-13T22:12:37Z

Thanks very much for your responsiveness @antonysigma -- I've just requested a review from the buildbots. If they don't pick up on this PR within an hour, just push an empty commit to this branch.

antonysigma · 2025-06-14T15:44:26Z

Summary of test bench failure:

On x64-osx-metal:

camera_pipe: Failed assertion parallelism <= 832.
interpolate_filter: Failed assertion parallelism <= 640.
stencil_chain_process: Failed assertion parallelism <= 832.
unsharp_filter: Failed assertion parallelism <= 768.

On llvm20-x64-Linux-cuda:

bgu_filter: CUDA error: CUDA_ERROR_ILLEGAL_ADDRESS cuCtxSynchronize failed
interpolate_filter: Error: OpenCL error: CL_OUT_OF_RESOURCES clEnqueueNDRangeKernel failed

On llvm20-x64-windows-opencl:

interpolate_filter: Error: OpenCL error: CL_OUT_OF_RESOURCES clEnqueueNDRangeKernel failed

I plan to scale back the benchmarking test coverage the PR. After all, the gpu autoscheuler is an experimental feature not covering modern Metal/OpenCL APIs.

alexreinking · 2025-06-17T11:48:11Z

So what's the path forward here? Do you want to add parallelism upper bounds for some targets?

We should open issues for the other three. It isn't clear why those schedules are failing.

antonysigma

Hi @alexreinking ,

So what's the path forward here? Do you want to add parallelism upper bounds for some targets?

I disabled the experimental GPU scheduling feature for the failing pipelines.

We should open issues for the other three. It isn't clear why those schedules are failing.

These errors can be all related; the autoscheduler may be double counting the threads_budget in separate places in the code.

src/autoschedulers/mullapudi2016/AutoSchedule.cpp:1422:            threads_budget = simplify(max(threads_budget / new_entry.factor, 1));
src/autoschedulers/mullapudi2016/AutoSchedule.cpp:3438:    if (can_prove(def_par < arch_params.parallelism)) {

I will bundle the bugfix with the next draft PR: antonysigma@91babf2

-Antony

alexreinking · 2025-06-17T18:48:30Z

Sounds good - I'll merge this when green.

alexreinking · 2025-06-20T21:36:03Z

@antonysigma do you have a draft PR to open?

antonysigma · 2025-06-20T23:50:42Z

do you have a draft PR to open?

@alexreinking Not for the coming week. I am still configuring my machine to simulate the host-metal environment. I need a matching environment (i.e. identical to that of Halide Buildbot) so I can reproduce the thread_budget error.

antonysigma mentioned this pull request Jun 13, 2025

Mullapudi2016 experimental GPU scheduling support: the tech debt #8640

Open

8 tasks

antonysigma force-pushed the mullapudi2016-gpu-testbench branch from 2fc6fd8 to 5b6a90b Compare June 13, 2025 17:47

alexreinking requested changes Jun 13, 2025

View reviewed changes

apps/compositing/CMakeLists.txt Show resolved Hide resolved

antonysigma force-pushed the mullapudi2016-gpu-testbench branch from 5b6a90b to 9a249e7 Compare June 13, 2025 22:10

antonysigma marked this pull request as ready for review June 13, 2025 22:10

antonysigma requested a review from alexreinking June 13, 2025 22:10

alexreinking approved these changes Jun 13, 2025

View reviewed changes

alexreinking requested a review from halidebuildbots June 13, 2025 22:11

Revert failing benchmarks

981ada0

antonysigma commented Jun 17, 2025

View reviewed changes

alexreinking merged commit 08c3357 into halide:main Jun 18, 2025
14 checks passed

antonysigma mentioned this pull request Jun 24, 2025

Highlight Metal GPU code in stmt_html #8659

Merged

BrewTestBot mentioned this pull request Sep 16, 2025

halide 21.0.0 Homebrew/homebrew-core#244220

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable experimental Mullapudi2016 GPU scheduler for test-bench #8650

Enable experimental Mullapudi2016 GPU scheduler for test-bench #8650

Uh oh!

antonysigma commented Jun 13, 2025 •

edited

Loading

Uh oh!

alexreinking commented Jun 13, 2025

Uh oh!

alexreinking left a comment

Uh oh!

Uh oh!

alexreinking commented Jun 13, 2025

Uh oh!

alexreinking commented Jun 13, 2025

Uh oh!

antonysigma commented Jun 14, 2025 •

edited

Loading

Uh oh!

alexreinking commented Jun 17, 2025

Uh oh!

antonysigma left a comment

Uh oh!

alexreinking commented Jun 17, 2025

Uh oh!

Uh oh!

alexreinking commented Jun 20, 2025

Uh oh!

antonysigma commented Jun 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Enable experimental Mullapudi2016 GPU scheduler for test-bench #8650

Enable experimental Mullapudi2016 GPU scheduler for test-bench #8650

Uh oh!

Conversation

antonysigma commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexreinking commented Jun 13, 2025

Uh oh!

alexreinking left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexreinking commented Jun 13, 2025

Uh oh!

alexreinking commented Jun 13, 2025

Uh oh!

antonysigma commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexreinking commented Jun 17, 2025

Uh oh!

antonysigma left a comment

Choose a reason for hiding this comment

Uh oh!

alexreinking commented Jun 17, 2025

Uh oh!

Uh oh!

alexreinking commented Jun 20, 2025

Uh oh!

antonysigma commented Jun 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

antonysigma commented Jun 13, 2025 •

edited

Loading

antonysigma commented Jun 14, 2025 •

edited

Loading