Skip to content

Conversation

@antonysigma
Copy link
Contributor

@antonysigma antonysigma commented Jun 13, 2025

(Github, unlike Gerrit, does not support daisy-chaining of PRs. Also, it doesn't support git-diff of parent & child PRs. Marking this as "Draft PR" to emulate the daisy-chaining feature.)

Depends on PR: #8647 .

Set autoschedule.experimental_gpu_schedule = 1 for selected Mullapudi2016 benchmarking tests in the path ./apps/*/. Override the GPU shared memory size estimates for specific algorithm pipelines to satisfy the Github/Buildbot's choice of GPU and hardware specifications:

  • bilateral_grid;
  • camera_pipe;
  • local_laplacian;
  • stencil_chain;

Manually tune the autoscheduler.parallelism from 128 to 4096 for the following pipelines to speed up the benchmarking tests. That is, to improve the auto-scheduled GPU algorithm by >5X. Default value (=128) is recommended in the original publication. It makes sense for GPUs of the 2016-era (e.g. Tesla K40), but the GPU foundry has improved since then. We are keeping the default value (=128) so as to honor the original publication.

  • conv_layer;
  • lens_blur;
  • local_laplacian;

Skip experimental GPU schedules for the following pipelines:

  • iir_blur: Problems about accessing index 5 for images having only 3 channels (likely the TailStrategy issues);
  • nl_means: More than 3 nested levels of gpu_threads and/or gpu_blocks detected.

@alexreinking
Copy link
Member

Github, unlike Gerrit, does not support daisy-chaining of PRs.

Have you heard of gh-stack? That might support this workflow better, though we've never used it on this repository before.

Copy link
Member

@alexreinking alexreinking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One little thing to fix, then rebase on main when the dependent PR is merged and I'll approve.

@alexreinking
Copy link
Member

@antonysigma - I just merged #8647. Please update this PR.

Set `autoschedule.experimental_gpu_schedule = 1` for selected Mullapudi2016
tests in the path `./apps/*/`. Manually adjust L2/L3 cache size for
specific algorithm pipelines. Skip experimental GPU schedules for the
following cases:

- iir_blur: Problems about accessing index 5 for images having only
3 channels (likely the TailStrategy issues);

- nl_means: More than 3 nested levels of `gpu_threads` and/or
`gpu_blocks` detected.
@antonysigma antonysigma force-pushed the mullapudi2016-gpu-testbench branch from 5b6a90b to 9a249e7 Compare June 13, 2025 22:10
@antonysigma antonysigma marked this pull request as ready for review June 13, 2025 22:10
@antonysigma antonysigma requested a review from alexreinking June 13, 2025 22:10
@alexreinking
Copy link
Member

Thanks very much for your responsiveness @antonysigma -- I've just requested a review from the buildbots. If they don't pick up on this PR within an hour, just push an empty commit to this branch.

@antonysigma
Copy link
Contributor Author

antonysigma commented Jun 14, 2025

Summary of test bench failure:

On x64-osx-metal:

  • camera_pipe: Failed assertion parallelism <= 832.
  • interpolate_filter: Failed assertion parallelism <= 640.
  • stencil_chain_process: Failed assertion parallelism <= 832.
  • unsharp_filter: Failed assertion parallelism <= 768.

On llvm20-x64-Linux-cuda:

  • bgu_filter: CUDA error: CUDA_ERROR_ILLEGAL_ADDRESS cuCtxSynchronize failed
  • interpolate_filter: Error: OpenCL error: CL_OUT_OF_RESOURCES clEnqueueNDRangeKernel failed

On llvm20-x64-windows-opencl:

  • interpolate_filter: Error: OpenCL error: CL_OUT_OF_RESOURCES clEnqueueNDRangeKernel failed

I plan to scale back the benchmarking test coverage the PR. After all, the gpu autoscheuler is an experimental feature not covering modern Metal/OpenCL APIs.

@alexreinking
Copy link
Member

So what's the path forward here? Do you want to add parallelism upper bounds for some targets?

We should open issues for the other three. It isn't clear why those schedules are failing.

Copy link
Contributor Author

@antonysigma antonysigma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alexreinking ,

So what's the path forward here? Do you want to add parallelism upper bounds for some targets?

I disabled the experimental GPU scheduling feature for the failing pipelines.

We should open issues for the other three. It isn't clear why those schedules are failing.

These errors can be all related; the autoscheduler may be double counting the threads_budget in separate places in the code.

src/autoschedulers/mullapudi2016/AutoSchedule.cpp:1422:            threads_budget = simplify(max(threads_budget / new_entry.factor, 1));
src/autoschedulers/mullapudi2016/AutoSchedule.cpp:3438:    if (can_prove(def_par < arch_params.parallelism)) {

I will bundle the bugfix with the next draft PR: antonysigma@91babf2

-Antony

@alexreinking
Copy link
Member

Sounds good - I'll merge this when green.

@alexreinking alexreinking merged commit 08c3357 into halide:main Jun 18, 2025
14 checks passed
@alexreinking
Copy link
Member

@antonysigma do you have a draft PR to open?

@antonysigma
Copy link
Contributor Author

do you have a draft PR to open?

@alexreinking Not for the coming week. I am still configuring my machine to simulate the host-metal environment. I need a matching environment (i.e. identical to that of Halide Buildbot) so I can reproduce the thread_budget error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants