-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Enable experimental Mullapudi2016 GPU scheduler for test-bench #8650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable experimental Mullapudi2016 GPU scheduler for test-bench #8650
Conversation
2fc6fd8 to
5b6a90b
Compare
Have you heard of |
alexreinking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One little thing to fix, then rebase on main when the dependent PR is merged and I'll approve.
|
@antonysigma - I just merged #8647. Please update this PR. |
Set `autoschedule.experimental_gpu_schedule = 1` for selected Mullapudi2016 tests in the path `./apps/*/`. Manually adjust L2/L3 cache size for specific algorithm pipelines. Skip experimental GPU schedules for the following cases: - iir_blur: Problems about accessing index 5 for images having only 3 channels (likely the TailStrategy issues); - nl_means: More than 3 nested levels of `gpu_threads` and/or `gpu_blocks` detected.
5b6a90b to
9a249e7
Compare
|
Thanks very much for your responsiveness @antonysigma -- I've just requested a review from the buildbots. If they don't pick up on this PR within an hour, just push an empty commit to this branch. |
|
Summary of test bench failure: On x64-osx-metal:
On llvm20-x64-Linux-cuda:
On llvm20-x64-windows-opencl:
I plan to scale back the benchmarking test coverage the PR. After all, the gpu autoscheuler is an experimental feature not covering modern Metal/OpenCL APIs. |
|
So what's the path forward here? Do you want to add parallelism upper bounds for some targets? We should open issues for the other three. It isn't clear why those schedules are failing. |
antonysigma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @alexreinking ,
So what's the path forward here? Do you want to add parallelism upper bounds for some targets?
I disabled the experimental GPU scheduling feature for the failing pipelines.
We should open issues for the other three. It isn't clear why those schedules are failing.
These errors can be all related; the autoscheduler may be double counting the threads_budget in separate places in the code.
src/autoschedulers/mullapudi2016/AutoSchedule.cpp:1422: threads_budget = simplify(max(threads_budget / new_entry.factor, 1));
src/autoschedulers/mullapudi2016/AutoSchedule.cpp:3438: if (can_prove(def_par < arch_params.parallelism)) {I will bundle the bugfix with the next draft PR: antonysigma@91babf2
-Antony
|
Sounds good - I'll merge this when green. |
|
@antonysigma do you have a draft PR to open? |
@alexreinking Not for the coming week. I am still configuring my machine to simulate the |
(Github, unlike Gerrit, does not support daisy-chaining of PRs. Also, it doesn't support
git-diffof parent & child PRs. Marking this as "Draft PR" to emulate the daisy-chaining feature.)Depends on PR: #8647 .
Set
autoschedule.experimental_gpu_schedule = 1for selected Mullapudi2016 benchmarking tests in the path./apps/*/. Override the GPU shared memory size estimates for specific algorithm pipelines to satisfy the Github/Buildbot's choice of GPU and hardware specifications:bilateral_grid;camera_pipe;local_laplacian;stencil_chain;Manually tune the
autoscheduler.parallelismfrom 128 to 4096 for the following pipelines to speed up the benchmarking tests. That is, to improve the auto-scheduled GPU algorithm by >5X. Default value (=128) is recommended in the original publication. It makes sense for GPUs of the 2016-era (e.g. Tesla K40), but the GPU foundry has improved since then. We are keeping the default value (=128) so as to honor the original publication.conv_layer;lens_blur;local_laplacian;Skip experimental GPU schedules for the following pipelines:
iir_blur: Problems about accessing index 5 for images having only 3 channels (likely the TailStrategy issues);nl_means: More than 3 nested levels ofgpu_threadsand/orgpu_blocksdetected.