Mullapudi2016-GPU: Reorder to avoid for-loops to be sandwiched between `gpu_blocks`. #8647

antonysigma · 2025-06-12T03:41:39Z

After committing gpu_tiles, reorder all axes such that the for-loops are inside all gpu_blocks.

Also limit the gpu_blocks count to no more than 3.

Refer to: #8640 for the list of pending actions to robustify the Mullapudi2016's experimental GPU schedules.

src/autoschedulers/mullapudi2016/AutoSchedule.cpp

After committing `gpu_tiles`, reorder all axes such that the for-loops are inside all `gpu_blocks`. Also limit the `gpu_blocks` count to no more than 3.

mcourteaux · 2025-06-12T17:42:07Z

Avoid force pushing when it's not necessary. It's easier to follow up on your changes when you don't. GitHub know when the last time was somebody reviewed your code, and we get to see a diff since then until now. Force pushing rewrites history, and makes that feature unavailable.

alexreinking

Can we figure out what's going on with this call?

alexreinking · 2025-06-12T17:50:25Z

src/autoschedulers/mullapudi2016/AutoSchedule.cpp

                    VarOrRVar seq(seq_var, (rvars.find(seq_var) != rvars.end()));
                    if (arch_params.is_gpu_schedule) {
-                        gpu_tiling.canReorder({seq, v});
+                        gpu_tiling.can_reorder({seq, v});


This call is highly suspicious. The other calls set an order for every dimension. But this only reorders two vars within the existing order in the else branch below. When this is called here, it discards other dimensions that might be present.

I realize that I probably should have caught this in the last PR's review.

Yes, even before the changes, the reorder calls inside the code block if (nested_parallelism) {...} looks suspicious to me. It feels like it is doing bubble sorting by abusing Halide's native API and IR. For bubble sorting to function, the Halide IR must provide the list container data structure and the std::swap function.

So, if we were to intercept the bubble sorting action by GPUTilingDedup, should I implement the class ExplicitBubbleSortingHelper?

Note that we have an issue only inside the code block if (nested_parallelism) {...}. The rest of the code calls can_reorder / require_ordering correctly as intended.

@antonysigma - I don't want to prescribe an implementation strategy. At a minimum, I want to be confident the behaviors match and don't regress, even if the actual behavior is less useful or efficient than it could be.

Moving the discussion to #8647 (comment) . Closing.

antonysigma

Can we figure out what's going on with this call?

@alexreinking Sure. The code looks suspicious even before my changes. It feels like the autoscheduler is abusing the Halide IR and reorder function to perform bubble sorting.

Please refer to the inline comments for details.

Changing the PR to draft.

Avoid force pushing when it's not necessary. It's easier to follow up on your changes when you don't. GitHub know when the last time was somebody reviewed your code, and we get to see a diff since then until now. Force pushing rewrites history, and makes that feature unavailable.

@mcourteaux Ah, sorry about that. I am so used to the Gerrit-style code review process: contributors must always squash-rebase-force-push changes locally for the server to accept new changes. I will avoid force-pushing from now on.

src/autoschedulers/mullapudi2016/AutoSchedule.cpp

antonysigma · 2025-06-12T19:54:21Z

src/autoschedulers/mullapudi2016/AutoSchedule.cpp

                    VarOrRVar seq(seq_var, (rvars.find(seq_var) != rvars.end()));
                    if (arch_params.is_gpu_schedule) {
-                        gpu_tiling.canReorder({seq, v});
+                        gpu_tiling.can_reorder({seq, v});


Yes, even before the changes, the reorder calls inside the code block if (nested_parallelism) {...} looks suspicious to me. It feels like it is doing bubble sorting by abusing Halide's native API and IR. For bubble sorting to function, the Halide IR must provide the list container data structure and the std::swap function.

So, if we were to intercept the bubble sorting action by GPUTilingDedup, should I implement the class ExplicitBubbleSortingHelper?

Note that we have an issue only inside the code block if (nested_parallelism) {...}. The rest of the code calls can_reorder / require_ordering correctly as intended.

1. Rename `GPUTilingDedup::can_reorder` -> `reorder`. 2. Write a new function `GPUTilingDedup::ensure_ordering` to implement the bubble sorting helper function.

antonysigma

Thank you for the quick feedback regarding the nested_parallelism algorithm behavior. I implemented a new deferred reorder function. Please refer to the inline comments for details.

src/autoschedulers/mullapudi2016/AutoSchedule.cpp

alexreinking

This looks much, much better, thank you! Just a couple outstanding questions and we can get this in.

src/autoschedulers/mullapudi2016/AutoSchedule.cpp

alexreinking · 2025-06-13T12:17:09Z

src/autoschedulers/mullapudi2016/AutoSchedule.cpp

+        // The nested parallelism implements a bubble sorting algorithm, which
+        // ensures the inner and outer variables are adjacent to each other.
+        // Assert the requirement here.
+        internal_assert(std::abs(std::distance(inner_iter, outer_iter)) == 1);


I don't understand fully why this should be true from the code. I think it would be sufficient simply to swap the two dims in place if they're out of order (equivalent to just dropping the assertion).

It's to make sure that they are adjacent.

Let me know if you all reached the consensus. I will revise the code accordingly. Again, I prefer small scale and frequent PRs to move us forward.

src/autoschedulers/mullapudi2016/AutoSchedule.cpp

antonysigma

Kindly asking for another round of code review. Please refer to the inline comments.

src/autoschedulers/mullapudi2016/AutoSchedule.cpp

antonysigma · 2025-06-13T16:57:36Z

src/autoschedulers/mullapudi2016/AutoSchedule.cpp

+        // The nested parallelism implements a bubble sorting algorithm, which
+        // ensures the inner and outer variables are adjacent to each other.
+        // Assert the requirement here.
+        internal_assert(std::abs(std::distance(inner_iter, outer_iter)) == 1);


Let me know if you all reached the consensus. I will revise the code accordingly. Again, I prefer small scale and frequent PRs to move us forward.

alexreinking

Looks good to me pending green.

antonysigma mentioned this pull request Jun 10, 2025

Mullapudi2016 experimental GPU scheduling support: the tech debt #8640

Open

8 tasks

mcourteaux reviewed Jun 12, 2025

View reviewed changes

src/autoschedulers/mullapudi2016/AutoSchedule.cpp Outdated Show resolved Hide resolved

src/autoschedulers/mullapudi2016/AutoSchedule.cpp Outdated Show resolved Hide resolved

alexreinking requested a review from halidebuildbots June 12, 2025 11:46

antonysigma force-pushed the mullapudi-gpu-reorder branch 2 times, most recently from 4a59239 to 893815c Compare June 12, 2025 15:33

Reorder for-loops after GPU tiling

0bdd227

After committing `gpu_tiles`, reorder all axes such that the for-loops are inside all `gpu_blocks`. Also limit the `gpu_blocks` count to no more than 3.

antonysigma force-pushed the mullapudi-gpu-reorder branch from 893815c to 0bdd227 Compare June 12, 2025 16:57

alexreinking requested changes Jun 12, 2025

View reviewed changes

antonysigma commented Jun 12, 2025

View reviewed changes

antonysigma marked this pull request as draft June 12, 2025 20:01

Ensure pairwise ordering

6aad2c8

1. Rename `GPUTilingDedup::can_reorder` -> `reorder`. 2. Write a new function `GPUTilingDedup::ensure_ordering` to implement the bubble sorting helper function.

antonysigma force-pushed the mullapudi-gpu-reorder branch from 51843d7 to 6aad2c8 Compare June 13, 2025 02:22

antonysigma commented Jun 13, 2025

View reviewed changes

src/autoschedulers/mullapudi2016/AutoSchedule.cpp Show resolved Hide resolved

alexreinking requested changes Jun 13, 2025

View reviewed changes

Initial order

6050e31

antonysigma commented Jun 13, 2025

View reviewed changes

antonysigma marked this pull request as ready for review June 13, 2025 17:01

antonysigma mentioned this pull request Jun 13, 2025

Enable experimental Mullapudi2016 GPU scheduler for test-bench #8650

Merged

alexreinking approved these changes Jun 13, 2025

View reviewed changes

alexreinking merged commit 68d6fe0 into halide:main Jun 13, 2025
15 checks passed

antonysigma deleted the mullapudi-gpu-reorder branch June 13, 2025 22:11

BrewTestBot mentioned this pull request Sep 16, 2025

halide 21.0.0 Homebrew/homebrew-core#244220

Merged

Mullapudi2016-GPU: Reorder to avoid for-loops to be sandwiched between gpu_blocks. #8647

Mullapudi2016-GPU: Reorder to avoid for-loops to be sandwiched between gpu_blocks. #8647

Uh oh!

Conversation

antonysigma commented Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

mcourteaux commented Jun 12, 2025

Uh oh!

alexreinking left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexreinking Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antonysigma left a comment • edited by alexreinking Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antonysigma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexreinking left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

antonysigma left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexreinking left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Mullapudi2016-GPU: Reorder to avoid for-loops to be sandwiched between `gpu_blocks`. #8647

Mullapudi2016-GPU: Reorder to avoid for-loops to be sandwiched between `gpu_blocks`. #8647

alexreinking Jun 13, 2025 •

edited

Loading

antonysigma left a comment •

edited by alexreinking

Loading

antonysigma left a comment •

edited

Loading