-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Mullapudi2016-GPU: Reorder to avoid for-loops to be sandwiched between gpu_blocks.
#8647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4a59239 to
893815c
Compare
After committing `gpu_tiles`, reorder all axes such that the for-loops are inside all `gpu_blocks`. Also limit the `gpu_blocks` count to no more than 3.
893815c to
0bdd227
Compare
|
Avoid force pushing when it's not necessary. It's easier to follow up on your changes when you don't. GitHub know when the last time was somebody reviewed your code, and we get to see a diff since then until now. Force pushing rewrites history, and makes that feature unavailable. |
alexreinking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we figure out what's going on with this call?
| VarOrRVar seq(seq_var, (rvars.find(seq_var) != rvars.end())); | ||
| if (arch_params.is_gpu_schedule) { | ||
| gpu_tiling.canReorder({seq, v}); | ||
| gpu_tiling.can_reorder({seq, v}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call is highly suspicious. The other calls set an order for every dimension. But this only reorders two vars within the existing order in the else branch below. When this is called here, it discards other dimensions that might be present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize that I probably should have caught this in the last PR's review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, even before the changes, the reorder calls inside the code block if (nested_parallelism) {...} looks suspicious to me. It feels like it is doing bubble sorting by abusing Halide's native API and IR. For bubble sorting to function, the Halide IR must provide the list container data structure and the std::swap function.
So, if we were to intercept the bubble sorting action by GPUTilingDedup, should I implement the class ExplicitBubbleSortingHelper?
Note that we have an issue only inside the code block if (nested_parallelism) {...}. The rest of the code calls can_reorder / require_ordering correctly as intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antonysigma - I don't want to prescribe an implementation strategy. At a minimum, I want to be confident the behaviors match and don't regress, even if the actual behavior is less useful or efficient than it could be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the discussion to #8647 (comment) . Closing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we figure out what's going on with this call?
@alexreinking Sure. The code looks suspicious even before my changes. It feels like the autoscheduler is abusing the Halide IR and reorder function to perform bubble sorting.
Please refer to the inline comments for details.
Changing the PR to draft.
Avoid force pushing when it's not necessary. It's easier to follow up on your changes when you don't. GitHub know when the last time was somebody reviewed your code, and we get to see a diff since then until now. Force pushing rewrites history, and makes that feature unavailable.
@mcourteaux Ah, sorry about that. I am so used to the Gerrit-style code review process: contributors must always squash-rebase-force-push changes locally for the server to accept new changes. I will avoid force-pushing from now on.
| VarOrRVar seq(seq_var, (rvars.find(seq_var) != rvars.end())); | ||
| if (arch_params.is_gpu_schedule) { | ||
| gpu_tiling.canReorder({seq, v}); | ||
| gpu_tiling.can_reorder({seq, v}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, even before the changes, the reorder calls inside the code block if (nested_parallelism) {...} looks suspicious to me. It feels like it is doing bubble sorting by abusing Halide's native API and IR. For bubble sorting to function, the Halide IR must provide the list container data structure and the std::swap function.
So, if we were to intercept the bubble sorting action by GPUTilingDedup, should I implement the class ExplicitBubbleSortingHelper?
Note that we have an issue only inside the code block if (nested_parallelism) {...}. The rest of the code calls can_reorder / require_ordering correctly as intended.
1. Rename `GPUTilingDedup::can_reorder` -> `reorder`. 2. Write a new function `GPUTilingDedup::ensure_ordering` to implement the bubble sorting helper function.
51843d7 to
6aad2c8
Compare
antonysigma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the quick feedback regarding the nested_parallelism algorithm behavior. I implemented a new deferred reorder function. Please refer to the inline comments for details.
alexreinking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks much, much better, thank you! Just a couple outstanding questions and we can get this in.
| // The nested parallelism implements a bubble sorting algorithm, which | ||
| // ensures the inner and outer variables are adjacent to each other. | ||
| // Assert the requirement here. | ||
| internal_assert(std::abs(std::distance(inner_iter, outer_iter)) == 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand fully why this should be true from the code. I think it would be sufficient simply to swap the two dims in place if they're out of order (equivalent to just dropping the assertion).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to make sure that they are adjacent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you all reached the consensus. I will revise the code accordingly. Again, I prefer small scale and frequent PRs to move us forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kindly asking for another round of code review. Please refer to the inline comments.
| // The nested parallelism implements a bubble sorting algorithm, which | ||
| // ensures the inner and outer variables are adjacent to each other. | ||
| // Assert the requirement here. | ||
| internal_assert(std::abs(std::distance(inner_iter, outer_iter)) == 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you all reached the consensus. I will revise the code accordingly. Again, I prefer small scale and frequent PRs to move us forward.
alexreinking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me pending green.
After committing
gpu_tiles, reorder all axes such that the for-loops are inside allgpu_blocks.Also limit the
gpu_blockscount to no more than 3.Refer to: #8640 for the list of pending actions to robustify the Mullapudi2016's experimental GPU schedules.