-
Notifications
You must be signed in to change notification settings - Fork 22.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[inductor] refine loop split logic #128812
base: gh/zhuhaozhe/39/base
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128812
Note: Links to docs will display an error until the docs builds have been completed. ❌ 11 New FailuresAs of commit 36213d7 with merge base 32a3dbc (): NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ghstack-source-id: a0ffb42b1c0b2159b72f278aa4184ab75325cd03 Pull Request resolved: #128812
ghstack-source-id: a0ffb42b1c0b2159b72f278aa4184ab75325cd03 Pull Request resolved: pytorch#128812
ghstack-source-id: a0ffb42b1c0b2159b72f278aa4184ab75325cd03 Pull Request resolved: pytorch#128812
ghstack-source-id: ae8e67d681d811c0cd0ed703d186ddbe8e39f854 Pull Request resolved: #128812
ghstack-source-id: ae8e67d681d811c0cd0ed703d186ddbe8e39f854 Pull Request resolved: pytorch#128812
ghstack-source-id: ae8e67d681d811c0cd0ed703d186ddbe8e39f854 Pull Request resolved: pytorch#128812
ghstack-source-id: ff1dcca4bbb2cf3100f86bf622b492f73df3ad16 Pull Request resolved: #128812
ghstack-source-id: 39d237a5cf04be275029125ef488469b2f430dda Pull Request resolved: #128812
ghstack-source-id: 6baf7b0426bbcc1ea0c06180b393ecb4619bb59d Pull Request resolved: #128812
ghstack-source-id: 8254f219519f68724f941713938b04d9d44c53ac Pull Request resolved: #128812
ghstack-source-id: 470238141e894f1cd0ea1c798987c229020dccf4 Pull Request resolved: #128812
ghstack-source-id: ceb03c79c58a58e489f216df12556cc559db904d Pull Request resolved: #128812
ghstack-source-id: af9d8dc8e5e77cfa9c203081e20cafd17569b38c Pull Request resolved: #128812
ghstack-source-id: 8cb091acab68b47a147e84d64a1d22bfa203ad02 Pull Request resolved: #128812
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put some early comments on the LoopNest and LoopLevel changes. Still need more time to review others.
@@ -4872,7 +5049,7 @@ def lines(self): | |||
|
|||
|
|||
@dataclasses.dataclass | |||
class LoopNestWithSplit: | |||
class LoopNest: | |||
""" | |||
A loop-nest like structure but with some loop level split along | |||
the loop range into the main tiling loop and the tail. It is built |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This need amendment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
else: | ||
loop_nest.kernel = kernel | ||
|
||
loop_nest = LoopNest(loops) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to set the kernel
field of LoopNest
here too?
torch/_inductor/codegen/cpp.py
Outdated
parent: Optional["LoopLevel"] = None | ||
# the next inner level of the loop, empty if it is inner-most | ||
# contains >1 LoopLevel if the inner level of loop is split | ||
inner: List["LoopLevel"] = dataclasses.field(default_factory=list) | ||
# kernel assigned to this loop level, only valid when it is a leaf | ||
kernel: Optional[CppKernel] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need it considering each LoopNest only has a single kernel now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In this PR. We maintained all kinds of kernels (CppKernel
, CppVecKernel
, CppTile2dKernel
in one CppKernelProxy
. And assert the kernel is CppKernelProxy
https://github.com/pytorch/pytorch/pull/128812/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R2077 here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean why can't we just use the kernel
object from LoopNest
, and why we still have to keep a kernel
object in the LoopLevel
?
torch/_inductor/codegen/cpp.py
Outdated
inner_loop_clone.parent = loop | ||
loop.inner.append(inner_loop_clone) | ||
loop.kernel = deepcopy(self.kernel) | ||
def split_with_tiling(self, factor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"split" is an op that splits the loop into two but this function doesn't seem to do so, right? It seems to create a vectorized loop level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reminder here, renamed to vectorized_with_tiling
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you commit the change? Also name it tile
sounds better? Please also amend the code doc for related functions.
ghstack-source-id: 8cb091acab68b47a147e84d64a1d22bfa203ad02 Pull Request resolved: #128812
torch/_inductor/codegen/cpp.py
Outdated
parent: Optional["LoopLevel"] = None | ||
# the next inner level of the loop, empty if it is inner-most | ||
# contains >1 LoopLevel if the inner level of loop is split | ||
inner: List["LoopLevel"] = dataclasses.field(default_factory=list) | ||
# kernel assigned to this loop level, only valid when it is a leaf | ||
kernel: Optional[CppKernel] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean why can't we just use the kernel
object from LoopNest
, and why we still have to keep a kernel
object in the LoopLevel
?
torch/_inductor/codegen/cpp.py
Outdated
dtype: torch.dtype, | ||
init_fn, | ||
): | ||
# gen preduction prefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# gen preduction prefix | |
# gen reduction prefix |
torch/_inductor/codegen/cpp.py
Outdated
|
||
stack.enter_context(code.indent()) | ||
if loop_nest.root: | ||
if loop_nest.loops: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this check here?
torch/_inductor/codegen/cpp.py
Outdated
kernel.gen_body(code) | ||
|
||
def get_reduction_prefix_suffix(kernel, parallel=False, buffer="prefix"): | ||
if buffer == "suffix": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using a boolean flag here?
torch/_inductor/codegen/cpp.py
Outdated
gen_loops(loop.inner, loop.is_reduction) | ||
else: | ||
gen_loop_kernel(loop) | ||
gen_loop_nest(_loop_nest, depth, loop.is_reduction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you don't have to do depth += 1
separately.
gen_loop_nest(_loop_nest, depth, loop.is_reduction) | |
gen_loop_nest(_loop_nest, depth + 1, loop.is_reduction) |
torch/_inductor/codegen/cpp.py
Outdated
tiling_idx = FloorDiv(loop.size, sympy_factor) * sympy_factor | ||
loop.steps = sympy_factor | ||
loop.simd_vec = True | ||
loop.tiling_offset = tiling_idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to use loop.tiled_size
.
@@ -1714,11 +1782,14 @@ class CppKernel(Kernel): | |||
|
|||
def __init__(self, args, num_threads): | |||
super().__init__(args) | |||
self.active_ranges: dict[sympy.Expr, Tuple[sympy.Expr, ...]] = {} | |||
self.inner_itervars: List[sympy.Symbol] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
torch/_inductor/codegen/cpp.py
Outdated
self.call_ranges: Optional[Tuple[sympy.Expr, ...]] = None | ||
self.ranges: List[sympy.Expr] = [] | ||
self.itervars: List[sympy.Symbol] = [] | ||
self.reduction_depth = None | ||
self.reduction_prefix = IndentedBuffer() | ||
self.reduction_prefix_fn: List[Callable] = [] # type: ignore[type-arg] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.reduction_prefix_fn: List[Callable] = [] # type: ignore[type-arg] | |
self.reduction_prefix_generators: List[Callable] = [] # type: ignore[type-arg] |
torch/_inductor/codegen/cpp.py
Outdated
@@ -239,6 +238,101 @@ def reduction_project(reduction_type, acc): | |||
return acc | |||
|
|||
|
|||
def transform_kernel_codes_under_inner_loop( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is move_code_under_inner_loop
simpler?
reduction_vars = tail_loop_kernel.reduction_var_names | ||
for name in reduction_vars: | ||
new_name = f"{name}_arr[{outer_loop.var}_tail - {cexpr_index(outer_loop.tiling_offset)}]" | ||
replace_acc_name(tail_loop_kernel.stores, name, new_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need the logic of replacing a generated buffer after introducing the design of lazy generation with callbacks?
ghstack-source-id: d888c7594d9013e594f1e317cb1d2486acb481e6 Pull Request resolved: #128812
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
ghstack-source-id: e47cfd0541b61da2496bbfdd74ea1420035de280 Pull Request resolved: #128812
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
ghstack-source-id: f9e59d934c7bda4fbf166e70d444772d0b6ca1b7 Pull Request resolved: #128812
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
ghstack-source-id: f9193cb0534abf568ae32986ca7f4e7817c3bd9b Pull Request resolved: #128812
This PR aims to improves parallelization by collapsing vectorized loop. #122281
For such case, the parallel level is only
2
.And the vectorized loop cannot be collapsed.
After this PR, we will gen code
Highlight
For reduction case, we have some side-effect here.
For below case, we vectorized
x1
dim and reduction atx2
dim.After collapse, the loop order will be
x1 -> x2 -> x1_tail_part
, thus we will need atmp_acc_arr
to store the reduction result forx1_tail_part
. And forreduction_stores
, we also need to checkx1
's value like what we do in theloopbody
since thereduction_stores
happened betweenx1
andx2
loops.Stack from ghstack (oldest at bottom):
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang