Make warp size dynamic#43
Conversation
|
@tpopp @jeffdaily this isn't intended to be submitted as is, but I'd be interested in your thoughts |
|
For the navi compilation failures due to static_assert. It might seem a bit hacky, but consider the following. In the gpu cache implementation, what if you conditionalize the various methods with #if defined(__GFX9__) || WARP_SIZE == 32
<method body>
#endifThis would have the effect of gfx9 targets like gfx90a/gfx942 would compile actual code when you instantiate the gpu cache templates with either 32 and 64 WARP_SIZE, but for all other (navi) gfx targets any gpu cache template with 64 WARP_SIZE would ifdef away as a no-op empty implementation. |
|
It seems like then we're just back to warp size being a compile-time constant with extra steps though? We could just make RDNA seems to be the main reason for that, since it can have a dynamic warp size. From the docs:
https://rocm.docs.amd.com/projects/HIP/en/latest/reference/hardware_features.html#id13 AFAICT this is something that's decided at runtime
https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warp-size
|
|
It's I haven't fully read this code so I don't have a full understanding of nv_gpu_cache yet. Is there any way to write it without using a WARP_SIZE template parameter at all? Can you query the device props for warpSize inside of nv_gpu_cache and calculate any sizes or variables dynamically, including any memory allocations? |
|
But |
|
I'm inclined to file an issue in https://github.com/ROCm/clr asking how this is supposed to work, unless you see something I'm missing. It looks to me like someone has declared dynamic warp sizes the way-you-should-do-things, but much of HIP/ROCm itself is actually not ready for it. Maybe just an ifdef on architectures is the way to go for now until that's resolved. |
|
Please file the issue, thanks. |
|
Filed ROCm/clr#154 |
|
Oh dear, I just read all of ROCm/ROCm#4121 and it appears that this is a case where there is no good replacement. It also seems like maybe HIP doesn't support RDNA in wave-64 mode at all, which I guess explains the compile-time error. But the official recommendation is somehow that every user of ROCm should define their own macro for this, which doesn't really make a whole lot of sense to me |
|
I think I also see the value now of your suggestion to ifdef the kernel implementation. That keeps it dynamic on the host, but also avoids compiling incorrect warp sizes on device. That still seems pretty gross though... |
We hit static asserts if we do this, in addition to it extending compile times. It turns out that RDNA with warp size 64 isn't even supported with HIP also. I tried various methods and landed on `if constexpr` comparing against the `warpSize` builtin. This is allegedly going to become not constexpr, but it remains quite unclear to me what exactly is going on. For instance `::rocm::device_warp_size()` may also become not constexpr. Until there's some clarity, I'm going to use the established `warpSize` builtin rather than adding an additional macro that may be no better. I also preserved the ability to use a warp size smaller than the device native warp size. The README and some checks in the code imply that this is allowed. I haven't actually seen that work and in the original codes templates with a different warp size aren't even instantiated (and therefore fail to link), but leaving that functionality in place just in case. I ended up not running clang-format because it did a lot of extraneous reformatting. I think keeping the diff here with upstream small is a higher priority than strict line length limit adherence.
|
Alrighty, I got something working, although it's using |
tpopp
left a comment
There was a problem hiding this comment.
Just approving what's here. It looks clean to me other than some small amounts of possible code reuse, but 2 warp sizes doesn't necessitate it I think.
Regarding whitespace changes, please double check that you're formatting with any configuration that the repository might have to avoid future merge-conflict annoyances.
| key = d_keys[key_idx]; | ||
| src_set = set_hasher::hash(key) % capacity_in_set; | ||
| src_slab = slab_hasher::hash(key) % set_associativity; | ||
| assert(warp_size <= warpSize); |
There was a problem hiding this comment.
Nit: One of these should go it seems like. Also static_assert if the assert stays I believe.
There was a problem hiding this comment.
Unfortunately not. The reason for this weirdness is that HIP has a static_assert that the tile size is not greater than the device warp size (e.g. for cg::thread_block_tile). We need the if constexpr to avoid compiling code where that is the case. On the host, we have to ensure we're not calling a kernel with the wrong warp_size for the device, which we do by looking it up at runtime and using a switch, so this should never get called with warp_size != warpSize, but the assert is there to make it more debuggable if something gets messed up. If it was a static_assert we'd immediately get a compilation error again.
The issue is that I introduced if conditions around a bunch of things, so they got indented |
|
@jeffdaily any further thoughts? |
|
Unless I missed them, I didn't see whitespace changes in other files. I only saw it in third_party/HugeCTR/gpu_cache/src/nv_gpu_cache.cu which was pretty extensive. If it doesn't break some sort of linter that's part of CI, perhaps you could add the if constexpr at the same indentation level as the rest which would resolve the whitespace-only-ish changes. Since this file is in third_party, is a it a git submodule? Do we have any hope of filing a PR against HugeCTR to resolve this? |
There's no CI, so that's not going to be a problem 😆 I guess it's just a question of how much we want to prioritize minimizing the diff vs making the code readable and avoiding accidental reformatting in the future.
No, it's vendored from an Nvidia repo: https://github.com/NVIDIA-Merlin/HugeCTR/tree/main/gpu_cache. The diff here is vs our hipified port. I think this is only an issue for ROCm, so I doubt they'd be interested in a PR. |
Rather than using a compile-time constant macro, we use a runtime variable, per https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_porting_guide.html#warpsize, https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warp-size, and https://github.com/ROCm/ROCm/releases/tag/rocm-6.4.0#:~:text=AMDGPU%20wavefront%20size%20compiler%20macro%20deprecation.
DO NOT SUBMIT: This is a draft on top of the already hipified code. It would need to be converted into the necessary pre-hip changes and rebased on the hip-ready branch before merging.
There are a few places, particularly around the GPU cache, that really want this to be a compile time constant and use it (or derived values) as array sizes and such. Here I basically used switches to select at runtime one of the two supported warp sizes. There are some neat tricks for doing this demonstrated in https://rocm.docs.amd.com/projects/HIP/en/docs-develop/tutorial/reduction.html, but unfortunately they require C++20 features. I did at some point create a wrapper macro to avoid duplicating code on each branch of the switch, but this ended up only being used in one place, so I dropped it.
Another tricky thing is that DGL's objects that are exposed in Python really want to be a single type so using a templated class creates issues. There were some existing workarounds for this for the GPU cache key type, but adding another template parameter would have again doubled the branches in every method and also it turns out to be quite difficult to correctly pass a templated class name to a function-like macro because it splits the argument at the comma (and you can't have class declarations inside parentheses). So instead I created non-templated virtual base classes and used containers for those and then a single switch to create the correct type on construction.
Since I was already markedly changing the type declarations for the nv_gpu_cache, I went ahead and changed the key types from
unsigned intandlong longtouint32_tanduint64_t. This allowed removing some static asserts DGL had to do about the size of the types and it seems much more sensible to use fixed-size types here.After all this, I tried compiling for gfx1100 and it still fails. There are apparently multiple places in HIP that static assert that the warp size isn't too big:
Compilation errors for gfx1100
So even instantiating these templates with warp size 64 causes failures. That code is just checking it against the now-deprecated
__AMDGCN_WAVEFRONT_SIZE. I think that means that it is actually not possible to make warp size a runtime decision right now as constructed by HIP itself 🤦 I'm also confused because I thought RDNA was supposed to support warp sizes of 32 or 64 depending on some mode. If I drop the explicit template instantiations in nv_gpu_cache.cu then I can get compilation to complete, but get later linking errors because the DGL code can't find those templates that were never actually created.As a side note, there's a compilation warning that appears to be triggered by just compiling libhipcxx for gfx1100 at all, which means it can't be compiled with
-Werror. I'm not sure how to get around this.But on the MI210 gfx90a, all the C++ and Python unit tests pass.