Skip to content

Make warp size dynamic#43

Open
GMNGeoffrey wants to merge 2 commits into
nod-ai:hipify-inplacefrom
GMNGeoffrey:dynamic-warp-size-hipified
Open

Make warp size dynamic#43
GMNGeoffrey wants to merge 2 commits into
nod-ai:hipify-inplacefrom
GMNGeoffrey:dynamic-warp-size-hipified

Conversation

@GMNGeoffrey

@GMNGeoffrey GMNGeoffrey commented Apr 17, 2025

Copy link
Copy Markdown
Collaborator

Rather than using a compile-time constant macro, we use a runtime variable, per https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_porting_guide.html#warpsize, https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warp-size, and https://github.com/ROCm/ROCm/releases/tag/rocm-6.4.0#:~:text=AMDGPU%20wavefront%20size%20compiler%20macro%20deprecation.

DO NOT SUBMIT: This is a draft on top of the already hipified code. It would need to be converted into the necessary pre-hip changes and rebased on the hip-ready branch before merging.

There are a few places, particularly around the GPU cache, that really want this to be a compile time constant and use it (or derived values) as array sizes and such. Here I basically used switches to select at runtime one of the two supported warp sizes. There are some neat tricks for doing this demonstrated in https://rocm.docs.amd.com/projects/HIP/en/docs-develop/tutorial/reduction.html, but unfortunately they require C++20 features. I did at some point create a wrapper macro to avoid duplicating code on each branch of the switch, but this ended up only being used in one place, so I dropped it.

Another tricky thing is that DGL's objects that are exposed in Python really want to be a single type so using a templated class creates issues. There were some existing workarounds for this for the GPU cache key type, but adding another template parameter would have again doubled the branches in every method and also it turns out to be quite difficult to correctly pass a templated class name to a function-like macro because it splits the argument at the comma (and you can't have class declarations inside parentheses). So instead I created non-templated virtual base classes and used containers for those and then a single switch to create the correct type on construction.

Since I was already markedly changing the type declarations for the nv_gpu_cache, I went ahead and changed the key types from unsigned int and long long to uint32_t and uint64_t. This allowed removing some static asserts DGL had to do about the size of the types and it seems much more sensible to use fixed-size types here.

After all this, I tried compiling for gfx1100 and it still fails. There are apparently multiple places in HIP that static assert that the warp size isn't too big:

Compilation errors for gfx1100
In file included from ../../../third_party/HugeCTR/gpu_cache/src/nv_gpu_cache.cu:18:
In file included from /opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/hip_cooperative_groups.h:38:
/opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:660:17: error: static assertion failed due to requirement 'integral_constant<bool, false>::value': Tile size is either not a power of 2 or greater than the wavefront size
  660 |   static_assert(is_valid_tile_size<size>::value,
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:746:39: note: in instantiation of template class 'cooperative_groups::thread_block_tile_base<64>' requested here
  746 | class thread_block_tile_type : public thread_block_tile_base<tileSize>,
      |                                       ^
/opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:843:43: note: in instantiation of template class 'cooperative_groups::thread_block_tile_type<64, cooperative_groups::thread_block>' requested here
  843 | class thread_block_tile_internal : public thread_block_tile_type<size, ParentCGTy> {
      |                                           ^
/opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:856:34: note: in instantiation of template class 'cooperative_groups::impl::thread_block_tile_internal<64, cooperative_groups::thread_block>' requested here
  856 | class thread_block_tile : public impl::thread_block_tile_internal<size, ParentCGTy> {
      |                                  ^
../../../third_party/HugeCTR/gpu_cache/src/nv_gpu_cache.cu:267:7: note: in instantiation of template class 'cooperative_groups::thread_block_tile<64, cooperative_groups::thread_block>' requested here
  267 |       cg::tiled_partition<warp_size>(cg::this_thread_block());
      |       ^
../../../third_party/HugeCTR/gpu_cache/src/nv_gpu_cache.cu:1431:3: note: in instantiation of function template specialization 'gpu_cache::get_kernel<unsigned int, unsigned long, cuda::atomic<unsigned long, cuda::std::__detail::thread_scope_device>, gpu_cache::slab_set<2, unsigned int, 64>, MurmurHash3_32<unsigned int>, Mod_Hash<unsigned int, unsigned long>, cuda::counting_semaphore<cuda::std::__detail::thread_scope_device, 1>, 4294967295U, 2, 64>' requested here
 1431 |   get_kernel<key_type, ref_counter_type, atomic_ref_counter_type, slabset, set_hasher, slab_hasher,
      |   ^
In file included from ../../../third_party/HugeCTR/gpu_cache/src/nv_gpu_cache.cu:18:
In file included from /opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/hip_cooperative_groups.h:38:
/opt/rocm-6.4.0/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:899:17: error: static assertion failed due to requirement 'integral_constant<bool, false>::value': Tiled partition with size > wavefront size. Currently not supported
  899 |   static_assert(is_valid_tile_size<size>::value,
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../../third_party/HugeCTR/gpu_cache/src/nv_gpu_cache.cu:267:11: note: in instantiation of function template specialization 'cooperative_groups::tiled_partition<64U, cooperative_groups::thread_block>' requested here
  267 |       cg::tiled_partition<warp_size>(cg::this_thread_block());
      |           ^
../../../third_party/HugeCTR/gpu_cache/src/nv_gpu_cache.cu:1431:3: note: in instantiation of function template specialization 'gpu_cache::get_kernel<unsigned int, unsigned long, cuda::atomic<unsigned long, cuda::std::__detail::thread_scope_device>, gpu_cache::slab_set<2, unsigned int, 64>, MurmurHash3_32<unsigned int>, Mod_Hash<unsigned int, unsigned long>, cuda::counting_semaphore<cuda::std::__detail::thread_scope_device, 1>, 4294967295U, 2, 64>' requested here
 1431 |   get_kernel<key_type, ref_counter_type, atomic_ref_counter_type, slabset, set_hasher, slab_hasher,
      |   ^

So even instantiating these templates with warp size 64 causes failures. That code is just checking it against the now-deprecated __AMDGCN_WAVEFRONT_SIZE. I think that means that it is actually not possible to make warp size a runtime decision right now as constructed by HIP itself 🤦 I'm also confused because I thought RDNA was supposed to support warp sizes of 32 or 64 depending on some mode. If I drop the explicit template instantiations in nv_gpu_cache.cu then I can get compilation to complete, but get later linking errors because the DGL code can't find those templates that were never actually created.

As a side note, there's a compilation warning that appears to be triggered by just compiling libhipcxx for gfx1100 at all, which means it can't be compiled with -Werror. I'm not sure how to get around this.

In file included from ../../../third_party/HugeCTR/gpu_cache/src/nv_gpu_cache.cu:21:
In file included from ../../../third_party/HugeCTR/gpu_cache/include/nv_gpu_cache.hpp:26:
In file included from /opt/rocm-6.4.0/lib/llvm/bin/../../../include/cuda/atomic:14:
In file included from /opt/rocm-6.4.0/lib/llvm/bin/../../../include/cuda/std/atomic:14:
In file included from /opt/rocm-6.4.0/lib/llvm/bin/../../../include/cuda/std/detail/__config:43:
/opt/rocm-6.4.0/lib/llvm/bin/../../../include/cuda/std/detail/libcxx/include/__config:2232:2: error: Assuming 100 MHz realtime clock rate (TSC) for gfx1100/gfx1101 (according to the RDNA3 ISA). Timing-related APIs (e.g., chrono) or sleep instructions may behave incorrectly! [-Werror,-W#warnings]
 2232 | #warning Assuming 100 MHz realtime clock rate (TSC) for gfx1100/gfx1101 (according to the RDNA3 ISA). Timing-related APIs (e.g., chrono) or sleep instructions may behave incorrectly!
      |  ^
1 error generated when compiling for gfx1100

But on the MI210 gfx90a, all the C++ and Python unit tests pass.

@GMNGeoffrey GMNGeoffrey requested a review from jeffdaily April 17, 2025 22:49
@GMNGeoffrey

Copy link
Copy Markdown
Collaborator Author

@tpopp @jeffdaily this isn't intended to be submitted as is, but I'd be interested in your thoughts

@jeffdaily

jeffdaily commented Apr 18, 2025

Copy link
Copy Markdown
Collaborator

For the navi compilation failures due to static_assert. It might seem a bit hacky, but consider the following. In the gpu cache implementation, what if you conditionalize the various methods with

   #if defined(__GFX9__) || WARP_SIZE == 32
   <method body>
   #endif

This would have the effect of gfx9 targets like gfx90a/gfx942 would compile actual code when you instantiate the gpu cache templates with either 32 and 64 WARP_SIZE, but for all other (navi) gfx targets any gpu cache template with 64 WARP_SIZE would ifdef away as a no-op empty implementation.

@GMNGeoffrey

Copy link
Copy Markdown
Collaborator Author

It seems like then we're just back to warp size being a compile-time constant with extra steps though? We could just make DGL_WARP_SIZE a slightly more sophisticated macro that checks the specific architecture, but the HIP/ROCm docs say that you should stop assuming warp size is a compile-time constant at all.

RDNA seems to be the main reason for that, since it can have a dynamic warp size. From the docs:

RDNA architectures have a configurable wavefront size. The native wavefront size is 32, but they can run in “CU mode”, which has an effective wavefront size of 64. This affects the number of resident wavefronts and blocks per compute Unit.

https://rocm.docs.amd.com/projects/HIP/en/latest/reference/hardware_features.html#id13

AFAICT this is something that's decided at runtime

RDNA architectures [warp size] can even differ between kernel launches, depending on whether they run in CU or WGP mode... Since warpSize can differ between devices, it can not be assumed to be a compile-time constant on the host

https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warp-size

warpSize will only be available as a non-constexpr variable. Where required,
the wavefront size should be queried via the warpSize variable in device code,
or via hipGetDeviceProperties in host code. Neither of these will result in a compile-time constant.

https://github.com/ROCm/ROCm/releases/tag/rocm-6.4.0#:~:text=AMDGPU%20wavefront%20size%20compiler%20macro%20deprecation

@jeffdaily

Copy link
Copy Markdown
Collaborator

It's warpSize which is no longer going to be constexpr. It's definition today is here. warpSize should really only be used in device code because it is dynamic there. You cannot use it outside of global or device code blocks. So if you have some host/device structure like nv_gpu_cache that was written to depend on or assume a compile-time constant value for warp size that's where you run into trouble like this.

I haven't fully read this code so I don't have a full understanding of nv_gpu_cache yet. Is there any way to write it without using a WARP_SIZE template parameter at all? Can you query the device props for warpSize inside of nv_gpu_cache and calculate any sizes or variables dynamically, including any memory allocations?

@GMNGeoffrey

Copy link
Copy Markdown
Collaborator Author

But cg::tiled_partition takes the tile size as a template parameter and that's where the static assert is. I don't understand how it would even be possible to use with a dynamic warp size if there's a static assert saying it has to be <=32. I was following the example in https://rocm.docs.amd.com/projects/HIP/en/docs-develop/tutorial/reduction.html where they promote warp size to a compile-time constant via a switch on the host. The gpu cache also uses warp size for the size of an array and there's another place in DGL that uses something derived from warp size for the same.

@GMNGeoffrey

Copy link
Copy Markdown
Collaborator Author

I'm inclined to file an issue in https://github.com/ROCm/clr asking how this is supposed to work, unless you see something I'm missing. It looks to me like someone has declared dynamic warp sizes the way-you-should-do-things, but much of HIP/ROCm itself is actually not ready for it. Maybe just an ifdef on architectures is the way to go for now until that's resolved.

@jeffdaily

Copy link
Copy Markdown
Collaborator

Please file the issue, thanks.

@GMNGeoffrey

Copy link
Copy Markdown
Collaborator Author

Filed ROCm/clr#154

@GMNGeoffrey

Copy link
Copy Markdown
Collaborator Author

Oh dear, I just read all of ROCm/ROCm#4121 and it appears that this is a case where there is no good replacement. It also seems like maybe HIP doesn't support RDNA in wave-64 mode at all, which I guess explains the compile-time error. But the official recommendation is somehow that every user of ROCm should define their own macro for this, which doesn't really make a whole lot of sense to me

@GMNGeoffrey

Copy link
Copy Markdown
Collaborator Author

I think I also see the value now of your suggestion to ifdef the kernel implementation. That keeps it dynamic on the host, but also avoids compiling incorrect warp sizes on device. That still seems pretty gross though...

We hit static asserts if we do this, in addition to it extending compile
times. It turns out that RDNA with warp size 64 isn't even supported
with HIP also. I tried various methods and landed on `if constexpr`
comparing against the `warpSize` builtin. This is allegedly going to
become not constexpr, but it remains quite unclear to me what exactly is
going on. For instance `::rocm::device_warp_size()` may also become not
constexpr. Until there's some clarity, I'm going to use the established
`warpSize` builtin rather than adding an additional macro that may be no
better.

I also preserved the ability to use a warp size smaller than the device
native warp size. The README and some checks in the code imply that this
is allowed. I haven't actually seen that work and in the original codes
templates with a different warp size aren't even instantiated (and
therefore fail to link), but leaving that functionality in place just in
case.

I ended up not running clang-format because it did a lot of extraneous
reformatting. I think keeping the diff here with upstream small is a
higher priority than strict line length limit adherence.
@GMNGeoffrey

GMNGeoffrey commented Apr 21, 2025

Copy link
Copy Markdown
Collaborator Author

Alrighty, I got something working, although it's using warpSize as a compile-time constant on device, which is apparently deprecated. I'm waiting for more clarity on what isn't deprecated before I switch. But this now compiles for gfx1100 modulo that one annoying warning directive. Strongly recommend hiding whitespace changes for review: https://github.com/nod-ai/dgl/pull/43/files?w=1

@tpopp tpopp left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just approving what's here. It looks clean to me other than some small amounts of possible code reuse, but 2 warp sizes doesn't necessitate it I think.

Regarding whitespace changes, please double check that you're formatting with any configuration that the repository might have to avoid future merge-conflict annoyances.

key = d_keys[key_idx];
src_set = set_hasher::hash(key) % capacity_in_set;
src_slab = slab_hasher::hash(key) % set_associativity;
assert(warp_size <= warpSize);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: One of these should go it seems like. Also static_assert if the assert stays I believe.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not. The reason for this weirdness is that HIP has a static_assert that the tile size is not greater than the device warp size (e.g. for cg::thread_block_tile). We need the if constexpr to avoid compiling code where that is the case. On the host, we have to ensure we're not calling a kernel with the wrong warp_size for the device, which we do by looking it up at runtime and using a switch, so this should never get called with warp_size != warpSize, but the assert is there to make it more debuggable if something gets messed up. If it was a static_assert we'd immediately get a compilation error again.

@GMNGeoffrey

Copy link
Copy Markdown
Collaborator Author

Regarding whitespace changes, please double check that you're formatting with any configuration that the repository might have to avoid future merge-conflict annoyances.

The issue is that I introduced if conditions around a bunch of things, so they got indented

@GMNGeoffrey

Copy link
Copy Markdown
Collaborator Author

@jeffdaily any further thoughts?

@GMNGeoffrey GMNGeoffrey marked this pull request as ready for review May 7, 2025 21:12
@jeffdaily

Copy link
Copy Markdown
Collaborator

Unless I missed them, I didn't see whitespace changes in other files. I only saw it in third_party/HugeCTR/gpu_cache/src/nv_gpu_cache.cu which was pretty extensive. If it doesn't break some sort of linter that's part of CI, perhaps you could add the if constexpr at the same indentation level as the rest which would resolve the whitespace-only-ish changes.

Since this file is in third_party, is a it a git submodule? Do we have any hope of filing a PR against HugeCTR to resolve this?

@GMNGeoffrey

Copy link
Copy Markdown
Collaborator Author

If it doesn't break some sort of linter that's part of CI, perhaps you could add the if constexpr at the same indentation level as the rest which would resolve the whitespace-only-ish changes.

There's no CI, so that's not going to be a problem 😆 I guess it's just a question of how much we want to prioritize minimizing the diff vs making the code readable and avoiding accidental reformatting in the future.

Since this file is in third_party, is a it a git submodule? Do we have any hope of filing a PR against HugeCTR to resolve this?

No, it's vendored from an Nvidia repo: https://github.com/NVIDIA-Merlin/HugeCTR/tree/main/gpu_cache. The diff here is vs our hipified port. I think this is only an issue for ROCm, so I doubt they'd be interested in a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants