ch4/ipc/gpu: fix broken IPC mapping cache retrieval#7821
Conversation
|
The handle validation check is supposed to verify that the unique buffer id stored with the IPC handle is the same as the one being used in the operation. Are you saying that in your case you have 2 different allocations that have the same CUDA buffer id? bool MPL_gpu_ipc_handle_is_valid(MPL_gpu_ipc_mem_handle_t * handle, void *ptr)
{
CUresult ret;
unsigned long long buffer_id;
ret = cuPointerGetAttribute(&buffer_id, CU_POINTER_ATTRIBUTE_BUFFER_ID, (CUdeviceptr) ptr);
assert(ret == cudaSuccess);
return buffer_id == handle->id;
} |
Hm, maybe I am not fully understanding the issue. Is the problem that the receiver is not mapping the entire allocation but only what is needed by the copy operation? The cache hit on the base address is correct but the mapped range is insufficient? |
Imagine the sender has had to deal with some allocation at some previous time at a given base address.
The receiver is relying on its previous map (from that smaller initial allocation) because it sees |
I'm still not grasping this totally. The first call using the second allocation produces an invalid cache hit at the sender. The sender evicts the invalid cache entry and creates a new entry/handle for the receiver with |
What if it's two different receivers getting the handle marked |
That's what I was missing! Yes, that is definitely a flaw in this implementation. Thanks for your patience in getting there. I wonder if we should expand the receiver-side key to encompass base ptr, remote rank, and remote buffer id. Then we can eliminate the valid/remap field since the receiver has everything they need to determine whether mapping is needed. In fact, it may allow us to eliminate the sender-side cache entirely. At least with CUDA, getting the IPC handle is a lightweight operation. |
After writing it, I realised I probably should have led with that x)
I expect the remote buffer ID would be enough to disambiguate, yes.
As you probably noticed, I took the liberty to do just that even with the current implementation. You reckon |
My worry is that it could be unreliable. Say if you cudaMalloc(1MB)/cudaFree()/cudaMalloc(1MB). You may get a cache hit and matching length, but the buffer ids will be different. Could still lead to a crash. Buffer id seems safest to me. I don't know the cost of the bounds or id checking, but I would guess they are similar. |
|
From the CUDA docs: """ ptr must be a pointer to memory obtained from a CUDA memory allocation API. Every memory allocation from any of the CUDA memory allocation APIs will have a unique ID over a process lifetime. Subsequent allocations do not reuse IDs from previous freed allocations. IDs are only unique within a single process. So I think we can just replace the base address with this id, instead of adding another field to the key. I suppose this would be safe for CUDA and HIP, dunno about ZE. |
Yes, this is even better. IIRC, ZE is unable to use the default cache so it might be moot whether it works for them. |
Just adding some more comments after looking at the code. It doesn't all have to get solved in this PR.
|
One issue is that eventually open IPC handles need to be closed/unmapped. Otherwise they hold a reference to the allocation and keep the memory from being released. The special ZE cache imposes a limit and evicts based on LRU policy. The generic cache used by CUDA/HIP does not have this capability at the moment. |
|
Hi @raffenet, I ended up implementing something a bit different from what we discussed, here's why:
No need, since the MPIDI handler includes the MPL handler which includes the buffer id.
No can do. We still need to key on the address, so we know we've mapped a given remote address before, so we can unmap a previous mapping that's now invalid.
For the same reason as above, we need both. We need to check if we have a remote base pointer hit and then check for a buffer id hit, in which case I'm done, but otherwise we need to unmap using the local base pointer. |
aa038c9 to
98577a8
Compare
Yes, I agree that it is good to unmap previous entries in the cache. However our existing unmap logic is only best-effort. We only unmap if the same base address is used, which we cannot guarantee. It would be an improvement just limit the cache size, but that will be additional work to implement. I would be fine leaving the hash lookup as-is and using the buffer id to validate the region for now.
Sorry, I meant the reuse check on the send-side. We only go down the IPC path once we detect that a send buffer has been reused some number of times. See |
I'm afraid it's not simply because it's good, it is a necessity. If you try to map an allocation on the same base address you get |
|
@raffanet, let me know if you're expecting anything else from me here. I think we agree that for the reason above it's also impossible to set a cache size limit and evict entries receiver-side, because you never know if you're going to need to unmap that address later. I've deleted the cache sender-side anyway, so the reuse check there is gone. |
I'll carve out some time to do a full review tonight or tomorrow. I think your approach is correct. |
| *vaddr = (void *) ((char *) pbase + handle.offset); | ||
| goto fn_exit; | ||
| /* found the base address in cache, check the buffer id */ | ||
| if (cached_remote_buffer_id == incoming_remote_buffer_id) { |
There was a problem hiding this comment.
Can we repurpose MPL_gpu_ipc_handle_is_valid for use on the recv side? That way we wouldn't have to expose MPL_gpu_buffer_id_t. Just compare to the cached handle and unmap if not "valid".
There was a problem hiding this comment.
Hm, I think we still need to expose the buffer id type. Let me play around with the implementation a bit and we can discuss.
| struct map_data { | ||
| const void *local_base_address; | ||
| MPL_gpu_buffer_id_t remote_buffer_id; | ||
| } mapped[]; /* array of local base addresses and remote buffer ids indexed by device id */ |
There was a problem hiding this comment.
Correct me if I am wrong, but the remote buffer id is a singular value, right? We have multiple recv-side mapped addresses, 1 per each device, but the sender's buffer id only needs to be stored once. I think this should simplify some of the changes in this commit.
| struct map_data { | |
| const void *local_base_address; | |
| MPL_gpu_buffer_id_t remote_buffer_id; | |
| } mapped[]; /* array of local base addresses and remote buffer ids indexed by device id */ | |
| MPL_gpu_buffer_id_t remote_buffer_id; | |
| const void *mapped_addrs[]; /* array of base addresses indexed by device id */ |
My other comment is to isolate the changes to MPL to expose the buffer id to their own commit. That should make it easier for review.
There was a problem hiding this comment.
Correct me if I am wrong, but the remote buffer id is a singular value, right?
Right.
We have multiple recv-side mapped addresses, 1 per each device, but the sender's buffer id only needs to be stored once.
We need to know the id currently mapped for each device, so we know if we need to unmap on that device.
My other comment is to isolate the changes to MPL to expose the buffer id to their own commit.
Done.
There was a problem hiding this comment.
We have multiple recv-side mapped addresses, 1 per each device, but the sender's buffer id only needs to be stored once.
We need to know the id currently mapped for each device, so we know if we need to unmap on that device.
If you have different buffer ids mapped for the same remote base address isn't that a problem? Once you lookup a base address and discover that the buffer id has changed, you need to clear all the mappings. The previous allocation is gone.
There was a problem hiding this comment.
Right now, we have a lazy approach, i.e., we (unmap and) map once MPIDI_GPU_ipc_handle_map() is called for a specific device. To save the buffer id only once, and assuming we change things so that we unmap on all devices just as we get the first mismatch (we'd probably need to rewrite ipc_mapped_cache_search() to return the cache entry so we can loop over base addresses on all devices), how do we know to map back on a given device? Mapping preemptively on all devices sounds unnecessary, so we'd have to wait until MPIDI_GPU_ipc_handle_map() is called for a given device to open the map, just as currently implemented, but how would we know we need to do it in the first place? Do we save flags for each device? I'd argue the id is exactly that flag.
There was a problem hiding this comment.
Yes, we map for each device as needed. Once an entry is invalidated, we have to clear all mappings and remove the entry entirely from the cache. That is the current behavior, just the invalidation logic is flawed as you discovered.
The new invalidation logic is when a remote base address has a new buffer id. Clear all mappings, create the new entry, map it only on the device being used for the newest transfer. We will map again for other devices as needed.
static int ipc_mapped_cache_delete(const void *remote_addr, int remote_rank)
{
int mpi_errno = MPI_SUCCESS;
struct MPIDI_GPUI_map_cache_entry *entry;
struct map_key key;
MPL_DBG_MSG_P(MPIDI_CH4_DBG_IPC, VERBOSE, "removing STALE mapped gpu ipc handle for %p",
remote_addr);
memset(&key, 0, sizeof(key));
key.remote_rank = remote_rank;
key.remote_addr = remote_addr;
HASH_FIND(hh, MPIDI_GPUI_global.ipc_map_cache, &key, sizeof(struct map_key), entry);
if (entry) {
HASH_DEL(MPIDI_GPUI_global.ipc_map_cache, entry);
for (int i = 0; i < MPIDI_GPUI_global.local_device_count; i++) {
if (entry->mapped_addrs[i]) {
int mpl_err = MPL_gpu_ipc_handle_unmap((void *) entry->mapped_addrs[i]);
MPIR_ERR_CHKANDJUMP(mpl_err != MPL_SUCCESS, mpi_errno, MPI_ERR_OTHER,
"**gpu_ipc_handle_unmap");
}
}
MPL_free(entry);
}
fn_exit:
return mpi_errno;
fn_fail:
goto fn_exit;
}There was a problem hiding this comment.
I suppose we don't actually need to delete the cache entry. We could just clear all existing mappings and then map only the new one. The key (remote_rank+remote_addr) remains the same.
There was a problem hiding this comment.
🙈 You're absolutely right, I completely forgot we had ipc_mapped_cache_delete() when I was talking about the looping logic above, it's already there! And yes, the "invalid flag" could just be a nullptr for the base pointer as well, probably most simply implemented by deleting the entry as you had done. Let me put that back.
There was a problem hiding this comment.
Done. EDIT: Jenkins appears to be broken...
b85b58d to
9119156
Compare
5619925 to
0132cbc
Compare
| assert(ret == ZE_RESULT_SUCCESS); | ||
|
|
||
| return handle->data.mem_id == ptr_attr.id; | ||
| return ptr_attr.id == handle->data.mem_id; |
There was a problem hiding this comment.
nitpick: this is the same comparison just reversed. should be dropped.
There was a problem hiding this comment.
Ah, this was so it matched the order of the CUDA/HIP codepaths; it was a nitpick in the first place to reverse it... I can put it back for that commit: this code is gone anyway. EDIT: Done.
| #include "cuda.h" | ||
| #include "cuda_runtime_api.h" | ||
|
|
||
| typedef unsigned long long MPL_gpu_buffer_id_t; |
There was a problem hiding this comment.
Oh I think I forgot that you'll also need a typedef in mpl_gpu_fallback.h to not break the CPU-only build
There was a problem hiding this comment.
No, you didn't actually, but somehow we reverted that change? How strange... https://github.com/pmodels/mpich/compare/fb92eda2b940dd517d155279dd15a6701b0a2a7a..98577a8b548ed8d737ff7e055d5e04310a9a819e
EDIT: Done, it's back.
|
test:mpich/ch4/most |
|
test:mpich/ch4/most |
|
test:mpich/warnings |
|
test:mpich/authorship |
|
Bad news: this is a step in the right direction, but there's still a flaw somewhere. I was testing this with One thing that strikes me is that caching is not optional, we must cache, because we must unmap. So the current design where we allow no cache doesn't feel right. Of course that's not what's happening here though because I'm definitely caching. I've tried ditching the cache entry unconditionally and printing the raw contents of the handle before we call |
|
Took a look at the code. If we only cache the base address, then we won't catch the case when user send in different pointers (at different offset) from a single allocation |
| memset(&key, 0, sizeof(key)); | ||
| key.remote_rank = remote_rank; | ||
| key.remote_addr = remote_addr; | ||
| HASH_FIND(hh, MPIDI_GPUI_global.ipc_map_cache, &key, sizeof(struct map_key), entry); |
There was a problem hiding this comment.
This is insufficient to catch the case when remote_addr are different but from the same allocation. How about use mapped_remote_buffer_id as key instead?
There was a problem hiding this comment.
The offset case should work. We always get the base buffer address and use it as the key. Both base addr and offset are part of the handshake message.
If we are still seeing "already mapped" errors, I do worry that a new allocation looks like an offset into a previous, now freed, mapping. I don't know if the previous AVL-based range lookups handled this or not. With a buffer id key we still need a way to inform the receiver to unmap stale mappings. Just because we get "already mapped" doesn't mean we know which cache entry to unmap 😦.
There was a problem hiding this comment.
I see. If we always use base address, then my worried case is covered.
I do worry that a new allocation looks like an offset into a previous, now freed, mapping.
You mean two separate allocation gets the same virtual address? That's what buffer_id is meant to resolve. @nmnobre Could you try add MPL_gpu_buffer_id_t to the key struct?
There was a problem hiding this comment.
Assuming this is indeed it, a new allocation, somewhere at an offset of an old one (I'll check tomorrow or after), we probably need to store base addresses, ids and ranges and unmap if the new incoming handle is at an address > old address and < old address + old range, assume a cache hit, check the id, and unmap on mismatch.
Pull Request Description
Closes #7819.
In
MPIDI_GPU_ipc_handle_map(), if the receiver had already mapped and cached an IPC memory allocation (as given by the input handler) with the same base address, we were assuming this mapping continued to be valid on a second encounter of the same base address. It just so happens that this second input handler might refer to an allocation with the same base address, but of different size, meaning the allocation needs to be remapped to avoid bounds checks failures later on, e.g. on acudaMemcpyAsync()as in #7819.One would then think that fixing this would just be a matter of more judiciously setting
MPIDI_GPU_IPC_HANDLE_REMAP_REQUIREDon the handles, sender-side. Problem is, the sender doesn't have enough information to do this. On a first call toMPIDI_GPU_fill_ipc_handle()for some device allocation, we were correctly setting the handle status toMPIDI_GPU_IPC_HANDLE_REMAP_REQUIRED(even if this pertains to an allocation of different size on a base address seen before), but a subsequent call with some input attributes referencing a different offset inside the allocation but the same base address, would incorrectly set the handler status toMPIDI_GPU_IPC_HANDLE_VALID. When the receiver sees this handler, it doesn't know to remap, leading to the problem above.Obviously, setting
MPIDI_GPU_IPC_HANDLE_REMAP_REQUIREDunconditionally solves the problem, but that's effectively the same as disabling the cache on the receiver side, so instead I've introduced a check on the allocation sizes.Cheers,
-Nuno
summary of key changes