Skip to content

ch4/ipc/gpu: fix broken IPC mapping cache retrieval#7821

Open
nmnobre wants to merge 5 commits into
pmodels:mainfrom
nmnobre:ipc
Open

ch4/ipc/gpu: fix broken IPC mapping cache retrieval#7821
nmnobre wants to merge 5 commits into
pmodels:mainfrom
nmnobre:ipc

Conversation

@nmnobre

@nmnobre nmnobre commented May 28, 2026

Copy link
Copy Markdown
Contributor

Pull Request Description

Closes #7819.
In MPIDI_GPU_ipc_handle_map(), if the receiver had already mapped and cached an IPC memory allocation (as given by the input handler) with the same base address, we were assuming this mapping continued to be valid on a second encounter of the same base address. It just so happens that this second input handler might refer to an allocation with the same base address, but of different size, meaning the allocation needs to be remapped to avoid bounds checks failures later on, e.g. on a cudaMemcpyAsync() as in #7819.

One would then think that fixing this would just be a matter of more judiciously setting MPIDI_GPU_IPC_HANDLE_REMAP_REQUIRED on the handles, sender-side. Problem is, the sender doesn't have enough information to do this. On a first call to MPIDI_GPU_fill_ipc_handle() for some device allocation, we were correctly setting the handle status to MPIDI_GPU_IPC_HANDLE_REMAP_REQUIRED (even if this pertains to an allocation of different size on a base address seen before), but a subsequent call with some input attributes referencing a different offset inside the allocation but the same base address, would incorrectly set the handler status to MPIDI_GPU_IPC_HANDLE_VALID. When the receiver sees this handler, it doesn't know to remap, leading to the problem above.

Obviously, setting MPIDI_GPU_IPC_HANDLE_REMAP_REQUIRED unconditionally solves the problem, but that's effectively the same as disabling the cache on the receiver side, so instead I've introduced a check on the allocation sizes.

Cheers,
-Nuno

summary of key changes

  • Sender - side handle cache is removed
  • Receiver -side always cache handle map
  • Rely on buffer id for cache validation

@raffenet

Copy link
Copy Markdown
Contributor

The handle validation check is supposed to verify that the unique buffer id stored with the IPC handle is the same as the one being used in the operation. Are you saying that in your case you have 2 different allocations that have the same CUDA buffer id?

bool MPL_gpu_ipc_handle_is_valid(MPL_gpu_ipc_mem_handle_t * handle, void *ptr)
{
    CUresult ret;
    unsigned long long buffer_id;

    ret = cuPointerGetAttribute(&buffer_id, CU_POINTER_ATTRIBUTE_BUFFER_ID, (CUdeviceptr) ptr);
    assert(ret == cudaSuccess);

    return buffer_id == handle->id;
}

@raffenet

Copy link
Copy Markdown
Contributor

but a subsequent call with some input attributes referencing a different offset inside the allocation but the same base address, would incorrectly set the handler status to MPIDI_GPU_IPC_HANDLE_VALID. When the receiver sees this handler, it doesn't know to remap, leading to the problem above.

Hm, maybe I am not fully understanding the issue. Is the problem that the receiver is not mapping the entire allocation but only what is needed by the copy operation? The cache hit on the base address is correct but the mapped range is insufficient?

@nmnobre

nmnobre commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

The handle validation check is supposed to verify that the unique buffer id stored with the IPC handle is the same as the one being used in the operation. Are you saying that in your case you have 2 different allocations that have the same CUDA buffer id?

Imagine the sender has had to deal with some allocation at some previous time at a given base address.
Now think of a second, larger allocation at the same base address, and of two different offsets inside that same allocation. Think of a scatter for example. On a first call to MPIDI_GPU_fill_ipc_handle pertaining to the second allocation,, first offset, we get an invalid cache hit. But, on a second call, for the second offset, to MPIDI_GPU_fill_ipc_handle, we have ourselves a 'valid' cache hit (but that should be marked MPIDI_GPU_IPC_HANDLE_REMAP_REQUIRED).

Is the problem that the receiver is not mapping the entire allocation but only what is needed by the copy operation? The cache hit on the base address is correct but the mapped range is insufficient?

The receiver is relying on its previous map (from that smaller initial allocation) because it sees MPIDI_GPU_IPC_HANDLE_VALID and the base address matches, which might indeed be insufficient because that base address now corresponds to a larger range.

@raffenet

Copy link
Copy Markdown
Contributor

The handle validation check is supposed to verify that the unique buffer id stored with the IPC handle is the same as the one being used in the operation. Are you saying that in your case you have 2 different allocations that have the same CUDA buffer id?

Imagine the sender has had to deal with some allocation at some previous time at a given base address. Now think of a second, larger allocation at the same base address, and of two different offsets inside that same allocation. Think of a scatter for example. On a first call to MPIDI_GPU_fill_ipc_handle pertaining to the second allocation,, first offset, we get an invalid cache hit. But, on a second call, for the second offset, to MPIDI_GPU_fill_ipc_handle, we have ourselves a 'valid' cache hit (but that should be marked MPIDI_GPU_IPC_HANDLE_REMAP_REQUIRED).

I'm still not grasping this totally. The first call using the second allocation produces an invalid cache hit at the sender. The sender evicts the invalid cache entry and creates a new entry/handle for the receiver with MPIDI_GPU_IPC_HANDLE_REMAP_REQUIRED. The receiver maps the new entry and is able to access the memory. The second call using that allocation should indeed get a 'valid' cache hit meaning the receiver can use its cached mapping.

@nmnobre

nmnobre commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

I'm still not grasping this totally. The first call using the second allocation produces an invalid cache hit at the sender. The sender evicts the invalid cache entry and creates a new entry/handle for the receiver with MPIDI_GPU_IPC_HANDLE_REMAP_REQUIRED. The receiver maps the new entry and is able to access the memory. The second call using that allocation should indeed get a 'valid' cache hit meaning the receiver can use its cached mapping.

What if it's two different receivers getting the handle marked MPIDI_GPU_IPC_HANDLE_REMAP_REQUIRED and the other marked MPIDI_GPU_IPC_HANDLE_VALID?

@raffenet

Copy link
Copy Markdown
Contributor

I'm still not grasping this totally. The first call using the second allocation produces an invalid cache hit at the sender. The sender evicts the invalid cache entry and creates a new entry/handle for the receiver with MPIDI_GPU_IPC_HANDLE_REMAP_REQUIRED. The receiver maps the new entry and is able to access the memory. The second call using that allocation should indeed get a 'valid' cache hit meaning the receiver can use its cached mapping.

What if it's two different receivers getting the handle marked MPIDI_GPU_IPC_HANDLE_REMAP_REQUIRED and the other marked MPIDI_GPU_IPC_HANDLE_VALID?

That's what I was missing! Yes, that is definitely a flaw in this implementation. Thanks for your patience in getting there.

I wonder if we should expand the receiver-side key to encompass base ptr, remote rank, and remote buffer id. Then we can eliminate the valid/remap field since the receiver has everything they need to determine whether mapping is needed. In fact, it may allow us to eliminate the sender-side cache entirely. At least with CUDA, getting the IPC handle is a lightweight operation.

@nmnobre

nmnobre commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

That's what I was missing! Yes, that is definitely a flaw in this implementation. Thanks for your patience in getting there.

After writing it, I realised I probably should have led with that x)
This was the hardest bug I ever tracked down though (best part of two days), so I didn't expect it to be easy to explain.

I wonder if we should expand the receiver-side key to encompass base ptr, remote rank, and remote buffer id.

I expect the remote buffer ID would be enough to disambiguate, yes.

Then we can eliminate the valid/remap field

As you probably noticed, I took the liberty to do just that even with the current implementation.

You reckon MPL_gpu_get_buffer_bounds() might still be too expensive?

@raffenet

Copy link
Copy Markdown
Contributor

You reckon MPL_gpu_get_buffer_bounds() might still be too expensive?

My worry is that it could be unreliable. Say if you cudaMalloc(1MB)/cudaFree()/cudaMalloc(1MB). You may get a cache hit and matching length, but the buffer ids will be different. Could still lead to a crash. Buffer id seems safest to me. I don't know the cost of the bounds or id checking, but I would guess they are similar.

@nmnobre

nmnobre commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

From the CUDA docs:

"""
CU_POINTER_ATTRIBUTE_BUFFER_ID:
Returns in *data a buffer ID which is guaranteed to be unique within the process. data must point to an unsigned long long.

ptr must be a pointer to memory obtained from a CUDA memory allocation API. Every memory allocation from any of the CUDA memory allocation APIs will have a unique ID over a process lifetime. Subsequent allocations do not reuse IDs from previous freed allocations. IDs are only unique within a single process.
"""

So I think we can just replace the base address with this id, instead of adding another field to the key. I suppose this would be safe for CUDA and HIP, dunno about ZE.

@raffenet

Copy link
Copy Markdown
Contributor

So I think we can just replace the base address with this id, instead of adding another field to the key. I suppose this would be safe for CUDA and HIP, dunno about ZE.

Yes, this is even better. IIRC, ZE is unable to use the default cache so it might be moot whether it works for them.

@raffenet

raffenet commented May 29, 2026

Copy link
Copy Markdown
Contributor

So I think we can just replace the base address with this id, instead of adding another field to the key. I suppose this would be safe for CUDA and HIP, dunno about ZE.

Yes, this is even better. IIRC, ZE is unable to use the default cache so it might be moot whether it works for them.

Just adding some more comments after looking at the code. It doesn't all have to get solved in this PR.

  1. Either extend MPL_gpu_query_pointer_attr to include a buffer id or add MPL_gpu_get_buffer_id(void *ptr, MPL_gpu_buffer_id *id)
  2. Sender-side cache becomes unnecessary. Get the mem handle + buffer id and send it to the receiver. Key the receiver cache on rank+buffer_id.
  3. Buffer reuse check should use buffer id, not base_ptr.
  4. MPL_gpu_ipc_handle_is_valid can be removed.

@raffenet

Copy link
Copy Markdown
Contributor
  • Sender-side cache becomes unnecessary. Get the mem handle + buffer id and send it to the receiver.

One issue is that eventually open IPC handles need to be closed/unmapped. Otherwise they hold a reference to the allocation and keep the memory from being released. The special ZE cache imposes a limit and evicts based on LRU policy. The generic cache used by CUDA/HIP does not have this capability at the moment.

@nmnobre nmnobre changed the title Fix broken IPC allocation cache retrieval Fix broken IPC mapping cache retrieval Jun 1, 2026
@nmnobre

nmnobre commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Hi @raffenet, I ended up implementing something a bit different from what we discussed, here's why:

  1. Either extend MPL_gpu_query_pointer_attr to include a buffer id or add MPL_gpu_get_buffer_id(void *ptr, MPL_gpu_buffer_id *id)

No need, since the MPIDI handler includes the MPL handler which includes the buffer id. I've instead just implemented a getter for the receiver to get the id given an MPL handle, MPL_gpu_buffer_id_t MPL_gpu_ipc_handle_id(MPL_gpu_ipc_mem_handle_t * handle) to address the differences between CUDA/HIP and ZE EDIT: switched to an ifdef in the body of MPIDI_GPU_ipc_handle_map() as adding to the API is unnecessary imho.

  1. Sender-side cache becomes unnecessary. Get the mem handle + buffer id and send it to the receiver.

I did't touch the sender side, but can do this, because the cache doesn't give us much, if anything now. Nvidia does claim cudaIpcGetMemHandle() is inexpensive... Problem here is I won't have an easy means to test the HIP or ZE codepaths. EDIT: Done.

  1. Key the receiver cache on rank+buffer_id.

No can do. We still need to key on the address, so we know we've mapped a given remote address before, so we can unmap a previous mapping that's now invalid.

  1. Buffer reuse check should use buffer id, not base_ptr.

For the same reason as above, we need both. We need to check if we have a remote base pointer hit and then check for a buffer id hit, in which case I'm done, but otherwise we need to unmap using the local base pointer.

@nmnobre nmnobre force-pushed the ipc branch 4 times, most recently from aa038c9 to 98577a8 Compare June 1, 2026 23:18
@raffenet

raffenet commented Jun 2, 2026

Copy link
Copy Markdown
Contributor
  1. Key the receiver cache on rank+buffer_id.

No can do. We still need to key on the address, so we know we've mapped a given remote address before, so we can unmap a previous mapping that's now invalid.

Yes, I agree that it is good to unmap previous entries in the cache. However our existing unmap logic is only best-effort. We only unmap if the same base address is used, which we cannot guarantee. It would be an improvement just limit the cache size, but that will be additional work to implement. I would be fine leaving the hash lookup as-is and using the buffer id to validate the region for now.

  1. Buffer reuse check should use buffer id, not base_ptr.

For the same reason as above, we need both. We need to check if we have a remote base pointer hit and then check for a buffer id hit, in which case I'm done, but otherwise we need to unmap using the local base pointer.

Sorry, I meant the reuse check on the send-side. We only go down the IPC path once we detect that a send buffer has been reused some number of times. See MPIDI_IPCI_is_repeat_addr. However my suggestion won't work because we use this check for CPU memory and don't have buffer ids for that case. We could split the CPU/GPU cases but again, probably too much work for this PR.

@nmnobre nmnobre changed the title Fix broken IPC mapping cache retrieval ch4/ipc/gpu: fix broken IPC mapping cache retrieval Jun 2, 2026
@nmnobre

nmnobre commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Yes, I agree that it is good to unmap previous entries in the cache.

I'm afraid it's not simply because it's good, it is a necessity. If you try to map an allocation on the same base address you get cudaErrorAlreadyMapped = 208.

@nmnobre

nmnobre commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

@raffanet, let me know if you're expecting anything else from me here.

I think we agree that for the reason above it's also impossible to set a cache size limit and evict entries receiver-side, because you never know if you're going to need to unmap that address later. I've deleted the cache sender-side anyway, so the reuse check there is gone.

@raffenet

raffenet commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

@raffanet, let me know if you're expecting anything else from me here.

I think we agree that for the reason above it's also impossible to set a cache size limit and evict entries receiver-side, because you never know if you're going to need to unmap that address later. I've deleted the cache sender-side anyway, so the reuse check there is gone.

I'll carve out some time to do a full review tonight or tomorrow. I think your approach is correct.

*vaddr = (void *) ((char *) pbase + handle.offset);
goto fn_exit;
/* found the base address in cache, check the buffer id */
if (cached_remote_buffer_id == incoming_remote_buffer_id) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we repurpose MPL_gpu_ipc_handle_is_valid for use on the recv side? That way we wouldn't have to expose MPL_gpu_buffer_id_t. Just compare to the cached handle and unmap if not "valid".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think we still need to expose the buffer id type. Let me play around with the implementation a bit and we can discuss.

Comment thread src/mpid/ch4/shm/ipc/gpu/gpu_types.h Outdated
Comment on lines +17 to +20
struct map_data {
const void *local_base_address;
MPL_gpu_buffer_id_t remote_buffer_id;
} mapped[]; /* array of local base addresses and remote buffer ids indexed by device id */

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I am wrong, but the remote buffer id is a singular value, right? We have multiple recv-side mapped addresses, 1 per each device, but the sender's buffer id only needs to be stored once. I think this should simplify some of the changes in this commit.

Suggested change
struct map_data {
const void *local_base_address;
MPL_gpu_buffer_id_t remote_buffer_id;
} mapped[]; /* array of local base addresses and remote buffer ids indexed by device id */
MPL_gpu_buffer_id_t remote_buffer_id;
const void *mapped_addrs[]; /* array of base addresses indexed by device id */

My other comment is to isolate the changes to MPL to expose the buffer id to their own commit. That should make it easier for review.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I am wrong, but the remote buffer id is a singular value, right?

Right.

We have multiple recv-side mapped addresses, 1 per each device, but the sender's buffer id only needs to be stored once.

We need to know the id currently mapped for each device, so we know if we need to unmap on that device.

My other comment is to isolate the changes to MPL to expose the buffer id to their own commit.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have multiple recv-side mapped addresses, 1 per each device, but the sender's buffer id only needs to be stored once.

We need to know the id currently mapped for each device, so we know if we need to unmap on that device.

If you have different buffer ids mapped for the same remote base address isn't that a problem? Once you lookup a base address and discover that the buffer id has changed, you need to clear all the mappings. The previous allocation is gone.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, we have a lazy approach, i.e., we (unmap and) map once MPIDI_GPU_ipc_handle_map() is called for a specific device. To save the buffer id only once, and assuming we change things so that we unmap on all devices just as we get the first mismatch (we'd probably need to rewrite ipc_mapped_cache_search() to return the cache entry so we can loop over base addresses on all devices), how do we know to map back on a given device? Mapping preemptively on all devices sounds unnecessary, so we'd have to wait until MPIDI_GPU_ipc_handle_map() is called for a given device to open the map, just as currently implemented, but how would we know we need to do it in the first place? Do we save flags for each device? I'd argue the id is exactly that flag.

@raffenet raffenet Jun 9, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we map for each device as needed. Once an entry is invalidated, we have to clear all mappings and remove the entry entirely from the cache. That is the current behavior, just the invalidation logic is flawed as you discovered.

The new invalidation logic is when a remote base address has a new buffer id. Clear all mappings, create the new entry, map it only on the device being used for the newest transfer. We will map again for other devices as needed.

static int ipc_mapped_cache_delete(const void *remote_addr, int remote_rank)
{
    int mpi_errno = MPI_SUCCESS;
    struct MPIDI_GPUI_map_cache_entry *entry;
    struct map_key key;

    MPL_DBG_MSG_P(MPIDI_CH4_DBG_IPC, VERBOSE, "removing STALE mapped gpu ipc handle for %p",
                  remote_addr);

    memset(&key, 0, sizeof(key));
    key.remote_rank = remote_rank;
    key.remote_addr = remote_addr;
    HASH_FIND(hh, MPIDI_GPUI_global.ipc_map_cache, &key, sizeof(struct map_key), entry);

    if (entry) {
        HASH_DEL(MPIDI_GPUI_global.ipc_map_cache, entry);
        for (int i = 0; i < MPIDI_GPUI_global.local_device_count; i++) {
            if (entry->mapped_addrs[i]) {
                int mpl_err = MPL_gpu_ipc_handle_unmap((void *) entry->mapped_addrs[i]);
                MPIR_ERR_CHKANDJUMP(mpl_err != MPL_SUCCESS, mpi_errno, MPI_ERR_OTHER,
                                    "**gpu_ipc_handle_unmap");
            }
        }
        MPL_free(entry);
    }

  fn_exit:
    return mpi_errno;
  fn_fail:
    goto fn_exit;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we don't actually need to delete the cache entry. We could just clear all existing mappings and then map only the new one. The key (remote_rank+remote_addr) remains the same.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙈 You're absolutely right, I completely forgot we had ipc_mapped_cache_delete() when I was talking about the looping logic above, it's already there! And yes, the "invalid flag" could just be a nullptr for the base pointer as well, probably most simply implemented by deleting the entry as you had done. Let me put that back.

@nmnobre nmnobre Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. EDIT: Jenkins appears to be broken...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nmnobre I am working with the IT

@nmnobre nmnobre force-pushed the ipc branch 2 times, most recently from b85b58d to 9119156 Compare June 9, 2026 00:06
@nmnobre nmnobre force-pushed the ipc branch 2 times, most recently from 5619925 to 0132cbc Compare June 9, 2026 15:42

@raffenet raffenet left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nmnobre thanks for your report and patch! The fix LGTM. I'll let @hzhou give final approval and deal with Jenkins issues 😄.

Comment thread src/mpl/src/gpu/mpl_gpu_ze.c Outdated
assert(ret == ZE_RESULT_SUCCESS);

return handle->data.mem_id == ptr_attr.id;
return ptr_attr.id == handle->data.mem_id;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: this is the same comparison just reversed. should be dropped.

@nmnobre nmnobre Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this was so it matched the order of the CUDA/HIP codepaths; it was a nitpick in the first place to reverse it... I can put it back for that commit: this code is gone anyway. EDIT: Done.

#include "cuda.h"
#include "cuda_runtime_api.h"

typedef unsigned long long MPL_gpu_buffer_id_t;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I think I forgot that you'll also need a typedef in mpl_gpu_fallback.h to not break the CPU-only build

@nmnobre nmnobre Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you didn't actually, but somehow we reverted that change? How strange... https://github.com/pmodels/mpich/compare/fb92eda2b940dd517d155279dd15a6701b0a2a7a..98577a8b548ed8d737ff7e055d5e04310a9a819e

EDIT: Done, it's back.

@hzhou

hzhou commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

test:mpich/ch4/most
test:mpich/ch4/gpu/ofi

@hzhou

hzhou commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

test:mpich/ch4/most
test:mpich/ch4/gpu/ofi

@hzhou

hzhou commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

test:mpich/warnings
test:mpich/spellcheck
test:mpich/whitespace

@hzhou

hzhou commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

test:mpich/authorship

@nmnobre

nmnobre commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Bad news: this is a step in the right direction, but there's still a flaw somewhere. I was testing this with CUDA_LAUNCH_BLOCKING=1 and 4 ranks, and though this fixes that and the error in #7819 goes away entirely, for other settings, e.g. CUDA_LAUNCH_BLOCKING=0 and 4 ranks or CUDA_LAUNCH_BLOCKING=1 and 6 ranks, I'm still getting cudaIpcOpenMemHandle: 208.

One thing that strikes me is that caching is not optional, we must cache, because we must unmap. So the current design where we allow no cache doesn't feel right. Of course that's not what's happening here though because I'm definitely caching. I've tried ditching the cache entry unconditionally and printing the raw contents of the handle before we call cudaIpcOpenMemHandle and can't see any reuses, so I'm a bit puzzled...

@hzhou

hzhou commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Took a look at the code. If we only cache the base address, then we won't catch the case when user send in different pointers (at different offset) from a single allocation

memset(&key, 0, sizeof(key));
key.remote_rank = remote_rank;
key.remote_addr = remote_addr;
HASH_FIND(hh, MPIDI_GPUI_global.ipc_map_cache, &key, sizeof(struct map_key), entry);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is insufficient to catch the case when remote_addr are different but from the same allocation. How about use mapped_remote_buffer_id as key instead?

@raffenet raffenet Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The offset case should work. We always get the base buffer address and use it as the key. Both base addr and offset are part of the handshake message.

If we are still seeing "already mapped" errors, I do worry that a new allocation looks like an offset into a previous, now freed, mapping. I don't know if the previous AVL-based range lookups handled this or not. With a buffer id key we still need a way to inform the receiver to unmap stale mappings. Just because we get "already mapped" doesn't mean we know which cache entry to unmap 😦.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. If we always use base address, then my worried case is covered.

I do worry that a new allocation looks like an offset into a previous, now freed, mapping.

You mean two separate allocation gets the same virtual address? That's what buffer_id is meant to resolve. @nmnobre Could you try add MPL_gpu_buffer_id_t to the key struct?

@nmnobre nmnobre Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this is indeed it, a new allocation, somewhere at an offset of an old one (I'll check tomorrow or after), we probably need to store base addresses, ids and ranges and unmap if the new incoming handle is at an address > old address and < old address + old range, assume a cache hit, check the id, and unmap on mismatch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Possible bug due to failing cudaMemcpyAsync

3 participants