Skip to content

Conversation

@seberg
Copy link
Member

@seberg seberg commented Nov 13, 2025

Since simple cupy.matmul calls, half of the overhead is still inside the gufuncs even with the previous PR, this suggests adding a fast-path mechanism.

I would love to avoid the fast-path here, but I doubt it is worthwhile to carefully Cythonize the gufuncs and a 2-3x speeup is just still a bit underwhelming.
(Fast-paths are always an opportunity for bugs after all...)

The core is to propose a mechanism of:

def try_fastcall(*args, **kwargs):
     if not supports_kwargs_and_args:
         return NotImplemented
     # do the real thing if possible

for "GUFuncs".


Draft as based on gh-9481 and I assume we can put this since @emcastillo already had a look. whoops, seems I forgot the draft...

This maybe closes gh-8191, from my timings we are still at about 2x slower than torch, though. That is lost in various places, the biggest chunk is probably a rollaxis (creating a new array).

This reduces the GUfunc overhead of a simple matmul from ~60% of the
operation to probably 10%, I expect around 1/3 of the time if the
matmul core itself is well optimized.

Mainly, also fixes a bunch of bugs, unfortunately... this is rather
complex code so it is hard to review.
(I am sure it fixes a lot of bugs more than it opens, but it's hard
to be sure everything is covered...)

It does have one larger change (also a bug fix) that I am aware of:
The old code incorrectly broadcast _core_ dimensions, this is not
allowed by NumPy unless the dimension is specified as `n|1`.
For matmul this just changes the error if someone actually relied
on it, they would have to switch to `|1` on newer versions.
Since simple `cupy.matmul` calls, half of the overhead is still
inside the gufuncs even with the previous PR, this suggests adding
a fast-path mechanism.

I would love to avoid the fast-path here, but I doubt it is
worthwhile to carefully Cythonize the gufuncs and a 2-3x speeup
is just still a bit underwhelming.
(Fast-paths are always an opportunity for bugs after all...)
@seberg seberg requested a review from a team as a code owner November 13, 2025 11:16
@leofang leofang added the cat:performance Performance in terms of speed or memory consumption label Nov 13, 2025
Copy link
Member

@emcastillo emcastillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:performance Performance in terms of speed or memory consumption

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Higher Kernel Launch CPU Overhead

3 participants