Add fast-path for gufunc (specifically matmul) #9482

seberg · 2025-11-13T11:16:17Z

Since simple cupy.matmul calls, half of the overhead is still inside the gufuncs even with the previous PR, this suggests adding a fast-path mechanism.

I would love to avoid the fast-path here, but I doubt it is worthwhile to carefully Cythonize the gufuncs and a 2-3x speeup is just still a bit underwhelming.
(Fast-paths are always an opportunity for bugs after all...)

The core is to propose a mechanism of:

def try_fastcall(*args, **kwargs):
     if not supports_kwargs_and_args:
         return NotImplemented
     # do the real thing if possible

for "GUFuncs".

~~Draft as based on gh-9481 and I assume we can put this since @emcastillo already had a look.~~ whoops, seems I forgot the draft...

This maybe closes gh-8191, from my timings we are still at about 2x slower than torch, though. That is lost in various places, the biggest chunk is probably a rollaxis (creating a new array).

This reduces the GUfunc overhead of a simple matmul from ~60% of the operation to probably 10%, I expect around 1/3 of the time if the matmul core itself is well optimized. Mainly, also fixes a bunch of bugs, unfortunately... this is rather complex code so it is hard to review. (I am sure it fixes a lot of bugs more than it opens, but it's hard to be sure everything is covered...) It does have one larger change (also a bug fix) that I am aware of: The old code incorrectly broadcast _core_ dimensions, this is not allowed by NumPy unless the dimension is specified as `n|1`. For matmul this just changes the error if someone actually relied on it, they would have to switch to `|1` on newer versions.

Since simple `cupy.matmul` calls, half of the overhead is still inside the gufuncs even with the previous PR, this suggests adding a fast-path mechanism. I would love to avoid the fast-path here, but I doubt it is worthwhile to carefully Cythonize the gufuncs and a 2-3x speeup is just still a bit underwhelming. (Fast-paths are always an opportunity for bugs after all...)

emcastillo

LGTM!

seberg added 3 commits November 12, 2025 07:33

Move transpose for clarity (and correctness although only in theory)

ea1d086

seberg requested a review from a team as a code owner November 13, 2025 11:16

leofang added the cat:performance Performance in terms of speed or memory consumption label Nov 13, 2025

emcastillo approved these changes Nov 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add fast-path for gufunc (specifically matmul) #9482

Add fast-path for gufunc (specifically matmul) #9482

Uh oh!

seberg commented Nov 13, 2025 •

edited

Loading

Uh oh!

emcastillo left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Add fast-path for gufunc (specifically matmul) #9482

Are you sure you want to change the base?

Add fast-path for gufunc (specifically matmul) #9482

Uh oh!

Conversation

seberg commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emcastillo left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

seberg commented Nov 13, 2025 •

edited

Loading