Question about unexpectedly high (CUDA) memory allocation for different SVD choices #602

lwelzel · 2025-06-22T11:10:14Z

lwelzel
Jun 22, 2025

Context

I am trying to track down a potential issue that is leading to OOM errors (CUDA) when decomposing large tensors on the PyTorch backend. For context, in my use case I use TensorLy to compute decompositions of noisy order-4 tensors with a total size of several GB (float32) on GPU. My tensors have relatively similar number of indices in each mode (~ $\mathcal{T} \in \mathbb{R}^{(50 \times 50 \times 100 \times 100)}$). I mostly use the TT decomposition.

Question/Problem

I would like to better understand the difference between svd = "randomized_svd" and svd = "truncated_svd" in regard to memory allocation/usage when computing decompositions. I experimentally observe a large difference between the two options in terms of GPU memory allocation, as tracked by PyTorch. The figure below shows the allocated GPU memory over time.

In the figure you can see the large difference between "truncated_svd"(left) and "randomized_svd"(right) when computing the TT decomposition. You can see that with the truncated SVD ~5.5 GB are allocated over what I expected, roughly 8 times the size of the tensor being decomposed. In the example I use a rank of 5, 10, and 20, respectively for each of the peaks, decompositing the same tensor each time.

The (largest) spikes are connected to the following calls (read from bottom to top):

40172 Addr: b'7ef91a000000_0, Size: 3.7GiB (3988272640 bytes) allocation, Total memory used after allocation: 6.5GiB (7017973432 bytes), # ...
# ... further dispatched to PyTorch and then native CUDA
python_linalg_functions.cpp:0:torch::autograd::THPVariable_linalg_svd(_object*, _object*, _object*)
site-packages/tensorly/backend/__init__.py:202:wrapped_backend_method
# Objects/classobject.c:61:method_vectorcall
# Objects/typeobject.c:8791:slot_tp_call
site-packages/tensorly/tenalg/svd.py:232:truncated_svd
site-packages/tensorly/tenalg/svd.py:427:svd_interface
# Objects/typeobject.c:8791:slot_tp_call
site-packages/tensorly/decomposition/_tt.py:51:tensor_train
# ...

Is this expected behavior when computing the SVDs? Why is the difference so large between the randomized and truncated versions? Naively, I would have expected the truncated SVD to use less memory, especially since the rank of the TT factors are so low. Unless this is expected behavior, the large memory allocation spikes are problematic because they make it difficult to compute the low-rank decompositions of large tensors for my downstream application using truncated SVDs. In my case I can just default to the randomized version, however, I am still curious about how to interpret this difference.

Thank you for the great package and for any help or insight you can offer on this.

Notes

Maybe related to #554 , however, the tensor_train interface does not expose an init='random' option.
Could also relate to #36.

Edit:

I see that randomized_svd calls truncated_svd under the hood, and I observe experimentally that, for slightly larger datasets, the same spikes in GPU memory allocation also happen with the randomized_svd.

lwelzel · 2025-06-27T14:12:13Z

lwelzel
Jun 27, 2025
Author

I think I have found the underlying problem:

Problem

When the TT (and I assume any other) decomposition function calls truncated_svd (either through randomized_svd or directly), tensorly.svd dispatches it to torch.svd [^1], which (at least in my cases) dispatches it to the underlying cuSolver method gesvdj (cusolverDn<t>esvdj [^2], where <t> is the dtype).
The actual problem is that gesvdj apparently allocates full matrices[^3], even if they are not needed, which blows up the required memory for the decomposition when computing U,S,V. This leads the the memory allocation spikes which I show in the original question.

Possible solutions

I see two possible solutions for this problem which can be implemented in TensorLy.

Switch the pytorch svd backend from torch.svd (deprecated) to torch.linalg.svd and expose the driver argument in the tensorly svd and decomposition API. By default this archives two things: a) tensorly moves to the preferred pytorch svd interface, and b) the user can specify the cuSolver driver for the SVD depending on their needs. By default this keeps the tensorly svd with the PyTorch backend like it was before, however, by dispatching to the gesvda we can compute the approximate SVD with a different solver. The gesvda should only be used under some specific conditions, but does not allocate as much memory as the current interface.
Add torch.svd_lowrank option as an svd interface for the pytorch backend which implements the randomized_svd (Halko et al. 2009) directly and allocates less memory.

In the figure below I show the difference between the memory allocation for the current svd computation (left), the torch.linalg.svd using the approximate SVD gesvda driver (middle), and the torch.svd_lowrank (right). All SVD calculations use the randomized SVD algorithm and factorize the same data tensor using the TT decomposition.

In this case, the solutions I listed above decrease the required memory to decompose the tensor by a factor of ~2 (!), or in absolute terms by 4-6 Gb, adjusting for base load.

I would be interested in putting together a PR to migrate TensorLy (on PyTorch backend) to the torch.linalg.svd with a driver selection, or adding the torch.svd_lowrank to the randomized_svd/svd_interface for the PyTorch backend, or both. Would this be a welcome contribution @JeanKossaifi? Since an SVD interface change for the PyTorch backend might impact a lot of the package, I would like a maintainers opinion on this before I start.

[^1] Note: torch.svd is being deprecated in favor of torch.linalg.svd
[^2] Note that it might also dispatch it to gesvd via cusolverDnGesvd which is being deprecated in favor of cusolverDnXgesvd in the next major CUDA/ which apparently will have better (pre) memory allocation. The driver in tensorly.svd cannot be set, however, in torch.linalg.svd it is settable via the driver argument.
[^3] gesvdj has an economy mode that allocates memory more economically, however, I have not tested that yet, and I think that might be a question for PyTorch rather than TensorLy anyway.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about unexpectedly high (CUDA) memory allocation for different SVD choices #602

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Question about unexpectedly high (CUDA) memory allocation for different SVD choices #602

Uh oh!

Uh oh!

lwelzel Jun 22, 2025

Context

Question/Problem

Notes

Edit:

Replies: 1 comment

Uh oh!

Uh oh!

lwelzel Jun 27, 2025 Author

Problem

Possible solutions

lwelzel
Jun 22, 2025

lwelzel
Jun 27, 2025
Author