Skip to content

CUDA erros in the inference.ipynb #13

@Jun-Kai-Zhang

Description

@Jun-Kai-Zhang

I keep encountering the following error when I ran the last block of inference.ipynb:

RuntimeError Traceback (most recent call last)
Cell In[14], line 1
----> 1 output = model.generate(graph, input_tokens)
2 print(output)

File ~/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)

File ~/3D-MoLM/model/blip2_llama_inference.py:85, in Blip2Llama.generate(self, graph_batch, text_batch, do_sample, num_beams, max_length, min_length, max_new_tokens, min_new_tokens, repetition_penalty, length_penalty, num_captions)
70 @torch.no_grad()
71 def generate(
72 self,
(...)
83 num_captions=1,
84 ):
---> 85 graph_embeds, graph_masks = self.graph_encoder(*graph_batch)
86 graph_embeds = self.ln_graph(graph_embeds)
87 query_tokens = self.query_tokens.expand(graph_embeds.shape[0], -1, -1)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
...
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.


The cuda version is 12.2. and the environment dependency is:

Package Version


absl-py 2.1.0
accelerate 0.32.1
aiohttp 3.9.5
aiosignal 1.3.1
altair 5.3.0
annotated-types 0.7.0
antlr4-python3-runtime 4.9.3
asttokens 2.4.1
async-timeout 4.0.3
attrs 23.2.0
bleach 6.1.0
blinker 1.8.2
blis 0.7.11
braceexpand 0.1.7
Brotli 1.1.0
cachetools 5.3.3
catalogue 2.0.10
certifi 2024.7.4
cffi 1.16.0
cfgv 3.4.0
charset-normalizer 3.3.2
click 8.1.7
cloudpathlib 0.18.1
cmake 3.30.0
comm 0.2.2
confection 0.1.5
contextlib2 21.6.0
contexttimer 0.3.3
contourpy 1.2.1
cycler 0.12.1
cymem 2.0.8
debugpy 1.8.2
decorator 5.1.1
decord 0.6.0
deepspeed 0.12.2
distlib 0.3.8
docker-pycreds 0.4.0
einops 0.8.0
exceptiongroup 1.2.0
executing 2.0.1
fairscale 0.4.4
filelock 3.15.4
fonttools 4.53.1
frozenlist 1.4.1
fsspec 2024.6.1
ftfy 6.2.0
gitdb 4.0.11
GitPython 3.1.43
gmpy2 2.1.5
h2 4.1.0
hjson 3.1.0
hpack 4.0.0
huggingface-hub 0.23.4
hyperframe 6.0.1
identify 2.6.0
idna 3.7
imageio 2.34.2
importlib_metadata 8.0.0
iopath 0.1.10
ipykernel 6.29.5
ipython 8.26.0
jedi 0.19.1
Jinja2 3.1.4
joblib 1.4.2
jsonschema 4.23.0
jsonschema-specifications 2023.12.1
jupyter_client 8.6.2
jupyter_core 5.7.2
kaggle 1.6.14
kiwisolver 1.4.5
langcodes 3.4.0
language_data 1.2.0
lazy_loader 0.4
lightning-utilities 0.11.3.post0
lit 18.1.8
lmdb 1.5.1
marisa-trie 1.2.0
markdown-it-py 3.0.0
MarkupSafe 2.1.5
matplotlib 3.9.1
matplotlib-inline 0.1.7
mdurl 0.1.2
ml_collections 0.1.1
mpmath 1.3.0
multidict 6.0.5
murmurhash 1.0.10
nest_asyncio 1.6.0
networkx 3.3
ninja 1.11.1.1
nodeenv 1.9.1
numpy 1.26.4
nvidia-cublas-cu11 11.10.3.66
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu11 8.5.0.96
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu11 10.9.0.58
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu11 10.2.10.91
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu11 11.7.4.91
nvidia-cusparse-cu12 12.1.0.106
nvidia-ml-py 12.555.43
nvidia-nccl-cu11 2.14.3
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.5.82
nvidia-nvtx-cu11 11.7.91
nvidia-nvtx-cu12 12.1.105
omegaconf 2.3.0
opencv-python-headless 4.5.5.64
opendatasets 0.1.22
packaging 24.1
pandas 2.2.2
parso 0.8.4
peft 0.11.1
pexpect 4.9.0
pickleshare 0.7.5
pillow 10.4.0
pip 24.1.2
platformdirs 4.2.2
plotly 5.22.0
portalocker 2.10.0
pre-commit 3.7.1
preshed 3.0.9
prompt_toolkit 3.0.47
protobuf 5.28.0rc1
psutil 6.0.0
ptyprocess 0.7.0
pure-eval 0.2.2
py-cpuinfo 9.0.0
pyarrow 16.1.0
pycocoevalcap 1.2
pycocotools 2.0.8
pycparser 2.22
pydantic 2.8.2
pydantic_core 2.20.1
pydeck 0.9.1
Pygments 2.18.0
pynvml 11.5.3
pyparsing 3.1.2
PySocks 1.7.1
python-dateutil 2.9.0
python-magic 0.4.27
python-slugify 8.0.4
pytorch-lightning 2.0.7
pytz 2024.1
PyYAML 6.0.1
pyzmq 26.0.3
rdkit 2024.3.3
referencing 0.35.1
regex 2024.5.15
requests 2.32.3
rich 13.7.1
rpds-py 0.19.0
safetensors 0.4.3
salesforce-lavis 1.0.2
scikit-image 0.24.0
scikit-learn 1.5.1
scipy 1.14.0
sentencepiece 0.2.0
sentry-sdk 2.9.0
setproctitle 1.3.3
setuptools 69.5.1
shellingham 1.5.4
six 1.16.0
smart-open 7.0.4
smmap 5.0.1
spacy 3.7.5
spacy-legacy 3.0.12
spacy-loggers 1.0.5
srsly 2.4.8
stack-data 0.6.2
streamlit 1.36.0
sympy 1.13.0
tenacity 8.5.0
tensorboardX 2.6.2.2
text-unidecode 1.3
thinc 8.2.5
threadpoolctl 3.5.0
tifffile 2024.7.2
timm 0.4.12
tokenizers 0.19.1
toml 0.10.2
toolz 0.12.1
torch 2.3.1
torch_geometric 2.5.3
torchaudio 2.3.0
torchmetrics 1.4.0.post0
torchvision 0.18.0
tornado 6.4.1
tqdm 4.66.4
traitlets 5.14.3
transformers 4.42.4
triton 2.3.1
typer 0.12.3
typing_extensions 4.12.2
tzdata 2024.1
unicore 0.0.1
urllib3 2.2.2
virtualenv 20.26.3
wandb 0.17.4
wasabi 1.1.3
watchdog 4.0.1
wcwidth 0.2.13
weasel 0.4.1
webdataset 0.2.86
webencodings 0.5.1
wheel 0.43.0
wrapt 1.16.0
yarl 1.9.4
zipp 3.19.2
zstandard 0.23.0


I have tried to switch to the pytorch version of 2.3.0 and have the error:

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)


It works if I use the cpu instead of gpu. However, cpu generates responses too slow.

Could you provide some advice on how to resolve this error? Thanks.

btw: I tried to install the environment with the requirements.txt but the pip failed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions