Skip to content

RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16 #32

@Vincento-Wang

Description

@Vincento-Wang

[rank3]: ret_val = func(*args, **kwargs)
[rank3]: File "/home/anaconda3/envs/cogview/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2054, in forward
[rank3]: loss = self.module(*inputs, **kwargs)
[rank3]: File "/home/anaconda3/envs/cogview/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: File "/home/anaconda3/envs/cogview/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
[rank3]: return inner()
[rank3]: File "/home/anaconda3/envs/cogview/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1805, in inner
[rank3]: result = forward_call(*args, **kwargs)
[rank3]: File "/home/anaconda3/envs/cogview/lib/python3.10/site-packages/diffusers/models/transformers/transformer_cogview4.py", line 713, in forward
[rank3]: hidden_states, encoder_hidden_states = self.patch_embed(hidden_states, encoder_hidden_states)
[rank3]: File "/home/anaconda3/envs/cogview/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: File "/home/anaconda3/envs/cogview/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: File "/home/anaconda3/envs/cogview/lib/python3.10/site-packages/diffusers/models/transformers/transformer_cogview4.py", line 59, in forward
[rank3]: hidden_states = self.proj(hidden_states)
[rank3]: File "/home/anaconda3/envs/cogview/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: File "/home/anaconda3/envs/cogview/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: File "/home/anaconda3/envs/cogview/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward
[rank3]: return F.linear(input, self.weight, self.bias)
[rank3]: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16
Training steps: 0%| | 0/200 [00:04<?, ?it/s]
[rank0]:[W508 15:09:04.137948373 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0508 15:09:05.675000 25220 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 25478 closing signal SIGTERM
W0508 15:09:05.675000 25220 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 25480 closing signal SIGTERM
W0508 15:09:05.677000 25220 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 25481 closing signal SIGTERM
E0508 15:09:06.356000 25220 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 25479) of binary: /home/anaconda3/envs/cogview/bin/python3.10
Traceback (most recent call last):
File "/home/anaconda3/envs/cogview/bin/accelerate", line 8, in
sys.exit(main())
File "/home/anaconda3/envs/cogview/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
args.func(args)
File "/home/anaconda3/envs/cogview/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1196, in launch_command
deepspeed_launcher(args)
File "/home/anaconda3/envs/cogview/lib/python3.10/site-packages/accelerate/commands/launch.py", line 878, in deepspeed_launcher
distrib_run.run(args)
File "/home/anaconda3/envs/cogview/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/home/anaconda3/envs/cogview/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/anaconda3/envs/cogview/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-05-08_15:09:05
host : 36aa4b45f54f
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 25479)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions