Skip to content

Unable to save checkpoint during multi-GPU training #45

@fzuo1230

Description

@fzuo1230

Hello, I am training wan-2.1-i2v-14b on 8-card A800, but when saving the checkpoint, I get the following error:

Training failed: [Errno 17] File exists: '/mmu_mllm_hdd_2/zuofei/VideoTuna/results/train/train_wanvideo_i2v_fullft_20250612153116/checkpoints/flow'
[rank1]: Traceback (most recent call last):
[rank1]: File "/mmu_mllm_hdd_2/zuofei/VideoTuna/scripts/train_new.py", line 131, in
[rank1]: trainer.fit(flow, data, ckpt_path=train_config.resume_ckpt)
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 538, in fit
[rank1]: call._call_and_handle_interrupt(
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]: return function(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 574, in _fit_impl
[rank1]: self._run(model, ckpt_path=ckpt_path)
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
[rank1]: results = self._run_stage()
[rank1]: ^^^^^^^^^^^^^^^^^
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1025, in _run_stage
[rank1]: self.fit_loop.run()
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
[rank1]: self.advance()
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
[rank1]: self.epoch_loop.run(self._data_fetcher)
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 140, in run
[rank1]: self.advance(data_fetcher)
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 269, in advance
[rank1]: call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 218, in _call_callback_hooks
[rank1]: fn(trainer, trainer.lightning_module, *args, **kwargs)
[rank1]: File "/mmu_mllm_hdd_2/zuofei/VideoTuna/videotuna/utils/callbacks.py", line 99, in on_train_batch_end
[rank1]: self._save_last_checkpoint(trainer, monitor_candidates, pl_module) # only save the last checkpoint
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/mmu_mllm_hdd_2/zuofei/VideoTuna/videotuna/utils/callbacks.py", line 133, in _save_last_checkpoint
[rank1]: self._save_checkpoint(trainer, filepath, pl_module)
[rank1]: File "/mmu_mllm_hdd_2/zuofei/VideoTuna/videotuna/utils/callbacks.py", line 146, in _save_checkpoint
[rank1]: self._save_flow_checkpoint(trainer, pl_module, filepath)
[rank1]: File "/mmu_mllm_hdd_2/zuofei/VideoTuna/videotuna/utils/callbacks.py", line 171, in _save_flow_checkpoint
[rank1]: os.makedirs(new_dirpath)
[rank1]: File "", line 225, in makedirs
[rank1]: FileExistsError: [Errno 17] File exists: '/mmu_mllm_hdd_2/zuofei/VideoTuna/results/train/train_wanvideo_i2v_fullft_20250612153116/checkpoints/flow'

It seems that this is because the models are parallel during training, and each GPU needs to save its own checkpoint when saving. How can I solve this problem?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions