-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Hello, I am training wan-2.1-i2v-14b on 8-card A800, but when saving the checkpoint, I get the following error:
Training failed: [Errno 17] File exists: '/mmu_mllm_hdd_2/zuofei/VideoTuna/results/train/train_wanvideo_i2v_fullft_20250612153116/checkpoints/flow'
[rank1]: Traceback (most recent call last):
[rank1]: File "/mmu_mllm_hdd_2/zuofei/VideoTuna/scripts/train_new.py", line 131, in
[rank1]: trainer.fit(flow, data, ckpt_path=train_config.resume_ckpt)
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 538, in fit
[rank1]: call._call_and_handle_interrupt(
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]: return function(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 574, in _fit_impl
[rank1]: self._run(model, ckpt_path=ckpt_path)
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
[rank1]: results = self._run_stage()
[rank1]: ^^^^^^^^^^^^^^^^^
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1025, in _run_stage
[rank1]: self.fit_loop.run()
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
[rank1]: self.advance()
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
[rank1]: self.epoch_loop.run(self._data_fetcher)
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 140, in run
[rank1]: self.advance(data_fetcher)
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 269, in advance
[rank1]: call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
[rank1]: File "/hetu_group/zuofei/env/wan_sft/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 218, in _call_callback_hooks
[rank1]: fn(trainer, trainer.lightning_module, *args, **kwargs)
[rank1]: File "/mmu_mllm_hdd_2/zuofei/VideoTuna/videotuna/utils/callbacks.py", line 99, in on_train_batch_end
[rank1]: self._save_last_checkpoint(trainer, monitor_candidates, pl_module) # only save the last checkpoint
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/mmu_mllm_hdd_2/zuofei/VideoTuna/videotuna/utils/callbacks.py", line 133, in _save_last_checkpoint
[rank1]: self._save_checkpoint(trainer, filepath, pl_module)
[rank1]: File "/mmu_mllm_hdd_2/zuofei/VideoTuna/videotuna/utils/callbacks.py", line 146, in _save_checkpoint
[rank1]: self._save_flow_checkpoint(trainer, pl_module, filepath)
[rank1]: File "/mmu_mllm_hdd_2/zuofei/VideoTuna/videotuna/utils/callbacks.py", line 171, in _save_flow_checkpoint
[rank1]: os.makedirs(new_dirpath)
[rank1]: File "", line 225, in makedirs
[rank1]: FileExistsError: [Errno 17] File exists: '/mmu_mllm_hdd_2/zuofei/VideoTuna/results/train/train_wanvideo_i2v_fullft_20250612153116/checkpoints/flow'
It seems that this is because the models are parallel during training, and each GPU needs to save its own checkpoint when saving. How can I solve this problem?