sft with multigpu #84

zhipuch · 2024-11-12T07:24:40Z

sft with multigpu and gradient accumulation, gradient_norm_before_clip undefined error

sayakpaul

Thank you!

glide-the · 2024-11-27T09:01:51Z

It seems that there is a bug in the accelerator.is_main_process task and accelerator.distributed_type. Is_main_process has scheduling problems during the training initiation and training stages.
Line 540 also has such a problem.
tracker_name = args.tracker_name or "cogvideox-sft"

``` [rank1]: File "/mnt/ceph/develop/jiawei/cogvideox-distillation/training/cogvideox_image_to_video_lora.py", line 1005, in <module> [rank1]: main(args) [rank1]: File "/mnt/ceph/develop/jiawei/cogvideox-distillation/training/cogvideox_image_to_video_lora.py", line 884, in main [rank1]: "gradient_norm_before_clip": gradient_norm_before_clip, [rank1]: UnboundLocalError: local variable 'gradient_norm_before_clip' referenced before assignment ``` ref1 :#84 ref2: #100 It seems that there is a bug in the accelerator.is_main_process task and accelerator.distributed_type. Is_main_process has scheduling problems during the training initiation and training stages.

* is some error ``` [rank1]: File "/mnt/ceph/develop/jiawei/cogvideox-distillation/training/cogvideox_image_to_video_lora.py", line 1005, in <module> [rank1]: main(args) [rank1]: File "/mnt/ceph/develop/jiawei/cogvideox-distillation/training/cogvideox_image_to_video_lora.py", line 884, in main [rank1]: "gradient_norm_before_clip": gradient_norm_before_clip, [rank1]: UnboundLocalError: local variable 'gradient_norm_before_clip' referenced before assignment ``` ref1 :#84 ref2: #100 It seems that there is a bug in the accelerator.is_main_process task and accelerator.distributed_type. Is_main_process has scheduling problems during the training initiation and training stages. * fix

* is some error ``` [rank1]: File "/mnt/ceph/develop/jiawei/cogvideox-distillation/training/cogvideox_image_to_video_lora.py", line 1005, in <module> [rank1]: main(args) [rank1]: File "/mnt/ceph/develop/jiawei/cogvideox-distillation/training/cogvideox_image_to_video_lora.py", line 884, in main [rank1]: "gradient_norm_before_clip": gradient_norm_before_clip, [rank1]: UnboundLocalError: local variable 'gradient_norm_before_clip' referenced before assignment ``` ref1 :huggingface/finetrainers#84 ref2: huggingface/finetrainers#100 It seems that there is a bug in the accelerator.is_main_process task and accelerator.distributed_type. Is_main_process has scheduling problems during the training initiation and training stages. * fix

sft with multigpu

ac5f417

jiashenggu mentioned this pull request Nov 22, 2024

adaption for CogVideoX1.5 #92

Merged

sayakpaul approved these changes Nov 24, 2024

View reviewed changes

sayakpaul requested a review from a-r-r-o-w November 24, 2024 03:13

glide-the merged commit 62a82a8 into huggingface:main Nov 27, 2024

glide-the mentioned this pull request Nov 30, 2024

Unbound fix #105

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sft with multigpu #84

sft with multigpu #84

Uh oh!

zhipuch commented Nov 12, 2024

Uh oh!

sayakpaul left a comment

Uh oh!

glide-the commented Nov 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sft with multigpu #84

sft with multigpu #84

Uh oh!

Conversation

zhipuch commented Nov 12, 2024

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

glide-the commented Nov 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants