Skip to content

Conversation

@zhipuch
Copy link
Contributor

@zhipuch zhipuch commented Nov 12, 2024

sft with multigpu and gradient accumulation, gradient_norm_before_clip undefined error

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@sayakpaul sayakpaul requested a review from a-r-r-o-w November 24, 2024 03:13
@glide-the
Copy link
Contributor

It seems that there is a bug in the accelerator.is_main_process task and accelerator.distributed_type. Is_main_process has scheduling problems during the training initiation and training stages.
Line 540 also has such a problem.
tracker_name = args.tracker_name or "cogvideox-sft"

@glide-the glide-the merged commit 62a82a8 into huggingface:main Nov 27, 2024
glide-the added a commit that referenced this pull request Nov 30, 2024
```
                                [rank1]:   File "/mnt/ceph/develop/jiawei/cogvideox-distillation/training/cogvideox_image_to_video_lora.py", line 1005, in <module>                                                                                                             [rank1]:     main(args)                                                                                                 [rank1]:   File "/mnt/ceph/develop/jiawei/cogvideox-distillation/training/cogvideox_image_to_video_lora.py", line 884, in main                                                                                                                  [rank1]:     "gradient_norm_before_clip": gradient_norm_before_clip,                                                    [rank1]: UnboundLocalError: local variable 'gradient_norm_before_clip' referenced before assignment
```
ref1 :#84
ref2: #100
It seems that there is a bug in the accelerator.is_main_process task and accelerator.distributed_type. Is_main_process has scheduling problems during the training initiation and training stages.
@glide-the glide-the mentioned this pull request Nov 30, 2024
a-r-r-o-w pushed a commit that referenced this pull request Nov 30, 2024
* is some error
```
                                [rank1]:   File "/mnt/ceph/develop/jiawei/cogvideox-distillation/training/cogvideox_image_to_video_lora.py", line 1005, in <module>                                                                                                             [rank1]:     main(args)                                                                                                 [rank1]:   File "/mnt/ceph/develop/jiawei/cogvideox-distillation/training/cogvideox_image_to_video_lora.py", line 884, in main                                                                                                                  [rank1]:     "gradient_norm_before_clip": gradient_norm_before_clip,                                                    [rank1]: UnboundLocalError: local variable 'gradient_norm_before_clip' referenced before assignment
```
ref1 :#84
ref2: #100
It seems that there is a bug in the accelerator.is_main_process task and accelerator.distributed_type. Is_main_process has scheduling problems during the training initiation and training stages.

* fix
adricwht pushed a commit to adricwht/finetrainers-patches that referenced this pull request May 9, 2025
* is some error
```
                                [rank1]:   File "/mnt/ceph/develop/jiawei/cogvideox-distillation/training/cogvideox_image_to_video_lora.py", line 1005, in <module>                                                                                                             [rank1]:     main(args)                                                                                                 [rank1]:   File "/mnt/ceph/develop/jiawei/cogvideox-distillation/training/cogvideox_image_to_video_lora.py", line 884, in main                                                                                                                  [rank1]:     "gradient_norm_before_clip": gradient_norm_before_clip,                                                    [rank1]: UnboundLocalError: local variable 'gradient_norm_before_clip' referenced before assignment
```
ref1 :huggingface/finetrainers#84
ref2: huggingface/finetrainers#100
It seems that there is a bug in the accelerator.is_main_process task and accelerator.distributed_type. Is_main_process has scheduling problems during the training initiation and training stages.

* fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants