Skip to content

Bunch of FSDP improvements#3671

Merged
S1ro1 merged 4 commits into
mainfrom
fsdp-split-tests
Jul 9, 2025
Merged

Bunch of FSDP improvements#3671
S1ro1 merged 4 commits into
mainfrom
fsdp-split-tests

Conversation

@S1ro1

@S1ro1 S1ro1 commented Jul 7, 2025

Copy link
Copy Markdown
Contributor

First I intended just to make the tests better, but I found some issues with the underlying code so I'm fixing that here too (can split if needed), most of it related to state loading, which somehow didn't get caught in the CI

  1. remove dynamic FSDP test shenanigans (tests/fsdp/test_fsdp.py), we now just inherit the class and run like that
  2. Scaler loading broken with FSDP2 because scale wasn't set correctly
  3. broadcast_from_rank_0 requires all ranks to get into the distributed section with FSDP2 model loading, that wasn't the case as with sync_module_states which we used in FSDP1 we just exit on non-main ranks.

Comment thread src/accelerate/state.py
if self.backend == "tccl":
local_rank = os.environ.get("LOCAL_RANK", -1)
torch.sdaa.set_device(f"sdaa:{local_rank}")
if (

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had to be moved, as it was in a deepspeed branch before, which was wrong

Comment thread src/accelerate/utils/fsdp_utils.py Outdated
if fsdp_plugin.state_dict_type == StateDictType.FULL_STATE_DICT:
if type(model) is not FSDP and accelerator.process_index != 0:
if (
type(model) is not FSDP and not isinstance(model, torch.distributed.fsdp.FSDPModule)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In here, we actually early exit if FSDP1, which shouldn't be the case in FSDP2 as the set_model_state_dict timeouts

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@SunMarc SunMarc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! Thanks for the improvements !

@S1ro1 S1ro1 merged commit d6c986c into main Jul 9, 2025
29 checks passed
@S1ro1 S1ro1 deleted the fsdp-split-tests branch July 9, 2025 14:05
S1ro1 added a commit that referenced this pull request Jul 9, 2025
* Feat: split tests

* Feat: finito

* Fix

* Final, tests pass
S1ro1 added a commit that referenced this pull request Jul 10, 2025
S1ro1 added a commit that referenced this pull request Jul 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants