N-D Parallelism

@SunMarc

FSDP2 Improvements

This release brings a large batch of FSDP2 fixes and quality-of-life improvements: correct dtype handling on load, sharding of embeddings/norms, QLoRA crash prevention, and a more robust auto-wrap policy.

Fsdp2 fully_shard embedding and norm by @SunMarc in #4015
Fix fsdp2 load full state dict dtype mismatch by @SunMarc in #4021
Fix region compilation fsdpv2 by @SunMarc in #4022
[FSDP2] Cast model to uniform dtype before fully_shard to fix mixed-dtype AssertionError by @roycho96 in #3985
[FSDP2] Auto-exclude non-floating frozen Params4bit from fully_shard to prevent QLoRA crash by @roycho96 in #3987
fix(FSDP2): auto-wrap policy ignoring _no_split_modules fallback by @JohnGiorgi in #3999
fix: use key-based matching in fsdp2_load_full_state_dict by @roycho96 in #3982
fix: add missing model_has_params4bit guard to fsdp2_load_full_state_dict call by @roycho96 in #3981
Fix to-fsdp2: drop REMOVED / NOT_YET_IMPLEMENTED FSDP1 keys instead of leaking them by @lollinng in #4065
Prevent double-wrapping models in prepare_model() by @joshuaswanson in #3977

AMD ROCm support

Accelerate now works end-to-end on AMD ROCm devices. Thanks @Abdennacer-Badaoui!

Make accelerate work end-to-end on AMD ROCm by @Abdennacer-Badaoui in #4025

Neuron

Further Neuron improvements to reduce recompilation and cover missing device cases.

Add padded allgather and broadcast for Neuron devices to reduce recompilation by @czkkkkkk in #4000
fix: add missing neuron device case by @michaelbenayoun in #4042

Quantization & Offloading

We improved offloading support for quantized models, including Torchao, int8, and tied-weight handling.

Torchao offload by @SunMarc in #3973
Fix int8 offload hook detachment statistics restoration by @jiqing-feng in #4044
Fix keep_in_fp32_modules not working for tied weights in load_and_quantize_model by @jiqing-feng in #4043
Fix dtype_byte_size for FP8 fnuz / e8m0fnu dtypes by @lollinng in #4063

Data Loading

Feat: Support dynamic batch size in BatchSamplerShard with even_batches by @yuxinyuan in #3969
Fix iterable dataset sharding condition when n_shards == num_processes by @SunMarc in #3958
Fix implicit padding in split_between_processes when apply_padding=False and num_samples < num_processes by @3manifold in #4052

Minor fixes

[DeepSpeed] allow kernels flash-attn in SP by @kashif in #3959
Fix: Conditionally import torch.distributed.algorithms.join in accelerator.py by @0xDELUXA in #3962
Fix is_hf_initialized attribute by @SunMarc in #3976
feat(utils): add max reduction type by @imstevenpmwork in #4027
fix(state): make MLU backend part of the _prepare_backend elif chain by @Anai-Guo in #4057
fix notebook launcher cuda init by @SunMarc in #4059
pytorch-triton-xpu rename to triton-xpu by @sywangyi in #4007
Relax numerical tolerance for XPU in test_big_modeling by @YangKai0616 in #4001
Fix gloo backend error in test_load_checkpoint_and_dispatch_with_broadcast on XPU by @kaixuanliu in #4056
Raise ValueError instead of a bare string in ParallelismConfig.get_device_mesh by @lollinng in #4064
tests: Gracefully handle missing set_device for mps by @booxter in #4028
test: add regression test for no_split_module_classes accepting set type by @UFO0506 in #4048
Fix all tests by @SunMarc in #4072
docs: add aggregate profiler memory example by @aryanputta in #4054
DOC: document missing parameters in load_accelerator_state, find_executable_batch_size, and send_to_device by @kratos0718 in #4051
docs: Fix docstring of fsdp2_prepare_auto_wrap_policy by @slocoro in #4037
Fix DistributedType documentation by @3manifold in #3980
Fix grammar, spelling, and consistency issues across docs and examples by @cihandemir in #3961
docs: fix typos in docstrings, comments, and user docs by @mokashang in #4040
chore: update doc-builder workflow SHA by @rtrompier in #4009
chore: bump doc-builder SHA for main doc build workflow by @rtrompier in #4018
[CI] Bump style-bot SHA + switch to GitHub App by @paulinebm in #4031
Fix TrackioTracker.log() ignoring step parameter by @joshuaswanson in #3975
fix: pass step parameter in TrackioTracker.log() by @liuyun7345 in #3970
fix(tracking): default step=None on tracker.log and accept extra kwargs in MLflowTracker by @1fanwang in #4039
Fix MLflowTracker.store_init_configuration mutating the caller's config dict by @ATOM00blue in #4046
fix(tracker): guard init_trackers and log against None kwargs by @xodn348 in #4026
🔒 Pin GitHub Actions to commit SHAs by @paulinebm in #3992
chore: update build-docker-images-release.yml by @hf-security-analysis[bot] in #4069
chore: enable Dependabot weekly GitHub Actions bumps by @hf-dependantbot-rollout[bot] in #4049
Bump the actions group with 8 updates by @dependabot[bot] in #4068

Full Changelog: v1.13.0...v1.14.0

@michaelbenayoun

AWS Neuron support

We now have support for AWS Neuron (Trainium/Inferentia) devices. Thanks @michaelbenayoun for adding this.

Neuron integration by @michaelbenayoun in #3935

XPU Improvements

We've removed IPEX dependency and improved device-agnostic code for XPU.

using spawn instead of fork for XPU device by @kaixuanliu in #3884
Remove ipex by @yao-matrix in #3883
enhance new codes to XPU, and make them be device agnostic by @yao-matrix in #3890
Fix KMP_AFFINITY incorrectly set for non-CPU training by @hexfaker in
#3912

FSDP2 Improvements

We've added a bunch of important fixes for FSDP2 users: upcasting only grad-requiring params, better tied embedding errors, DCP optimizer loading, bf16 optimizer step crash fix, and torch < 2.7.0 compatibility.

Upcast FSDP2 parameters only if requires_grad by @ojh31 in #3848
Fix FSDP2 tied embedding errors with targeted ValueError guidance by @amanzoni1 in #3878
bug: fsdp cannot load optimizer state using dcp by @flymin in #3904
fix crash in optimizer.step when fsdp2 is enabled and model is bfloat16 by @sywangyi in #3905
Fix FSDP2 crash with ignored_params on torch < 2.7.0 by @Mr-Neutr0n in #3924

DeepSpeed Sequence Parallelism

We've added several fixes to the DeepSpeed + Sequence Parallelism integration introduced in v1.12.0, including evaluation support during SP training and proper process group handling.

[SP] fix loss computation example by @kashif in #3858
[SP and CP] error out if both CP and SP enabled by @kashif in #3862
DeepSpeed has its own process group by @kashif in #3916
[Deepspeed] skip device mesh creation when deepspeed and sp_size >1 by @kashif in #3914
Enable evaluation during deepspeed Sequence Parallel by @jp1924 in #3917

FP8

We've enhanced FP8 training. Thanks @shimizust for fixing torchao support.

Fix FP8 torchao default config with padding and FSDP2 all-gather support by @shimizust in #3831
Fix execution with Transformer Engine by @ksivaman in #3852
add MS-AMP deprecation warnings by @neha222222 in #3857

Performance

Accelerate now imports faster by deferring heavy dependencies, and torch.compile hooks are disabled lazily.

Faster import by @SunMarc in #3953
lazy compile disable by @SunMarc in #3947
Disable hook compile by @SunMarc in #3888

Minor fixes

Allow non-Tensor values in a batch with dispatch_batches=True by @tomaarsen in #3850
fix module and optimizer parameter mismatch before prepare_tp_ by @naomili0924 in #3845
Fix KeyError in extract_model_from_parallel for partial torch.compile by @amanzoni1 in #3881
Fix hf_device_map device index comparison in prepare_model by @rezaqorbani in #3895
Fix StatefulDataLoader KeyError with num_workers > 0 by @veeceey in #3931
Fix stateful dataloader DDP by @SunMarc in #3952
Fix: Remove duplicate W&B initialization in offline mode by @shantanugupta2004 in #3886
Avoid using nvidia-smi on a CPU-only Colab instance by @FlorianVal in #3872
Fix logging logic when in_order is set to True by @yuxinyuan in #3280
Fix cpu offload check by @SunMarc in #3946
fix bug when both cpu_ram_efficient_loading and cpu_offload are enabled by @kaixuanliu in #3910
Fix async compatibility across python versions by @SunMarc in #3901
fix tp only bug by @sywangyi in #3908
fix parallelism_config None error by @jp1924 in #3927
Np parall fix by @sywangyi in #3900
change the default value of fsdp_min_num_params to int by @CodeMan62 in #3902
Fix mutable default in Megatron init and IndexError on empty ModuleList by @jashshah999 in #3944
Prepare TP fix by @michaelbenayoun in #3945
feat: added fine tuning example focused on TPUs by @tengomucho in #3847
Remove 8bit force hook for bnb by @SunMarc in #3907
docs: flag MS-AMP as deprecated in low-precision training guides by @ManasVardhan in #3929
fix: correct typo 'guarentee' to 'guarantee' by @thecaptain789 in #3922
Updating support of Megatron-LM by @pengdurice in #3842
Update support of Megatron-LM PR 2 by @pengdurice in #3887
Fix RNG state setting for HPU by @michaelbenayoun in #3936
fix: load the HPU RNG state by @michaelbenayoun in #3937

@S1ro1

Deepspeed Ulysses/ALST integration

Deepspeed Ulysses/ALST is an efficient way of training on long sequences by employing sequence parallelism and attention head parallelism. You can learn more about this technology in this paper https://arxiv.org/abs/2506.13996 or this deepspeed tutorial https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/.

To enable Deepspeed Ulysses, you first need to create ParallelismConfig and setting sp related args:

parallelism_config = ParallelismConfig(
    sp_backend="deepspeed",
    sp_size=2,
    sp_handler=DeepSpeedSequenceParallelConfig(...),
)

Then, you need to make sure to compute the correct loss as described on our docs

        ...
        losses_per_rank = torch.distributed.nn.functional.all_gather(loss, group=sp_group)
        good_tokens = (shift_labels != -100).view(-1).sum()
        good_tokens_per_rank = torch.distributed.nn.functional.all_gather(good_tokens, group=sp_group)
        total_loss = sum(
            losses_per_rank[rank] * good_tokens_per_rank[rank]
            for rank in range(sp_world_size)
            if good_tokens_per_rank[rank] > 0
        )
        total_good_tokens = sum(good_tokens_per_rank)
        loss = total_loss / max(total_good_tokens, 1)

Thanks @S1ro1 for starting this work and for @stas00 for finishing this work. Also thanks @kashif for adding docs and reviewing/testing this PR !

This feature will also be available in HF Trainer thanks for this PR from @stas00: huggingface/transformers#41832

Minor changes

Remove warning for cpu_ram_efficient_loading by @SunMarc in #3816
update typo in bnb quantisation 4bit flag docstring by @hbraith in #3828
ArXiv -> HF Papers by @qgallouedec in #3834
Fix typo in broadcast_object_list docstring by @wsntxxn in #3823
[Bug] Update torch.optim.Optimizer parameter states after tensor parallelism by @naomili0924 in #3835
use self hosted runner by @SunMarc in #3841
device type helper by @kashif in #3843

New Contributors

@hbraith made their first contribution in #3828
@wsntxxn made their first contribution in #3823
@naomili0924 made their first contribution in #3835

Full Changelog: v1.11.0...v1.12.0

@pstjohn

TE MXFP8 support

We've added support for MXFP8 in our TransformerEngine integration. To use that, you need to set use_mxfp8_block_scaling in fp8_config. See nvidia docs [here]. (https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#MXFP8-and-block-scaling)

Add support for TE MXFP8 recipe in accelerate by @pstjohn in #3688

FP16/BF16 Training for MPS devices

BF16 and FP16 support for MPS devices is finally here. You can now pass mixed_precision = "fp16" or "bf16" when training on a mac (fp16 requires torch 2.8 and bf16 requires torch 2.6)

Add bf16/fp16 support for amp with mps device by @SunMarc in #3373

FSDP updates

The following PRs add respectively support to ignored_params and no_sync() for FSDPv2:

feat: add ignored_params support for fsdp2 by @kmehant in #3731
fix: model.set_requires_gradient_sync(False) should be called to turn off gradient synchronization in FSDP2 by @EquationWalker in #3762

Mixed precision can now be passed as a dtype string from accelerate cli flag or fsdp_config in accelerate config file:

feat: allow mixed precision policy as dtype by @kmehant in #3751

Nd-parallel updates

Some minor updates concerning nd-parallelism.

Context Parallelism docs typos fixed by @sergiopaniego in #3761
Feat: add to_json by @S1ro1 in #3743
make torch_native_parallelism examples device agnostic by @yao-matrix in #3759
[ND Parallel] Update examples, cleanup by @S1ro1 in #3737

Bump to Python 3.10

We've dropped support for python 3.9 as it reached EOL in October.

Bump to python3.10 + update linter by @SunMarc in #3809

Lots of minor fixes:

fix: CPU RAM efficient loading for nd or HSDP parallelisms by @kmehant in #3740
xpu INT64 all_gather issue fixed in 2.9 by @yao-matrix in #3756
Specify device_ids in torch.distributed.barrier for PartialState by @qgallouedec in #3744
fix: specify device for process_tensor in example usage by @qgallouedec in #3755
Lower complexity of get_balanced_memory by adding a set by @SamuelBarryCS in #3776
Fix (skip) cuda cache flush when origin device is cpu and offloaded to meta by @Qubitium in #3796
Fix convert LayerNorm without bias to fp8 by @mjun0812 in #3725
Add optional typing by @cyyever in #3769
refactor: Use with in Accelerator.autocast()instead of __enter__() and __exit__() for more elegant style. by @EquationWalker in #3767
switch XPU ccl backend to torch-builtin xccl in test_zero3_integration by @yao-matrix in #3773
fix FSDP2 test case failure on XPU by @yao-matrix in #3771
Fix tests by @SunMarc in #3722
Protect import for device_mesh by @SunMarc in #3742
Fix SWANLAB_MODE by @SunMarc in #3808
Fix tracking swanlab by @SunMarc in #3810
refactor: nit change for get_parameters_from_modules (code debt) by @kmehant in #3815
Remove deprecated FindTiedParametersResult by @cyyever in #3786
Add optional typing by @cyyever in #3769
remove mlflow from testing by @SunMarc in #3783
enable 2 model hook ut cases on XPU by @yao-matrix in #3774
Added Tip for better rendering by @sergiopaniego in #3781
Fix typos by @cyyever in #3753
fix: torch_npu import error in some envs by @yanyongyu in #3764
Fix: typo makes tests fail by @S1ro1 in #3765
fix Muti node CUDA error: invalid device ordinal #3775 by @RicardoDominguez in #3779
use reset_peak_memory_stats on xpu by @yao-matrix in #3772

New Contributors

@mjun0812 made their first contribution in #3725
@sergiopaniego made their first contribution in #3761
@EquationWalker made their first contribution in #3762
@yanyongyu made their first contribution in #3764
@RicardoDominguez made their first contribution in #3779
@SamuelBarryCS made their first contribution in #3776
@Qubitium made their first contribution in #3796

Full Changelog: v1.10.1...v1.11.0

@S1ro1

Feat: add to_json by @S1ro1 in #3743
Protect import for device_mesh by @SunMarc in #3742.

Full Changelog: v1.10.0...v1.10.1

@salmanmohammadi

N-D Parallelism

Training large models across multiple GPUs can be complex, especially when combining different parallelism strategies (e.g TP, CP, DP). To simplify this process, we've collaborated with Axolotl to introduce an easy-to-use integration that allows you to apply any combination of parallelism strategies directly in your training script. Just pass a ParallelismConfig specifying the size of each parallelism type—it's that simple.
Learn more about how it works in our latest blogpost.

parallelism_config = ParallelismConfig(
    dp_shard_size=2,
    dp_replicate_size=2,
    cp_size=2,
    tp_size=2,
)
accelerator = Accelerator(
    parallelism_config=parallelism_config,
   ...
)
model = AutoModelForCausalLM.from_pretrained("your-model-name", device_mesh=accelerator.torch_device_mesh)
model = accelerator.prepare(model)

Parallelism config + TP + HSDP + BYODM (Bring Your Own Device Mesh) by @salmanmohammadi in #3682
Feat: context parallel v2.0 by @S1ro1 in #3700
set default submesh_tp_size to prevent unset local variable error by @winglian in #3687
Add Parallelism getter property to Accelerator class by @WoosungMyung in #3703
Fix: prepare works even if nothing except tp specified (rare) by @S1ro1 in #3707
Set parallelism_config in constructor due to Trainer reset of State by @winglian in #3713
Fix: tp size wouldn't read from env by @S1ro1 in #3716
Remove ParallelismConfig from PartialState by @SunMarc in #3720

FSDP improvements

We've fixed ignored modules attribute. With this, it is now possible to train PEFT model that moe layers that contrains q_proj and v_proj parameters. This is especially important for fine-tuning gpt-oss model.

ENH: Allow FSDP ignored modules to be regex by @BenjaminBossan in #3698
TST Add test for FSDP ignored_modules as str by @BenjaminBossan in #3719

Minor improvements

feature: CpuOffload pre_forward don't attempt to move if already on device by @JoeGaffney in #3695
Fix: Ensure environment variable values are case-insensitive in Accelerate by @jp1924 in #3712
remove use_ipex by @SunMarc in #3721

New Contributors

@salmanmohammadi made their first contribution in #3682
@WoosungMyung made their first contribution in #3703
@jp1924 made their first contribution in #3712
@JoeGaffney made their first contribution in #3695

Full Changelog: v1.9.0...v1.10.0

@pcuenca

Trackio tracker support

We've added support for a trackio, lightweight, 💯 free experiment tracking Python library built on top of 🤗 Datasets and Spaces.

Main features are:

Local-first design: dashboard runs locally by default. You can also host it on Spaces by specifying a space_id.
Persists logs locally (or in a private Hugging Face Dataset)
Visualize experiments with a Gradio dashboard locally (or on Hugging Face Spaces)
Everything here, including hosting on Hugging Faces, is free!

To use it with accelerate, you need to set log_with and initialize the trackers

accelerator = Accelerator(log_with="trackio")
config={"learning_rate": 0.001, "batch_size": 32}
# init_kwargs in order to host the dashboard on spaces
init_kwargs = {"trackio": {"space_id": "hf_username/space_name"}
accelerator.init_trackers("example_project", config=config, init_kwargs=init_kwargs})

Thanks @pcuenca for the integration !

trackio by @pcuenca in #3669

Model loading speedup when relying `set_module_tensor_to_device`

Setting tensor while clearing cache is very slow, so we added clear_device option to disable it.
Another small optimization is using non_blocking everywhere and syncing just before returning control to the user. This makes the loading slightly faster.

Speedup model loading by 4-5x in Diffusers ⚡ by @a-r-r-o-w in #3674

FDSP, Deepspeed, FP8 minor improvements

Add support for e5e2 and default to hybrid when launcher is used by @IlyasMoutawwakil in #3640
Fix FP8 tests, enable FP8 to be used without direct Accelerator() configuring by @pstjohn in #3677
Bunch of FSDP improvements by @S1ro1 in #3671
Fix: properly error when DDP + Dtensor model by @S1ro1 in #3629
Fix fsdp2 example typo by @shimizust in #3657
Added a check in no_sync() to avoid errors when using deepspeed zero2/3 by @xliu0105 in #3656

🚨🚨🚨 Breaking changes 🚨🚨🚨

find_executable_batch_size() will no longer halves the batch after every OOM. Instead, we will multiply the batch size by 0.9. This should help user not waste gpu capacity.

“Stop Halving My Batch!” · Default back-off 0.5 → 0.9 by @SunMarc in #3684

What's Changed

[typo] shards instead of shard by @SunMarc in #3645
Docs: Fix typos in gradient accumulation guide by @kilavvy in #3649
xpu enablement on left cases by @yao-matrix in #3654
unpin datasets in examples requirements by @SunMarc in #3681
fix: wandb config not saved in offline mode by @ved1beta in #3648
accelerate/data_loader.py: do not yield if the base_dataloader is empty by @0xnightwind in #3659
warn for invalid keys by @ved1beta in #3613
Update Gaudi runner image to latest SynapseAI and enable previously disabled tests by @IlyasMoutawwakil in #3653

New Contributors

@kilavvy made their first contribution in #3649
@shimizust made their first contribution in #3657
@xliu0105 made their first contribution in #3656
@0xnightwind made their first contribution in #3659

Full Changelog: v1.8.1...v1.9.0

@IlyasMoutawwakil

Add support for e5e2 and default to hybrid when launcher is used by @IlyasMoutawwakil in #3640
shards by @SunMarc in #3645

Full Changelog: v1.8.0...v1.8.1

@S1ro1

FSDPv2 refactor + FP8 support

We've simplified how to prepare FSDPv2 models, as there were too many ways to compose FSDP2 with other features (e.g., FP8, torch.compile, activation checkpointing, etc.). Although the setup is now more restrictive, it leads to fewer errors and a more performant user experience. We’ve also added support for FP8. You can read about the results here. Thanks to @S1ro1 for this contribution!

[FSDP2] Refactor + FP8 by @S1ro1 in #3585

Faster Distributed Training on Intel CPUs

We updated the CCL_WORKER_COUNT variable and added KMP parameters for Intel CPU users. This significantly improves distributed training performance (e.g., Tensor Parallelism), with up to a 40% speed-up on Intel 4th Gen Xeon when training transformer TP models.

Set ccl and KMP param in simple launch by @jiqing-feng in #3575

Regional Compilation for DeepSpeed

We added support for regional compilation with the DeepSpeed engine. DeepSpeed’s .compile() modifies models in-place using torch.nn.Module.compile(...), rather than the out-of-place torch.compile(...), so we had to account for that. Thanks @IlyasMoutawwakil for this feature!

Fix deepspeed regional compilation by @IlyasMoutawwakil in #3609

ipex.optimize deprecation

ipex.optimize is being deprecated. Most optimizations have been upstreamed to PyTorch, and future improvements will land there directly. For users without PyTorch 2.8, we’ll continue to rely on IPEX for now.

remove ipex.optimize in accelerate by @yao-matrix in #3608

Better XPU Support

We've greatly expanded and stabilized support for Intel XPUs:

enable fsdp2 benchmark on XPU by @yao-matrix in #3590
enable big_model_inference on xpu by @yao-matrix in #3595
enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU by @yao-matrix in
enable test_cli & test_example cases on XPU by @yao-matrix in #3578
enable torchao and pippy test cases on XPU by @yao-matrix in #3599
enable regional_compilation benchmark on xpu by @yao-matrix in #3592
fix xpu 8bit value loading by @jiqing-feng in #3623
add device-agnostic GradScaler by @yao-matrix in #3588
add xpu support in TorchTensorParallelPlugin by @yao-matrix in #3627

Trackers

We've added support for SwanLab as an experiment tracking backend. Huge thanks to @ShaohonChen for this contribution ! We also deferred all tracker initializations to prevent premature setup of distributed environments.

Integrate SwanLab for offline/online experiment tracking for Accelerate by @ShaohonChen in #3605
Fix: Defer Tracker Initialization to Prevent Premature Distributed Setup by @yuanjua in #3581

What's Changed

Fix bf16 training with TP by @SunMarc in #3610
better handle FP8 with and without deepspeed by @IlyasMoutawwakil in #3611
Update Gaudi Runners by @IlyasMoutawwakil in #3593
goodbye torch_ccl by @yao-matrix in #3580
Add support for standalone mode when default port is occupied on single node by @laitifranz in #3576
Resolve logger warnings by @emmanuel-ferdman in #3582
Add kwargs to optimizer, scheduler and dataloader using function accelerator().load_state() by @luiz0992 in #3540
[docs] no hard-coded cuda in the ddp documentation by @faaany in #3589
change to use torch.device by @yao-matrix in #3594
Fix: list object has no attribute keys by @S1ro1 in #3603
Update Gaudi Runners by @IlyasMoutawwakil in #3593
Fix bf16 training with TP by @SunMarc in #3610
better handle FP8 with and without deepspeed by @IlyasMoutawwakil in #3611
Remove device_count for TPU launcher to avoid initializing runtime by @sorgfresser in #3587
Fix missing te.LayerNorm in intel_transformer_engine by @IlyasMoutawwakil in #3619
Add fp8_e5m2 support in dtype_byte_size by @SunMarc in #3625
[Deepspeed] deepspeed auto grad accum by @kashif in #3630
Remove hardcoded cuda from fsdpv2 by @IlyasMoutawwakil in #3631
Integrate SwanLab for offline/online experiment tracking for Accelerate by @ShaohonChen in #3605
Fix Typos in Documentation and Comments by @leopardracer in #3621
feat: use datasets.IterableDataset shard if possible by @SunMarc in #3635
[DeepSpeed] sync gradient accum steps from deepspeed plugin by @kashif in #3632
Feat: add cpu offload by @S1ro1 in #3636
Fix: correct labels for fsdp2 examples by @S1ro1 in #3637
fix grad acc deepspeed by @SunMarc in #3638

New Contributors

@laitifranz made their first contribution in #3576
@emmanuel-ferdman made their first contribution in #3582
@yuanjua made their first contribution in #3581
@sorgfresser made their first contribution in #3587
@ShaohonChen made their first contribution in #3605
@leopardracer made their first contribution in #3621

Full Changelog: v1.7.0...v1.8.0

@IlyasMoutawwakil

Regional compilation

Instead of compiling the entire model at once, regional compilation targets repeated blocks (such as decoder layers) first. This allows the compiler to cache and reuse optimized code for subsequent blocks, significantly reducing the cold start compilation time typically seen during the first inference. Thanks @IlyasMoutawwakil for the feature ! You can view the full benchmark here, and check out our updated compilation guide for more details!

To enable this feature, set use_regional_compilation=True in the TorchDynamoPlugin configuration.

# Configure the compilation backend
dynamo_plugin = TorchDynamoPlugin(
    use_regional_compilation=True,
    ... # other parameters
)
# Initialize accelerator with the plugin
accelerator = Accelerator(dynamo_plugin=dynamo_plugin)
# This will apply compile_regions to your model
model = accelerator.prepare(model)

Layerwise casting hook

We've introduced a new hook that enables per-layer upcasting and downcasting (e.g., for Linear layers) during inference. This allows users to run models with separate storage and compute dtypes, resulting in memory savings. The concept was first implemented in diffusers, where downcasting models to FP8 proved effective without major quality degradation. Contributed by @sayakpaul in #3427

model = ....
storage_dtype = torch.float8_e4m3fn
compute_dtype = torch.bfloat16
attach_layerwise_casting_hooks(
            model,
            storage_dtype=storage_dtype,
            compute_dtype=compute_dtype,
        )

Better FSDP2 support

This release includes numerous new features and bug fixes. Notably, we’ve added support for FULL_STATE_DICT, a widely used option in FSDP, now enabling .save_pretrained() in transformers to work with FSDP2 wrapped models. QLoRA training is now supported as well but more testing is needed. We have also resolved a backend issue related to parameter offloading to CPU. Additionally, a significant memory spike that occurred when cpu_ram_efficient_loading=True was enabled has been fixed. Several other minor improvements and fixes are also included—see the What’s Changed section for full details.

FULL_STATE_DICT have been enabled by @S1ro1 in #3527
QLoRA support by @winglian in #3546
set backend correctly for CUDA+FSDP2+cpu-offload in #3574
memory spike fixed when using cpu_ram_efficient_loading=True by @S1ro1 in #3482

Better HPU support:

We have added a documentation for Intel Gaudi hardware !
The support is already available since v1.5.0 through this PR.

Add the HPU into accelerate config by @yuanwu2017 in #3495
Add Gaudi doc by @regisss in #3537

Torch.compile breaking change for `dynamic` argument

We've updated the logic for setting self.dynamic to explicitly preserve None rather than defaulting to False when the USE_DYNAMIC environment variable is unset. This change aligns the behavior with the PyTorch documentation for torch.compile. Thanks to @yafshar for contributing this improvement in #3567.

What's Changed

use device agnostic torch.OutOfMemoryError from pytorch 2.5.0 by @yao-matrix in #3475
Adds style bot by @zach-huggingface in #3478
Fix a tiny typo in low_precision_training guide by @sadra-barikbin in #3488
Fix check_tied_parameters_in_config for multimodal models by @SunMarc in #3479
Don't create new param for TorchAO sequential offloading due to weak BC guarantees by @a-r-r-o-w in #3444
add support for custom function for reducing the batch size by @winglian in #3071
Fix fp8 deepspeed config by @SunMarc in #3492
fix warning error by @faaany in #3491
[bug] unsafe_serialization option in "merge-weights" doesn't work by @cyr0930 in #3496
Add the HPU into accelerate config by @yuanwu2017 in #3495
Use torch.distributed.checkpoint.state_dict.set_model_state_dict in load_checkpoint_in_model by @ringohoffman in #3432
nit: needed sanity checks for fsdp2 by @kmehant in #3499
(Part 1) fix: make TP training compatible with new transformers by @kmehant in #3457
Fix deepspeed tests by @S1ro1 in #3503
Add FP8 runners + tweak building FP8 image by @zach-huggingface in #3493
fix: apply torchfix to set weights_only=True by @bzhong-solink in #3497
Fix: require transformers version for tp tests by @S1ro1 in #3504
Remove deprecated PyTorch/XLA APIs by @zpcore in #3484
Fix cache issue by upgrading github actions version by @SunMarc in #3513
[Feat] Layerwise casting hook by @sayakpaul in #3427
Add torchao to FP8 error message by @jphme in #3514
Fix unwanted cuda init due to torchao by @SunMarc in #3530
Solve link error in internal_mechanism documentation (#3506) by @alvaro-mazcu in #3507
[FSDP2] Enable FULL_STATE_DICT by @S1ro1 in #3527
[FSDP2] Fix memory spike with cpu_ram_efficient_loading=True by @S1ro1 in #3482
[FSDP2] Issues in Wrap Policy and Mixed Precision by @jhliu17 in #3528
Fix logic in accelerator.prepare + IPEX for 2+ nn.Models and/or optim.Optimizers by @mariusarvinte in #3517
Update Docker builds to align with CI requirements by @matthewdouglas in #3532
Fix CI due to missing package by @SunMarc in #3535
Update big_modeling.md for layerwise casting by @sayakpaul in #3548
[FSDP2] Fix: "..." is not a buffer or a paremeter by @S1ro1 in
fix notebook_launcher for Colab TPU compatibility. by @BogdanDidenko in #3541
Fix typos by @omahs in #3549
Dynamo regional compilation by @IlyasMoutawwakil in #3529
add support for port 0 auto-selection in multi-GPU environments by @hellobiondi in #3501
Fix the issue where set_epoch does not take effect. by @hongjx175 in #3556
[FSDP2] Fix casting in _cast_and_contiguous by @dlvp in #3559
[FSDP] Make env var and dataclass flag consistent for cpu_ram_efficient_loading by @SumanthRH in #3307
canonicalize optimized names before fixing optimizer in fdsp2 by @pstjohn in #3560
[docs] update deepspeed config path by @faaany in #3561
preserve parameter keys when removing prefix by @mjkvaak-amd in #3564
Add Gaudi doc by @regisss in #3537
Update dynamic env handling to preserve None when USE_DYNAMIC is unset by @yafshar in #3567
add a synchronize call for xpu in _gpu_gather by @faaany in #3563
simplify model.to logic by @yao-matrix in #3562
tune env command output by @yao-matrix in #3570
Add regional compilation to cli tools and env vars by @IlyasMoutawwakil in #3572
reenable FSDP2+qlora support by @winglian in #3546
Fix prevent duplicate GPU usage in distributed processing by @ved1beta in #3526
set backend correctly for CUDA+FSDP2+cpu-offload by @SunMarc in #3574
enable test_dispatch_model_tied_weights_memory_with_nested_offload_cpu on xpu by @yao-matrix i...

Releases: huggingface/accelerate

v1.14.0: AMD ROCm support, FSDP2 hardening

FSDP2 Improvements

AMD ROCm support

Neuron

Quantization & Offloading

Data Loading

Minor fixes

Contributors

Uh oh!

v1.13.0: Neuron support, IPEX removal, and distributed training fixes

AWS Neuron support

XPU Improvements

FSDP2 Improvements

DeepSpeed Sequence Parallelism

FP8

Performance

Minor fixes

Contributors

Uh oh!

v1.12.0: Deepspeed Ulysses/ALST

Deepspeed Ulysses/ALST integration

Minor changes

New Contributors

Contributors

Uh oh!

v1.11.0: TE MXFP8, FP16/BF16 with MPS, Python 3.10

TE MXFP8 support

FP16/BF16 Training for MPS devices

FSDP updates

Nd-parallel updates

Bump to Python 3.10

Lots of minor fixes:

New Contributors

Contributors

Uh oh!

v1.10.1: Patchfix

Contributors

Uh oh!

v1.10.0: N-D Parallelism

N-D Parallelism

FSDP improvements

Minor improvements

New Contributors

Contributors

Uh oh!

v1.9.0: Trackio support, Model loading speedup, Minor distributed improvements

Trackio tracker support

Model loading speedup when relying set_module_tensor_to_device

FDSP, Deepspeed, FP8 minor improvements

🚨🚨🚨 Breaking changes 🚨🚨🚨

What's Changed

New Contributors

Contributors

Uh oh!

v1.8.1: Patchfix

Contributors

Uh oh!

v1.8.0: FSDPv2 + FP8, Regional Compilation for DeepSpeed, Faster Distributed Training on Intel CPUs, ipex.optimize deprecation

FSDPv2 refactor + FP8 support

Faster Distributed Training on Intel CPUs

Regional Compilation for DeepSpeed

ipex.optimize deprecation

Better XPU Support

Trackers

What's Changed

New Contributors

Contributors

Uh oh!

v1.7.0 : Regional compilation, Layerwise casting hook, FSDPv2 + QLoRA

Regional compilation

Layerwise casting hook

Better FSDP2 support

Better HPU support:

Torch.compile breaking change for dynamic argument

What's Changed

Contributors

Uh oh!

Model loading speedup when relying `set_module_tensor_to_device`

Torch.compile breaking change for `dynamic` argument