Releases · huggingface/trl

@AmineDiro

Features

AsyncGRPO rollout worker now runs in a separate process

AsyncRolloutWorker is no longer a thread — it's a spawned child process with its own GIL. The trainer's autograd engine no longer competes with recursive_parse / accuracy_reward for the GIL, which was causing 1-5s stalls in real Qwen3-30B-A3B @ 16k runs and ultimately NCCL watchdog timeouts on other ranks.

Architectural changes:

AsyncRolloutWorker (parent) owns the child process + shared mp.Queue / mp.Value / mp.Event.
_AsyncRolloutLoop (child-only) handles tokenization, dataset iteration, reward funcs, and asyncio loops.
A new WeightTransferClient owns the NCCL group with vLLM (/pause, /resume, /init_weight_transfer_engine, /update_weights); the rollout child only talks to /v1/completions.

Two correctness fixes shipped alongside (they would have conflicted otherwise): broader aiohttp retry (now catches ClientPayloadError) with bounded exponential backoff, and all-NaN reward columns are now preserved — np.nansum was silently returning 0, giving unscorable completions a real advantage signal and pushing the policy away from correct answers (~30% of DeepMath / OpenR1-Math rows).

Note

reward_funcs / tools / environment_factory must now be picklable, and the child runs CPU-only (CUDA_VISIBLE_DEVICES="").

by @AmineDiro in #5749

New experimental A2PO trainer (Optimal Advantage Regression)

A new A2POTrainer implements A*-PO from "Accelerating RL for LLM Reasoning with Optimal Advantage Regression". Two stages: an offline V* estimation pass from reference policy samples (with optional filter_all_incorrect to drop prompts where every reference completion fails), then on-policy training with one generation per prompt and a plain least-squares loss on β₂·log(π/π_ref) vs r − V*. No group, no critic, no clipping, no reward normalization.

from trl.experimental.a2po import A2POConfig, A2POTrainer

trainer = A2POTrainer(
    model="Qwen/Qwen3-4B",
    args=A2POConfig(num_value_samples=8, filter_all_incorrect=True),
    train_dataset=dataset,
    reward_funcs=accuracy_reward,
)
trainer.train()

Designed for binary verifiable rewards (math/code), not open-ended problems.

by @raghulchandramouli in #5940

KTO now supports VLMs + big alignment push

The biggest KTO ↔ DPO alignment cycle yet — KTOTrainer now supports vision-language models, plus a deep restructuring of compute_loss, KL dataset generation, ref-logp precomputation, activation offloading, sampler strategy, metrics, and more. KTO graduation is very close.

from trl.experimental.kto import KTOConfig, KTOTrainer

trainer = KTOTrainer(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    args=KTOConfig(...),
    train_dataset=vision_kto_dataset,
)

VLM support: by @albertvillanova in #5939. Plus ~20 alignment PRs all by @albertvillanova: #5820, #5849, #5852, #5850, #5866, #5864, #5856, #5872, #5875, #5900, #5901, #5899, #5906, #5909, #5914, #5982, #5936, #5996, #5998, #5999.

Cross-tokenizer alignment in GOLD via byte offsets

The GOLD distillation trainer used to align student/teacher tokens by extending two decoded strings and flushing on equality. It silently broke on any byte-level disagreement — including the common case of one tokenizer prepending BOS while the other doesn't (Llama-3 ↔ Qwen-3). The X-Token paper called this out by name.

Each side now carries (start_byte, end_byte) spans derived once from the fast tokenizer's char offsets, and the walker syncs on cumulative byte boundaries. On the on-policy path, spans come from piece_byte_len over the sampled token ids (not from re-encoding the decoded completion — BPE makes that round-trip non-injective).

Two related fixes shipped: long rows no longer lose the completion (now keeping the last max_length tokens), and the vLLM on-policy original_prompt_text is now decoded from the truncated ids the student actually consumed.

by @kashif in #5885

SDFT / SDPO: live teacher logprobs from the vLLM server

When teacher_model_kind="live" and vllm_mode="server", the vLLM generation server already holds the current student weights (synced every step for rollouts). The new use_teacher_server=True flag scores the teacher's log-probs on that same server instead of running a separate local teacher forward — removing the teacher from the training step entirely.

Supported modes: sampled_token (reverse KL on the realized token) and topk_logits. When buffered batches reuse steps (num_iterations > 1), weights are re-synced before scoring so the teacher never scores stale.

by @kashif in #5989

Bidirectional masked importance sampling (MIS) for IcePop

vLLM importance sampling in GRPO now uses a two-sided band [C_min, C_max] instead of a single upper cap, aligning TIS/MIS with IcePop's bidirectional handling of train–inference ratio outliers.

from trl import GRPOConfig

config = GRPOConfig(
    vllm_importance_sampling_clip_min=0.5,
    vllm_importance_sampling_clip_max=2.0,
    vllm_importance_sampling_correction="mask",  # or "truncate"
)

The old vllm_importance_sampling_cap is deprecated and maps to clip_max.

by @casinca in #4732

NemotronH and Nemotron 3 Ultra support

Day-zero training support for NVIDIA's new model families.

NemotronH integration by @qgallouedec in #5938
Nemotron 3 Ultra support by @qgallouedec in #5942
Enable gradient checkpointing in Nemotron 3 SFT example by @sergiopaniego in #5944

Even more training chat templates

Three more model families with {% generation %} markers (assistant-only loss out of the box):

Qwen2.5-VL by @aazizyan in #5838
Qwen2-VL by @aazizyan in #5839
Llava-Next by @aazizyan in #5959

Distributed backend boilerplate, hidden

A new trl/distributed.py introduces a single DistributedBackend class that detects ZeRO stage and FSDP version once, then exposes two context managers (gather_params, summon_full_params) used everywhere. Replaces the scattered getattr(state, "fsdp_plugin", None) / gather_if_zero3 / summon_full_params if ... else nullcontext() boilerplate spread across vllm_generation.py, models/utils.py, and the main trainers. Future deprecations land in one place.

by @albertvillanova in #6000

Decoupled self-distillation trainers

A two-PR refactor that disentangles SDPO, SDFT, and other self-distillation trainers from their shared base, making each one self-contained and consistent with the rest of the codebase.

by @LeonEricsson in #5862 and #5883

Heads-up: SFT default `loss_type` will change in 1.7

Setting SFTConfig.loss_type is now optional, and leaving it unset emits a FutureWarning: in TRL 1.7 the default will switch from "nll" to "chunked_nll". No action needed — you'll just get the new default automatically on upgrade — unless you want to pin the current behavior (e.g. for custom models) with loss_type="nll".

by @qgallouedec in #5997

Other

Support 'None' as CLI value for Optional[T] fields by @qgallouedec in #5843
Support non-lm_head output projections in chunked SFT loss (GPTNeoX) by @qgallouedec in #5857
SFTTrainer: merge entropy and accuracy computation to eliminate redundant logits copy by @flutist in #5897
Remove redundant .contiguous() calls in DPOTrainer to reduce peak memory by @flutist in #5926
Remove unnecessary explicit .contiguous() before entropy_from_logits by @qgallouedec in #5930
Exclude None reward completions from GRPO/RLOO advantage baseline by @AmineDiro in #5902
Support multimodal config in PPO ValueHead by @albertvillanova in #5907
Support vision datasets for Liger in DPO by @albertvillanova in #5943
Raise if precompute_ref_log_probs with vision datasets in DPO by @albertvillanova in #5867
🔒 Gate trainer telemetry on an explicit class-name allowlist by @qgallouedec in #5851
Update vLLM version support to 0.19.0 by @sergiopaniego in #5879
Improve error message when image tokens are truncated by max_length by @lxk8998 in #5927
Padding-free invariance test by @qgallouedec in #5842
Per-field invariance tolerances, calibrated by @qgallouedec in #5844

Fixes

Fix loss_type="chunked_nll" under DeepSpeed ZeRO-3 by @qgallouedec in #5873
Fix GRPO use_liger_kernel under DeepSpeed ZeRO-3 by @kashif in #5891
async_grpo: don't return on queue.Empty by @AmineDiro in https://github.c...

@qgallouedec

What's Changed

🔒 Gate trainer telemetry on an explicit class-name allowlist by @qgallouedec in #5851

Full Changelog: v1.5.0...v1.5.1

@DagaBhai

Features

Even more training chat templates

Three more model families gain training-compatible templates with {% generation %} markers (so assistant_only_loss=True just works):

Phi-3.5 by @DagaBhai in #5746
Qwen3-VL by @aazizyan in #5764
Qwen3.5 Think / NoThink by @aazizyan in #5824

Final logits softcapping for async GRPO

The chunked LM-head path used by AsyncGRPOTrainer now supports models that use final_logit_softcapping (notably Gemma 2). _ChunkedLogProbFunction applies logit_scale, optional tanh-based softcapping, and temperature consistently in both forward and backward — softcapped models are no longer rejected.

by @mlarnouhet in #5691

KTO ↔ DPO alignment continues

Two more cycles closer to KTO graduation:

Align compute_loss flow by @albertvillanova in #5810
Align _compute_loss_liger flow by @albertvillanova in #5816

Trainer telemetry (opt-out)

_BaseTrainer.__init__ now emits a single anonymous huggingface_hub.send_telemetry ping per trainer instantiation, so we can finally see which trainers / model families / distributed backends are actually being used in practice and prioritize accordingly.

The payload is intentionally minimal — TRL version, trainer class name, model architecture, PEFT yes/no, distributed backend (deepspeed/fsdp/ddp/none), bucketed world size, device type, GPU model when available. No user data, no dataset names, no model paths, no hyperparameter values, never sent in CI / offline / HF_HUB_DISABLE_TELEMETRY mode.

See usage_stats.md for what's collected and how to opt out.

by @qgallouedec in #5758

Other

OpenRewardSpec: fix omitting task-scoped tools during rollout binding (fixes #5727) by @rycerzes in #5729
Add OpenReward example to the list of examples by @sergiopaniego in #5752
Add DDP-2 members to invariant test suite by @qgallouedec in #5736
Align and simplify the stable training scripts by @qgallouedec in #5812
Replace uv installation script with setup action by @qgallouedec in #5735

Fixes

Fix exponential backtracking in qwen3 / qwen3_5 / glm4moe response parsing — GRPOTrainer was hanging indefinitely on truncated <tool_call> blocks (a degenerate case that happens naturally when generation hits max_completion_length mid-tool-call). Rewrote the regex to be non-backtracking — worst case goes from O(2ⁿ) to O(n). By @xodn348 in #5798
CUDA memory leak: release BNB dequantization buffers & stale state in OffloadActivations — follow-up to v1.4's activation-offloading leak fix. By @butterwecksolutions in #5730
Invalidate ZeRO-3 param coordinator trace in add_hooks by @roycho96 in #4693
Fix nested vocab_size for DistillationTrainer and GOLDTrainer by @Beichen-Ma in #5592
Fix MPS support in experimental empty_cache() by @jamie-peterson-ml in #5799
Fix metric_for_best_model for trainer-specific eval metrics by @qgallouedec in #5811
Fix generate_batch: inference tensors blocking inplace ops in background thread by @albertvillanova in #5818
Replace deprecated torch_dtype with dtype across examples, docs, notebooks, tests, and experimental distillation / gold trainers by @qgallouedec in #5717

Documentation and Examples

docs(grpo): align model to Qwen2.5 and add GRPO OOM tab in quickstart by @xodn348 in #5740

CI

Migrate tests to Qwen3.5 Think/NoThink fixtures + tiny-model generation scripts by @aazizyan in #5819 and #5821
Align tiny Glm4MoeForCausalLM / Cohere / Cohere2 / Qwen2.5-VL configs with their reference models by @qgallouedec in #5638, #5706, #5707 and #5739
Fix tiny Qwen3-VL deepstack_visual_indexes and drop the test skip by @qgallouedec in #5779
Fix tiny Qwen2.5-VL fullatt_block_indexes out of range for depth=2 by @albertvillanova in #5805
Remove non-existent params from tiny Qwen2-VL model by @albertvillanova in #5795
Fix vision config num_heads key in Qwen VL tiny model scripts by @matdou in #5792
Drop unjustified model.visual. skip in GRPO/RLOO Qwen2.5-VL tests by @qgallouedec in #5780
Make the LLaVA / LLaVA-Next test guard explicit by @qgallouedec in #5778
Remove obsolete Gemma 3 vision-head guard from VLM training tests by @qgallouedec in #5772
Fix OOM in CI: reduce batch size in VLM SFT / GRPO/RLOO VLM / toolcall tests by @albertvillanova in #5687, #5767, #5801
Fix OOM in CI by clearing chained exception tracebacks by @albertvillanova in #5776
Fix OOM in CI by reducing intermediate_size and image token budget for tiny Gemma 4 by @albertvillanova in #5760
Fix CI errors in response parsing for gpt-oss/llama with transformers v5 by @albertvillanova in #5755
Fix CI AttributeError: 'GptOssConfig' object has no attribute 'num_experts' by @albertvillanova in #5756
Fix CI apply_model_revisions by removing _commit_hash kwarg by @albertvillanova in #5762
Fix CI test to avoid skipping model.visual params by @albertvillanova in #5806
Fix transformers min version for tiny gemma 4 as 5.5.0 by @albertvillanova in #5763
Hotfix CI: pin torch < 2.12.0 (later reverted) by @albertvillanova in #5769
Fix catch-all empty string in Makefile pytest --only-rerun by @albertvillanova in #5784
chore: update tests_latest.yml by @hf-security-analysis[bot] in #5733

New Contributors

@hf-security-analysis[bot] made their first contribution in #5733
@Beichen-Ma made their first contribution in #5592
@DagaBhai made their first contribution in #5746
@xodn348 made their first contribution in #5740
@mlarnouhet made their first contribution in #5691
@matdou made their first contribution in #5792
@jamie-peterson-ml made their first contribution in #5799
@rycerzes made their first contribution in #5729

What's Changed

⬆️ Bump dev version by @qgallouedec in #5734
chore: update tests_latest.yml by @hf-security-analysis[bot] in #5733
fix: CUDA memory leak / release BNB dequantization buffers & stale state in OffloadActivations by @butterwecksolutions in #5730
fix: invalidate ZeRO-3 param coordinator trace in add_hooks by @roycho96 in #4693
Fix nested vocab_size for DistillationTrainer and GOLDTrainer by @Beichen-Ma in #5592
feat: add Phi-3.5 training chat templates with generation markers by @DagaBhai in #5746
docs(grpo): align model to Qwen2.5 and add GRPO OOM tab in quickstart by @xodn348 in #5740
torch_dtype -> dtype by @qgallouedec in #5717
Add OpenReward example to the list of examples by @sergiopaniego in #5752
Fix CI errors in response parsing for gptoss/llama with transformers v5 by @albertvillanova in #5755
Add DDP-2 members to invariant test suite by @qgallouedec in #5736
Hotfix CI param not updated AssertionError: Pin torch < 2.12.0 by @albertvillanova in #5769
Align tiny-Glm4MoeForCausalLM with GLM-4.5 reference config by @qgallouedec in #5638
Align tiny Cohere config with aya-expanse-8b by @qgallouedec in #5706
Align tiny Cohere2 config with tiny-aya-earth by @qgallouedec in #5707
Fix OOM in CI by reducing intermediate_size and image token budget for tiny Gemma4 by @albertvilla...

@qgallouedec

Features

Chunked cross-entropy loss for SFT (up to –50% VRAM)

A new loss_type="chunked_nll" option drastically reduces peak activation memory in SFT by avoiding the full [batch × seq × vocab] logits tensor. Ignored-label tokens are dropped before the lm_head matmul, and the cross-entropy is computed over the remaining tokens in checkpointed chunks (default chunk_size=256, the sweet spot consistent across model sizes and sequence lengths).

from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model="Qwen/Qwen3-4B",
    args=SFTConfig(loss_type="chunked_nll"),
    train_dataset=dataset,
)
trainer.train()

Peak GPU memory, AdamW fp32:

Model	Hardware	Seq	`nll`	`chunked_nll`
Qwen3-1.7B + LoRA	1×H100 80GB	2048	47.9 GB	12.3 GB (3.9× less)
Qwen3-4B	1×H100 80GB	16384	OOM	63.8 GB
Qwen3-14B	8×H100 FSDP2	16384	58.9 GB	38.9 GB (1.5× less)
Qwen3-32B	8×H100 FSDP2	8192	OOM	71.2 GB

End-to-end, chunked NLL is consistently as fast or faster than nll — and it unlocks sequence lengths that don't fit at all under the standard path.

The chunked path also supports VLMs (#5684).

by @qgallouedec in #5575, #5676 and #5684

OpenReward Standard environment adapter (experimental)

A new trl.experimental.openreward adapter plugs any environment speaking the Open Reward Standard (ORS) protocol into any TRL trainer accepting an environment_factory (GRPOTrainer, AsyncGRPOTrainer). One identifier wires all three trainer slots — dataset, factory, reward_func:

from trl import GRPOConfig, GRPOTrainer
from trl.experimental.openreward import OpenRewardEnv

env = OpenRewardEnv("Eigent/SETA")  # or "http://localhost:8000"

trainer = GRPOTrainer(
    model="Qwen/Qwen3-4B",
    args=GRPOConfig(...),
    train_dataset=env.dataset,
    environment_factory=env.factory,
    reward_funcs=env.reward_func,
)

Tools are bound dynamically from JSON Schema at construction (no per-env wrapper code), and env.dataset autoderives task lists from the ORS task endpoints. The same code path works for envs hosted on the OpenReward platform, self-hosted on any container service, or running locally on localhost. A SETA training example is included.

by @adithya-s-k in #5696

Training-invariance test suite

Unit tests don't catch trainer-level numerical drift (gradient-accumulation normalization bugs, attention-impl divergence (eager ↔ FA2 / kernels)) they silently shift the loss trajectory and users only notice when their run no longer reproduces. (Cf. last year's transformers grad-accum bug, or the "We found two bugs in DeepSpeed" paper.)

A new opt-in pytest -m invariant suite asserts the loss / grad_norm trajectory of short end-to-end SFT/DPO runs against committed reference snapshots, with equivalence classes for configs that should produce identical trajectories (e.g. pdb=1, gas=8 ≡ default; eager ≡ FA2 ≡ kernels). Hardware-pinned to H100 80GB, real pretrained model, full_determinism, fixed seed. Initial coverage: 2 trainers × 2 invariance axes (grad-accum, attn-impl) × gradient-checkpointing equivalence.

by @qgallouedec in #5686, #5688 and #5689

MFU helpers

Three new pure helpers in trl.trainer.utils for measuring training efficiency:

compute_flops_per_token(config, seq_len) — handles dense and MoE (Mixtral, Qwen3-MoE, DeepSeek-V2)
compute_mfu(flops_per_token, tps, world_size, peak_flops) — Model FLOPs Utilization as a percentage
adjusted_mfu(mfu, config, seq_len) — non-causal → causal-corrected (Llama / DS Ulysses convention)

by @AmineDiro in #5698

GRPO Liger kernel update (Liger 0.8.0)

GRPO's Liger-kernel integration is updated for Liger 0.8.0: delta two-sided clipping, use_bias_correction_kl, and SAPO/VESPO parameters are now forwarded into LigerFusedLinearGRPOLoss. The previous delta + use_liger_kernel guard is removed — both can be combined.

by @kashif in #5690

Length-normalized DPO sigmoid loss

A new loss_type="sigmoid_norm" option for DPOConfig implements the per-token (length-normalized) DPO loss used by Tülu 3 / OLMo (paper §5.1.2 eq. 6) to mitigate length bias.

from trl import DPOConfig, DPOTrainer

trainer = DPOTrainer(
    model="Qwen/Qwen3-4B",
    args=DPOConfig(loss_type="sigmoid_norm"),
    train_dataset=dataset,
)

by @BrownianNotion in #5406

Even more training chat templates

Four more model families gain training-compatible chat templates with {% generation %} markers (assistant-only loss masking) and/or response schemas (tool-calling parsing):

Cohere training template by @dschulmeist in #5627
Cohere2 {% generation %} markers by @qgallouedec in #5675
Gemma 3 training template by @hwanython in #5685
Qwen3-2507 training template by @SwayamInSync in #5574
Qwen2.5 response schema by @aazizyan in #5728

get_training_chat_template now also accepts a processor (not just a tokenizer) — useful for VLMs (#5560).

KTO ↔ DPO alignment: closing in on graduation

Another batch of alignment PRs this cycle. KTO and DPO are now structurally aligned across PEFT handling, model initialization, training-arg grouping, ref-logp precomputation, and metric handling — promotion of KTO out of experimental is imminent.

PRs (all by @albertvillanova): #5659, #5660, #5661, #5679, #5701, #5702, #5703, #5704, #5705, #5714.

Other

Reject parallelism_config with cp_size>1 or sp_size>1 in GRPO/RLOO — fail fast at config init with a clear error instead of mid-training crash. By @kashif in #5699
Fail early for unsupported PEFT + Liger Kernel in DPO by @albertvillanova in #5709
Explicitly set model_accepts_loss_kwargs=False in DPO and Reward by @albertvillanova in #5710
Set _tokenizer attribute in experimental trainers by @albertvillanova in #5566
Simplify peft_config handling in core / experimental trainers by @albertvillanova in #5673 and #5674
Replace isinstance with is_peft_model / drop redundant is_peft_available by @albertvillanova in #5682 and #5683
Reduce inconsistency across trainer test files by @qgallouedec in #5678
Refactor tiny-model generation scripts by @qgallouedec in #5637
Revert VLM support in parse_response by @qgallouedec in #5561

Fixes

5 GB+ CUDA memory leak in activation offloading — OffloadActivations.__exit__ now syncs the compute/offload streams and clears the stash dictionaries, preventing orphaned offload tensors from leaking onto a dead stream (~0.2 GiB/step accumulation observed during QLoRA vision training before the fix). By @butterwecksolutions in #5694 and #5700
Fix reverse-KL server path NaN on variable completion length in DistillationTrainer by @k1064190 in #5594
GKDTrainer: fix return_outputs in the Liger kernel path by @roycho96 in #4688
GKDTrainer: fix seq-KD wasted teacher forward by @roycho96 in #5726
GKDTrainer: fix Liger fused JSD path computing wrong loss by @roycho96 in #5731
Fix missing PEFT validation when passing peft_config to core / experimental trainers by @albertvillanova in #5664 and #5665
Fix peft_config type hint in experimental trainers by @albertvillanova in #5666
Fix discarded assertion message in trainer parameter checks by @qgallouedec in #5677
Fix typo in model name in README by @qgallouedec in #5711

Documentation and Examples

Upload testing suite for DistillationTrainer by @cmpatino in #5615

CI

Fix OOM in CI by reducing batch size in VLM SFT tests by @albertvillanova in #5687
Fix OOM in CI by reducing image size of tiny Gemma 3 model by @albertvillanova in htt...

@qgallouedec

Features

Qwen 3.6 integration

ChatGPT Image Apr 26, 2026 at 11_16_18 AM

TRL v1.3 ships training support for the new Qwen 3.6 family (Qwen/Qwen3.6-27B, Qwen/Qwen3.6-35B-A3B). Qwen 3.6 reuses the Qwen3_5Moe* architecture but ships a slightly different chat template (adds a preserve_thinking flag, tweaks tool-arg stringification), so exact-string template matching needed updates across the stack.

What landed:

Chat templates: qwen3_6.jinja (verbatim from upstream) and qwen3_6_training.jinja (prefix-preserving + {% generation %} markers for assistant_only_loss=True)
Response schema: routes to the existing qwen3_5_schema for tool-call parsing — output format unchanged
Tiny test models for VLM training: tiny-Qwen3_5MoeForConditionalGeneration-3.6 (with MoE-specific shrinking)
Test matrix updated across SFT/DPO/GRPO/RLOO test_(train|training)_vlm cases

from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model="Qwen/Qwen3.6-27B",
    args=SFTConfig(assistant_only_loss=True),  # works out of the box
    train_dataset=dataset,
)
trainer.train()

Tool-calling agent training also works end-to-end via the existing Qwen 3.5 response schema:

from trl import GRPOConfig, GRPOTrainer

def multiply(a: int, b: int) -> int:
    """
    Multiplies two integers.

    Args:
        a: The first integer.
        b: The second integer.

    Returns:
        The product of the two integers.
    """
    return a * b

trainer = GRPOTrainer(
    model="Qwen/Qwen3.6-27B",
    reward_funcs=my_reward_fn,
    args=GRPOConfig(...),
    train_dataset=dataset,
    tools=[multiply],
)
trainer.train()

by @qgallouedec in #5642

New experimental TPO trainer

A new experimental TPOTrainer implements Triple Preference Optimization, which augments DPO with a reference (gold) completion alongside chosen/rejected. The paper reports +7-19 points over DPO/SimPO on Arena-Hard, MixEval-Hard, MMLU-Pro and GSM8K, with less data.

from trl.experimental.tpo import TPOConfig, TPOTrainer

trainer = TPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=TPOConfig(output_dir="Qwen3-0.6B-TPO"),
    train_dataset=load_dataset("tpo-alignment/triple-preference-ultrafeedback-40K", split="train"),
)
trainer.train()

by @kashif in #5506

Speculative decoding in `trl vllm-serve`

A new --speculative_config JSON flag exposes vLLM's speculative decoding directly through trl vllm-serve — works with native MTP heads (Qwen3 Next), Eagle3 drafts, etc. — without forking the serve script.

# Qwen3 native MTP (no extra draft model)
trl vllm-serve --model Qwen/Qwen3-Next-80B-A3B-Instruct \
    --speculative_config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 5}'

# Eagle3 draft model
trl vllm-serve --model Qwen/Qwen3-32B \
    --speculative_config '{"model": "RedHatAI/Qwen3-32B-speculator.eagle3", "method": "eagle3", "num_speculative_tokens": 3}'

by @Ofir408 in #5605

KTO ↔ DPO alignment: nearing the finish line

Twelve more alignment PRs this cycle, bringing KTOTrainer and DPOTrainer essentially into structural parity. Notable shifts include moving completion assembly out of _prepare_dataset into a new DataCollatorForKTO, inlining the two-pass tokenization into a single pass, removing BOS/EOS handling, and supporting IterableDataset and dict eval_dataset. The goal — promoting KTO out of experimental and into stable — is now within reach for an upcoming release.

PRs (all by @albertvillanova): #5582, #5578, #5579, #5583, #5587, #5599, #5601, #5600, #5606, #5612, #5632, #5635

More `{% generation %}` training chat templates

Three more model families gain training-compatible chat templates with {% generation %} markers, so assistant_only_loss=True works out of the box:

Gemma / Gemma 2 by @ps-abhi in #5523
Phi-3 by @RudrenduPaul in #5526
GLM-4-MoE by @casinca in #5519

Other

Support processor in maybe_apply_chat_template by @albertvillanova in #5567
Support VLM processors in is_chat_template_prefix_preserving by @qgallouedec in #5558
Check prefix preservation at the token level (not string level) by @qgallouedec in #5559
Drop vLLM 0.11 support by @qgallouedec in #5549
Remove forward_masked_logits by @qgallouedec in #5626
Remove dead token attributes from experimental trainers by @albertvillanova in #5565
Set _tokenizer as trainer attribute by @albertvillanova in #5489
Use PreTrainedTokenizerBase for tokenizer type hints by @qgallouedec in #5629
Renaming of internal variables: async_reward_X to async_X by @qgallouedec in #5616

Fixes

Fix entropy calculation in SFT — three bugs at once: misaligned by one position (next-token shift), averaged over the wrong tokens (used attention_mask instead of label != -100), and wrong cross-rank aggregation (unweighted mean instead of sum/count). The reported entropy under completion_only_loss=True and sequence parallelism is now correct. Same fix applied to DPO entropy logging. By @qgallouedec in #5620
Pass AsyncGRPOTrainer's processing_class to AsyncRolloutWorker by @xuanduy04 in #5538
Fix generate_tiny_models for gpt-oss by @albertvillanova in #5622
Fix docstring style in vllm-serve script by @albertvillanova in #5628
Replace wrong comment about chat template with EOS by @albertvillanova in #5607

Documentation and Examples

Add chat templates page to web docs by @sergiopaniego in #5581
Update AsyncGRPO example with GSM8K and tested hyperparameters by @sergiopaniego in #5580
Update RapidFire AI integration with FSDP and multi-backend tracking by @kamran-rapidfireAI in #5618

CI

Add doc-builder style check to pre-commit and CI by @albertvillanova in #5630
Align and update doc-builder commit hash in CI GitHub Actions by @albertvillanova in #5631
Hotfix CI: Add ruff dependency to doc-builder style check by @albertvillanova in #5634
Fix CI with dev dependencies for Llava models by @albertvillanova in #5499
Add additional model parameters to TestSupportsToolCalling for improved coverage by @qgallouedec in #5537
Differentiate Phi-3 and Phi-3.5 in tests by @qgallouedec in #5546

New Contributors

@Ofir408 made their first contribution in #5605
@ps-abhi made their first contribution in #5523

What's Changed

⬆️ Bump dev version by @qgallouedec in #5577
Support processor in maybe_apply_chat_template by @albertvillanova in #5567
Remove dead token attributes from experimental trainers by @albertvillanova in #5565
Support VLM processors in is_chat_template_prefix_preserving by @qgallouedec in #5558
Align KTO with DPO: Align add_model_tags by @albertvillanova in #5582
Align KTO with DPO: Align processing_class initialization by @albertvillanova in #5578
Align KTO with DPO: Align _prepare_dataset by @albertvillanova in #5579
Align KTO with DPO: Align ref_model preparation for distributed training by @albertvillanova in #5583
Align KTO with DPO: Make conditional prompt extraction and unpairing in _prepare_dataset by @albertvillanova in #5587
Update AsyncGRPO example with GSM8K and tested hyperparameters by @sergiopaniego in #5580
[docs] Add chat templates page to web docs by @sergiopaniego in #5581
Add additional model parameters to TestSupportsToolCalling for improved coverage by @qgallouedec in #5537
Fix CI with dev dependencies for Llava models by @albertvillanova in #5499
Differentiate Phi-3 and Phi-3.5 in tests by @qgallouedec in #5546
Set _tokenizer as trainer attribute by @albertvillanova in #5489
Align KTO ...

@kashif

Features

New `SSDTrainer` — Simple Self-Distillation

A new experimental SSDTrainer implements the method described in Embarrassingly Simple Self-Distillation Improves Code Generation. SSD samples completions from the model itself at a training-time temperature/truncation setting, then fine-tunes on those raw, unverified samples with standard cross-entropy loss. No reward model, verifier, teacher model, or RL: just prompts and the model.

from datasets import Dataset
from trl.experimental.ssd import SSDConfig, SSDTrainer

dataset = Dataset.from_dict({
    "prompt": [
        [{"role": "user", "content": "Write a function to add two numbers."}],
        [{"role": "user", "content": "Write a function to check if a number is prime."}],
    ],
})

trainer = SSDTrainer(
    model="Qwen/Qwen3-4B-Instruct",
    args=SSDConfig(
        output_dir="ssd-model",
        temperature=0.6,      # T_train from the paper
        top_k=20,
        top_p=0.95,
        learning_rate=5e-6,
    ),
    train_dataset=dataset,
)
trainer.train()

by @kashif in #5505

Drop, don't truncate, overlong tool results in `GRPOTrainer`

When tool calls produce more tokens than max_completion_length allows, GRPOTrainer now rolls back the tool messages/images added in the current iteration instead of trying to truncate them. This removes ~80 lines of fragile, image-boundary-aware bookkeeping in favor of a ~15-line snapshot-and-rollback. Since overlong samples almost always get rewarded as failures anyway, the learning signal is effectively unchanged — but the code is dramatically simpler and no longer needs per-VLM-family vision-token lookup tables.

by @qgallouedec in #5521

Expanded tool-calling model support: LLaMA 3.1 / 3.2 & DeepSeek-V3

Continuing the effort from v1.1:

LLaMA 3.1 and 3.2 tool-calling response schemas, with dedicated templates for identity matching. Note that these templates only support a single tool call and no content alongside the tool call — limitations inherited from the models' native templates. By @qgallouedec in #5518
DeepSeek-V3 training chat template with {% generation %} markers, enabling assistant-only loss masking for DeepSeek-V3 models. By @RudrenduPaul in #5527

As a result of a tightened detection (see fixes below), the list of templates reported as tool-calling capable is now correct — notably, the basic Llama 3 template is no longer falsely classified as tool-calling capable.

KTO/DPO alignment push

A major cleanup sweep keeps KTOTrainer and DPOTrainer in lockstep, same initialization patterns, same config surface, same precompute behavior:

Add precompute_ref_batch_size to KTO (#5530)
Align ref_model initialization (#5534)
Align model initialization (#5533)
Support None args (#5531)
Remove generate_during_eval (#5551)
Remove model and ref adapter names (#5552)
Don't load ref_model when precompute_ref_log_probs is set in DPO/KTO (#5542)

All by @albertvillanova.

Other

Support messages with images in prepare_multimodal_messages by @albertvillanova in #5474
Simplify role handling in prepare_multimodal_messages by @albertvillanova in #5508
Update vLLM version support to 0.18.0 by @qgallouedec in #5547

Fixes

Fix supports_tool_calling falsely accepting templates that drop assistant tool_calls by @qgallouedec in #5517
Fix add_response_schema for VLM processors — the schema was being set on the outer processor instead of the inner tokenizer, so it had no effect. This also collapses a handful of __init__/decode-gate workarounds. By @qgallouedec in #5520
Remove xfail condition for Gemma 4 response_schema regex bug by @qgallouedec in #5510
Remove unused dependencies for judges from dev requirements by @qgallouedec in #5515

Deprecations

Deprecate use_transformers_paged in GRPOConfig and RLOOConfig (and remove entirely from experimental OnlineDPOConfig, GOLDConfig, SelfDistillationConfig). Will be removed from the remaining configs in v2.0.0. In a small A/B benchmark (Qwen3-0.6B GRPO), the paged path is ~20% slower and uses ~6x more peak VRAM than the default; it's also superseded by transformers continuous batching. By @qgallouedec in #5544

Documentation and Examples

Add example script section to experimental trainer docs by @sergiopaniego in #5543
[Docs] Fix formatting in SSD training example script by @kashif in #5548
Nits in SSD docs by @sergiopaniego in #5554
[docs] Add LLaMA 3 / Qwen 2.5 entries to chat_templates/README by @qgallouedec in #5545
Update CARLA VLM example scripts by @sergiopaniego in #5557

CI

Fix CI dependency installs to use a single resolve by @qgallouedec in #5513
Set upper transformers version to skip distributed test_rloo after fixed by @albertvillanova in #5535
Update tests with zero3 for RLOO and GRPO once fixed in transformers 5.5.4 by @albertvillanova in #5541
Bump doc-builder SHA for PR upload workflow by @rtrompier in #5553

What's Changed

⬆️ Bump dev version by @qgallouedec in #5525
Simplify role handling in prepare_multimodal_messages by @albertvillanova in #5508
Fix CI dependency installs to use a single resolve by @qgallouedec in #5513
Fix supports_tool_calling falsely accepting templates that drop assistant tool_calls by @qgallouedec in #5517
feat: add DeepSeek-V3 training chat template with generation markers by @RudrenduPaul in #5527
Drop, don't truncate, overlong tool results in GRPOTrainer by @qgallouedec in #5521
Set upper transformers version to skip distributed test_rloo after fixed by @albertvillanova in #5535
Align KTO with DPO: Add precompute_ref_batch_size by @albertvillanova in #5530
Update tests with zero3 for RLOO and GRPO once fixed in transformers 5.5.4 by @albertvillanova in #5541
Align KTO with DPO: Align ref_model initialization by @albertvillanova in #5534
Align KTO with DPO: Align model initialization by @albertvillanova in #5533
Remove unused dependencies for judges from dev requirements by @qgallouedec in #5515
Remove xfail condition for Gemma4 response_schema regex bug by @qgallouedec in #5510
Align KTO with DPO: Support None args by @albertvillanova in #5531
Add example script section to experimental trainer docs by @sergiopaniego in #5543
[SSD] Added SSD trainer in experimental by @kashif in #5505
[Docs] Fix formatting in SSD training example script by @kashif in #5548
Don't load ref_model when precompute_ref_log_probs in DPO/KTO by @albertvillanova in #5542
chore: bump doc-builder SHA for PR upload workflow by @rtrompier in #5553
Nits is SSD docs by @sergiopaniego in #5554
Deprecate use_transformers_paged by @qgallouedec in #5544
Update vLLM version support to 0.18.0 by @qgallouedec in #5547
Align KTO with DPO: Remove generate_during_eval by @albertvillanova in #5551
Align KTO with DPO: Remove model and ref adapter names by @albertvillanova in #5552
Support messages with images in prepare_multimodal_messages by @albertvillanova in #5474
Update CARLA VLM example scripts by @sergiopaniego in #5557
Fix add_response_schema for VLM processors by @qgallouedec in #5520
[docs] Add LLaMA 3 / Qwen 2.5 entries to chat_templates/README by @qgallouedec in #5545
Add LLaMA 3.1 and 3.2 tool calling support by @qgallouedec in #5518
Release: v1.2 by @qgallouedec in #5576
...

@cmpatino

Features

`DistillationTrainer` for efficient on-policy distillation

Read the blog post: https://huggingface.co/spaces/HuggingFaceTB/trl-distillation-trainer

The new DistillationTrainer implements on-policy knowledge distillation as described in On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes. It extends the ideas from the GKDTrainer with three key optimizations: a generation buffer that decouples the training microbatch size from the generation batch size (up to 40x speedup), external teacher server support so the teacher doesn't need to fit on training GPUs, and binary-encoded logprob payloads that shrink transfer payloads by ~5x.

from datasets import load_dataset
from trl.experimental.distillation import DistillationConfig, DistillationTrainer

dataset = load_dataset("openai/gsm8k", "main", split="train")
dataset = dataset.map(
    lambda x: {"messages": [{"role": "user", "content": x["question"]}]},
    remove_columns=dataset.column_names,
)

trainer = DistillationTrainer(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    teacher_model="Qwen/Qwen2.5-7B-Instruct",
    args=DistillationConfig(
        output_dir="results/distill-qwen-gsm8k",
        lmbda=1.0,                   # fully on-policy (student generates)
        beta=1.0,                    # reverse KL
        teacher_model_init_kwargs={"torch_dtype": "bfloat16"},
    ),
    train_dataset=dataset,
)
trainer.train()

by @cmpatino in #5407, #5500 and #5501

Chunked LM head for memory-efficient log-prob computation in `AsyncGRPOTrainer`

AsyncGRPOTrainer now supports a chunked LM-head path that computes per-token log-probs and entropy via online logsumexp without materializing the full [N, V] logits tensor. Combined with completion_mask filtering to skip prompt tokens, this brings massive memory savings on long sequences — up to 44x lower peak-allocated memory on an 8192-token sequence:

`chunk_lm_head_size`	Peak Alloc (GB)	Reduction	Wall Time (ms)
`None` (baseline)	18.55	1.00x	808.7
`4096`	0.42	44.32x	459.0
`8192`	0.76	24.34x	393.0

Enable it via the new chunk_lm_head_size option in AsyncGRPOConfig:

from trl.experimental.async_grpo import AsyncGRPOConfig, AsyncGRPOTrainer

trainer = AsyncGRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    args=AsyncGRPOConfig(chunk_lm_head_size=4096),
    ...
)

Note: mutually exclusive with use_liger_kernel (both replace the LM head forward pass).

by @AmineDiro in #5349

`{% generation %}` support in training chat templates

SFT with assistant_only_loss=True requires chat templates to include {% generation %} / {% endgeneration %} markers so that return_assistant_tokens_mask=True produces correct masks. Very few models ship these markers natively, so users hit a cryptic error when enabling assistant-only loss with models like Qwen3, Llama 3 or GPT-OSS.

SFTTrainer now automatically swaps in a patched training chat template when the original template lacks generation markers — no manual template surgery required. Training templates are shipped for Qwen2.5, Qwen3, Llama 3 and GPT-OSS, stored as standalone .jinja files under trl/chat_templates/ for readability, diffability, and editor syntax highlighting.

from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model="Qwen/Qwen3-4B",
    args=SFTConfig(assistant_only_loss=True),  # now just works
    train_dataset=dataset,
)
trainer.train()

by @qgallouedec in #5459, #5470, by @RudrenduPaul in #5493 and #5522, and by @casinca in #5484

Expanded tool-calling model support

Agent training now supports a broader family of models via native tool-call response schemas:

GPT-OSS (#5464)
GLM-4-MoE (#5463)
Qwen3-VL (#5469)
Gemma 4 — the first model to natively ship a response schema (#5454)

A new supports_tool_calling() utility detects whether a tokenizer/processor can render a full tool-calling turn, and GRPOTrainer now validates tool support at initialization — raising a clear error upfront instead of failing cryptically mid-training.

by @qgallouedec in #5462, #5464, #5463, #5469 and #5454

Multimodal tool responses for VLM training

environment_factory tool methods can now return multimodal content blocks (images + text) for VLM training. Previously, tool responses were always converted to str(result), discarding any visual information. Now tools can return content block lists with images, and the trainer handles them end-to-end through tokenization, generation, and the forward pass — including correct pixel_values plumbing.

class ScreenshotEnv:
    def take_screenshot(self) -> list[dict]:
        return [
            {"type": "image", "image": self.browser.screenshot()},
            {"type": "text", "text": "Current page state"},
        ]

The OpenEnv browsergym.py example has been migrated to this pattern, and a new carla_vlm.py example demonstrates VLM training against CARLA with camera-image tool responses.

by @sergiopaniego in #5323 and #5437, and by @qgallouedec in #5448

Built-in reward functions now log extra columns

accuracy_reward and reasoning_accuracy_reward now emit extra diagnostic columns (solution, gold_parsed, answer_parsed) via the log_extra callback introduced in v1.0.0. These show up in the rich completions table, making it much easier to debug why a reward was (or wasn't) assigned.

from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer
from trl.rewards import accuracy_reward

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    args=GRPOConfig(log_completions=True),
    train_dataset=dataset,
)
trainer.train()

by @qgallouedec in #5308

Other

Align KTO with DPO: precompute reference log probs at init by @albertvillanova in #5447
Align KTO with DPO: reorganize KTOConfig by @albertvillanova in #5477
Use generic VLM key passthrough in DPO by @albertvillanova in #5468
Make images optional in prepare_multimodal_messages by @albertvillanova in #5424
Avoid image deepcopy in prepare_multimodal_messages by @albertvillanova in #5475
Replace pixel_position_ids with image_position_ids for Gemma 4 support by @qgallouedec in #5452
Update vLLM minimum supported version to 0.11.0 by @albertvillanova in #5443
Remove dead token attributes from trainers by @albertvillanova in #5483
Remove unnecessary isinstance(part, dict) checks in image extraction by @qgallouedec in #5439
Simplify _get_tool_suffix_ids by @qgallouedec in #5440
Narrow prefix-preserving check to the actual requirement by @qgallouedec in #5458
Remove duplicated prepare_deepspeed by @albertvillanova in #5414

Fixes

Fix targeting fused parameters with LoRA by @BenjaminBossan in #5430
Fix ImportError with vllm-0.10.2 in OnlineDPO and OpenEnv by @albertvillanova in #5423
Fix _get_per_token_logps_and_entropies return type by @kashif in #5456
Fix SFT deprecation warning by @albertvillanova in #5466
Fix broken validation of user-specified tokens by @albertvillanova in #5482
Fix prepare_multimodal_messages not normalizing empty string content for assistant/tool roles by @albertvillanova in #5496
Remove redundant alignment of pad_token_id by @albertvillanova in #5487
Replace deprecated huggingface-cli references with hf by @hanouticelina in #5486
Remove unused truncation_mode from experimental truncate_dataset by @albertvillanova in #5467
Fix PR template check bot reopen loop by @qgallouedec in #5488
R...

@qgallouedec

Read our blog post for an overview of TRL v1.

Features

Asynchronous GRPO

Asynchronous GRPO decouples generation from the gradient update loop by offloading rollouts to an external vLLM server. Generation runs in parallel while training continues, eliminating idle GPU time and improving hardware utilization.

from trl.experimental.async_grpo import AsyncGRPOTrainer
from trl.rewards import accuracy_reward
from datasets import load_dataset

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = AsyncGRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()

by @qgallouedec in #5293

Variational Sequence-Level Soft Policy Optimization (VESPO)

VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:

from trl import GRPOConfig, GRPOTrainer

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(loss_type="vespo"),
    ...
)

by @casinca in #5199

Divergence Proximal Policy Optimization (DPPO)

DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.

by @LeonEricsson in #5117

Self-Distillation Policy Optimization (SDPO)

SDPO is a new experimental trainer that augments on-policy RL with self-distillation from the model's own high-reward trajectories. Instead of using an external teacher, SDPO treats the current model conditioned on feedback as a self-teacher, distilling its feedback-informed predictions back into the policy.

from trl.experimental import SDPOTrainer, SDPOConfig

config = SDPOConfig(
    output_dir="./results",
    num_generations=8,
    success_reward_threshold=1.0,
    use_successful_as_teacher=True,
)

trainer = SDPOTrainer(
    model="Qwen/Qwen2.5-Math-1.5B-Instruct",
    reward_funcs=[accuracy_reward],
    args=config,
    train_dataset=dataset,
)
trainer.train()

by @MengAiDev in #4935

Reward functions can now log extra columns and scalar metrics

Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.

def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
    extracted = [extract_answer(c) for c in completions]
    rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]

    if log_extra:
        log_extra("golden_answer", list(answer))
        log_extra("extracted_answer", extracted)

    if log_metric:
        log_metric("accuracy", sum(rewards) / len(rewards))

    return rewards

by @manueldeprada in #5233

Tool calling support in `VLLMClient.chat()`

VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.

by @kansalaman in #4889

35% faster packing

BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.

by @mariosasko in #5189

[GKD] Buffer implementation and vLLM inference for distillation trainer

The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation. vLLM inference support has also been added to the base self-distillation trainer.

by @cmpatino in #5137 and #5388

v0 → v1 migration guide

A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.

by @qgallouedec in #5255

Other

Change default vllm_mode to "colocate" by @qgallouedec in #5255
Support truncation_mode in SFT by @albertvillanova in #5306
Support max_length in DPO VLM training by @albertvillanova in #5284
Add pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in #5180
Support sequence sampling in Liger Kernel by @michaelroyzen in #5190
Add tool calling support to VLLMClient.chat() by @kansalaman in #4889
Add support for raw token IDs in vLLM client prompts by @qgallouedec in #5225
Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in #5227
Enhance print_prompt_completions_sample to include reasoning content by @qgallouedec in #5327
Add support for pixel_position_ids vision key by @qgallouedec in #5374
Add second version of Qwen 3.5 chat template by @apardyl in #5405
Pass tools as None to apply_chat_template when it is an empty list by @rabinadk1 in #5380

Fixes

Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in #5305
Prevent corruption of DPO VLM training if "keep_end" truncation_mode by @albertvillanova in #5286
Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in #5279
Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in #5295
Fix accuracy_reward crash when called from non-main thread by @qgallouedec in #5281
Fix GRPOTrainer attribute access for vLLM model config by @falcondai in #5302
[GRPO] Fix re-tokenization bug in tool-calling loop by @qgallouedec in #5242
[CPO/ORPO] Fix handling of different length chosen/rejected prompts by @davmels in #4639
Fix RewardFunc type alias to reflect actual calling convention by @s-zx in #5246
fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in #5245
Fix prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in #5212
Fix support for model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5230
Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in #5274
Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in #5266
Sync entire prompt/completion token tensors before indexing by @shawnghu in #5218
Clean up model update group on worker exit by @AmineDiro in #5325
Fix prefix EOS slicing for tool suffix (with Qwen3/3.5 chat templates) by @casinca in #5330
Fix: apply reward_weights to logged reward/reward_std in GRPOTrainer by @lailanelkoussy in #5353
Fix IDs shape mismatch in SFT for VLMs with text-only by @albertvillanova in #5354

Documentation and Examples

Add minimal CARLA example script by @sergiopaniego in #5161
...

@casinca

Features

Variational Sequence-Level Soft Policy Optimization (VESPO)

VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:

from trl import GRPOConfig, GRPOTrainer

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(loss_type="vespo"),
    ...
)

by @casinca in #5199

Divergence Proximal Policy Optimization (DPPO)

DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.

by @LeonEricsson in #5117

Reward functions can now log extra columns and scalar metrics

Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.

def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
    extracted = [extract_answer(c) for c in completions]
    rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]

    if log_extra:
        log_extra("golden_answer", list(answer))
        log_extra("extracted_answer", extracted)

    if log_metric:
        log_metric("accuracy", sum(rewards) / len(rewards))

    return rewards

by @manueldeprada in #5233

Tool calling support in `VLLMClient.chat()`

VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.

by @kansalaman in #4889

35% faster packing

BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.

by @mariosasko in #5189

[GKD] Buffer implementation for distillation trainer

The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation.

by @cmpatino in #5137

v0 → v1 migration guide

A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.

by @qgallouedec in #5255

Other

Change default vllm_mode to "colocate" by @qgallouedec in #5255
Support truncation_mode in SFT by @albertvillanova in #5306
Support max_length in DPO VLM training by @albertvillanova in #5284
Add pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in #5180
Support sequence sampling in Liger Kernel by @michaelroyzen in #5190
Add tool calling support to VLLMClient.chat() by @kansalaman in #4889
Add support for raw token IDs in vLLM client prompts by @qgallouedec in #5225
Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in #5227

Fixes

Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in #5305
Prevent corruption of DPO VLM training if "keep_end" truncation_mode by @albertvillanova in #5286
Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in #5279
Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in #5295
Fix accuracy_reward crash when called from non-main thread by @qgallouedec in #5281
Fix GRPOTrainer attribute access for vLLM model config by @falcondai in #5302
[GRPO] Fix re-tokenization bug in tool-calling loop by @qgallouedec in #5242
[CPO/ORPO] Fix handling of different length chosen/rejected prompts by @davmels in #4639
Fix RewardFunc type alias to reflect actual calling convention by @s-zx in #5246
fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in #5245
Fix prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in #5212
Fix support for model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5230
Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in #5274
Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in #5266
Sync entire prompt/completion token tensors before indexing by @shawnghu in #5218
Clean up model update group on worker exit by @AmineDiro in #5325

Documentation and Examples

Add minimal CARLA example script by @sergiopaniego in #5161
Nemotron 3 examples added by @sergiopaniego in #5272
Align docs about tool calling in trainers with dataset format by @albertvillanova in #5311
Add repository-specific guidance for agents (AGENTS.md) by @qgallouedec in #5236
Align documentation with the intended public API by @qgallouedec in #5162

What's Changed

⬆️ Bump dev version by @qgallouedec in #5182
Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError by @albertvillanova in #5178
Document parameters with differing default values in core configs by @albertvillanova in #5168
Make _BaseConfig and _BaseTrainer explicitly private by @albertvillanova in #5169
Refactor CLI [4/N]: Replace top-level TrlParser with ArgumentParser by @albertvillanova in #5170
Add minimal CARLA example script by @sergiopaniego in #5161
Align documentation with the intended public API by @qgallouedec in #5162
Fix deprecation warning of create_reference_model by @albertvillanova in #5184
Fix deprecation warning of fork in multi-threaded process by @albertvillanova in #5185
Refactor CLI [5/N]: Refactor TrainingCommand with delayed imports by @albertvillanova in #5186
Refactor CLI [6/N]: Refactor env/vllm-serve commands with delayed imports by @albertvillanova in #5187
Fix CI tests patching BaseTrainer by @albertvillanova in #5192
Add pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in #5180
Re-add liger-kernel to dev deps by @qgallouedec in #5164
Set CI PYTORCH_ALLOC_CONF env variable to avoid OOM by @albertvillanova in #5197
Support sequence sampling in Liger Kernel and pass importance_samplin… by @michaelroyzen in #5190
Mark CI test_training_vlm_and_liger as xfail by @albertvillanova in #5202
Decouple rollout dispatch from vLLM backend in GRPO _generate_single_turn by @albertvillanova in #5122
CI: Add Qwen 3.5 tiny model to tests by @qgallouedec in #5204
Add support for Qwen3.5 f...

@albertvillanova

What's Changed

Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError by @albertvillanova in #5178
Fix prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in #5212
Fix type for model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5230
Decouple rollout dispatch from vLLM backend in GRPO _generate_single_turn by @albertvillanova in #5122
Simplify logic for structured outputs across vLLM versions by @albertvillanova in #5215
Add support for raw ids in prompts in vLLM client and server by @qgallouedec in #5225
Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in #5227
Move rollout_func from _generate_single_turn to _generate by @qgallouedec in #5232
[GRPO/RLOO] Tokenize before vLLM generation call by @qgallouedec in #5238
Support JSON string parsing of teacher_model_init_kwargs in MiniLLMConfig by @albertvillanova in #5259
[GRPO/RLOO] Unify tokenization across all generation backends in _generate_single_turn by @qgallouedec in #5239
[GRPO/RLOO] Extract tokenize prompts from _generate_single_turn by @qgallouedec in #5240
[CPO/ORPO] Fix handling of different length chosen/rejected prompts. by @davmels in #4639
Fix type for teacher_model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5258
Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in #5266
Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in #5279
Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in #5274
Fix GRPOTrainer attribute access for vLLM model config by @falcondai in #5302
[GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs by @qgallouedec in #5242

New Contributors

@davmels made their first contribution in #4639
@falcondai made their first contribution in #5302

Full Changelog: v0.29.0...v0.29.1

Releases: huggingface/trl

v1.6.0

Features

AsyncGRPO rollout worker now runs in a separate process

New experimental A2PO trainer (Optimal Advantage Regression)

KTO now supports VLMs + big alignment push

Cross-tokenizer alignment in GOLD via byte offsets

SDFT / SDPO: live teacher logprobs from the vLLM server

Bidirectional masked importance sampling (MIS) for IcePop

NemotronH and Nemotron 3 Ultra support

Even more training chat templates

Distributed backend boilerplate, hidden

Decoupled self-distillation trainers

Heads-up: SFT default loss_type will change in 1.7

Other

Fixes

Contributors

Uh oh!

v1.5.1

What's Changed

Contributors

Uh oh!

v1.5.0

Features

Even more training chat templates

Final logits softcapping for async GRPO

KTO ↔ DPO alignment continues

Trainer telemetry (opt-out)

Other

Fixes

Documentation and Examples

CI

New Contributors

What's Changed

Contributors

Uh oh!

v1.4.0

Features

Chunked cross-entropy loss for SFT (up to –50% VRAM)

OpenReward Standard environment adapter (experimental)

Training-invariance test suite

MFU helpers

GRPO Liger kernel update (Liger 0.8.0)

Length-normalized DPO sigmoid loss

Even more training chat templates

KTO ↔ DPO alignment: closing in on graduation

Other

Fixes

Documentation and Examples

CI

Contributors

Uh oh!

v1.3.0

Features

Qwen 3.6 integration

New experimental TPO trainer

Speculative decoding in trl vllm-serve

KTO ↔ DPO alignment: nearing the finish line

More {% generation %} training chat templates

Other

Fixes

Documentation and Examples

CI

New Contributors

What's Changed

Contributors

Uh oh!

v1.2.0

Features

New SSDTrainer — Simple Self-Distillation

Drop, don't truncate, overlong tool results in GRPOTrainer

Expanded tool-calling model support: LLaMA 3.1 / 3.2 & DeepSeek-V3

KTO/DPO alignment push

Other

Fixes

Deprecations

Documentation and Examples

CI

What's Changed

Contributors

Heads-up: SFT default `loss_type` will change in 1.7

Speculative decoding in `trl vllm-serve`

More `{% generation %}` training chat templates

New `SSDTrainer` — Simple Self-Distillation

Drop, don't truncate, overlong tool results in `GRPOTrainer`

`DistillationTrainer` for efficient on-policy distillation

Chunked LM head for memory-efficient log-prob computation in `AsyncGRPOTrainer`

`{% generation %}` support in training chat templates

Tool calling support in `VLLMClient.chat()`