Align KTO with DPO: Align tokenization by albertvillanova · Pull Request #5601 · huggingface/trl

albertvillanova · 2026-04-20T08:30:19Z

Align KTO with DPO: Align tokenization.

This PR refactors the tokenization logic in KTOTrainer to improve clarity, modularity, and support for conversational data. The main changes include removing the old _tokenize function, introducing a new instance method _tokenize that handles both conversational and non-conversational inputs, and updating dataset preparation to use this new logic. Additionally, the pull request simplifies and clarifies the token processing pipeline.

Part of:

KTO refactoring #4786

Changes

Tokenization Refactor and Conversational Data Support:

Removed the original _tokenize function, which was previously responsible for tokenizing batches, and replaced it with a new instance method _tokenize that can handle both plain text and conversational (chat) inputs, improving modularity and clarity.
Updated dataset preparation in _prepare_dataset to use the new _tokenize method and added support for conversational data via the is_conversational utility, ensuring correct tokenization for both chat and non-chat datasets.

Code Cleanup and Simplification:

Removed unused imports such as numpy and the type import for PreTrainedTokenizer, streamlining dependencies.
Simplified the _process_tokens function by removing redundant checks and batch construction, focusing it on label processing.

Note

Medium Risk
Changes how training/eval datasets are tokenized (including chat-style inputs), which can subtly affect prompt/answer boundaries and thus loss computation; tests were updated to cover the new behavior but tokenizer edge cases may still surface.

Overview
Refactors KTOTrainer dataset preprocessing to use a new instance _tokenize method and a per-example tokenize_fn, adding explicit handling for conversational inputs via is_conversational/apply_chat_template and emitting a warning when prompt vs prompt+completion tokenization mismatches.

Removes the old batch _tokenize implementation (and its numpy dependency) and simplifies _process_tokens to operate purely on pre-tokenized fields while preserving the label.

Updates test_tokenize_and_process_tokens to validate tokenization through trainer.train_dataset (rather than calling _tokenize directly), adds KL corruption checks against the prepared dataset, and adds a synthetic Dataset.from_dict case to exercise _process_tokens with num_proc.

^{Reviewed by Cursor Bugbot for commit 8780ca1. Bugbot is set up for automated code reviews on this repo. Configure here.}

HuggingFaceDocBuilderDev · 2026-04-20T08:32:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 13db276. Configure here.}

cursor · 2026-04-20T08:37:46Z

+                    ]
+                    prompt_completion_ids = self._tokenize(
+                        processing_class, example["prompt"] + example["completion"]
+                    )["input_ids"]


Conversational branch in tokenize_fn is unreachable dead code

Medium Severity

The is_conversational(example) check in tokenize_fn will always return False because maybe_apply_chat_template is applied to the dataset earlier (line 587–591), which converts all conversational data (lists of message dicts) to plain strings. DPO's equivalent tokenize_fn works because DPO does not call maybe_apply_chat_template before tokenization — data is still in conversational format when DPO's is_conversational check runs. Here, the entire conversational branch (using add_generation_prompt=True and apply_chat_template for tokenization) is dead code that can never execute.

Additional Locations (1)

trl/experimental/kto/kto_trainer.py#L586-L591

^{Triggered by project rule: ../.ai/AGENTS.md}

^{Reviewed by Cursor Bugbot for commit 13db276. Configure here.}

Yes, this is intentional. I am addressing this in my follow-up PR: Remove maybe_apply_chat_template

qgallouedec

lgtm!

qgallouedec · 2026-04-20T14:56:52Z

 from transformers import AutoModelForCausalLM, AutoTokenizer

 from trl.experimental.kto import KTOConfig, KTOTrainer
-from trl.experimental.kto.kto_trainer import _get_kl_dataset, _process_tokens, _tokenize


albertvillanova added 4 commits April 20, 2026 10:02

Add _tokenize method

914175a

Replace _tokenize with in-line tokenize_fn

f4e5fd5

Remove module-level _tokenize

bc1f1a7

Drop text-field preamble in _process_tokens

13db276

cursor Bot reviewed Apr 20, 2026

View reviewed changes

Update test_tokenize_and_process_tokens

8780ca1

albertvillanova mentioned this pull request Apr 20, 2026

Align KTO with DPO: Remove maybe_apply_chat_template #5606

Merged

1 task

qgallouedec approved these changes Apr 20, 2026

View reviewed changes

albertvillanova merged commit 4ca2e9b into huggingface:main Apr 20, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align KTO with DPO: Align tokenization#5601

Align KTO with DPO: Align tokenization#5601
albertvillanova merged 5 commits into
huggingface:mainfrom
albertvillanova:align-ktp-dpo-tokenize

albertvillanova commented Apr 20, 2026 •

edited by cursor Bot

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Apr 20, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 20, 2026

Uh oh!

albertvillanova Apr 20, 2026

Uh oh!

qgallouedec left a comment

Uh oh!

qgallouedec Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

albertvillanova commented Apr 20, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

HuggingFaceDocBuilderDev commented Apr 20, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 20, 2026

Choose a reason for hiding this comment

Conversational branch in tokenize_fn is unreachable dead code

Uh oh!

albertvillanova Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

qgallouedec Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

albertvillanova commented Apr 20, 2026 •

edited by cursor Bot

Loading