Skip to content

Align KTO with DPO: Align tokenization#5601

Merged
albertvillanova merged 5 commits into
huggingface:mainfrom
albertvillanova:align-ktp-dpo-tokenize
Apr 20, 2026
Merged

Align KTO with DPO: Align tokenization#5601
albertvillanova merged 5 commits into
huggingface:mainfrom
albertvillanova:align-ktp-dpo-tokenize

Conversation

@albertvillanova

@albertvillanova albertvillanova commented Apr 20, 2026

Copy link
Copy Markdown
Member

Align KTO with DPO: Align tokenization.

This PR refactors the tokenization logic in KTOTrainer to improve clarity, modularity, and support for conversational data. The main changes include removing the old _tokenize function, introducing a new instance method _tokenize that handles both conversational and non-conversational inputs, and updating dataset preparation to use this new logic. Additionally, the pull request simplifies and clarifies the token processing pipeline.

Part of:

Changes

Tokenization Refactor and Conversational Data Support:

  • Removed the original _tokenize function, which was previously responsible for tokenizing batches, and replaced it with a new instance method _tokenize that can handle both plain text and conversational (chat) inputs, improving modularity and clarity.
  • Updated dataset preparation in _prepare_dataset to use the new _tokenize method and added support for conversational data via the is_conversational utility, ensuring correct tokenization for both chat and non-chat datasets.

Code Cleanup and Simplification:

  • Removed unused imports such as numpy and the type import for PreTrainedTokenizer, streamlining dependencies.
  • Simplified the _process_tokens function by removing redundant checks and batch construction, focusing it on label processing.

Note

Medium Risk
Changes how training/eval datasets are tokenized (including chat-style inputs), which can subtly affect prompt/answer boundaries and thus loss computation; tests were updated to cover the new behavior but tokenizer edge cases may still surface.

Overview
Refactors KTOTrainer dataset preprocessing to use a new instance _tokenize method and a per-example tokenize_fn, adding explicit handling for conversational inputs via is_conversational/apply_chat_template and emitting a warning when prompt vs prompt+completion tokenization mismatches.

Removes the old batch _tokenize implementation (and its numpy dependency) and simplifies _process_tokens to operate purely on pre-tokenized fields while preserving the label.

Updates test_tokenize_and_process_tokens to validate tokenization through trainer.train_dataset (rather than calling _tokenize directly), adds KL corruption checks against the prepared dataset, and adds a synthetic Dataset.from_dict case to exercise _process_tokens with num_proc.

Reviewed by Cursor Bugbot for commit 8780ca1. Bugbot is set up for automated code reviews on this repo. Configure here.

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 13db276. Configure here.

]
prompt_completion_ids = self._tokenize(
processing_class, example["prompt"] + example["completion"]
)["input_ids"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversational branch in tokenize_fn is unreachable dead code

Medium Severity

The is_conversational(example) check in tokenize_fn will always return False because maybe_apply_chat_template is applied to the dataset earlier (line 587–591), which converts all conversational data (lists of message dicts) to plain strings. DPO's equivalent tokenize_fn works because DPO does not call maybe_apply_chat_template before tokenization — data is still in conversational format when DPO's is_conversational check runs. Here, the entire conversational branch (using add_generation_prompt=True and apply_chat_template for tokenization) is dead code that can never execute.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by project rule: ../.ai/AGENTS.md

Reviewed by Cursor Bugbot for commit 13db276. Configure here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is intentional. I am addressing this in my follow-up PR: Remove maybe_apply_chat_template

@qgallouedec qgallouedec left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

from transformers import AutoModelForCausalLM, AutoTokenizer

from trl.experimental.kto import KTOConfig, KTOTrainer
from trl.experimental.kto.kto_trainer import _get_kl_dataset, _process_tokens, _tokenize

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@albertvillanova albertvillanova merged commit 4ca2e9b into huggingface:main Apr 20, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants