[None][feature] Add thinking token budget control by tijyojwad · Pull Request #14665 · NVIDIA/TensorRT-LLM

tijyojwad · 2026-05-28T03:44:43Z

Summary by CodeRabbit

New Features
- Added thinking token budget parameter that allows users to control the maximum number of tokens allocated for model reasoning during generation.
- Integrated budget enforcement across OpenAI-compatible API endpoints and LLM API interfaces.
- Supports unlimited reasoning when budget is not specified.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: tijyojwad <1127155+tijyojwad@users.noreply.github.com>

tijyojwad · 2026-05-28T03:48:22Z

/bot run --disable-fail-fast

coderabbitai · 2026-05-28T03:51:04Z

📝 Walkthrough

Walkthrough

This PR adds per-request thinking token budget enforcement to TensorRT-LLM. A new thinking_token_budget sampling parameter limits reasoning-phase token generation; when exhausted, a logits processor forces an end-of-reasoning token. The feature integrates across the sampling pipeline, OpenAI protocol, and all generation handlers.

Changes

Thinking Budget Token Limiting

Layer / File(s)	Summary
Sampling parameter definition and validation `tensorrt_llm/sampling_params.py`	New `thinking_token_budget: Optional[int]` field added to `SamplingParams` with `validate_thinking_token_budget(...)` normalizing `-1` as unlimited and rejecting negative values.
Thinking budget logits processor implementation `tensorrt_llm/llmapi/thinking_budget.py`	`ThinkingBudgetLogitsProcessor` class detects reasoning start/end token boundaries in each beam's history, computes remaining budget (accounting for partial end-token suffix/prefix overlaps), and forces the end token when budget expires by masking logits. `add_thinking_budget_logits_processor(...)` conditionally attaches the processor to existing logits-processor chains via `_ChainedLogitsProcessor`. Helper functions encode boundary strings and locate token sequences.
OpenAI request model integration `tensorrt_llm/serve/openai_protocol.py`	`CompletionRequest`, `ChatCompletionRequest`, and `ResponsesRequest` each gain `thinking_token_budget` fields with Pydantic validators delegating to `validate_thinking_token_budget(...)`, and `to_sampling_params(...)` methods forward the parameter into `SamplingParams`.
Server-side processor wiring `tensorrt_llm/llmapi/llm.py`, `tensorrt_llm/serve/openai_server.py`, `tensorrt_llm/serve/responses_utils.py`	`add_thinking_budget_logits_processor(...)` wired into LLM API's `_prepare_sampling_params`, OpenAI handler's `openai_chat` and `openai_completion` paths, and response request preprocessing. `chat_harmony` adds validation to reject thinking budget on that serving path.
Public API exposure and validation tests `tensorrt_llm/llmapi/__init__.py`, `tests/unittest/api_stability/references/sampling_params.yaml`, `tests/unittest/llmapi/test_sampling_params.py`	`ThinkingBudgetLogitsProcessor` and `add_thinking_budget_logits_processor` exported from `tensorrt_llm.llmapi`. API stability schema updated. Tests validate parameter normalization, processor token forcing at budget exhaustion, partial end-token masking, and processor attachment via OpenAI request models using fake tokenizer/reasoning parser.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is entirely the template with no actual content filled in—no explanation of the issue, solution, test coverage, or checklist completion beyond an auto-checked box.	Replace template with actual description explaining what thinking token budget control does, why it's needed, which tests cover the changes, and confirm the PR checklist items are addressed.
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.76% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main feature being added: thinking token budget control. It directly corresponds to the changes across multiple files for implementing this functionality.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

tests/unittest/llmapi/test_sampling_params.py (2)

58-63: ⚡ Quick win

Add explicit return annotations to updated test/helper functions.

Please add -> None to the changed test functions and a concrete return annotation for FakeTokenizer.encode to align with repo typing standards.

Suggested patch

-def test_completion_logprobs_assignment_revalidates():
+def test_completion_logprobs_assignment_revalidates() -> None:
@@
-def test_thinking_token_budget_unlimited_values(value):
+def test_thinking_token_budget_unlimited_values(value: int | None) -> None:
@@
-def test_thinking_token_budget_rejects_invalid_values(value):
+def test_thinking_token_budget_rejects_invalid_values(value: object) -> None:
@@
-def test_chat_thinking_token_budget_request_validation():
+def test_chat_thinking_token_budget_request_validation() -> None:
@@
-def test_thinking_budget_logits_processor_forces_end_token():
+def test_thinking_budget_logits_processor_forces_end_token() -> None:
@@
-def test_thinking_budget_logits_processor_completes_partial_end_sequence():
+def test_thinking_budget_logits_processor_completes_partial_end_sequence() -> None:
@@
-def test_thinking_budget_logits_processor_ignores_partial_end_before_budget():
+def test_thinking_budget_logits_processor_ignores_partial_end_before_budget() -> None:
@@
-def test_thinking_budget_logits_processor_ignores_closed_reasoning_block():
+def test_thinking_budget_logits_processor_ignores_closed_reasoning_block() -> None:
@@
-def test_add_thinking_budget_logits_processor_uses_reasoning_parser_tokens():
+def test_add_thinking_budget_logits_processor_uses_reasoning_parser_tokens() -> None:
     class FakeTokenizer:
-        def encode(self, text, add_special_tokens=False):
+        def encode(self, text: str, add_special_tokens: bool = False) -> list[int]:
             del add_special_tokens

As per coding guidelines, “Static type checking with mypy is opt-in by submodule; always annotate functions with return types (use None if function does not return)”.

Also applies to: 65-169

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/llmapi/test_sampling_params.py` around lines 58 - 63, Add
explicit return type annotations to the modified test and helper functions:
append "-> None" to any test functions in the updated block (the tests around
request.to_sampling_params checks) and to other test helpers in the 65-169
region, and add a concrete return annotation to FakeTokenizer.encode (e.g., ->
List[int] or -> Sequence[int] consistent with project typing) so all functions
have explicit return types per the repo's mypy guidance; update imports if
needed for typing names.

65-169: QA list updates are unnecessary for this PR scope.

These are unittest/API-stability changes only, so no tests/integration/test_lists/qa/* update is required in this PR.

As per coding guidelines, “If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/llmapi/test_sampling_params.py` around lines 65 - 169, The PR
touches only unit tests and API-stability tests (e.g.,
test_thinking_token_budget_unlimited_values,
test_thinking_budget_logits_processor_*, SamplingParams,
add_thinking_budget_logits_processor), so explicitly state in the PR description
(or commit message) that QA list updates are unnecessary for this change; add a
single-line note in the PR body like “QA list updates unnecessary: changes
limited to unittest/API-stability” and optionally add a one-line comment at the
top of the modified test module to record that scope for future reviewers.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/unittest/llmapi/test_sampling_params.py`:
- Around line 58-63: Add explicit return type annotations to the modified test
and helper functions: append "-> None" to any test functions in the updated
block (the tests around request.to_sampling_params checks) and to other test
helpers in the 65-169 region, and add a concrete return annotation to
FakeTokenizer.encode (e.g., -> List[int] or -> Sequence[int] consistent with
project typing) so all functions have explicit return types per the repo's mypy
guidance; update imports if needed for typing names.
- Around line 65-169: The PR touches only unit tests and API-stability tests
(e.g., test_thinking_token_budget_unlimited_values,
test_thinking_budget_logits_processor_*, SamplingParams,
add_thinking_budget_logits_processor), so explicitly state in the PR description
(or commit message) that QA list updates are unnecessary for this change; add a
single-line note in the PR body like “QA list updates unnecessary: changes
limited to unittest/API-stability” and optionally add a one-line comment at the
top of the modified test module to record that scope for future reviewers.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 25fd3653-5f10-4ea2-ae12-a2f2d4e96fb2

📥 Commits

Reviewing files that changed from the base of the PR and between 1b8c739 and d4f3d0f.

📒 Files selected for processing (9)

tensorrt_llm/llmapi/__init__.py
tensorrt_llm/llmapi/llm.py
tensorrt_llm/llmapi/thinking_budget.py
tensorrt_llm/sampling_params.py
tensorrt_llm/serve/openai_protocol.py
tensorrt_llm/serve/openai_server.py
tensorrt_llm/serve/responses_utils.py
tests/unittest/api_stability/references/sampling_params.yaml
tests/unittest/llmapi/test_sampling_params.py

tensorrt-cicd · 2026-05-28T03:54:01Z

PR_Github #50686 [ run ] triggered by Bot. Commit: 84c1a41 Link to invocation

nv-guomingz

LGTM for llm api part

tensorrt-cicd · 2026-05-28T11:26:56Z

PR_Github #50686 [ run ] completed with state SUCCESS. Commit: 84c1a41
/LLM/main/L0_MergeRequest_PR pipeline #40174 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

tijyojwad · 2026-05-28T12:57:18Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-28T13:03:43Z

PR_Github #50777 [ run ] triggered by Bot. Commit: 84c1a41 Link to invocation

Signed-off-by: tijyojwad <1127155+tijyojwad@users.noreply.github.com>

tijyojwad · 2026-05-28T15:09:41Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-28T15:15:26Z

PR_Github #50815 [ run ] triggered by Bot. Commit: db7b36a Link to invocation

tensorrt-cicd · 2026-05-28T15:20:25Z

PR_Github #50777 [ run ] completed with state ABORTED. Commit: 84c1a41

Link to invocation

tensorrt-cicd · 2026-05-28T23:23:27Z

PR_Github #50815 [ run ] completed with state FAILURE. Commit: db7b36a
/LLM/main/L0_MergeRequest_PR pipeline #40285 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

tijyojwad · 2026-05-29T00:53:20Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-29T00:58:51Z

PR_Github #50920 [ run ] triggered by Bot. Commit: db7b36a Link to invocation

tensorrt-cicd · 2026-05-29T04:33:02Z

PR_Github #50920 [ run ] completed with state SUCCESS. Commit: db7b36a
/LLM/main/L0_MergeRequest_PR pipeline #40381 completed with status: 'SUCCESS'

CI Report

Link to invocation

Add thinking token budget control

d4f3d0f

Signed-off-by: tijyojwad <1127155+tijyojwad@users.noreply.github.com>

tijyojwad requested review from a team as code owners May 28, 2026 03:44

tijyojwad requested a review from venkywonka May 28, 2026 03:44

github-actions Bot assigned tijyojwad May 28, 2026

Add text-level thinking budget tests

84c1a41

Signed-off-by: tijyojwad <1127155+tijyojwad@users.noreply.github.com>

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

nv-guomingz approved these changes May 28, 2026

View reviewed changes

tijyojwad added the api-compatible Accepted LLM API contract change that is backwards-compatible label May 28, 2026

juney-nvidia reviewed May 28, 2026

View reviewed changes

Comment thread tests/unittest/api_stability/references/sampling_params.yaml

Mark thinking budget API as prototype

db7b36a

Signed-off-by: tijyojwad <1127155+tijyojwad@users.noreply.github.com>

juney-nvidia approved these changes May 28, 2026

View reviewed changes

Wanli-Jiang merged commit 5421ef9 into NVIDIA:main May 29, 2026
7 checks passed

Conversation

tijyojwad commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

tijyojwad commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Walkthrough

Changes

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented May 28, 2026

Uh oh!

nv-guomingz left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented May 28, 2026

Uh oh!

tijyojwad commented May 28, 2026

Uh oh!

tensorrt-cicd commented May 28, 2026

Uh oh!

Uh oh!

tijyojwad commented May 28, 2026

Uh oh!

tensorrt-cicd commented May 28, 2026

Uh oh!

tensorrt-cicd commented May 28, 2026

Uh oh!

tensorrt-cicd commented May 28, 2026

Uh oh!

tijyojwad commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tijyojwad commented May 28, 2026 •

edited by coderabbitai Bot

Loading