Skip to content

[None][feature] Add thinking token budget control#14665

Merged
Wanli-Jiang merged 3 commits into
NVIDIA:mainfrom
tijyojwad:feature/thinking-budget-control
May 29, 2026
Merged

[None][feature] Add thinking token budget control#14665
Wanli-Jiang merged 3 commits into
NVIDIA:mainfrom
tijyojwad:feature/thinking-budget-control

Conversation

@tijyojwad

@tijyojwad tijyojwad commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • New Features
    • Added thinking token budget parameter that allows users to control the maximum number of tokens allocated for model reasoning during generation.
    • Integrated budget enforcement across OpenAI-compatible API endpoints and LLM API interfaces.
    • Supports unlimited reasoning when budget is not specified.

Review Change Stack

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: tijyojwad <1127155+tijyojwad@users.noreply.github.com>
@tijyojwad tijyojwad requested review from a team as code owners May 28, 2026 03:44
@tijyojwad tijyojwad requested a review from venkywonka May 28, 2026 03:44
Signed-off-by: tijyojwad <1127155+tijyojwad@users.noreply.github.com>
@tijyojwad

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR adds per-request thinking token budget enforcement to TensorRT-LLM. A new thinking_token_budget sampling parameter limits reasoning-phase token generation; when exhausted, a logits processor forces an end-of-reasoning token. The feature integrates across the sampling pipeline, OpenAI protocol, and all generation handlers.

Changes

Thinking Budget Token Limiting

Layer / File(s) Summary
Sampling parameter definition and validation
tensorrt_llm/sampling_params.py
New thinking_token_budget: Optional[int] field added to SamplingParams with validate_thinking_token_budget(...) normalizing -1 as unlimited and rejecting negative values.
Thinking budget logits processor implementation
tensorrt_llm/llmapi/thinking_budget.py
ThinkingBudgetLogitsProcessor class detects reasoning start/end token boundaries in each beam's history, computes remaining budget (accounting for partial end-token suffix/prefix overlaps), and forces the end token when budget expires by masking logits. add_thinking_budget_logits_processor(...) conditionally attaches the processor to existing logits-processor chains via _ChainedLogitsProcessor. Helper functions encode boundary strings and locate token sequences.
OpenAI request model integration
tensorrt_llm/serve/openai_protocol.py
CompletionRequest, ChatCompletionRequest, and ResponsesRequest each gain thinking_token_budget fields with Pydantic validators delegating to validate_thinking_token_budget(...), and to_sampling_params(...) methods forward the parameter into SamplingParams.
Server-side processor wiring
tensorrt_llm/llmapi/llm.py, tensorrt_llm/serve/openai_server.py, tensorrt_llm/serve/responses_utils.py
add_thinking_budget_logits_processor(...) wired into LLM API's _prepare_sampling_params, OpenAI handler's openai_chat and openai_completion paths, and response request preprocessing. chat_harmony adds validation to reject thinking budget on that serving path.
Public API exposure and validation tests
tensorrt_llm/llmapi/__init__.py, tests/unittest/api_stability/references/sampling_params.yaml, tests/unittest/llmapi/test_sampling_params.py
ThinkingBudgetLogitsProcessor and add_thinking_budget_logits_processor exported from tensorrt_llm.llmapi. API stability schema updated. Tests validate parameter normalization, processor token forcing at budget exhaustion, partial end-token masking, and processor attachment via OpenAI request models using fake tokenizer/reasoning parser.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is entirely the template with no actual content filled in—no explanation of the issue, solution, test coverage, or checklist completion beyond an auto-checked box. Replace template with actual description explaining what thinking token budget control does, why it's needed, which tests cover the changes, and confirm the PR checklist items are addressed.
Docstring Coverage ⚠️ Warning Docstring coverage is 9.76% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main feature being added: thinking token budget control. It directly corresponds to the changes across multiple files for implementing this functionality.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/unittest/llmapi/test_sampling_params.py (2)

58-63: ⚡ Quick win

Add explicit return annotations to updated test/helper functions.

Please add -> None to the changed test functions and a concrete return annotation for FakeTokenizer.encode to align with repo typing standards.

Suggested patch
-def test_completion_logprobs_assignment_revalidates():
+def test_completion_logprobs_assignment_revalidates() -> None:
@@
-def test_thinking_token_budget_unlimited_values(value):
+def test_thinking_token_budget_unlimited_values(value: int | None) -> None:
@@
-def test_thinking_token_budget_rejects_invalid_values(value):
+def test_thinking_token_budget_rejects_invalid_values(value: object) -> None:
@@
-def test_chat_thinking_token_budget_request_validation():
+def test_chat_thinking_token_budget_request_validation() -> None:
@@
-def test_thinking_budget_logits_processor_forces_end_token():
+def test_thinking_budget_logits_processor_forces_end_token() -> None:
@@
-def test_thinking_budget_logits_processor_completes_partial_end_sequence():
+def test_thinking_budget_logits_processor_completes_partial_end_sequence() -> None:
@@
-def test_thinking_budget_logits_processor_ignores_partial_end_before_budget():
+def test_thinking_budget_logits_processor_ignores_partial_end_before_budget() -> None:
@@
-def test_thinking_budget_logits_processor_ignores_closed_reasoning_block():
+def test_thinking_budget_logits_processor_ignores_closed_reasoning_block() -> None:
@@
-def test_add_thinking_budget_logits_processor_uses_reasoning_parser_tokens():
+def test_add_thinking_budget_logits_processor_uses_reasoning_parser_tokens() -> None:
     class FakeTokenizer:
-        def encode(self, text, add_special_tokens=False):
+        def encode(self, text: str, add_special_tokens: bool = False) -> list[int]:
             del add_special_tokens

As per coding guidelines, “Static type checking with mypy is opt-in by submodule; always annotate functions with return types (use None if function does not return)”.

Also applies to: 65-169

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/llmapi/test_sampling_params.py` around lines 58 - 63, Add
explicit return type annotations to the modified test and helper functions:
append "-> None" to any test functions in the updated block (the tests around
request.to_sampling_params checks) and to other test helpers in the 65-169
region, and add a concrete return annotation to FakeTokenizer.encode (e.g., ->
List[int] or -> Sequence[int] consistent with project typing) so all functions
have explicit return types per the repo's mypy guidance; update imports if
needed for typing names.

65-169: QA list updates are unnecessary for this PR scope.

These are unittest/API-stability changes only, so no tests/integration/test_lists/qa/* update is required in this PR.

As per coding guidelines, “If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/llmapi/test_sampling_params.py` around lines 65 - 169, The PR
touches only unit tests and API-stability tests (e.g.,
test_thinking_token_budget_unlimited_values,
test_thinking_budget_logits_processor_*, SamplingParams,
add_thinking_budget_logits_processor), so explicitly state in the PR description
(or commit message) that QA list updates are unnecessary for this change; add a
single-line note in the PR body like “QA list updates unnecessary: changes
limited to unittest/API-stability” and optionally add a one-line comment at the
top of the modified test module to record that scope for future reviewers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/unittest/llmapi/test_sampling_params.py`:
- Around line 58-63: Add explicit return type annotations to the modified test
and helper functions: append "-> None" to any test functions in the updated
block (the tests around request.to_sampling_params checks) and to other test
helpers in the 65-169 region, and add a concrete return annotation to
FakeTokenizer.encode (e.g., -> List[int] or -> Sequence[int] consistent with
project typing) so all functions have explicit return types per the repo's mypy
guidance; update imports if needed for typing names.
- Around line 65-169: The PR touches only unit tests and API-stability tests
(e.g., test_thinking_token_budget_unlimited_values,
test_thinking_budget_logits_processor_*, SamplingParams,
add_thinking_budget_logits_processor), so explicitly state in the PR description
(or commit message) that QA list updates are unnecessary for this change; add a
single-line note in the PR body like “QA list updates unnecessary: changes
limited to unittest/API-stability” and optionally add a one-line comment at the
top of the modified test module to record that scope for future reviewers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 25fd3653-5f10-4ea2-ae12-a2f2d4e96fb2

📥 Commits

Reviewing files that changed from the base of the PR and between 1b8c739 and d4f3d0f.

📒 Files selected for processing (9)
  • tensorrt_llm/llmapi/__init__.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/llmapi/thinking_budget.py
  • tensorrt_llm/sampling_params.py
  • tensorrt_llm/serve/openai_protocol.py
  • tensorrt_llm/serve/openai_server.py
  • tensorrt_llm/serve/responses_utils.py
  • tests/unittest/api_stability/references/sampling_params.yaml
  • tests/unittest/llmapi/test_sampling_params.py

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50686 [ run ] triggered by Bot. Commit: 84c1a41 Link to invocation

@nv-guomingz nv-guomingz left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for llm api part

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50686 [ run ] completed with state SUCCESS. Commit: 84c1a41
/LLM/main/L0_MergeRequest_PR pipeline #40174 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@tijyojwad

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50777 [ run ] triggered by Bot. Commit: 84c1a41 Link to invocation

@tijyojwad tijyojwad added the api-compatible Accepted LLM API contract change that is backwards-compatible label May 28, 2026
Comment thread tests/unittest/api_stability/references/sampling_params.yaml
Signed-off-by: tijyojwad <1127155+tijyojwad@users.noreply.github.com>
@tijyojwad

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50815 [ run ] triggered by Bot. Commit: db7b36a Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50777 [ run ] completed with state ABORTED. Commit: 84c1a41

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50815 [ run ] completed with state FAILURE. Commit: db7b36a
/LLM/main/L0_MergeRequest_PR pipeline #40285 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@tijyojwad

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50920 [ run ] triggered by Bot. Commit: db7b36a Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50920 [ run ] completed with state SUCCESS. Commit: db7b36a
/LLM/main/L0_MergeRequest_PR pipeline #40381 completed with status: 'SUCCESS'

CI Report

Link to invocation

@Wanli-Jiang Wanli-Jiang merged commit 5421ef9 into NVIDIA:main May 29, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-compatible Accepted LLM API contract change that is backwards-compatible

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants