[None][perf] Remove unnecessary ToPIL() from find_mm_token_lengths by yechank-nvidia · Pull Request #11640 · NVIDIA/TensorRT-LLM

yechank-nvidia · 2026-02-23T06:45:00Z

This PR removes ToPIL() from find_mm_token_lengths.
It takes around 5ms per request to convert Tensor to Image.Image object. By removing this, we could obtain around ~3-5% improved TTFT.

Summary by CodeRabbit

Release Notes

Refactor
- Updated image and video input processing to use torch.Tensor instead of PIL Image objects.
- Modified method signatures in multimodal input processors to accept tensor-based inputs.
- Removed automatic PIL image conversion in multimodal length computation.
Tests
- Updated image token count tests to align with tensor-based input handling.

yechank-nvidia · 2026-02-23T06:47:59Z

/bot run

coderabbitai · 2026-02-23T06:48:48Z

📝 Walkthrough

Walkthrough

Type annotations are refactored across multiple model files to replace PIL Image.Image parameters with torch.Tensor parameters in image and video processing methods. Dimension extraction logic is updated to derive height and width from tensor shapes instead of PIL image attributes. Supporting framework code and tests are updated accordingly.

Changes

Cohort / File(s)	Summary
Model implementations `tensorrt_llm/_torch/models/modeling_mistral.py`, `tensorrt_llm/_torch/models/modeling_nemotron_nano.py`, `tensorrt_llm/_torch/models/modeling_phi4mm.py`	Method signatures updated to accept `torch.Tensor` instead of PIL image types in `get_num_tokens_per_image` and `get_num_tokens_per_video` methods. Dimension extraction changed from PIL attributes to tensor `.shape[-2:]`. PIL import removed from phi4mm.py.
Input processing framework `tensorrt_llm/inputs/registry.py`, `tensorrt_llm/inputs/multimodal.py`	Base class and utility code updated to work with tensor-based inputs. Removed PIL conversion logic in multimodal.py token counting. Updated dimension extraction to use tensor shapes in registry.py `BaseMultimodalInputProcessor`.
Tests `tests/unittest/_torch/multimodal/test_find_num_image_tokens.py`	Updated test data loading to use tensor format and extract dimensions from tensor shape instead of PIL image size attributes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The pull request description is incomplete and does not follow the required template structure.	Complete the description section with a clear explanation of what and why, provide test coverage details, and complete the PR Checklist.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main optimization: removing unnecessary ToPIL() conversion from find_mm_token_lengths for performance improvement.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/inputs/registry.py (1)
288-314: ⚠️ Potential issue | 🟠 Major

Type annotation mismatch: video parameter still typed as List[Image.Image].

The type annotation on line 291 specifies video: List[Image.Image], but the implementation on lines 303 and 310 uses tensor-specific attributes (.shape[-2], .shape[-1]). This inconsistency will mislead developers and cause static analysis tools to report false positives.
🐛 Proposed fix
     def get_num_tokens_per_video(
         self,
         *,
-        video: List[Image.Image],
+        video: List[torch.Tensor],
         **kwargs,
     ):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/inputs/registry.py` around lines 288 - 314, The parameter type
for get_num_tokens_per_video is incorrect: update the annotation for video in
get_num_tokens_per_video from List[Image.Image] to a sequence of tensors (e.g.,
Sequence[torch.Tensor] or Sequence[Any] if torch isn't available) so it matches
the implementation that accesses .shape; add the appropriate typing import
(Sequence) and, if using PyTorch, reference torch.Tensor in the annotation and
update the docstring to reflect that frames are tensors, not PIL Images; keep
the function name get_num_tokens_per_video and usages unchanged.

🧹 Nitpick comments (1)

tensorrt_llm/_torch/models/modeling_mistral.py (1)
281-285: Signature inconsistency with base class BaseMultimodalInputProcessor.

The base class in registry.py (line 270-286) defines the method with keyword-only arguments:
def get_num_tokens_per_image(self, *, image: torch.Tensor, **kwargs):
This implementation lacks the *, marker and **kwargs, which could cause issues if callers pass additional keyword arguments or use the method polymorphically through the base class interface.
♻️ Suggested fix to align with base class signature
-    def get_num_tokens_per_image(self, image: torch.Tensor):
+    def get_num_tokens_per_image(self, *, image: torch.Tensor, **kwargs):
         h, w = image.shape[-2:]
         ncols, nrows = self.image_processor._image_to_num_tokens(
             Image.new("RGB", (w, h)))
         return ncols * nrows + nrows
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_mistral.py` around lines 281 - 285, The
method get_num_tokens_per_image in modeling_mistral.py must match the
BaseMultimodalInputProcessor signature: change its definition to include the
keyword-only marker and accept arbitrary kwargs (i.e., def
get_num_tokens_per_image(self, *, image: torch.Tensor, **kwargs):) so it
supports calls via the base class and extra keywords; keep the current logic
that computes h,w from image.shape[-2:], calls
self.image_processor._image_to_num_tokens(Image.new("RGB", (w, h))) and returns
ncols * nrows + nrows unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/unittest/_torch/multimodal/test_find_num_image_tokens.py`:
- Around line 121-122: The test assigns image_width, image_height =
test_image.shape[-2:] but tensor.shape[-2:] yields (height, width), so swap the
assignment to image_height, image_width = test_image.shape[-2:] to match actual
dimensions; locate the usage in the test around load_image and the variables
image_width/image_height (used later in assertions) and update the assignment to
avoid misleading error messages.

---

Outside diff comments:
In `@tensorrt_llm/inputs/registry.py`:
- Around line 288-314: The parameter type for get_num_tokens_per_video is
incorrect: update the annotation for video in get_num_tokens_per_video from
List[Image.Image] to a sequence of tensors (e.g., Sequence[torch.Tensor] or
Sequence[Any] if torch isn't available) so it matches the implementation that
accesses .shape; add the appropriate typing import (Sequence) and, if using
PyTorch, reference torch.Tensor in the annotation and update the docstring to
reflect that frames are tensors, not PIL Images; keep the function name
get_num_tokens_per_video and usages unchanged.

---

Nitpick comments:
In `@tensorrt_llm/_torch/models/modeling_mistral.py`:
- Around line 281-285: The method get_num_tokens_per_image in
modeling_mistral.py must match the BaseMultimodalInputProcessor signature:
change its definition to include the keyword-only marker and accept arbitrary
kwargs (i.e., def get_num_tokens_per_image(self, *, image: torch.Tensor,
**kwargs):) so it supports calls via the base class and extra keywords; keep the
current logic that computes h,w from image.shape[-2:], calls
self.image_processor._image_to_num_tokens(Image.new("RGB", (w, h))) and returns
ncols * nrows + nrows unchanged.

tensorrt-cicd · 2026-02-23T06:54:34Z

PR_Github #36464 [ run ] triggered by Bot. Commit: 26239bd Link to invocation

tensorrt-cicd · 2026-02-23T08:06:02Z

PR_Github #36464 [ run ] completed with state SUCCESS. Commit: 26239bd
/LLM/main/L0_MergeRequest_PR pipeline #28209 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yechank-nvidia · 2026-02-24T02:14:15Z

/bot run

tensorrt-cicd · 2026-02-24T02:20:20Z

PR_Github #36580 [ run ] triggered by Bot. Commit: 26239bd Link to invocation

tensorrt-cicd · 2026-02-24T03:38:20Z

PR_Github #36580 [ run ] completed with state SUCCESS. Commit: 26239bd
/LLM/main/L0_MergeRequest_PR pipeline #28310 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

yechank-nvidia · 2026-04-24T06:51:18Z

/bot run

tensorrt-cicd · 2026-04-24T07:07:35Z

PR_Github #45358 [ run ] triggered by Bot. Commit: e6a2bb9 Link to invocation

tensorrt-cicd · 2026-04-24T12:10:28Z

PR_Github #45358 [ run ] completed with state SUCCESS. Commit: e6a2bb9
/LLM/main/L0_MergeRequest_PR pipeline #35605 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yechank-nvidia · 2026-04-24T14:42:43Z

/bot run

tensorrt-cicd · 2026-04-24T14:50:51Z

PR_Github #45403 [ run ] triggered by Bot. Commit: e6a2bb9 Link to invocation

tensorrt-cicd · 2026-04-24T17:43:17Z

PR_Github #45403 [ run ] completed with state SUCCESS. Commit: e6a2bb9
/LLM/main/L0_MergeRequest_PR pipeline #35640 completed with status: 'SUCCESS'

CI Report

Link to invocation

…VIDIA#11640) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

yechank-nvidia requested review from a team as code owners February 23, 2026 06:45

yechank-nvidia requested review from Wanli-Jiang, amukkara and jdebache February 23, 2026 06:45

yechank-nvidia changed the title ~~[perf][fix] Remove unnecessary ToPIL() from find_mm_token_lengths~~ [None][perf] Remove unnecessary ToPIL() from find_mm_token_lengths Feb 23, 2026

yechank-nvidia self-assigned this Feb 23, 2026

yechank-nvidia added the Multimodal Label for issues & PRs regarding Multimodal related objects label Feb 23, 2026

coderabbitai Bot reviewed Feb 23, 2026

View reviewed changes

Comment thread tests/unittest/_torch/multimodal/test_find_num_image_tokens.py Outdated

yechank-nvidia added 2 commits April 23, 2026 20:26

remove ToPIL from find_mm_token_lengths

7ad20c1

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

fix test

e6a2bb9

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

yechank-nvidia force-pushed the mm_token_len branch from 26239bd to e6a2bb9 Compare April 24, 2026 06:50

yechank-nvidia requested a review from a team as a code owner April 24, 2026 06:50

yechank-nvidia requested a review from moraxu April 24, 2026 06:50

2ez4bz approved these changes Apr 24, 2026

View reviewed changes

moraxu approved these changes Apr 24, 2026

View reviewed changes

amukkara approved these changes Apr 24, 2026

View reviewed changes

2ez4bz enabled auto-merge (squash) April 24, 2026 17:21

2ez4bz merged commit a2a7d0b into NVIDIA:main Apr 24, 2026
5 checks passed

yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026

[None][perf] Remove unnecessary ToPIL() from find_mm_token_lengths (N…

4c09b45

…VIDIA#11640) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

Conversation

yechank-nvidia commented Feb 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

yechank-nvidia commented Feb 23, 2026

Uh oh!

coderabbitai Bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Feb 23, 2026

Uh oh!

tensorrt-cicd commented Feb 23, 2026

Uh oh!

yechank-nvidia commented Feb 24, 2026

Uh oh!

tensorrt-cicd commented Feb 24, 2026

Uh oh!

tensorrt-cicd commented Feb 24, 2026

Uh oh!

yechank-nvidia commented Apr 24, 2026

Uh oh!

tensorrt-cicd commented Apr 24, 2026

Uh oh!

tensorrt-cicd commented Apr 24, 2026

Uh oh!

yechank-nvidia commented Apr 24, 2026

Uh oh!

tensorrt-cicd commented Apr 24, 2026

Uh oh!

tensorrt-cicd commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yechank-nvidia commented Feb 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 23, 2026 •

edited

Loading