Skip to content

[TRTLLM-13429][feat] Switch DeepSeek/NemotronH/Qwen3/Qwen3.5-MoE to sharding-IR canonical models#13478

Merged
greg-kwasniewski1 merged 11 commits into
NVIDIA:mainfrom
nv-auto-deploy:gk/switch_to_ir_models
May 25, 2026
Merged

[TRTLLM-13429][feat] Switch DeepSeek/NemotronH/Qwen3/Qwen3.5-MoE to sharding-IR canonical models#13478
greg-kwasniewski1 merged 11 commits into
NVIDIA:mainfrom
nv-auto-deploy:gk/switch_to_ir_models

Conversation

@greg-kwasniewski1

@greg-kwasniewski1 greg-kwasniewski1 commented Apr 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes #13429.

Replace the legacy non-IR modeling_*.py for four model architectures with their sharding-IR variants. The IR-aware implementation (using apply_sharding_hints for TP/EP/BMM via canonical-op kwargs like tp_mode, layer_type, output_sizes, tp_min_local_shape, tp_scaled_dim, shardable) is now the canonical implementation. The AD_USE_IR_MODELS env-var gate is removed.

File swap

git rm modeling_X.py && git mv modeling_X_ir.py modeling_X.py for:

  • modeling_deepseek.py — DeepSeek V3, MLA + MoE
  • modeling_nemotron_h.py — NemotronH, hybrid Mamba + MoE
  • modeling_qwen3.py — Qwen3 dense (was previously only available as _ir; now registered in __init__.py)
  • modeling_qwen3_5_moe.py — Qwen3.5 MoE, GatedDeltaNet + Gated MHA + MoE

Continuation of #12419 (sharding-IR introduction) and #13272 (post-#12419 cleanup), per the staged plan to migrate model implementations to the hint-driven sharder one cohort at a time.

Skill update for future regenerations

.claude/skills/ad-model-onboard/SKILL.md:

  • Reverse the misleading noaux_tc "Key Gotcha" bullet — was telling agents to substitute fused trtllm calls with vanilla PyTorch; corrected to KEEP torch.ops.trtllm.{noaux_tc_op, dsv3_router_gemm_op} verbatim (no AD transform recovers vanilla → fused; vanilla replacement is ~17x more kernel launches and loses the FP8-friendly router GEMM).
  • New "Step 0 — IR delta contract (READ FIRST)" with explicit A1-A5 allowlist + F1-F7 forbidden list of changes an IR port may introduce vs the non-IR base.
  • New "Step 12 — Pre-finalization self-audit (MANDATORY)" — agents must run diff -u and classify every hunk before reporting done.
  • Phase 5 position_ids rule now carries an "IR-port exception" pointing at the contract.

Audit summary

Each of the four IR files was regenerated from its non-IR base by parallel generalPurpose subagents using the updated skill, then audited at the parent level. Every hunk classifies as A1-A5 (allowed). Cross-checks: same class count, same nn.Parameter/register_buffer/register_load_state_dict_pre_hook count, same def forward( count, both torch.ops.trtllm.* call sites in DeepSeekV3MoEGate and NemotronHTopkRouter preserved verbatim (the noaux_tc paths are the original fused kernels), and the position_ids contract preserved per architecture.

Test plan

  • Local smoke on small/fp8/deepseek/r1.yaml with the IR sharder + world_size=2 on 2x H100: exit=0, 51 sharding nodes, 10 prompts generated cleanly. (AD_USE_IR_MODELS=1 no longer needed since IR is canonical.)
  • Targeted CI run (Group 1 + Group 2 stages per ci-guidelines.md) — will trigger via /bot run --stage-list ... after PR opens.
  • Full pre-merge /bot run once the targeted run is green.

Refs: #12419 (sharding-IR introduction), #13271 (post-#12419 cleanup feature), #13272 (post-#12419 cleanup PR), #13429 (this feature's tracking issue).

Made with Cursor

Summary by CodeRabbit

  • New Features

    • Enhanced tensor parallelism support for DeepSeek, Nemotron, and Qwen3.5 models with sharding-aware optimizations.
  • Refactor

    • Consolidated model registrations and unified internal variant handling for cleaner module structure.
  • Documentation

    • Updated model onboarding guidelines with IR porting requirements, operation allowlists, and mandatory audit workflows.

@greg-kwasniewski1

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1"

@coderabbitai

coderabbitai Bot commented Apr 26, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This change consolidates four model families from dual-file implementations to IR-fied single files. The non-IR implementations gain sharding-aware custom ops carrying explicit sharding metadata, the parallel _ir.py files are deleted, module registration is consolidated to remove conditional IR logic, and IR porting documentation guidelines are established.

Changes

Cohort / File(s) Summary
IR Porting Documentation
.claude/skills/ad-model-onboard/SKILL.md
Adds comprehensive IR delta allowlist/forbidden list, introduces mandatory pre-final audit workflow, and revises noaux_tc router instructions to preserve specific torch.ops.trtllm.* ops from base implementations.
Module Registration Consolidation
tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py
Removes conditional AD_USE_IR_MODELS logic and parallel *_ir module entry registration. Directly registers IR-aware sharding variants as canonical mappings. Adds explicit modeling_qwen3 entry to expose Qwen3ForCausalLM.
DeepSeek Consolidation
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek.py, modeling_deepseek_ir.py
Enriches non-IR file with explicit TP sharding intent via custom ops: replaces standard linear projections with torch_linear_simple (colwise/rowwise), adds all_reduce at MLP/MoE/MLA merge points, and passes enable_sharding and layer_type hints to torch_moe and torch_mla. Deletes outdated IR file (768 lines).
Nemotron-H Consolidation
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_nemotron_h.py, modeling_nemotron_h_ir.py
Replaces eager PyTorch ops (RMSNorm-gated, linear projections, tensor splits/views) with AutoDeploy custom ops carrying tp_mode, tp_scaled_dim, output_sizes, and layer_type metadata. Adds all_reduce semantics for Mamba/MLP/MoE final projections. Switches to torch.ops.auto_deploy.torch_rmsnorm_gated. Deletes outdated IR file (822 lines).
Qwen3 Updates
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3.py
Renames rotary embedding cos/sin caches to _ad_cos_cached/_ad_sin_cached for AutoDeploy/lift_to_meta compatibility. Adds inline documentation around cos/sin slicing, BSND layout, and prefill-only inference. Minor comment/whitespace adjustments.
Qwen3.5 MoE Consolidation
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.py
Reworks forward passes in GatedDeltaNet, Attention, MLP, and SparseMoeBlock to use torch_linear_simple, split_with_sizes, view, and all_reduce custom ops. Tags operations with layer_type hints ("delta", "mha", "moe"). Adds AutoDeploy custom ops import.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: switching four model families (DeepSeek, NemotronH, Qwen3, Qwen3.5-MoE) to use sharding-IR as the canonical implementation.
Linked Issues check ✅ Passed The PR directly addresses issue #13429 by promoting four model families to use sharding-IR as canonical [deepseek, nemotron_h, qwen3, qwen3_5_moe], removing AD_USE_IR_MODELS gates, and updating skill documentation as required.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the four target models plus skill documentation updates; no unrelated modifications to other models, infrastructure, or unrelated components are present.
Description check ✅ Passed The PR description comprehensively covers the changes, test plan, and references. It includes clear file-swap operations, skill updates, audit summary, and detailed test results.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
.claude/skills/ad-model-onboard/SKILL.md (1)

556-559: ⚠️ Potential issue | 🟠 Major

Resolve conflicting guidance on torch.ops.trtllm.* usage.

Line 556 requires preserving specific torch.ops.trtllm.* router ops, but Line 558 says to never use trtllm_*. This contradiction can lead to incorrect ports. Please carve out an explicit exception in the “Key Gotchas” rule for the noaux_tc/dsv3 router case.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/ad-model-onboard/SKILL.md around lines 556 - 559, The "Key
Gotchas" rule contradicts earlier guidance about preserving specific router ops;
add an explicit exception stating that for noaux_tc / DeepSeek-V3 style routers
you must keep torch.ops.trtllm.noaux_tc_op and
torch.ops.trtllm.dsv3_router_gemm_op exactly as in the non-IR base (router gate
is TP-REPLICATED, no sharding hints), while retaining the general prohibition on
trtllm_* in AutoDeploy (i.e., only allow these two trtllm ops when the source
model uses them verbatim and do not introduce vanilla replacements), and update
the SKILL.md section text to reflect this exception so both rules are
consistent.
tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py (1)

9-37: ⚠️ Potential issue | 🔴 Critical

Add modeling_llama3_ir to _MODEL_MODULES or restore the AD_USE_IR_MODELS gate.

The file modeling_llama3_ir.py exists but is not discoverable through the custom models __init__.py — it's neither listed in _MODEL_MODULES nor guarded by the AD_USE_IR_MODELS environment variable that tests still set. The module self-registers via AutoModelForCausalLMFactory, but removal of the environment-gated registration breaks the intended conditional loading mechanism referenced in the tests.

Either add "modeling_llama3_ir": ["Llama3ForCausalLM"] to _MODEL_MODULES (consistent with other converted IR variants like deepseek, nemotron_h, qwen3, and qwen3_5_moe), or restore the conditional gate if IR-only loading is required.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py` around lines 9 -
37, The custom models registry is missing the modeling_llama3_ir entry so
modeling_llama3_ir (which defines Llama3ForCausalLM and self-registers via
AutoModelForCausalLMFactory) isn't discoverable; fix by either adding
"modeling_llama3_ir": ["Llama3ForCausalLM"] to the _MODEL_MODULES dict in
__init__.py (so the importlib loop loads it like
modeling_deepseek/modeling_qwen3 entries) or restore the AD_USE_IR_MODELS
environment-gate around the registration/import of modeling_llama3_ir
(reintroduce the AD_USE_IR_MODELS check used by tests) so the module is
conditionally registered as intended.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/ad-model-onboard/SKILL.md:
- Around line 503-509: The fenced code block that begins with the table header
"| Hunk lines (in IR file) | Summary of change | Category | Verdict |" is
missing a language tag which triggers MD040; update that fenced block by adding
a language identifier (e.g., "markdown") right after the opening backticks so
the block becomes ```markdown, ensuring the table renders and linters stop
flagging it; locate the block by searching for the exact table header text
within SKILL.md.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py`:
- Around line 1-4: This file is missing the required NVIDIA copyright header at
the top; add the standard NVIDIA source-file header (with updated year if this
is a modification) as the very first lines of
tensorrt_llm._torch.auto_deploy.models.custom.__init__ before any imports, then
keep the existing imports and _logger definition unchanged so symbols like
importlib, logging and _logger remain intact.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek.py`:
- Line 1: Update the copyright header year from 2025 to 2026 in the file
modeling_deepseek.py: locate the top-of-file copyright comment and change the
year range to include 2026 so the header reflects the file modification in 2026.

---

Outside diff comments:
In @.claude/skills/ad-model-onboard/SKILL.md:
- Around line 556-559: The "Key Gotchas" rule contradicts earlier guidance about
preserving specific router ops; add an explicit exception stating that for
noaux_tc / DeepSeek-V3 style routers you must keep torch.ops.trtllm.noaux_tc_op
and torch.ops.trtllm.dsv3_router_gemm_op exactly as in the non-IR base (router
gate is TP-REPLICATED, no sharding hints), while retaining the general
prohibition on trtllm_* in AutoDeploy (i.e., only allow these two trtllm ops
when the source model uses them verbatim and do not introduce vanilla
replacements), and update the SKILL.md section text to reflect this exception so
both rules are consistent.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py`:
- Around line 9-37: The custom models registry is missing the modeling_llama3_ir
entry so modeling_llama3_ir (which defines Llama3ForCausalLM and self-registers
via AutoModelForCausalLMFactory) isn't discoverable; fix by either adding
"modeling_llama3_ir": ["Llama3ForCausalLM"] to the _MODEL_MODULES dict in
__init__.py (so the importlib loop loads it like
modeling_deepseek/modeling_qwen3 entries) or restore the AD_USE_IR_MODELS
environment-gate around the registration/import of modeling_llama3_ir
(reintroduce the AD_USE_IR_MODELS check used by tests) so the module is
conditionally registered as intended.
🪄 Autofix (Beta)

❌ Autofix failed (check again to retry)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c69a2cab-991f-4250-b5a0-f23027751349

📥 Commits

Reviewing files that changed from the base of the PR and between dd907c0 and a5ced6c.

📒 Files selected for processing (9)
  • .claude/skills/ad-model-onboard/SKILL.md
  • tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_ir.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_nemotron_h.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_nemotron_h_ir.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe_ir.py
💤 Files with no reviewable changes (2)
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_ir.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_nemotron_h_ir.py

Comment thread .claude/skills/ad-model-onboard/SKILL.md Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek.py Outdated
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #45593 [ run ] triggered by Bot. Commit: a5ced6c Link to invocation

@greg-kwasniewski1

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #45594 [ run ] triggered by Bot. Commit: 46f9725 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #45594 [ run ] completed with state FAILURE. Commit: 46f9725
/LLM/main/L0_MergeRequest_PR pipeline #35810 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@greg-kwasniewski1

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #45730 [ run ] triggered by Bot. Commit: 46f9725 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #45730 [ run ] completed with state FAILURE. Commit: 46f9725
/LLM/main/L0_MergeRequest_PR pipeline #35927 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@greg-kwasniewski1

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #45765 [ run ] triggered by Bot. Commit: 46f9725 Link to invocation

@greg-kwasniewski1

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #45771 [ run ] triggered by Bot. Commit: 72cae82 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #45771 [ run ] completed with state FAILURE. Commit: 72cae82
/LLM/main/L0_MergeRequest_PR pipeline #35962 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@greg-kwasniewski1

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #46148 [ run ] triggered by Bot. Commit: c9b392d Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #46148 [ run ] completed with state SUCCESS. Commit: c9b392d
/LLM/main/L0_MergeRequest_PR pipeline #36274 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@greg-kwasniewski1

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" --disable-fail-fast

@greg-kwasniewski1 greg-kwasniewski1 force-pushed the gk/switch_to_ir_models branch from c9b392d to 6f4dce2 Compare May 2, 2026 19:19
@greg-kwasniewski1

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #46567 [ run ] triggered by Bot. Commit: 6f4dce2 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #46567 [ run ] completed with state FAILURE. Commit: 6f4dce2
/LLM/main/L0_MergeRequest_PR pipeline #36619 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@greg-kwasniewski1

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #46636 [ run ] triggered by Bot. Commit: 4ea9552 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #46636 [ run ] completed with state SUCCESS. Commit: 4ea9552
/LLM/main/L0_MergeRequest_PR pipeline #36680 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

…g the 4 IR-aware canonical files

Append enable_sharder_ir.yaml to yaml_extra for the 20 entries in
models.yaml whose model architecture is handled by one of the four
modeling files promoted to IR-canonical in this PR. Identified via
huggingface_hub.HfApi.model_info().config.architectures matching:

- DeepseekV3ForCausalLM     -> modeling_deepseek.py        (4 entries)
- NemotronHForCausalLM      -> modeling_nemotron_h.py       (5 entries)
- Qwen3ForCausalLM          -> modeling_qwen3.py           (9 entries)
- Qwen3_5MoeForConditional  -> modeling_qwen3_5_moe.py     (2 entries)

After this change, running build_and_run_ad.py against any of these 20
HF model names with --use-registry will resolve to a yaml_extra list
ending in enable_sharder_ir.yaml, which disables legacy detect_sharding
and enables apply_sharding_hints. The IR-aware modeling code is now
exercised end-to-end without requiring users to opt in explicitly.

The two pre-existing -IR virtual entries (DeepSeek-R1-IR,
NVIDIA-Nemotron-3-Nano-30B-A3B-FP8-IR) are kept as smoke-test entries:
they pair the IR overlay with the trimmed num_hidden_layers_5.yaml or
nano_v3.yaml configs, useful for fast validation without going through
the production sharding plan.

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
…ormers 5.x

Two latent bugs pre-existed in the file (originally modeling_qwen3_ir.py
on upstream; renamed to canonical modeling_qwen3.py by this PR's switch
commit) and only surfaced once IR sharding became the default for Qwen3
text models (this PR's models.yaml flip):

1. config.rope_theta -> config.rope_scaling['rope_theta']

   In transformers 4.x, Qwen3Config.rope_theta sits at the top of the
   config object. In transformers 5.x, rope settings moved under
   config.rope_scaling (alongside rope_type). Reading config.rope_theta
   directly hits AttributeError when loading any Qwen3 model with the
   installed transformers 5.x runtime. Replaced with a defensive lookup
   that handles both layouts.

2. _tied_weights_keys = list -> dict

   transformers 5.x's PreTrainedModel.get_expanded_tied_weights_keys
   calls .keys() on _tied_weights_keys, which crashes when the class
   attribute is a list. Replaced with the dict form mapping
   {'lm_head.weight': 'model.embed_tokens.weight'} that matches what
   the analogous upstream commit (NVIDIA#12829) did for the other three
   IR-aware modeling files (deepseek, nemotron_h, qwen3_5_moe).

Validated end-to-end with build_and_run_ad.py against Qwen/Qwen3-4B
(ws=2, IR sharding default): 10/10 coherent prompts, apply_sharding_hints
matches=468 per rank.

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
…kill

Extract the "Sharding-aware IR model porting" section from
ad-model-onboard into a dedicated ad-sharding-ir-port skill.
The two workflows serve different purposes (onboarding a new
model vs mechanically adding sharding hints to an existing one)
and benefit from isolation.

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Sharding IR is now the default path — hints are added directly
to the canonical modeling_*.py instead of creating a separate
_ir.py copy. The legacy dual-file pattern is deprecated.

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
…onboard SKILL.md to match main

This bullet was modified during the PR NVIDIA#13478 work to add a detailed
explanation about keeping the trtllm router kernels verbatim. That
verbose content belongs in a sharding-IR-specific doc, not in the
general ad-model-onboard skill. Restore the short upstream/main form.

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
@greg-kwasniewski1 greg-kwasniewski1 force-pushed the gk/switch_to_ir_models branch from 0730798 to 84471f6 Compare May 24, 2026 15:41
@greg-kwasniewski1

Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50105 [ run ] triggered by Bot. Commit: 84471f6 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50105 [ run ] completed with state SUCCESS. Commit: 84471f6
/LLM/main/L0_MergeRequest_PR pipeline #39658 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

taylor-yb-lee added a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request May 25, 2026
…ariant

The sharding-IR ``modeling_gpt_oss_ir.py`` already covered every feature of
the non-IR legacy file (same RMSNorm / RoPE / attention with sinks / MoE
router / experts), differing only in:

* attention Linears go through ``torch.ops.auto_deploy.torch_linear_simple``
  with sharding hint kwargs so TP > 1 attention sharding works without an
  external graph rewrite;
* view ops on q/k/v/attn_out use ``torch.ops.auto_deploy.view`` with
  ``tp_scaled_dim=2`` so the head-count dim scales with TP;
* the post-attention all-reduce is expressed as a
  ``torch.ops.auto_deploy.all_reduce`` placeholder.

Consolidate: rename ``modeling_gpt_oss_ir.py`` into the default
``modeling_gpt_oss.py`` (the legacy non-IR variant is removed) and drop
the ``AD_USE_IR_MODELS`` opt-in entry for gpt-oss in
``models/custom/__init__.py``. This matches the trajectory in upstream
PR NVIDIA#13478 (other models being migrated to sharding-IR as default).

GSM8K full 1319-sample validation, gpt-oss-120b @ TP=2 (post-rebase,
W4A8 mxfp8 activations):

  Pre-IR (legacy modeling):  88.55 % (±0.88), 992 s
  Post-IR (sharding-IR):     88.55 % (±0.88), 902 s
  Reference (PT):            90.30 %

Accuracy is identical to the post-rebase TP=2 baseline; total run-time
is ~9 % faster. The existing ``quantize_mxfp4_moe_trtllm_gen`` post-load
transform continues to handle MXFP4 weight prep and op retargeting on
top of the IR modeling.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
@greg-kwasniewski1

Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "AutoDeploy stages passed in L0 #39658; only failure is an unrelated EADDRINUSE flake in test_wan_pipeline_parallel.py (visual_gen multi-GPU)."

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50205 [ skip ] triggered by Bot. Commit: 84471f6 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50205 [ skip ] completed with state SUCCESS. Commit: 84471f6
Skipping testing for commit 84471f6

Link to invocation

@greg-kwasniewski1 greg-kwasniewski1 merged commit 92c5030 into NVIDIA:main May 25, 2026
7 checks passed
KleinBlueC pushed a commit to KleinBlueC/TensorRT-LLM that referenced this pull request May 26, 2026
…harding-IR canonical models (NVIDIA#13478)

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
bmarimuthu-nv pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request May 28, 2026
…harding-IR canonical models (NVIDIA#13478)

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
taylor-yb-lee added a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request May 28, 2026
…ariant

The sharding-IR ``modeling_gpt_oss_ir.py`` already covered every feature of
the non-IR legacy file (same RMSNorm / RoPE / attention with sinks / MoE
router / experts), differing only in:

* attention Linears go through ``torch.ops.auto_deploy.torch_linear_simple``
  with sharding hint kwargs so TP > 1 attention sharding works without an
  external graph rewrite;
* view ops on q/k/v/attn_out use ``torch.ops.auto_deploy.view`` with
  ``tp_scaled_dim=2`` so the head-count dim scales with TP;
* the post-attention all-reduce is expressed as a
  ``torch.ops.auto_deploy.all_reduce`` placeholder.

Consolidate: rename ``modeling_gpt_oss_ir.py`` into the default
``modeling_gpt_oss.py`` (the legacy non-IR variant is removed) and drop
the ``AD_USE_IR_MODELS`` opt-in entry for gpt-oss in
``models/custom/__init__.py``. This matches the trajectory in upstream
PR NVIDIA#13478 (other models being migrated to sharding-IR as default).

GSM8K full 1319-sample validation, gpt-oss-120b @ TP=2 (post-rebase,
W4A8 mxfp8 activations):

  Pre-IR (legacy modeling):  88.55 % (±0.88), 992 s
  Post-IR (sharding-IR):     88.55 % (±0.88), 902 s
  Reference (PT):            90.30 %

Accuracy is identical to the post-rebase TP=2 baseline; total run-time
is ~9 % faster. The existing ``quantize_mxfp4_moe_trtllm_gen`` post-load
transform continues to handle MXFP4 weight prep and op retargeting on
top of the IR modeling.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
greg-kwasniewski1 added a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request May 30, 2026
…orted model families

Add 'enable_sharder_ir.yaml' to the yaml_extra of every model_registry/models.yaml entry whose config class maps to one of the 13 modeling files ported in this PR (llama3, mistral, granite, hunyuan_dense, seed_oss, gemma2, cohere, exaone, gemma, deepseek_v2, glm4_moe, qwen3_moe, llama4).

Without this, those entries continue to run through the legacy detect_sharding pipeline. detect_sharding does recognize torch_linear_simple via is_any_lin_op, so it still shards correctly, but the explicit sharding-hint kwargs on the IR-ported ops are dormant metadata under that path. This commit makes the registry actually exercise the apply_sharding_hints transform on those models, matching the opt-in pattern PR NVIDIA#13478 used for the first four families (qwen3, qwen3_5_moe, deepseek, nemotron_h).

70 entries updated. Excluded by design (still on legacy via no override):

- 'multimodal.yaml' entries: vision/audio variants take a separate path.

- 'simple_shard_only.yaml' entries: Llama-3_1-Nemotron-51B, -Ultra-253B (and -FP8), Qwen3-30B-A3B, Qwen3-235B-A22B - explicit opt-outs of the regular sharding pass.

Refs NVIDIA#14642

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
greg-kwasniewski1 added a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request May 30, 2026
…orted model families

Add 'enable_sharder_ir.yaml' to the yaml_extra of every model_registry/models.yaml entry whose config class maps to one of the 13 modeling files ported in this PR (llama3, mistral, granite, hunyuan_dense, seed_oss, gemma2, cohere, exaone, gemma, deepseek_v2, glm4_moe, qwen3_moe, llama4).

Without this, those entries continue to run through the legacy detect_sharding pipeline. detect_sharding does recognize torch_linear_simple via is_any_lin_op, so it still shards correctly, but the explicit sharding-hint kwargs on the IR-ported ops are dormant metadata under that path. This commit makes the registry actually exercise the apply_sharding_hints transform on those models, matching the opt-in pattern PR NVIDIA#13478 used for the first four families (qwen3, qwen3_5_moe, deepseek, nemotron_h).

70 entries updated. Excluded by design (still on legacy via no override):

- 'multimodal.yaml' entries: vision/audio variants take a separate path.

- 'simple_shard_only.yaml' entries: Llama-3_1-Nemotron-51B, -Ultra-253B (and -FP8), Qwen3-30B-A3B, Qwen3-235B-A22B - explicit opt-outs of the regular sharding pass.

Refs NVIDIA#14642

Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
taylor-yb-lee added a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request May 31, 2026
…ariant

The sharding-IR ``modeling_gpt_oss_ir.py`` already covered every feature of
the non-IR legacy file (same RMSNorm / RoPE / attention with sinks / MoE
router / experts), differing only in:

* attention Linears go through ``torch.ops.auto_deploy.torch_linear_simple``
  with sharding hint kwargs so TP > 1 attention sharding works without an
  external graph rewrite;
* view ops on q/k/v/attn_out use ``torch.ops.auto_deploy.view`` with
  ``tp_scaled_dim=2`` so the head-count dim scales with TP;
* the post-attention all-reduce is expressed as a
  ``torch.ops.auto_deploy.all_reduce`` placeholder.

Consolidate: rename ``modeling_gpt_oss_ir.py`` into the default
``modeling_gpt_oss.py`` (the legacy non-IR variant is removed) and drop
the ``AD_USE_IR_MODELS`` opt-in entry for gpt-oss in
``models/custom/__init__.py``. This matches the trajectory in upstream
PR NVIDIA#13478 (other models being migrated to sharding-IR as default).

GSM8K full 1319-sample validation, gpt-oss-120b @ TP=2 (post-rebase,
W4A8 mxfp8 activations):

  Pre-IR (legacy modeling):  88.55 % (±0.88), 992 s
  Post-IR (sharding-IR):     88.55 % (±0.88), 902 s
  Reference (PT):            90.30 %

Accuracy is identical to the post-rebase TP=2 baseline; total run-time
is ~9 % faster. The existing ``quantize_mxfp4_moe_trtllm_gen`` post-load
transform continues to handle MXFP4 weight prep and op retargeting on
top of the IR modeling.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
taylor-yb-lee added a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Jun 2, 2026
…ariant

The sharding-IR ``modeling_gpt_oss_ir.py`` already covered every feature of
the non-IR legacy file (same RMSNorm / RoPE / attention with sinks / MoE
router / experts), differing only in:

* attention Linears go through ``torch.ops.auto_deploy.torch_linear_simple``
  with sharding hint kwargs so TP > 1 attention sharding works without an
  external graph rewrite;
* view ops on q/k/v/attn_out use ``torch.ops.auto_deploy.view`` with
  ``tp_scaled_dim=2`` so the head-count dim scales with TP;
* the post-attention all-reduce is expressed as a
  ``torch.ops.auto_deploy.all_reduce`` placeholder.

Consolidate: rename ``modeling_gpt_oss_ir.py`` into the default
``modeling_gpt_oss.py`` (the legacy non-IR variant is removed) and drop
the ``AD_USE_IR_MODELS`` opt-in entry for gpt-oss in
``models/custom/__init__.py``. This matches the trajectory in upstream
PR NVIDIA#13478 (other models being migrated to sharding-IR as default).

GSM8K full 1319-sample validation, gpt-oss-120b @ TP=2 (post-rebase,
W4A8 mxfp8 activations):

  Pre-IR (legacy modeling):  88.55 % (±0.88), 992 s
  Post-IR (sharding-IR):     88.55 % (±0.88), 902 s
  Reference (PT):            90.30 %

Accuracy is identical to the post-rebase TP=2 baseline; total run-time
is ~9 % faster. The existing ``quantize_mxfp4_moe_trtllm_gen`` post-load
transform continues to handle MXFP4 weight prep and op retargeting on
top of the IR modeling.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
taylor-yb-lee added a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Jun 4, 2026
…ariant

The sharding-IR ``modeling_gpt_oss_ir.py`` already covered every feature of
the non-IR legacy file (same RMSNorm / RoPE / attention with sinks / MoE
router / experts), differing only in:

* attention Linears go through ``torch.ops.auto_deploy.torch_linear_simple``
  with sharding hint kwargs so TP > 1 attention sharding works without an
  external graph rewrite;
* view ops on q/k/v/attn_out use ``torch.ops.auto_deploy.view`` with
  ``tp_scaled_dim=2`` so the head-count dim scales with TP;
* the post-attention all-reduce is expressed as a
  ``torch.ops.auto_deploy.all_reduce`` placeholder.

Consolidate: rename ``modeling_gpt_oss_ir.py`` into the default
``modeling_gpt_oss.py`` (the legacy non-IR variant is removed) and drop
the ``AD_USE_IR_MODELS`` opt-in entry for gpt-oss in
``models/custom/__init__.py``. This matches the trajectory in upstream
PR NVIDIA#13478 (other models being migrated to sharding-IR as default).

GSM8K full 1319-sample validation, gpt-oss-120b @ TP=2 (post-rebase,
W4A8 mxfp8 activations):

  Pre-IR (legacy modeling):  88.55 % (±0.88), 992 s
  Post-IR (sharding-IR):     88.55 % (±0.88), 902 s
  Reference (PT):            90.30 %

Accuracy is identical to the post-rebase TP=2 baseline; total run-time
is ~9 % faster. The existing ``quantize_mxfp4_moe_trtllm_gen`` post-load
transform continues to handle MXFP4 weight prep and op retargeting on
top of the IR modeling.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
taylor-yb-lee added a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Jun 8, 2026
…ariant

The sharding-IR ``modeling_gpt_oss_ir.py`` already covered every feature of
the non-IR legacy file (same RMSNorm / RoPE / attention with sinks / MoE
router / experts), differing only in:

* attention Linears go through ``torch.ops.auto_deploy.torch_linear_simple``
  with sharding hint kwargs so TP > 1 attention sharding works without an
  external graph rewrite;
* view ops on q/k/v/attn_out use ``torch.ops.auto_deploy.view`` with
  ``tp_scaled_dim=2`` so the head-count dim scales with TP;
* the post-attention all-reduce is expressed as a
  ``torch.ops.auto_deploy.all_reduce`` placeholder.

Consolidate: rename ``modeling_gpt_oss_ir.py`` into the default
``modeling_gpt_oss.py`` (the legacy non-IR variant is removed) and drop
the ``AD_USE_IR_MODELS`` opt-in entry for gpt-oss in
``models/custom/__init__.py``. This matches the trajectory in upstream
PR NVIDIA#13478 (other models being migrated to sharding-IR as default).

GSM8K full 1319-sample validation, gpt-oss-120b @ TP=2 (post-rebase,
W4A8 mxfp8 activations):

  Pre-IR (legacy modeling):  88.55 % (±0.88), 992 s
  Post-IR (sharding-IR):     88.55 % (±0.88), 902 s
  Reference (PT):            90.30 %

Accuracy is identical to the post-rebase TP=2 baseline; total run-time
is ~9 % faster. The existing ``quantize_mxfp4_moe_trtllm_gen`` post-load
transform continues to handle MXFP4 weight prep and op retargeting on
top of the IR modeling.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AutoDeploy][Feature]: Switch nemotron-H, qwen3, qwen3.5-moe, deepseek to sharding IR (step 2)

6 participants