[TRTLLM-13429][feat] Switch DeepSeek/NemotronH/Qwen3/Qwen3.5-MoE to sharding-IR canonical models#13478
Conversation
|
/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" |
📝 WalkthroughWalkthroughThis change consolidates four model families from dual-file implementations to IR-fied single files. The non-IR implementations gain sharding-aware custom ops carrying explicit sharding metadata, the parallel Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
.claude/skills/ad-model-onboard/SKILL.md (1)
556-559:⚠️ Potential issue | 🟠 MajorResolve conflicting guidance on
torch.ops.trtllm.*usage.Line 556 requires preserving specific
torch.ops.trtllm.*router ops, but Line 558 says to never usetrtllm_*. This contradiction can lead to incorrect ports. Please carve out an explicit exception in the “Key Gotchas” rule for the noaux_tc/dsv3 router case.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/ad-model-onboard/SKILL.md around lines 556 - 559, The "Key Gotchas" rule contradicts earlier guidance about preserving specific router ops; add an explicit exception stating that for noaux_tc / DeepSeek-V3 style routers you must keep torch.ops.trtllm.noaux_tc_op and torch.ops.trtllm.dsv3_router_gemm_op exactly as in the non-IR base (router gate is TP-REPLICATED, no sharding hints), while retaining the general prohibition on trtllm_* in AutoDeploy (i.e., only allow these two trtllm ops when the source model uses them verbatim and do not introduce vanilla replacements), and update the SKILL.md section text to reflect this exception so both rules are consistent.tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py (1)
9-37:⚠️ Potential issue | 🔴 CriticalAdd
modeling_llama3_irto_MODEL_MODULESor restore theAD_USE_IR_MODELSgate.The file
modeling_llama3_ir.pyexists but is not discoverable through the custom models__init__.py— it's neither listed in_MODEL_MODULESnor guarded by theAD_USE_IR_MODELSenvironment variable that tests still set. The module self-registers viaAutoModelForCausalLMFactory, but removal of the environment-gated registration breaks the intended conditional loading mechanism referenced in the tests.Either add
"modeling_llama3_ir": ["Llama3ForCausalLM"]to_MODEL_MODULES(consistent with other converted IR variants like deepseek, nemotron_h, qwen3, and qwen3_5_moe), or restore the conditional gate if IR-only loading is required.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py` around lines 9 - 37, The custom models registry is missing the modeling_llama3_ir entry so modeling_llama3_ir (which defines Llama3ForCausalLM and self-registers via AutoModelForCausalLMFactory) isn't discoverable; fix by either adding "modeling_llama3_ir": ["Llama3ForCausalLM"] to the _MODEL_MODULES dict in __init__.py (so the importlib loop loads it like modeling_deepseek/modeling_qwen3 entries) or restore the AD_USE_IR_MODELS environment-gate around the registration/import of modeling_llama3_ir (reintroduce the AD_USE_IR_MODELS check used by tests) so the module is conditionally registered as intended.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/ad-model-onboard/SKILL.md:
- Around line 503-509: The fenced code block that begins with the table header
"| Hunk lines (in IR file) | Summary of change | Category | Verdict |" is
missing a language tag which triggers MD040; update that fenced block by adding
a language identifier (e.g., "markdown") right after the opening backticks so
the block becomes ```markdown, ensuring the table renders and linters stop
flagging it; locate the block by searching for the exact table header text
within SKILL.md.
In `@tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py`:
- Around line 1-4: This file is missing the required NVIDIA copyright header at
the top; add the standard NVIDIA source-file header (with updated year if this
is a modification) as the very first lines of
tensorrt_llm._torch.auto_deploy.models.custom.__init__ before any imports, then
keep the existing imports and _logger definition unchanged so symbols like
importlib, logging and _logger remain intact.
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek.py`:
- Line 1: Update the copyright header year from 2025 to 2026 in the file
modeling_deepseek.py: locate the top-of-file copyright comment and change the
year range to include 2026 so the header reflects the file modification in 2026.
---
Outside diff comments:
In @.claude/skills/ad-model-onboard/SKILL.md:
- Around line 556-559: The "Key Gotchas" rule contradicts earlier guidance about
preserving specific router ops; add an explicit exception stating that for
noaux_tc / DeepSeek-V3 style routers you must keep torch.ops.trtllm.noaux_tc_op
and torch.ops.trtllm.dsv3_router_gemm_op exactly as in the non-IR base (router
gate is TP-REPLICATED, no sharding hints), while retaining the general
prohibition on trtllm_* in AutoDeploy (i.e., only allow these two trtllm ops
when the source model uses them verbatim and do not introduce vanilla
replacements), and update the SKILL.md section text to reflect this exception so
both rules are consistent.
In `@tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py`:
- Around line 9-37: The custom models registry is missing the modeling_llama3_ir
entry so modeling_llama3_ir (which defines Llama3ForCausalLM and self-registers
via AutoModelForCausalLMFactory) isn't discoverable; fix by either adding
"modeling_llama3_ir": ["Llama3ForCausalLM"] to the _MODEL_MODULES dict in
__init__.py (so the importlib loop loads it like
modeling_deepseek/modeling_qwen3 entries) or restore the AD_USE_IR_MODELS
environment-gate around the registration/import of modeling_llama3_ir
(reintroduce the AD_USE_IR_MODELS check used by tests) so the module is
conditionally registered as intended.
🪄 Autofix (Beta)
❌ Autofix failed (check again to retry)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c69a2cab-991f-4250-b5a0-f23027751349
📒 Files selected for processing (9)
.claude/skills/ad-model-onboard/SKILL.mdtensorrt_llm/_torch/auto_deploy/models/custom/__init__.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_ir.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_nemotron_h.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_nemotron_h_ir.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe_ir.py
💤 Files with no reviewable changes (2)
- tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_ir.py
- tensorrt_llm/_torch/auto_deploy/models/custom/modeling_nemotron_h_ir.py
|
PR_Github #45593 [ run ] triggered by Bot. Commit: |
|
/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" |
|
PR_Github #45594 [ run ] triggered by Bot. Commit: |
|
PR_Github #45594 [ run ] completed with state
|
|
/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" |
|
PR_Github #45730 [ run ] triggered by Bot. Commit: |
|
PR_Github #45730 [ run ] completed with state
|
|
/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #45765 [ run ] triggered by Bot. Commit: |
46f9725 to
72cae82
Compare
|
/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #45771 [ run ] triggered by Bot. Commit: |
|
PR_Github #45771 [ run ] completed with state
|
|
/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #46148 [ run ] triggered by Bot. Commit: |
|
PR_Github #46148 [ run ] completed with state
|
|
/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" --disable-fail-fast |
c9b392d to
6f4dce2
Compare
|
/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #46567 [ run ] triggered by Bot. Commit: |
|
PR_Github #46567 [ run ] completed with state
|
|
/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #46636 [ run ] triggered by Bot. Commit: |
|
PR_Github #46636 [ run ] completed with state |
…g the 4 IR-aware canonical files Append enable_sharder_ir.yaml to yaml_extra for the 20 entries in models.yaml whose model architecture is handled by one of the four modeling files promoted to IR-canonical in this PR. Identified via huggingface_hub.HfApi.model_info().config.architectures matching: - DeepseekV3ForCausalLM -> modeling_deepseek.py (4 entries) - NemotronHForCausalLM -> modeling_nemotron_h.py (5 entries) - Qwen3ForCausalLM -> modeling_qwen3.py (9 entries) - Qwen3_5MoeForConditional -> modeling_qwen3_5_moe.py (2 entries) After this change, running build_and_run_ad.py against any of these 20 HF model names with --use-registry will resolve to a yaml_extra list ending in enable_sharder_ir.yaml, which disables legacy detect_sharding and enables apply_sharding_hints. The IR-aware modeling code is now exercised end-to-end without requiring users to opt in explicitly. The two pre-existing -IR virtual entries (DeepSeek-R1-IR, NVIDIA-Nemotron-3-Nano-30B-A3B-FP8-IR) are kept as smoke-test entries: they pair the IR overlay with the trimmed num_hidden_layers_5.yaml or nano_v3.yaml configs, useful for fast validation without going through the production sharding plan. Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
…ormers 5.x
Two latent bugs pre-existed in the file (originally modeling_qwen3_ir.py
on upstream; renamed to canonical modeling_qwen3.py by this PR's switch
commit) and only surfaced once IR sharding became the default for Qwen3
text models (this PR's models.yaml flip):
1. config.rope_theta -> config.rope_scaling['rope_theta']
In transformers 4.x, Qwen3Config.rope_theta sits at the top of the
config object. In transformers 5.x, rope settings moved under
config.rope_scaling (alongside rope_type). Reading config.rope_theta
directly hits AttributeError when loading any Qwen3 model with the
installed transformers 5.x runtime. Replaced with a defensive lookup
that handles both layouts.
2. _tied_weights_keys = list -> dict
transformers 5.x's PreTrainedModel.get_expanded_tied_weights_keys
calls .keys() on _tied_weights_keys, which crashes when the class
attribute is a list. Replaced with the dict form mapping
{'lm_head.weight': 'model.embed_tokens.weight'} that matches what
the analogous upstream commit (NVIDIA#12829) did for the other three
IR-aware modeling files (deepseek, nemotron_h, qwen3_5_moe).
Validated end-to-end with build_and_run_ad.py against Qwen/Qwen3-4B
(ws=2, IR sharding default): 10/10 coherent prompts, apply_sharding_hints
matches=468 per rank.
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
…kill Extract the "Sharding-aware IR model porting" section from ad-model-onboard into a dedicated ad-sharding-ir-port skill. The two workflows serve different purposes (onboarding a new model vs mechanically adding sharding hints to an existing one) and benefit from isolation. Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Sharding IR is now the default path — hints are added directly to the canonical modeling_*.py instead of creating a separate _ir.py copy. The legacy dual-file pattern is deprecated. Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
…onboard SKILL.md to match main This bullet was modified during the PR NVIDIA#13478 work to add a detailed explanation about keeping the trtllm router kernels verbatim. That verbose content belongs in a sharding-IR-specific doc, not in the general ad-model-onboard skill. Restore the short upstream/main form. Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
0730798 to
84471f6
Compare
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #50105 [ run ] triggered by Bot. Commit: |
|
PR_Github #50105 [ run ] completed with state
|
…ariant The sharding-IR ``modeling_gpt_oss_ir.py`` already covered every feature of the non-IR legacy file (same RMSNorm / RoPE / attention with sinks / MoE router / experts), differing only in: * attention Linears go through ``torch.ops.auto_deploy.torch_linear_simple`` with sharding hint kwargs so TP > 1 attention sharding works without an external graph rewrite; * view ops on q/k/v/attn_out use ``torch.ops.auto_deploy.view`` with ``tp_scaled_dim=2`` so the head-count dim scales with TP; * the post-attention all-reduce is expressed as a ``torch.ops.auto_deploy.all_reduce`` placeholder. Consolidate: rename ``modeling_gpt_oss_ir.py`` into the default ``modeling_gpt_oss.py`` (the legacy non-IR variant is removed) and drop the ``AD_USE_IR_MODELS`` opt-in entry for gpt-oss in ``models/custom/__init__.py``. This matches the trajectory in upstream PR NVIDIA#13478 (other models being migrated to sharding-IR as default). GSM8K full 1319-sample validation, gpt-oss-120b @ TP=2 (post-rebase, W4A8 mxfp8 activations): Pre-IR (legacy modeling): 88.55 % (±0.88), 992 s Post-IR (sharding-IR): 88.55 % (±0.88), 902 s Reference (PT): 90.30 % Accuracy is identical to the post-rebase TP=2 baseline; total run-time is ~9 % faster. The existing ``quantize_mxfp4_moe_trtllm_gen`` post-load transform continues to handle MXFP4 weight prep and op retargeting on top of the IR modeling. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
|
/bot skip --comment "AutoDeploy stages passed in L0 #39658; only failure is an unrelated EADDRINUSE flake in test_wan_pipeline_parallel.py (visual_gen multi-GPU)." |
|
PR_Github #50205 [ skip ] triggered by Bot. Commit: |
|
PR_Github #50205 [ skip ] completed with state |
…harding-IR canonical models (NVIDIA#13478) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com> Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
…harding-IR canonical models (NVIDIA#13478) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com> Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
…ariant The sharding-IR ``modeling_gpt_oss_ir.py`` already covered every feature of the non-IR legacy file (same RMSNorm / RoPE / attention with sinks / MoE router / experts), differing only in: * attention Linears go through ``torch.ops.auto_deploy.torch_linear_simple`` with sharding hint kwargs so TP > 1 attention sharding works without an external graph rewrite; * view ops on q/k/v/attn_out use ``torch.ops.auto_deploy.view`` with ``tp_scaled_dim=2`` so the head-count dim scales with TP; * the post-attention all-reduce is expressed as a ``torch.ops.auto_deploy.all_reduce`` placeholder. Consolidate: rename ``modeling_gpt_oss_ir.py`` into the default ``modeling_gpt_oss.py`` (the legacy non-IR variant is removed) and drop the ``AD_USE_IR_MODELS`` opt-in entry for gpt-oss in ``models/custom/__init__.py``. This matches the trajectory in upstream PR NVIDIA#13478 (other models being migrated to sharding-IR as default). GSM8K full 1319-sample validation, gpt-oss-120b @ TP=2 (post-rebase, W4A8 mxfp8 activations): Pre-IR (legacy modeling): 88.55 % (±0.88), 992 s Post-IR (sharding-IR): 88.55 % (±0.88), 902 s Reference (PT): 90.30 % Accuracy is identical to the post-rebase TP=2 baseline; total run-time is ~9 % faster. The existing ``quantize_mxfp4_moe_trtllm_gen`` post-load transform continues to handle MXFP4 weight prep and op retargeting on top of the IR modeling. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…orted model families Add 'enable_sharder_ir.yaml' to the yaml_extra of every model_registry/models.yaml entry whose config class maps to one of the 13 modeling files ported in this PR (llama3, mistral, granite, hunyuan_dense, seed_oss, gemma2, cohere, exaone, gemma, deepseek_v2, glm4_moe, qwen3_moe, llama4). Without this, those entries continue to run through the legacy detect_sharding pipeline. detect_sharding does recognize torch_linear_simple via is_any_lin_op, so it still shards correctly, but the explicit sharding-hint kwargs on the IR-ported ops are dormant metadata under that path. This commit makes the registry actually exercise the apply_sharding_hints transform on those models, matching the opt-in pattern PR NVIDIA#13478 used for the first four families (qwen3, qwen3_5_moe, deepseek, nemotron_h). 70 entries updated. Excluded by design (still on legacy via no override): - 'multimodal.yaml' entries: vision/audio variants take a separate path. - 'simple_shard_only.yaml' entries: Llama-3_1-Nemotron-51B, -Ultra-253B (and -FP8), Qwen3-30B-A3B, Qwen3-235B-A22B - explicit opt-outs of the regular sharding pass. Refs NVIDIA#14642 Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
…orted model families Add 'enable_sharder_ir.yaml' to the yaml_extra of every model_registry/models.yaml entry whose config class maps to one of the 13 modeling files ported in this PR (llama3, mistral, granite, hunyuan_dense, seed_oss, gemma2, cohere, exaone, gemma, deepseek_v2, glm4_moe, qwen3_moe, llama4). Without this, those entries continue to run through the legacy detect_sharding pipeline. detect_sharding does recognize torch_linear_simple via is_any_lin_op, so it still shards correctly, but the explicit sharding-hint kwargs on the IR-ported ops are dormant metadata under that path. This commit makes the registry actually exercise the apply_sharding_hints transform on those models, matching the opt-in pattern PR NVIDIA#13478 used for the first four families (qwen3, qwen3_5_moe, deepseek, nemotron_h). 70 entries updated. Excluded by design (still on legacy via no override): - 'multimodal.yaml' entries: vision/audio variants take a separate path. - 'simple_shard_only.yaml' entries: Llama-3_1-Nemotron-51B, -Ultra-253B (and -FP8), Qwen3-30B-A3B, Qwen3-235B-A22B - explicit opt-outs of the regular sharding pass. Refs NVIDIA#14642 Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
…ariant The sharding-IR ``modeling_gpt_oss_ir.py`` already covered every feature of the non-IR legacy file (same RMSNorm / RoPE / attention with sinks / MoE router / experts), differing only in: * attention Linears go through ``torch.ops.auto_deploy.torch_linear_simple`` with sharding hint kwargs so TP > 1 attention sharding works without an external graph rewrite; * view ops on q/k/v/attn_out use ``torch.ops.auto_deploy.view`` with ``tp_scaled_dim=2`` so the head-count dim scales with TP; * the post-attention all-reduce is expressed as a ``torch.ops.auto_deploy.all_reduce`` placeholder. Consolidate: rename ``modeling_gpt_oss_ir.py`` into the default ``modeling_gpt_oss.py`` (the legacy non-IR variant is removed) and drop the ``AD_USE_IR_MODELS`` opt-in entry for gpt-oss in ``models/custom/__init__.py``. This matches the trajectory in upstream PR NVIDIA#13478 (other models being migrated to sharding-IR as default). GSM8K full 1319-sample validation, gpt-oss-120b @ TP=2 (post-rebase, W4A8 mxfp8 activations): Pre-IR (legacy modeling): 88.55 % (±0.88), 992 s Post-IR (sharding-IR): 88.55 % (±0.88), 902 s Reference (PT): 90.30 % Accuracy is identical to the post-rebase TP=2 baseline; total run-time is ~9 % faster. The existing ``quantize_mxfp4_moe_trtllm_gen`` post-load transform continues to handle MXFP4 weight prep and op retargeting on top of the IR modeling. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…ariant The sharding-IR ``modeling_gpt_oss_ir.py`` already covered every feature of the non-IR legacy file (same RMSNorm / RoPE / attention with sinks / MoE router / experts), differing only in: * attention Linears go through ``torch.ops.auto_deploy.torch_linear_simple`` with sharding hint kwargs so TP > 1 attention sharding works without an external graph rewrite; * view ops on q/k/v/attn_out use ``torch.ops.auto_deploy.view`` with ``tp_scaled_dim=2`` so the head-count dim scales with TP; * the post-attention all-reduce is expressed as a ``torch.ops.auto_deploy.all_reduce`` placeholder. Consolidate: rename ``modeling_gpt_oss_ir.py`` into the default ``modeling_gpt_oss.py`` (the legacy non-IR variant is removed) and drop the ``AD_USE_IR_MODELS`` opt-in entry for gpt-oss in ``models/custom/__init__.py``. This matches the trajectory in upstream PR NVIDIA#13478 (other models being migrated to sharding-IR as default). GSM8K full 1319-sample validation, gpt-oss-120b @ TP=2 (post-rebase, W4A8 mxfp8 activations): Pre-IR (legacy modeling): 88.55 % (±0.88), 992 s Post-IR (sharding-IR): 88.55 % (±0.88), 902 s Reference (PT): 90.30 % Accuracy is identical to the post-rebase TP=2 baseline; total run-time is ~9 % faster. The existing ``quantize_mxfp4_moe_trtllm_gen`` post-load transform continues to handle MXFP4 weight prep and op retargeting on top of the IR modeling. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…ariant The sharding-IR ``modeling_gpt_oss_ir.py`` already covered every feature of the non-IR legacy file (same RMSNorm / RoPE / attention with sinks / MoE router / experts), differing only in: * attention Linears go through ``torch.ops.auto_deploy.torch_linear_simple`` with sharding hint kwargs so TP > 1 attention sharding works without an external graph rewrite; * view ops on q/k/v/attn_out use ``torch.ops.auto_deploy.view`` with ``tp_scaled_dim=2`` so the head-count dim scales with TP; * the post-attention all-reduce is expressed as a ``torch.ops.auto_deploy.all_reduce`` placeholder. Consolidate: rename ``modeling_gpt_oss_ir.py`` into the default ``modeling_gpt_oss.py`` (the legacy non-IR variant is removed) and drop the ``AD_USE_IR_MODELS`` opt-in entry for gpt-oss in ``models/custom/__init__.py``. This matches the trajectory in upstream PR NVIDIA#13478 (other models being migrated to sharding-IR as default). GSM8K full 1319-sample validation, gpt-oss-120b @ TP=2 (post-rebase, W4A8 mxfp8 activations): Pre-IR (legacy modeling): 88.55 % (±0.88), 992 s Post-IR (sharding-IR): 88.55 % (±0.88), 902 s Reference (PT): 90.30 % Accuracy is identical to the post-rebase TP=2 baseline; total run-time is ~9 % faster. The existing ``quantize_mxfp4_moe_trtllm_gen`` post-load transform continues to handle MXFP4 weight prep and op retargeting on top of the IR modeling. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…ariant The sharding-IR ``modeling_gpt_oss_ir.py`` already covered every feature of the non-IR legacy file (same RMSNorm / RoPE / attention with sinks / MoE router / experts), differing only in: * attention Linears go through ``torch.ops.auto_deploy.torch_linear_simple`` with sharding hint kwargs so TP > 1 attention sharding works without an external graph rewrite; * view ops on q/k/v/attn_out use ``torch.ops.auto_deploy.view`` with ``tp_scaled_dim=2`` so the head-count dim scales with TP; * the post-attention all-reduce is expressed as a ``torch.ops.auto_deploy.all_reduce`` placeholder. Consolidate: rename ``modeling_gpt_oss_ir.py`` into the default ``modeling_gpt_oss.py`` (the legacy non-IR variant is removed) and drop the ``AD_USE_IR_MODELS`` opt-in entry for gpt-oss in ``models/custom/__init__.py``. This matches the trajectory in upstream PR NVIDIA#13478 (other models being migrated to sharding-IR as default). GSM8K full 1319-sample validation, gpt-oss-120b @ TP=2 (post-rebase, W4A8 mxfp8 activations): Pre-IR (legacy modeling): 88.55 % (±0.88), 992 s Post-IR (sharding-IR): 88.55 % (±0.88), 902 s Reference (PT): 90.30 % Accuracy is identical to the post-rebase TP=2 baseline; total run-time is ~9 % faster. The existing ``quantize_mxfp4_moe_trtllm_gen`` post-load transform continues to handle MXFP4 weight prep and op retargeting on top of the IR modeling. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Summary
Fixes #13429.
Replace the legacy non-IR
modeling_*.pyfor four model architectures with their sharding-IR variants. The IR-aware implementation (usingapply_sharding_hintsfor TP/EP/BMM via canonical-op kwargs liketp_mode,layer_type,output_sizes,tp_min_local_shape,tp_scaled_dim,shardable) is now the canonical implementation. TheAD_USE_IR_MODELSenv-var gate is removed.File swap
git rm modeling_X.py && git mv modeling_X_ir.py modeling_X.pyfor:modeling_deepseek.py— DeepSeek V3, MLA + MoEmodeling_nemotron_h.py— NemotronH, hybrid Mamba + MoEmodeling_qwen3.py— Qwen3 dense (was previously only available as_ir; now registered in__init__.py)modeling_qwen3_5_moe.py— Qwen3.5 MoE, GatedDeltaNet + Gated MHA + MoEContinuation of #12419 (sharding-IR introduction) and #13272 (post-#12419 cleanup), per the staged plan to migrate model implementations to the hint-driven sharder one cohort at a time.
Skill update for future regenerations
.claude/skills/ad-model-onboard/SKILL.md:noaux_tc"Key Gotcha" bullet — was telling agents to substitute fused trtllm calls with vanilla PyTorch; corrected to KEEPtorch.ops.trtllm.{noaux_tc_op, dsv3_router_gemm_op}verbatim (no AD transform recovers vanilla → fused; vanilla replacement is ~17x more kernel launches and loses the FP8-friendly router GEMM).diff -uand classify every hunk before reporting done.position_idsrule now carries an "IR-port exception" pointing at the contract.Audit summary
Each of the four IR files was regenerated from its non-IR base by parallel
generalPurposesubagents using the updated skill, then audited at the parent level. Every hunk classifies as A1-A5 (allowed). Cross-checks: sameclasscount, samenn.Parameter/register_buffer/register_load_state_dict_pre_hookcount, samedef forward(count, bothtorch.ops.trtllm.*call sites inDeepSeekV3MoEGateandNemotronHTopkRouterpreserved verbatim (thenoaux_tcpaths are the original fused kernels), and theposition_idscontract preserved per architecture.Test plan
small/fp8/deepseek/r1.yamlwith the IR sharder +world_size=2on 2x H100: exit=0, 51 sharding nodes, 10 prompts generated cleanly. (AD_USE_IR_MODELS=1no longer needed since IR is canonical.)ci-guidelines.md) — will trigger via/bot run --stage-list ...after PR opens./bot runonce the targeted run is green.Refs: #12419 (sharding-IR introduction), #13271 (post-#12419 cleanup feature), #13272 (post-#12419 cleanup PR), #13429 (this feature's tracking issue).
Made with Cursor
Summary by CodeRabbit
New Features
Refactor
Documentation