-
Notifications
You must be signed in to change notification settings - Fork 74
Jj/grouped mm quantization math runtime #5687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ation_math_runtime
|
!test |
|
Review updated until commit 090e395 Auto-merge Status✅ Internal CI is finished Description
|
| Relevant files | |||
|---|---|---|---|
| Enhancement |
| ||
| Tests |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| 🔒 No security concerns identified |
| ⚡ Recommended focus areas for review |
Numerical Accuracy Concerns
|
Test failures
-
(High, 44)
NCCL NVLink SHARP (NVLS) multicast memory binding errors in multidevice distributed / nvFuser test suites on dlcluster_viking_ci runnerTest Name H100 (dist.) Source tests.python.multidevice.test_communication.test_allgather ❌ tests.python.multidevice.test_communication.test_allgather_expanded_broadcast ❌ tests.python.multidevice.test_communication.test_allreduce ❌ tests.python.multidevice.test_communication.test_reduce_scatter ❌ tests.python.multidevice.test_communication.test_reduce_scatter_noncontiguous ❌ tests.python.multidevice.test_dtensor.test_column_parallel_linear ❌ tests.python.multidevice.test_dtensor.test_plus_one ❌ tests.python.multidevice.test_dtensor.test_row_parallel_linear ❌ tests.python.multidevice.test_expert_parallel.test_dispatch_and_combine ❌ tests.python.multidevice.test_matmul.test_column_parallel_grouped_mm ❌ ... with 34 more test failures omitted. Check internal logs. -
(Medium, 1)
NCCL invalid usage in multidevice overlap allgather test (tests/python/multidevice)Test Name H100 (dist.) Source tests.python.multidevice.test_overlap.test_overlap_allgather_matmul_shard_outermost[backend_type=CommunicatorBackend.cuda] ❌
| if constexpr (USE_GLOBAL_SCALE) { | ||
| scaled_max = global_scale[0] / scaled_max; | ||
| } else { | ||
| scaled_max = 1.0 / scaled_max; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be the right math, but looks like it's causing cpp test errors. I'll double check that.
…ation_math_runtime
…ation_math_runtime
…ation_math_runtime
|
I should re-evaluate the test after #5696 |
Greptile SummaryThis PR replaces the decomposed block quantization implementation with the single Key Changes:
Critical Issues:
Concerns: Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Test as Test Code
participant NVFuser as nvFuser Fusion
participant BlockQuant as nv_block_quantize Op
participant Layout as preprocess_grouped_matmul_input_sf
participant GroupedMM as cutlass_nvfp4_grouped_mm
participant Reference as Decomposed Reference
Note over Test: Prepare input matrices
Test->>Test: Create mat1 (m, k) float32
Test->>Test: Create mat2 (g, n, k) float32
Test->>Test: Quantize mat2 to FP4 with block scales
Note over NVFuser: nvFuser path (new)
Test->>NVFuser: Execute fusion with mat1
NVFuser->>BlockQuant: nv_block_quantize(mat1)
BlockQuant-->>NVFuser: (fp4_mat1, fp8_scale1)
NVFuser->>Layout: preprocess_grouped_matmul_input_sf(fp8_scale1)
Layout-->>NVFuser: layout_fp8_scale1
NVFuser->>GroupedMM: cutlass_nvfp4_grouped_mm(fp4_mat1, mat2, scales...)
GroupedMM-->>NVFuser: output (bfloat16)
NVFuser-->>Test: nvfuser_output
Note over Reference: Reference path (decomposed)
Test->>Reference: activation_scale_to_nvfp4(mat1)
Reference-->>Test: (mat1_fp4, scale1)
loop For each expert group
Test->>Reference: torch._scaled_mm(mat1_fp4[slice], mat2[i], scales...)
Reference-->>Test: output[slice]
Test->>Test: Multiply by mat2_gs[i]
end
Note over Test: Validation
Test->>Test: Compare abs_diff = |nvfuser_output - reference|
Test->>Test: Assert max_diff <= 10.0
Test->>Test: Assert large_diff_ratio < 10%
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
|
This looks good to me - do you want to rebase on #5696 and test out the accuracy and we can land both these PRs. |
err. I actually still won't be able to improve the reference result from this, since the TE quantization doesn't allow me to skip global scaling factor (for I might have to stick with the validation. 😢 |
|
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 3 comments
protonu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
failing nccl error seems to be unrelated. merging as-is. |
|
@protonu I can use an approving stamp to merge. |
Note that: I'm pretty nervous about the numeric of block quantization op. I left the check here as relaxed as the existing python ops, as I can't get it to perform close to what decomposed version does.