Skip to content

Add cvt_f16x2_f32 intrinsic for f32-to-f16x2 packing#66

Open
honeyspoon wants to merge 1 commit into
NVlabs:mainfrom
honeyspoon:upstream-cvt-f16x2
Open

Add cvt_f16x2_f32 intrinsic for f32-to-f16x2 packing#66
honeyspoon wants to merge 1 commit into
NVlabs:mainfrom
honeyspoon:upstream-cvt-f16x2

Conversation

@honeyspoon
Copy link
Copy Markdown

Summary

Add cvt_f16x2_f32 — packs two f32 values into a single u32 containing two packed f16 values in a single PTX instruction.

Maps to PTX: cvt.rn.f16x2.f32 d, hi, lo;

CUDA C equivalent: __float2half2_rn()

Motivation

WMMA GEMM epilogues accumulate in f32 but need to write half-precision output. The scalar approach requires two separate f32→f16 conversions plus bit manipulation ((lo_bits) | (hi_bits << 16)). This intrinsic compiles to a single PTX instruction that does both conversions and the pack.

Implementation

Follows the same 4-crate pattern as the existing CvtF32x2Bf16x2Op (bfloat16 variant in tcgen05.rs):

  • cuda-device/src/convert.rs: cvt_f16x2_f32(lo: f32, hi: f32) -> u32 stub
  • dialect-nvvm/src/ops/convert.rs: CvtF16x2F32Op — 2 f32 operands, 1 i32 result
  • mir-importer/.../intrinsics/convert.rs: Standard 2-arg emit with result storage via emit_store_result_and_goto
  • mir-lower/.../intrinsics/convert.rs: llvm.inline_asm with constraint string "=r,f,f"

The lowering is structurally identical to the existing convert_cvt_f32x2_bf16x2 in crates/mir-lower/src/convert/intrinsics/tcgen05.rs (lines 510–539), differing only in the PTX instruction name.

Testing

cargo fmt --check           # pass
cargo clippy -p cuda-device  # pass (workspace clippy blocked by hi_sparse_bitset MSRV on rustc 1.93)

Validated in a WMMA GEMM kernel on SM_89: f32 accumulator → packed f16x2 output with correct values (cosine similarity 1.000000 vs scalar reference).

Signed-off-by: abder <bobmatt911@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant