Skip to content

Add floating point variable types: Float16, Float8, BFloat16, TensorFloat32, Float6, Float4 (ESROGUE-730)#1172

Merged
ruck314 merged 40 commits into
pre-releasefrom
ESROGUE-730
Apr 13, 2026
Merged

Add floating point variable types: Float16, Float8, BFloat16, TensorFloat32, Float6, Float4 (ESROGUE-730)#1172
ruck314 merged 40 commits into
pre-releasefrom
ESROGUE-730

Conversation

@ruck314

@ruck314 ruck314 commented Apr 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Extends Rogue's variable model system with six new floating point types for NVIDIA GPU interoperability.

New types

Type Format Bits Value Range NVIDIA Support
pr.Float16 / pr.Float16BE IEEE 754 half-precision, 1s/5e/10m 16 ±65504.0 All
pr.Float8 / pr.Float8BE E4M3, NaN=0x7F, no Inf 8 ±448.0 Hopper, Blackwell
pr.BFloat16 / pr.BFloat16BE 1s/8e/7m, IEEE exponent range 16 ±3.39e38 Ampere, Hopper, Blackwell
pr.TensorFloat32 / pr.TensorFloat32BE 1s/8e/10m, 19 bits in 4 bytes 19 (32 storage) ±3.40e38 Ampere, Hopper, Blackwell
pr.Float6 / pr.Float6BE E3M2, no NaN/Inf 6 ±28.0 Blackwell
pr.Float4 / pr.Float4BE E2M1, no NaN/Inf 4 ±6.0 Blackwell

Implementation

Each type follows the same full-stack pattern:

  • C++ model ID constant in Constants.h
  • Block get/set methods with inline bit-manipulation converters in Block.cpp
  • Variable dispatch in Variable.cpp
  • Python Model class with toBytes/fromBytes/fromString/minValue/maxValue
  • Sphinx API documentation (per-type page + consolidated summary reference)

Special-value handling per spec:

  • Float16: full IEEE 754 NaN and ±Inf
  • Float8: NaN = 0x7F, no infinity (clamps to max finite)
  • BFloat16/TF32: full IEEE 754 NaN and ±Inf
  • Float6/Float4: no NaN or Inf (clamps to max finite on encode)

Documentation

Added docs/src/pyrogue_tree/core/float_types_summary.rst — a consolidated quick-reference page with format details table, special-value handling, NVIDIA architecture support matrix, and usage example.

Test plan

  • pytest tests/core/test_float16.py — all boundary values, round-trip, metadata, remote variable integration
  • pytest tests/core/test_float8.py — all boundary values, round-trip, metadata, remote variable integration
  • pytest tests/core/test_bfloat16.py — all boundary values, round-trip, metadata, remote variable integration
  • pytest tests/core/test_tensorfloat32.py — all boundary values, round-trip, metadata, remote variable integration
  • pytest tests/core/test_float6.py — all 64 bit patterns, boundary values, NaN/Inf clamping, remote variable integration
  • pytest tests/core/test_float4.py — all 16 bit patterns, boundary values, NaN/Inf clamping, remote variable integration
  • Full regression: 207 passed
  • flake8 + cpplint clean

Resolves ESROGUE-730.

Add Float16 model type so users can declare 16-bit half-precision
floating point variables with base=pr.Float16. Implements the full
C++/Python stack: model ID constant, Block get/set methods with inline
IEEE 754 half-float converters, Variable dispatch, Python Model classes
(Float16, Float16BE), Sphinx API documentation, and comprehensive tests
including boundary values, NaN/Inf, subnormals, and cache key behavior.

Resolves ESROGUE-730.
@ruck314 ruck314 changed the title Add IEEE 754 half-precision (Float16) variable support Add IEEE 754 half-precision (Float16) variable support (ESROGUE-730) Apr 4, 2026
ruck314 added 9 commits April 4, 2026 17:15
…et/set methods

- Add Float8 = 0x0A constant to Constants.h between Float16 and Custom
- Declare setFloat8Py/getFloat8Py/setFloat8/getFloat8 in Block.h
- Implement floatToFloat8/float8ToFloat E4M3 converters in anonymous namespace
- Implement all four Block methods following Float16 pattern
- setFloat8Py/getFloat8Py use NPY_FLOAT (no native numpy Float8 dtype)
…nding

- Add setFloat8_/getFloat8_ function pointer members to Variable.h
- Add public setFloat8/getFloat8 accessor declarations to Variable.h
- Add case rim::Float8 to all four switch blocks in Variable.cpp
- Implement setFloat8/getFloat8 C++ accessor wrappers in Variable.cpp
- Expose Float8 constant to Python in module.cpp
- Add Float8(Model) with E4M3 encoding: modelId=rim.Float8, ndType=float32
- Manual bit manipulation toBytes/fromBytes (no struct.pack format for 8-bit float)
- NaN encodes as 0x7F, infinity clamps to max finite (sign|0x7E)
- minValue=-448.0, maxValue=448.0
- Float8BE subclass with endianness='big'
- Create tests/core/test_float8.py with 18 tests covering metadata,
  boundary values round-trip, NaN encoding (0x7F), no-infinity clamping,
  known bit patterns, and RemoteVariable integration
- Add Float8Var (offset=0x30, bitSize=8) to ModelVariableDevice in
  test_model_variables.py and exercise it in round-trip test
- All 22 tests pass
- Create docs/src/api/python/pyrogue/float8.rst with autoclass directives
- Add float8 to Models toctree in index.rst (after float16)
- Add Float8/Float8BE rows to Built-In Model Types table in model.rst
- Update floating-point family list to include Float8 and Float8BE
…thods

- Add BFloat16 = 0x0B constant to Constants.h after Float8
- Add setBFloat16Py/getBFloat16Py/setBFloat16/getBFloat16 declarations to Block.h
- Add floatToBFloat16/bfloat16ToFloat converter functions (upper 16 bits of float32)
- Implement all four BFloat16 Block methods using NPY_FLOAT (no native numpy dtype)
…binding

- Add setBFloat16_/getBFloat16_ function pointer members to Variable.h
- Add setBFloat16/getBFloat16 public accessor declarations to Variable.h
- Add four case rim::BFloat16 dispatch blocks in Variable.cpp constructor
- Add setBFloat16/getBFloat16 C++ accessor wrapper methods in Variable.cpp
- Initialize setBFloat16_/getBFloat16_ to NULL in constructor
- Expose rim.BFloat16 = 0x0B constant via module.cpp Python binding
- BFloat16(Model): 1s/8e/7m format, upper 16 bits of float32 bit pattern
- modelId = rim.BFloat16, ndType = np.float32, bitSize must be 16
- toBytes/fromBytes use integer bit manipulation (no struct format code for BF16)
- Supports infinity and NaN (unlike Float8 which clamps)
- BFloat16BE subclass with endianness = 'big'
- Create tests/core/test_bfloat16.py with 17 tests covering:
  metadata, wrong bitsize rejection, instance caching, boundary value
  round-trips, NaN, infinity, known bit patterns, BE endianness,
  and remote variable integration through C++ Block layer
- Update test_model_variables.py: add BFloat16Var at offset=0x32
  and include it in round-trip test parametrization
@bengineerd bengineerd requested a review from Copilot April 5, 2026 01:09
- Create docs/src/api/python/pyrogue/bfloat16.rst with autoclass directives
- Add bfloat16 to Models toctree in index.rst after float8
- Add BFloat16 and BFloat16BE rows to Built-In Model Types table in model.rst
- Update floating-point family list to include BFloat16 and BFloat16BE

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds IEEE 754 half-precision (Float16) as a first-class model/variable type across the C++/Python memory stack, with docs and tests.

Changes:

  • Introduces Float16/Float16BE model types (Python) and a new model ID constant (Float16 = 0x09).
  • Adds C++ Block/Variable dispatch and Python bindings for Float16 get/set, including half-float conversion helpers.
  • Updates Sphinx/Doxygen docs and expands tests to cover Float16 metadata, caching, and RemoteVariable integration.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/core/test_model.py Adds Float16 model metadata, caching, and boundary/NaN round-trip tests (Python struct-based).
tests/core/test_model_variables.py Adds a Float16 RemoteVariable and round-trip/boundary tests through the remote/block path.
src/rogue/interfaces/memory/Variable.cpp Wires Float16 modelId dispatch to Block get/set and adds Variable::get/setFloat16 methods.
src/rogue/interfaces/memory/module.cpp Exposes Float16 constant into the Python rogue.interfaces.memory module.
src/rogue/interfaces/memory/Block.cpp Implements Float16 Python/C++ get/set plus half-float conversion helpers.
python/pyrogue/_Model.py Adds pyrogue.Float16 and pyrogue.Float16BE Model implementations.
include/rogue/interfaces/memory/Variable.h Declares Float16 function pointers and public get/setFloat16 APIs.
include/rogue/interfaces/memory/Constants.h Adds the Float16 model ID constant.
include/rogue/interfaces/memory/Block.h Declares Float16 Block APIs for C++ and Python.
docs/src/api/python/pyrogue/index.rst Adds float16 to the Models docs toctree.
docs/src/api/python/pyrogue/float16.rst New API doc page for pyrogue.Float16 / pyrogue.Float16BE.
docs/src/api/cpp/interfaces/memory/model.rst Links Float16/Float16BE from the C++ interfaces memory model docs.
docs/src/api/cpp/interfaces/memory/constants.rst Adds Float16 to the constants reference list.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/rogue/interfaces/memory/Variable.cpp
Comment thread src/rogue/interfaces/memory/Block.cpp Outdated
Comment thread src/rogue/interfaces/memory/Block.cpp Outdated
Comment thread src/rogue/interfaces/memory/Block.cpp Outdated
Comment thread tests/core/test_model_variables.py Outdated
ruck314 added 13 commits April 4, 2026 18:40
…nd Block methods

- Add TensorFloat32 = 0x0C constant to Constants.h
- Add floatToTensorFloat32/tensorFloat32ToFloat converter pair (4-byte, mask 0xFFFFE000)
- Add setTensorFloat32/getTensorFloat32 C++ and Python Block methods using uint32_t storage
…nding

- Add setTensorFloat32_/getTensorFloat32_ function pointers and public accessors to Variable.h
- Add NULL init, 4 switch-case dispatches, and C++ wrapper methods to Variable.cpp
- Expose TensorFloat32 constant to Python via module.cpp
- rim.TensorFloat32 == 12 verified from Python
- TensorFloat32 class with 4-byte storage, 0xFFFFE000 mask, rim.TensorFloat32 modelId
- TensorFloat32BE big-endian variant
- bitSize=32 assertion, ndType=np.float32, min/max ~3.40e38
…tion

- 17 tests: metadata, wrong bitsize, cache, boundary round-trip, NaN, infinity, fromString, bit patterns, BE endianness, remote variable round-trip, remote variable boundary values
- TF32Var added to model variables integration device at offset=0x34, bitSize=32
- All 35 existing BFloat16/Float8 tests pass with no regressions
- New tensorfloat32.rst with autoclass directives for TensorFloat32 and TensorFloat32BE
- Added tensorfloat32 to Models toctree in pyrogue index.rst
- Added TF32 rows to Built-In Model Types table and floating-point family list in model.rst
… get/set methods

- Add Float6 = 0x0D constant in Constants.h
- Add setFloat6/getFloat6/setFloat6Py/getFloat6Py declarations in Block.h
- Implement floatToFloat6/float6ToFloat converters with E3M2 semantics (1s/3e/2m, bias=3, no NaN/Inf)
- Implement all four Block methods with NPY_FLOAT numpy array support
…nding

- Add setFloat6_/getFloat6_ function pointer members in Variable.h
- Add public setFloat6/getFloat6 accessor declarations in Variable.h
- Add 4 case rim::Float6 dispatch blocks in Variable.cpp
- Add NULL initialization for Float6 function pointers
- Add C++ accessor wrapper implementations
- Expose Float6 constant to Python in module.cpp
- Float6 E3M2 model with toBytes/fromBytes bit manipulation
- NaN/Inf inputs clamp to max finite (+/-28.0)
- Float6BE big-endian variant
- bitSize=8, ndType=float32, modelId=rim.Float6
- Create test_float6.py with metadata, boundary, bit pattern, NaN/Inf, remote variable tests
- Add Float6Var to test_model_variables.py integration test
- Fix float6ToFloat C++ subnormal decode using direct formula instead of normalization loop
- Create float6.rst with autoclass directives for Float6/Float6BE
- Add float6 to API index toctree
- Add Float6/Float6BE rows to model type table and family list
… methods

- Add Float4 = 0x0E constant in Constants.h
- Add setFloat4/getFloat4/setFloat4Py/getFloat4Py declarations in Block.h
- Implement floatToFloat4/float4ToFloat converters with E2M1 semantics (1s/2e/1m, bias=1, no NaN/Inf)
- Implement Block set/get methods for Float4 with NPY_FLOAT numpy array support
…nding

- Add setFloat4_/getFloat4_ function pointer members in Variable.h
- Add public setFloat4/getFloat4 accessor declarations in Variable.h
- Add 4 case rim::Float4: blocks in Variable.cpp switch statements
- Add NULL initialization for Float4 function pointers
- Add C++ accessor wrapper implementations for Float4
- Expose Float4 constant to Python in module.cpp
- Float4 E2M1 model with manual bit manipulation encode/decode
- NaN/Inf inputs clamp to max finite (+/-6.0), no special encodings
- Float4BE big-endian variant
- ndType=float32, bitSize assertion enforces 8-bit storage
@ruck314 ruck314 changed the title Add IEEE 754 half-precision (Float16) variable support (ESROGUE-730) Add floating point variable types: Float8, BFloat16, TensorFloat32, Float6, Float4 (ESROGUE-730) Apr 5, 2026
ruck314 added 2 commits April 4, 2026 22:38
- Initialize setFloat16_/getFloat16_ and setFloat8_/getFloat8_ to NULL
  alongside all other function pointers in Variable constructor; missing
  initialization left indeterminate values that made NULL checks unreliable

- Fix floatToHalf() subnormal encoding: remove redundant >> 13 after
  mantissa >>= shift which flushed all half-precision subnormals to zero

- Fix floatToHalf() NaN encoding: ensure half mantissa payload is non-zero
  when the float32 NaN payload fits in fewer than 13 bits, preventing
  mis-encoding as infinity (0x7C00)

- Fix halfToFloat() subnormal decoding: use int32_t for the working
  exponent in the normalization loop so it can go negative without
  wrapping to UINT32_MAX and producing invalid float results

- Add true half-precision subnormal (2**-24) to the Float16 integration
  test to cover the floatToHalf/halfToFloat denormal path
…type

The normalization loop in float8ToFloat checked bit 2 (0x04) for the
implicit leading 1, but Float8's 3-bit stored mantissa has its leading
bit at position 3 (0x08). Stopping one shift early produced a 2x error
for every subnormal value (e.g. 0x01 decoded to 2^-8 instead of 2^-9).

Fix by checking bit 3 (0x08) and stripping with 0x07 (not 0x03), and
use int32_t for the working exponent so the counter can go negative
without wrapping. Add a RemoteVariable subnormal test (2^-9) to cover
the C++ Block conversion path.
@ruck314

ruck314 commented Apr 5, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up to the Copilot review fixes (commit 19691ae): audited all other new floating point types for the same 5 bug categories.

Bug 4 (uint32_t underflow in normalization loop) also affected float8ToFloat — and had a related secondary bug: the loop terminated when bit 2 was set (0x04) instead of bit 3 (0x08), stopping one shift early. The combination caused every Float8 subnormal to decode at 2× the correct value (e.g. 0x012^-8 instead of 2^-9). Fixed in commit d42516f.

Float6 and Float4 are immune to both issues — their ToFloat decoders use direct arithmetic instead of a normalization loop.

BFloat16 and TensorFloat32 have an analogous NaN-payload-truncation edge case (Bug 3): a float32 NaN with payload bits only in the lower 16/13 bits will round-trip as Inf rather than NaN. This is an inherent property of the truncation-based encoding (both formats are defined as bit-truncations of float32), not a coding error.

Bug 1 (NULL init) for Float8 pointers was already caught and fixed in 19691ae alongside the Float16 fix.

cpplint requires controlled statements inside while clause brackets
to be on separate lines (whitespace/newline rule).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 12 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/pyrogue/_Model.py Outdated
Comment thread python/pyrogue/_Model.py
Comment thread python/pyrogue/_Model.py
Comment thread python/pyrogue/_Model.py
Comment thread src/rogue/interfaces/memory/Block.cpp Outdated
Comment thread docs/src/pyrogue_tree/core/model.rst Outdated
Comment thread docs/src/api/cpp/interfaces/memory/constants.rst
Comment thread docs/src/api/cpp/interfaces/memory/model.rst
Comment thread tests/core/test_bfloat16.py
Comment thread tests/core/test_tensorfloat32.py
@ruck314 ruck314 changed the title Add floating point variable types: Float8, BFloat16, TensorFloat32, Float6, Float4 (ESROGUE-730) Add floating point variable types: Float16, Float8, BFloat16, TensorFloat32, Float6, Float4 (ESROGUE-730) Apr 5, 2026
C++:
- floatToBFloat16(): detect NaN whose payload is entirely in the lower
  16 truncated bits and force a NaN mantissa bit so it cannot become Inf
- floatToTensorFloat32(): same fix for the lower 13 masked mantissa bits

Python:
- BFloat16.toBytes/fromBytes: use self.endianness (big vs little) when
  packing/unpacking the uint16; BFloat16BE was silently producing
  little-endian bytes
- BFloat16.toBytes: same NaN-preservation logic as C++ fix above
- TensorFloat32.toBytes/fromBytes: same endianness and NaN fixes

Tests:
- test_bfloat16_be_endianness: add known bit-pattern assertions for 1.0,
  +Inf, -Inf encoded in big-endian byte order
- test_tensorfloat32_be_endianness: add known bit-pattern assertion for
  1.0 encoded in big-endian byte order

Docs:
- float_types_summary.rst: fix Float16 model-class column (was "Float")
- pyrogue_tree/core/model.rst: add Float16/Float16BE to floating-point
  model list
- api/cpp/constants.rst: add Float8, BFloat16, TF32, Float6, Float4
- api/cpp/model.rst: add anchor sections and Python doc links for all
  new model types (Float8, BFloat16, TF32, Float6, Float4)

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/rogue/interfaces/memory/Block.cpp Outdated
Comment thread src/rogue/interfaces/memory/Block.cpp Outdated
Comment thread python/pyrogue/_Model.py
Comment thread docs/src/pyrogue_tree/core/model.rst Outdated
Comment thread tests/core/test_tensorfloat32.py
…ble rows

- floatToFloat6/floatToFloat4: NaN has no meaningful sign so always clamp
  to positive max; infinity preserves sign to match Python model behavior
- Float16.toBytes(): document that struct.pack uses round-to-nearest-even
  while the C++ floatToHalf uses truncation (difference is at most 1 ULP)
- model.rst: add missing Float16/Float16BE rows to Built-In Model Types table
- Add remote variable NaN clamping tests for Float6/Float4 through C++ path

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread include/rogue/interfaces/memory/Variable.h Outdated
Comment thread python/pyrogue/_Model.py
Comment thread docs/src/pyrogue_tree/core/float_types_summary.rst Outdated
- Fix "C++ filed point" → "C++ fixed point" typo in Variable.h section header
- Add missing Float16BE to big-endian variant list in float_types_summary.rst

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/rogue/interfaces/memory/Block.cpp
Comment thread src/rogue/interfaces/memory/Block.cpp
Comment thread src/rogue/interfaces/memory/Block.cpp
Comment thread src/rogue/interfaces/memory/Block.cpp
Comment thread src/rogue/interfaces/memory/Block.cpp
Comment thread src/rogue/interfaces/memory/Block.cpp
ruck314 added 6 commits April 7, 2026 17:09
The pre-release merge brought the wide signed Fixed32/40/64 test
variables into test_model_variables.py alongside the Float16/8/BFloat16/
TF32/Float6/Float4 variables already on this branch. Both sets used
offsets 0x28/0x30/0x38, so Block::addVariables rejected the device with
a bit-overlap GeneralError and every test in the file failed at
construction time. Move the Fixed* entries into the free region after
UFixedListVar (0x68/0x70/0x78), which is well within the 0x4000 emulated
memory window.
…del.rst

Convert the new Float8/BFloat16/TensorFloat32/Float6/Float4 rows (and BE
variants) in the Built-In Model Types table to use :ref: cross-references
consistent with every other row. Correct the Bit Size column for Float6
and Float4 from 8-bits (the storage size) to 6-bits and 4-bits (the format
width). Widen columns so BFloat16/TF32 notes and the tensorfloat32be ref
fit within their grid cells.
@ruck314 ruck314 merged commit 6c470c0 into pre-release Apr 13, 2026
13 of 14 checks passed
@ruck314 ruck314 deleted the ESROGUE-730 branch April 13, 2026 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants