#neural-network #game-ai #chess #nnue #machine-learning

noru

Zero-dependency NNUE training & inference library in pure Rust

11 stable releases

2.2.0 May 7, 2026
2.1.1 May 1, 2026
2.1.0 Apr 29, 2026
1.2.1 Apr 21, 2026

#61 in Machine learning


Used in figrid-board

MIT/Apache

1MB
3.5K SLoC

NORU

NNUE On RUst — Zero-dependency NNUE training & inference library in pure Rust.

What is NNUE?

NNUE (Efficiently Updatable Neural Network) is a neural network architecture designed for fast evaluation in game engines. Originally developed for Shogi and adopted by Stockfish, NNUE enables real-time neural network inference through incremental accumulator updates.

What is NORU?

NORU is a game-agnostic NNUE library that provides both training and inference in a single, dependency-free Rust crate. Configure your network dimensions at runtime via NnueConfig — no recompilation needed.

Why this library?

Most NNUE code in the wild lives inside a specific chess engine (Stockfish, Rapfi, …) and is hard-wired to that engine's feature layout in C++. Applying NNUE to a different game — Gomoku, Connect 4, a tactical hex-grid battler — traditionally means either forking one of those engines and rewriting its feature encoder, or re-implementing training from scratch in Python with PyTorch and then writing a separate C++ inference path for deployment.

NORU collapses that pipeline into one pure-Rust crate:

  • One crate for training and inference. FP32 backprop with Adam for training, i16 quantized forward pass with SIMD acceleration for deployment. You don't leave the Rust toolchain.
  • No learned assumptions about chess. NnueConfig decouples feature_size, accumulator_size, hidden_sizes, and the activation function from the binary layout, so the same crate serves a 4096-feature Gomoku encoder and a 138-feature hex-grid encoder without code changes.
  • No dependencies. Including the RNG (xorshift64). cargo add noru just works on any platform Rust supports, including WebAssembly and ARM embedded targets.
  • Deployment-ready. A cdylib build + the noru::ffi layer exposes the inference API to Unity, Godot, C#, and C++ so the same trained weights can ship into a game engine without a Python runtime.

The design target is game AI developers who want Stockfish-class evaluation quality for non-chess domains without paying the integration cost of the chess-engine ecosystem.

Key Features

  • Multi-hidden-layer — Arbitrary depth networks (e.g. &[256, 32, 32])
  • CReLU + SCReLU — Squared Clipped ReLU for stronger accumulator activation
  • SIMD-accelerated inference — AVX2 (x86_64), NEON (aarch64), with scalar fallback
  • Training + Inference — FP32 backpropagation with Adam optimizer, i16 quantized inference
  • Zero dependencies — Pure Rust, no PyTorch, no CUDA, no C bindings
  • Game-agnostic — Runtime-configurable network dimensions via NnueConfig
  • Incremental updates — Efficient accumulator add/remove for search trees
  • Quantization — Automatic FP32 → i16 conversion for deployment
  • Binary format v2 — Versioned model serialization with auto-detection
  • C ABI / FFI layercdylib build + noru::ffi for embedding in Unity, Godot, C#, C++

Quick Start

Add to your Cargo.toml:

[dependencies]
noru = "2.0"

Training

use noru::config::{NnueConfig, Activation};
use noru::trainer::{TrainableWeights, AdamState, Gradients, TrainingSample, SimpleRng};

// 1. Define your network dimensions
let config = NnueConfig::new_static(
    530,               // your game's feature count
    256,               // hidden accumulator neurons
    &[64],             // hidden layer sizes (multi-layer: &[256, 32, 32])
    Activation::CReLU, // or Activation::SCReLU
);

// 2. Initialize weights
let mut rng = SimpleRng::new(42);
let mut weights = TrainableWeights::init_random(config.clone(), &mut rng);
let mut adam = AdamState::new(config.clone());

// 3. Train on samples
let sample = TrainingSample {
    stm_features: vec![0, 42, 100],   // active feature indices (side-to-move)
    nstm_features: vec![10, 50, 200], // active feature indices (opponent)
    target: 0.8,                       // evaluation target
};

let fwd = weights.forward(&sample.stm_features, &sample.nstm_features);
let mut grad = Gradients::new(config);
weights.backward_bce(&sample, &fwd, &mut grad);  // BCE loss, target in [0, 1]
// or for raw eval regression:
// weights.backward_raw_mse(&sample, &fwd, &mut grad);
weights.adam_update(&grad, &mut adam, 0.001, 1.0);

// 4. Quantize for deployment
let inference_weights = weights.quantize(); // FP32 → i16

Inference

use noru::config::{NnueConfig, Activation};
use noru::network::{NnueWeights, Accumulator, FeatureDelta, forward};

// Load quantized weights (v2 format auto-detected)
let weights = NnueWeights::load_from_bytes(&model_bytes, None)?;

// Or with legacy format (requires config)
let config = NnueConfig::new_static(530, 256, &[64], Activation::CReLU);
let weights = NnueWeights::load_from_bytes(&model_bytes, Some(config))?;

// Evaluate a position
let mut acc = Accumulator::new(&weights.feature_bias);
acc.refresh(&weights, &stm_features, &nstm_features);
let eval: i32 = forward(&acc, &weights);

// Incremental update (for search trees)
let delta_stm = FeatureDelta::from_slices(&[new_feature], &[old_feature])?;
let delta_nstm = FeatureDelta::new();
acc.update_incremental(&weights, &delta_stm, &delta_nstm);

Save / Load Models

// Save
let bytes = weights.save_to_bytes(); // v2 format with NORU header
std::fs::write("model.bin", &bytes)?;

// Load (auto-detects v2 header)
let data = std::fs::read("model.bin")?;
let weights = NnueWeights::load_from_bytes(&data, None)?;

Examples

Runnable examples live in examples/. Each is a self-contained binary you can clone and run without a separate game engine:

# Minimal training → quantization → inference round trip (4-feature toy problem)
cargo run --release --example xor

# Multi-hidden-layer network with SCReLU activation
cargo run --release --example multi_layer

# FP32 → i16 → save → load → inference, reports quantization audit metrics
cargo run --release --example quantize_roundtrip

# Mini board state -> sparse feature extraction -> training/inference loop
cargo run --release --example feature_loop

Applications

NORU has been validated across three games of different branching factors and feature encodings, which is the primary evidence that the runtime-configurable design generalizes beyond chess:

  • Gomoku (15×15 Five-in-a-Row). figrid-board v0.4.x ships a pbrain/Piskvork-compatible Gomocup engine (pbrain-figrid-noru) built on NORU. Feature set: 4096 (PS + LP-Rich + Compound threats + Density). Configuration: accumulator 512 → hidden 64 → output. Gomocup 2026 submission target. Repo: https://github.com/nicotina04/figrid-board.
  • Hex-grid tactical battler. An auto-extraction RPG combat engine uses NORU for unit-placement evaluation. Feature set: 138 (position-independent, per-class + global). Configuration: accumulator 256 → hidden 64 → output. Demonstrates that non-board-game domains fit the same API.
  • Connect 4. A minimal second game used as an ablation target to confirm generality; reaches ~45% win rate against a depth-matched heuristic after a few hours of training.

These three share the identical noru crate — only NnueConfig and the feature extractor differ per domain.

Evidence notes and current public/private artifact status are tracked in documents/adoption_evidence.md and documents/benchmark_inventory.md.

Architecture

Input (sparse features)
  ↓
Feature Transform: [feature_size][accumulator_size] (per perspective)
  ↓
CReLU or SCReLU
  ↓
Concat: [accumulator_size × 2] (STM + NSTM perspectives)
  ↓
Hidden Layer₁ → CReLU → Hidden Layer₂ → ... → Hidden Layerₙ → CReLU
  ↓
Output Layer → 1 (evaluation score)

All dimensions are configured at runtime:

// Simple (single hidden layer)
let config = NnueConfig::new_static(530, 256, &[64], Activation::CReLU);

// Stockfish-style (multi-layer + SCReLU)
let config = NnueConfig::new_static(768, 1024, &[256, 32, 32], Activation::SCReLU);

SIMD Acceleration

Inference is automatically accelerated on supported platforms:

Platform Instruction Set Width Auto-detected
x86_64 AVX2 256-bit (16 × i16) Runtime
aarch64 NEON 128-bit (8 × i16) Compile-time
Other Scalar Fallback

No configuration needed — the fastest available path is selected automatically.

API Reference

noru::config

Type Description
NnueConfig Network dimensions and activation type (borrowed or owned hidden_sizes)
OwnedNnueConfig Runtime-constructible variant with Vec<usize> hidden sizes; convert via .into_config()
Activation Activation function enum (CReLU, SCReLU)

noru::ffi (C ABI, optional)

NORU is built as a cdylib in addition to rlib, producing libnoru.{so,dylib} / noru.dll. The noru::ffi module exposes a C ABI surface for embedding in game engines and other non-Rust hosts:

  • Trainer: noru_trainer_new / free / forward / backward_bce / backward_raw_mse / zero_grad / adam_step
  • Accumulator tree-search helpers: noru_accumulator_clone / copy_from / update_undo for alpha-beta without snapshot allocation per node
  • Checkpoint: noru_trainer_save_fp32 / load_fp32 (FP32 weight serialization)
  • Quantize: noru_trainer_quantizeNoruWeights for inference
  • Inference: noru_weights_load / save / free, noru_accumulator_new / refresh / update / swap / forward
  • Errors: noru_last_error() returns a thread-local C string for the most recent failure.

All FFI functions return an i32 status code (NORU_OK = 0, negative values for errors) and catch panics at the boundary. See src/ffi.rs for the full surface.

noru::audit (Quantization Drift)

Type / Function Description
AuditSample Borrowed feature lists for audit-only evaluation
FeatureSet Trait for reusable STM/NSTM sample adapters
QuantizationReport Aggregate sign/range/error metrics for FP32 vs i16
audit_quantized_model() Compare FP32 weights against a quantized model
TrainableWeights::audit_quantization() Quantize and audit in one call
NnueWeights::audit_against_fp32() Audit a saved/reloaded quantized model

noru::network (Inference, i16)

Type / Function Description
NnueWeights Quantized i16 weights for inference
NnueWeights::load_from_bytes() Load weights from binary (v2 auto-detect)
NnueWeights::save_to_bytes() Save weights to v2 binary format
Accumulator Maintains per-perspective activation sums
Accumulator::refresh() Full recomputation from feature list
Accumulator::update_incremental() Efficient add/remove update
Accumulator::swap() Swap STM/NSTM perspectives
FeatureDelta Tracks added/removed features for incremental updates
FeatureDelta::from_slices() Checked constructor that rejects overflow instead of truncating
forward() Full forward pass: Accumulator → Hidden layers → Output

noru::trainer (Training, FP32)

Type / Function Description
TrainableWeights FP32 weights with training methods
TrainableWeights::init_random() Kaiming initialization
TrainableWeights::forward() FP32 forward pass with intermediate results
TrainableWeights::backward_bce() Backpropagation (BCE loss)
TrainableWeights::backward_raw_mse() Backpropagation (raw-output MSE loss)
TrainableWeights::adam_update() Adam optimizer step
TrainableWeights::quantize() FP32 → i16 for deployment
AdamState Adam optimizer momentum/velocity state
Gradients Gradient accumulation buffer
TrainingSample Training data (features + target)
SimpleRng Built-in xorshift64 RNG (no external dependency)

Development

Local checks:

cargo fmt --check
cargo test
cargo doc --no-deps
cargo package --allow-dirty --list

For contribution and support pathways, see CONTRIBUTING.md, CODE_OF_CONDUCT.md, and CITATION.cff.

Publication

Draft software-paper materials live in paper.md, paper.bib, and documents/benchmark_inventory.md.

Reproducibility

For reviewer-facing usage examples beyond the toy demos:

noru::simd

Function Description
vec_add_i16() Saturating i16 vector addition
vec_sub_i16() Saturating i16 vector subtraction
vec_clipped_relu() ClippedReLU activation (clamp to 0..127)
dot_i16_i32() i16 dot product with i32 accumulation
dot_screlu_i64() SCReLU squared dot product with i64 accumulation

noru::quant

Constant / Function Description
WEIGHT_SCALE (64) FP32 → i16 quantization scale
ACTIVATION_SCALE (256) Accumulator → Hidden scale
OUTPUT_SCALE (16) Final output scale
clipped_relu() ClippedReLU activation
screlu_f32() Squared ClippedReLU (f32)
saturate_i16() Safe i32 → i16 conversion

Building

# Library
cargo build --release

# Run tests
cargo test

# Generate documentation
cargo doc --open

Design Decisions

  • No GPU — Designed for real-time game AI on CPU. NNUE's strength is being fast enough for depth-4+ search on consumer hardware.
  • No external dependencies — Even the RNG is built-in (xorshift64). This means cargo add noru just works, everywhere.
  • SCReLU on first layer only — Following the Stockfish pattern, SCReLU is applied to the accumulator output. Subsequent hidden layers always use CReLU to avoid numerical issues in narrow layers.
  • Output-major weight layout — Hidden layer weights are stored transposed (output-major) for contiguous SIMD memory access in dot products.
  • Vec<T> over fixed arrays — All weights use heap-allocated vectors for runtime flexibility. Slight overhead vs compile-time arrays, but enables one binary for any game.
  • Sparse feature input — Features are passed as active index lists, not dense vectors. This matches NNUE's design for board games where most features are inactive.

License

Licensed under either of

at your option.

  • Stockfish NNUE — The chess engine that popularized NNUE
  • bullet — GPU-accelerated NNUE training (Rust + CUDA)
  • Rapfi — Gomoku engine with advanced NNUE

No runtime deps

Features