CAI NEURAL API

CAI NEURAL API is a pascal based deep learning neural network API optimized for AVX, AVX2 and AVX512 instruction sets plus OpenCL capable devices including AMD, Intel and NVIDIA. This API has been tested under Windows and Linux.

This project is a subproject from a bigger and older project called CAI and is sister to Keras based K-CAI NEURAL API. You can find trained neural network models in the pre-trained-neural-api-networks repository.

Intro Videos


Basics of Neural Networks in Pascal - Loading and Saving	Neural Networks for Absolute Beginners! Learning a Simple Function	Coding a Neural Network in Pascal that Learns to Calculate the Hypotenuse

Why Pascal?

The Pascal computer language is easy to learn. Pascal allows developers to make a readable and understandable source code.
You'll be able to make super-fast native code and at the same time have a readable code.
This API can outperform some major APIs in some architectures.

Prerequisites

You'll need Lazarus development environment. If you have an OpenCL capable device, you'll need its OpenCL drivers. Many examples use the CIFAR-10 dataset. You'll also find examples for the CIFAR-100, MNIST, Fashion MNIST and the Places365-Standard Small images 256x256 dataset.

Will It Work with Delphi?

This project is Lazarus based. That said, as of release v2.0.0, a number of units do compile with Delphi and you can create and run neural networks with Delphi. You'll be able to compile these units with Delphi: neuralvolume, neuralnetwork, neuralab, neuralabfun, neuralbit, neuralbyteprediction, neuralcache, neuraldatasets, neuralgeneric, neuralplanbuilder, Neural OpenCL, Neural Threading and neuralfit.

Installation

Clone this project, add the neural folder to your Lazarus unit search path and you'll be ready to go!

A.I. Powered Support

You can get A.I. powered help from these tools:

Documentation

The documentation covers:

Easy examples
Simple image classification examples
Youtube videos
Advanced examples
Data structures (Volumes)
Neural network layers
Dataset support
Training (fitting) your neural network
Parallel computing
Full set of examples
Normalization Cheat Sheet
Layer Authoring Guide — checklist for adding a new layer plus mini-guides on reading numerical-gradient failures and picking a tolerance
Other scientific publications from the same author

Easy Examples First Please!

You can click on the image above to watch the video.

Assuming that you would like to train a neural network to learn a function that has 2 inputs and one output, you could start with something like this:

    NN.AddLayer([
      TNNetInput.Create(2),
      TNNetFullConnectReLU.Create(32),
      TNNetFullConnectReLU.Create(32),
      TNNetFullConnectLinear.Create(1)
    ]);

The example above has 2 inputs (TNNetInput), 2 dense layers (TNNetFullConnectReLU) with 32 neurons each and one output (TNNetFullConnectLinear).

You can learn more about how to build and train simple neural networks at the following source code examples:

Loading and Saving Neural Networks

Loading is very easy:

    NN := TNNet.Create;
    NN.LoadFromFile('MyTrainedNeuralNetwork.nn');

Saving is as easy:

    NN.SaveToFile('MyTrainedNeuralNetwork.nn');

Simple Image Classification Examples

CIFAR-10 Image Classification Example

The CIFAR-10 dataset is a well-known collection of images commonly used to train machine learning and computer vision algorithms. It was created by the Canadian Institute for Advanced Research (CIFAR). It contains 60K 32x32 color images. The images are classified into 10 different classes, with 6,000 images per class. The classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. Despite its relatively low resolution and small size, CIFAR-10 can be challenging for models to achieve high accuracy, making it a good dataset for testing advancements in machine learning techniques.

Follows a source code example for the CIFAR-10 image classification:

NN := TNNet.Create();
NN.AddLayer([
  TNNetInput.Create(32, 32, 3), //32x32x3 Input Image
  TNNetConvolutionReLU.Create({Features=}16, {FeatureSize=}5, {Padding=}0, {Stride=}1, {SuppressBias=}0),
  TNNetMaxPool.Create({Size=}2),
  TNNetConvolutionReLU.Create({Features=}32, {FeatureSize=}5, {Padding=}0, {Stride=}1, {SuppressBias=}0),
  TNNetMaxPool.Create({Size=}2),
  TNNetConvolutionReLU.Create({Features=}32, {FeatureSize=}5, {Padding=}0, {Stride=}1, {SuppressBias=}0),
  TNNetFullConnectReLU.Create({Neurons=}32),
  TNNetFullConnectLinear.Create(NumClasses),
  TNNetSoftMax.Create()
]);

CreateCifar10Volumes(ImgTrainingVolumes, ImgValidationVolumes, ImgTestVolumes);

WriteLn('Neural Network will minimize error with:');
WriteLn(' Layers: ', NN.CountLayers());
WriteLn(' Neurons:', NN.CountNeurons());
WriteLn(' Weights:', NN.CountWeights());

NeuralFit := TNeuralImageFit.Create;
NeuralFit.InitialLearningRate := fLearningRate;
NeuralFit.Inertia := fInertia;
NeuralFit.Fit(NN, ImgTrainingVolumes, ImgValidationVolumes, ImgTestVolumes, NumClasses, {batchsize}128, {epochs}100);

These examples train a neural network to classify images in classes such as: image has a cat, image has a dog, image has an airplane...

You can save and load trained models (neural networks) with TNNet.SaveToFile and TNNet.LoadFromFile. The file format is portable meaning that you can train on CPU and run on GPU or train in AMD and run on ARM as examples. The following code shows a simple example for image classification loading a pre-trained model:

  procedure ClassifyOneImageSimple;
  var
    NN: TNNet;
    ImageFileName: string;
    NeuralFit: TNeuralImageFit;
  begin
    WriteLn('Loading Neural Network...');
    NN := TNNet.Create;
    NN.LoadFromFile('SimplePlantLeafDisease-20230720.nn');
    NeuralFit := TNeuralImageFit.Create;
    ImageFileName := 'plant/Apple___Black_rot/image (1).JPG';
    WriteLn('Processing image: ', ImageFileName);
    WriteLn(
      'The class of the image is: ',
      NeuralFit.ClassifyImageFromFile(NN, ImageFileName)
    );
    NeuralFit.Free;
    NN.Free;
  end;

Youtube Videos


Basics of Neural Networks in Pascal - Loading and Saving	Neural Networks for Absolute Beginners! Learning a Simple Function	Coding a Neural Network in Pascal that Learns to Calculate the Hypotenuse

Pre-trained Neural Networks & Transfer Learning with Pascal's CAI Neural API	Coding a Neural Network in Pascal that Learns the OR Boolean Operation	A Dive into Identity Shortcut Connection - The ResNet building block

Increasing Image Resolution with Neural Networks	Ultra Fast Single Precision Floating Point Computing	AVX and AVX2 Code Optimization

Some videos make referrence to uvolume unit. The current neuralvolume unit used to be called uvolume. This is why it's mentioned.

Advanced Examples

Although these examples require deeper understanding about neural networks, they are very interesting:

Identity Shortcut Connection - ResNet building block
ResNet-20 - includes a web server example
DenseNetBC L40
Separable Convolutions - MobileNet building block
Gradient Ascent - Visualizing patterns from inner neurons in image classification
Artificial Art - Let a neural network produce art via a generative adversarial network
Super Resolution - A neural network learns how to increase image resolution
CIFAR-10 Resized - A program that resizes CIFAR-10 and CIFAR-100 images to 64x64 and 128x128 pixels.
Autoencoder - Shows an autoencoder built with hyperbolic tangents and trained with Tiny ImageNet 200.

There is also a full set of examples that you can look at.

Volumes

Volumes behave like dynamically created arrays. They are the main array like structure used by this API. TNNetVolume class allows you to create volumes that can be accessed as 1D, 2D or 3D arrays and be operated with Advanced Vector Extensions (AVX) - Single Instruction Multiple Data (SIMD) instruction set. The usual way to create a volume is:

constructor Create(pSizeX, pSizeY, pDepth: integer; c: T = 0);

You can access the data as 1D or 3D with:

property Raw[x: integer]: T read GetRaw write SetRaw;
property Data[x, y, d: integer]: T read Get write Store; default;

Your code will look like this:

// Usage Examples
vInput := TNNetVolume.Create(32, 32, 3);
vInput[1, 1, 1] := 1;
vInput[2, 2, 2] := vInput[1, 1, 1] + 1;
vInput.Raw[10] := 5;

vInput.RandomizeGaussian();
WriteLn('Avg: ', vInput.GetAvg());
WriteLn('Variance: ', vInput.GetVariance());
WriteLn('Std Dev: ', vInput.GetStdDeviation());

WriteLn('Multiplying by 10');
vInput.Mul(10);
WriteLn('Avg: ', vInput.GetAvg());
WriteLn('Variance: ', vInput.GetVariance());
WriteLn('Std Dev: ', vInput.GetStdDeviation());

As examples, you can add, subtract, multiply and calculate dot products with:

procedure Add(Original: TNNetVolume); overload;
procedure Sub(Original: TNNetVolume); overload;
procedure Mul(Value: Single); overload;
function DotProduct(Original: TNNetVolume): TNeuralFloat; overload;

In the case that you need the raw position or raw pointer to an element of the volume, you can get with:

function GetRawPos(x, y, d: integer): integer; overload;
function GetRawPos(x, y: integer): integer; overload;
function GetRawPtr(x, y, d: integer): pointer; overload;
function GetRawPtr(x, y: integer): pointer; overload;
function GetRawPtr(x: integer): pointer; overload;

You can easily operate volumes with OpenCL via TEasyOpenCLV:

  TEasyOpenCLV = class (TEasyOpenCL)
    public
      function CreateBuffer(flags: cl_mem_flags; V: TNNetVolume): cl_mem; overload;
      function CreateInputBuffer(V: TNNetVolume): cl_mem; overload;
      function CreateHostInputBuffer(V: TNNetVolume): cl_mem; overload;
      function CreateOutputBuffer(V: TNNetVolume): cl_mem; overload;
      function CreateBuffer(V: TNNetVolume): cl_mem;  overload;

      function WriteBuffer(buffer: cl_mem; V: TNNetVolume; blocking: cl_bool = CL_FALSE): integer;
      function ReadBuffer(buffer: cl_mem; V: TNNetVolume; blocking: cl_bool = CL_TRUE): integer;

      function CreateAndWriteBuffer(V: TNNetVolume; var buffer: cl_mem): integer; overload;
      function CreateAndWriteBuffer(V: TNNetVolume): cl_mem; overload;
      function CreateWriteSetArgument(V: TNNetVolume; kernel:cl_kernel; arg_index: cl_uint): cl_mem;
      function CreateOutputSetArgument(V: TNNetVolume; kernel:cl_kernel; arg_index: cl_uint): cl_mem;
  end;

Volume Pairs, Volume Lists and Volume Pair Lists

Volumes can be organized in pairs:

  /// Implements a pair of volumes
  TNNetVolumePair = class(TObject)
    protected
      FA: TNNetVolume;
      FB: TNNetVolume;
    public
      constructor Create(); overload;
      constructor Create(pA, pB: TNNetVolume); overload;
      constructor CreateCopying(pA, pB: TNNetVolume); overload;

      destructor Destroy(); override;

      property A:TNNetVolume read FA;
      property B:TNNetVolume read FB;
      property I:TNNetVolume read FA;
      property O:TNNetVolume read FB;
  end;

Depending on the problem that you are trying to solve, modelling the training with pairs or pair lists might be helpful. Typically, a pair will be (input, desired output). This is how volume lists and volume pair lists have been implemented:

TNNetVolumeList = class (specialize TFPGObjectList<TNNetVolume>
TNNetVolumePairList = class (specialize TFPGObjectList<TNNetVolumePair>)

Neural Network Layers

The layered structure of artificial neural networks is inspired by the organization of the human brain and nervous system. In the human brain, information processing occurs in a hierarchical manner. Sensory inputs are first processed by lower-level neurons, which extract simple features. These features are then passed on to deeper neurons that combine them to recognize more complex patterns. This hierarchical processing is mirrored in artificial neural networks through the use of stacked layers.

Biological neurons are connected to each other through synapses, forming complex networks. Similarly, in artificial neural networks, neurons in one layer are connected to neurons in the next layer, mimicking this interconnected structure. Biological neurons fire (activate) based on a non-linear response to their inputs. This non-linearity is crucial for the brain's ability to learn complex patterns. In artificial neural networks, we use non-linear activation functions (such as ReLU) to introduce this non-linearity. Different regions of the brain specialize in processing different types of information. For instance, the visual cortex has layers specialized for detecting edges, shapes, and complex objects. This specialization is reflected in artificial neural networks, where different layers can learn to recognize different levels of abstraction.

In the context of artificial neural networks, we can see this biologically-inspired layered approach implemented. For example:

NN := TNNet.Create();
NN.AddLayer([
  TNNetInput.Create(32, 32, 3),
  TNNetConvolutionLinear.Create({neurons=}16, {featuresize}3, {padding}1, {stride}1),
  TNNetReLU6.Create()
]);

This code snippet demonstrates the creation of a neural network with an input layer and a convolutional layer followed by a ReLU6 activation. This structure is inspired by the visual cortex in the brain, where neurons respond to specific patterns in their receptive fields, similar to how convolutional layers operate. The CAI Neural API also supports the creation of more complex, biologically-inspired architectures. These architectures are designed with multiple layers of different types, mirroring the complex structure of the brain.

Artificial neural networks with multiple layers and specialized structures are inspired by the hierarchical and specialized nature of biological neural processing. It's important to note that while artificial neural networks are inspired by biological neural networks, they are highly simplified models. The human brain is far more complex, with various types of neurons, complex connectivity patterns, and mechanisms we don't yet fully understand. However, the layered structure in artificial neural networks has proven to be a powerful approach for solving complex problems in machine learning, inspired by the remarkable capabilities of biological neural networks.

Input Layer

The input layer serves as the gateway to the entire network. It's like the sensory organs of our brain, receiving information from the outside world. Without an input layer, the neural network would have no way to receive and interpret the initial data, making it impossible to perform any meaningful computations or learning tasks. The TNNetInput class implements the input layer.

Fully Connected (Dense) Layers

Fully connected layers, also known as dense layers, are a fundamental component of neural networks. In these layers, every neuron is connected to every neuron in the previous layer, allowing for comprehensive information processing across the entire network.

In the context of the CAI Neural API, fully connected layers are represented by various classes derived from TNNetFullConnect. These layers play a crucial role in transforming input data and learning complex patterns.

The computation process in a fully connected layer involves:

Multiplying input values by the layer's weights.
Adding bias terms (if not suppressed).
Applying an activation function (if present).

Key types of fully connected layers include:

TNNetFullConnectLinear: a basic fully connected layer without an activation function. It performs a linear transformation of the input data.
TNNetFullConnectReLU: incorporates the Rectified Linear Unit (ReLU) activation function. ReLU introduces non-linearity by outputting the input for positive values and zero for negative values, helping the network learn complex patterns.
TNNetFullConnectSigmoid: applies the sigmoid activation function to the layer's output. Sigmoid squashes the output between 0 and 1, useful for binary classification tasks.
TNNetComplexLinear: the 2-dimensional complex base rung of the same Cayley–Dickson family — each 2-channel group (Re,Im) is multiplied by a learned complex w = a + b·i via the 2×2 complex-multiply block [[a,-b],[b,a]] (Re' = a·Re − b·Im, Im' = a·Im + b·Re), using ~1/2 the weights of an equal-width real dense layer while still mixing the real and imaginary parts. The forward block equals the true complex product and is norm-multiplicative |w·X| = |w|·|X|. Input/output Depth must be multiples of 2. See examples/ComplexLinear/.
TNNetQuaternionLinear: a hypercomplex dense layer that shares each 4×4 Hamilton block from a single learned quaternion, coupling 4-channel groups (rotation/scaling in quaternion space) with ~1/4 the weights of an equal-width real dense layer. Input/output Depth must be multiples of 4.
TNNetOctonionLinear: the 8-dimensional octonion (Cayley–Dickson) generalization of the above — each 8-channel group is multiplied by a learned octonion via the (non-associative) octonion product, assembled as an 8×8 signed block, using ~1/8 the weights of an equal-width real dense layer. The implementation hard-codes an auditable Cayley–Dickson sign table verified by a norm-multiplicativity test (|W·X| = |W|·|X|). Input/output Depth must be multiples of 8. See examples/OctonionLinear/.

Fully connected layers are typically used in neural network architectures as:

Hidden layers for processing and transforming features.
Output layers for producing final predictions.

For example, in the provided context, we see a simple neural network structure:

NN := TNNet.Create();
NN.AddLayer([
  TNNetInput.Create(2),       // Input layer with 2 inputs
  TNNetFullConnect.Create(2), // Hidden fully connected layer with 2 neurons
  TNNetFullConnect.Create(1)  // Output fully connected layer with 1 neuron  
]);

Layer Name	Input/Output Dimensions	Activation	Description
`TNNetFullConnectLinear`	1D, 2D, or 3D	None	Fully connected layer without an activation function (linear).
`TNNetFullConnect`	1D, 2D, or 3D	tanh	Fully connected layer with `tanh` as the default activation function.
`TNNetFullConnectReLU`	1D, 2D, or 3D	ReLU	Fully connected layer with ReLU activation.
`TNNetFullConnectSigmoid`	1D, 2D, or 3D	Sigmoid	Fully connected layer with Sigmoid activation.
`TNNet.AddGroupedFullConnect`	1D, 2D, or 3D	Optional	Adds a grouped fully connected layer, inspired by `TNNet.AddGroupedConvolution`.
`TNNetBitLinear`	1D, 2D, or 3D	None	BitNet b1.58 ternary-weight linear layer: forward uses per-neuron absmean-quantized weights `Wq = scale*round(clip(W/scale,-1,+1))` with `scale = mean(
`TNNetCirculantLinear`	1D, 2D, or 3D (`n=Depth`)	None	Structured-matrix dense layer whose `n×n` weight matrix is CIRCULANT — every row is a cyclic shift of one learned length-`n` kernel `c`, so it stores `O(n)` weights instead of `O(n²)` and the map is `y = circular_convolution(c, x) (+ bias)`. Default forward/backward is the exact direct `O(n²)` circular sum; set `UseFFT := true` for an opt-in `O(n log n)` radix-2 FFT fast path (power-of-two `n`; default OFF, bit-for-bit equivalent to the direct path to <1e-5). Distinct from LoRA (low-rank), `AddGroupedFullConnect` (block-diagonal) and `TNNetBitLinear` (quantized dense). See `examples/CirculantLinear/`.
`TNNetHouseholderLinear`	1D, 2D, or 3D (`n=Depth`)	None	EXACTLY-orthogonal dense layer whose `n×n` weight is parameterized as a product of `K` Householder reflections `Q = H_1·H_2·…·H_K` with `H_i = I − 2·(v_i·v_iᵀ)/(v_iᵀ·v_i)`; the trainable parameters are the `K` reflection vectors `v_i ∈ ℝ^n`. `Q` is orthogonal for any `v_i` (no constrained optimization or re-projection), so the map is exactly norm/volume preserving (`‖y‖ = ‖x‖` with bias off) and exactly invertible (`Qᵀ = H_K·…·H_1`). Forward applies the reflections one at a time (`O(K·n)` per sample, never materializing `Q`), `y = Q·x (+ bias)`; backward caches each intermediate `x_i = H_i·…·H_K·x` for a closed-form per-`v_i` gradient and propagates `dL/dx = Qᵀ·dL/dy`. An exactly-orthogonal Jacobian is an isometry, so deep plain stacks neither explode nor vanish (the building block for orthogonal/unitary RNNs and reversible/normalizing-flow nets). Distinct from `TNNetSpectralNorm` (bounds only `σ_1`, leaves the other singular values free), `TNNetCirculantLinear`/Toeplitz (constrain the matrix form, not orthogonality) and Muon (orthogonalizes the update, not the weight). The `v_iᵀ·v_i → 0` denominator is guarded (degenerate reflection ≈ identity). Composite helper `TNNet.AddHouseholderLinear(N, NumReflections, UseBias)` (default `K = n` for a full orthogonal group element; `K < n` gives a cheaper sub-group). See `examples/HouseholderOrthogonal/`.
`TNNetHyperbolicLinear`	1D, 2D, or 3D (`Depth`-vector point in the ball)	None	Poincaré-ball HYPERBOLIC dense layer (Ganea, Bécigneul & Hofmann 2018, Hyperbolic Neural Networks; Nickel & Kiela 2017): treats the `Depth`-vector input `x` as a point inside the open Poincaré ball of radius `1/√c` (fixed curvature `c>0`, default `1.0`) and computes the Möbius matrix–vector product followed by a hyperbolic (Möbius) bias translation, `y = exp_0(M·log_0(x)) ⊕_c b`, where `log_0(x)=(1/√c)·atanh(√c‖x‖)·x/‖x‖` and `exp_0(v)=(1/√c)·tanh(√c‖v‖)·v/‖v‖` are the log/exp maps at the origin and `a ⊕_c b` is Möbius addition. Reuses the `TNNetFullConnectLinear` weight layout (one neuron per output dim holds matrix row `M[j]`; the Möbius bias `b` is held coordinate-wise in the per-neuron bias). Genuinely non-Euclidean — unlike every other dense layer here (all flat `sum_j W·x` plus optional Euclidean bias), the matmul and bias act in hyperbolic space, the natural geometry for embedding trees / hierarchies with low distortion; the output always stays inside the ball (`‖y‖<1/√c`). Backward is the EXACT analytic chain rule through both radial `atanh`/`tanh` maps and the Möbius-addition Jacobian (input, matrix-row and Möbius-bias gradients all numerically gradient-checked); the `‖x‖→1/√c` boundary and `‖x‖→0` origin are guarded by an EPS/series fallback so the Jacobian stays finite. Curvature `c` round-trips via `FFloatSt[0]`; suppress-bias via `FStruct[3]`. Created with `TNNetHyperbolicLinear.Create(Dout)` (default `c=1`) or `Create(Dout, c)`. Optional trainable curvature: `Create(Dout, c, SuppressBias, LearnCurvature:=true)` makes `c` a single learnable scalar via the constrained-scalar pattern — one extra 1-weight neuron holds a raw value mapped through `c = 0.01 + 3.99*sigmoid(raw)` (kept strictly positive/bounded), the exact `dL/draw` is accumulated via a forward-mode tangent of `y` through the `log_0`/`exp_0`/Möbius-add chain, and the flag round-trips via `FStruct[4]`. The default (fixed-`c`) path is byte-for-byte unchanged.
`TNNetHyperbolicDistance`	1D, 2D, or 3D (`Depth`-vector point in the ball)	None	Poincaré-ball distance READOUT head (companion to `TNNetHyperbolicLinear`): maps the `Depth`-vector input `x` (a point inside the curvature-`c` Poincaré ball) to a `K`-vector of hyperbolic distances to `K` learnable prototype points `p_k`, `d_k = dist_c(x, p_k) = (2/√c)·atanh(√c·‖(-x) ⊕_c p_k‖)` (Möbius addition `⊕_c` as in `TNNetHyperbolicLinear`). Reuses the `TNNetFullConnectLinear` weight layout (one neuron per prototype holds `p_k`, bias suppressed). A usable hyperbolic classification/regression head — small distances mean "close in hyperbolic space". Backward is the EXACT analytic chain rule through the radial `atanh` and the Möbius-addition Jacobian (input and per-prototype gradients both numerically gradient-checked); boundary/origin guarded by EPS/series fallback. Curvature `c` round-trips via `FFloatSt[0]`, prototype count `K` via `FStruct[0]`. Created with `TNNetHyperbolicDistance.Create(K)` (default `c=1`) or `Create(K, c)`. See `examples/HyperbolicEmbedding/`.
`TNNetMonarchLinear`	1D, 2D, or 3D (`n=Depth`)	None	Sub-quadratic STRUCTURED dense layer whose `n×n` weight is a MONARCH matrix (Dao et al. 2022, Monarch: Expressive Structured Matrices for Efficient and Accurate Training): a product of two block-diagonal factors interleaved by a fixed reshape-transpose permutation, `y = Pᵀ·(L·(P·(R·x))) (+ bias)`, where `R` and `L` are block-diagonal with `b` blocks of size `m×m` (`n = b·m`, default `b = round(√n)` for perfect-square `n`, else the largest divisor ≤ √n; `b` round-trips via `FStruct[1]`) and `P` is the fixed `(b,m)→(m,b)` index-gather permutation (no weights). Stores/runs in `O(n^1.5)` instead of `O(n²)` yet provably contains the DFT, the Hadamard transform and ordinary convolutions — a genuinely different structured operator from `TNNetCirculantLinear` (single cyclic kernel), LoRA (low-rank) and `AddGroupedFullConnect` (one block-diagonal, no permutation mix). Forward applies `R`, permute, `L`, un-permute as four cheap passes (never materializing the dense `n×n`); backward is the exact transpose chain (`dL/dx`, `dL/dR`, `dL/dL` all block-local matmuls), both factor gradients numerically gradient-checked. Suppress-bias via `FStruct[3]`. The map is square, so `n` is inferred from the previous layer's size; created with `TNNetMonarchLinear.Create()` (or `Create(1)` to suppress bias).
`TNNetKroneckerLinear`	1D, 2D, or 3D (`n=Depth`)	None	Sub-quadratic STRUCTURED dense layer whose `n×n` weight is a single KRONECKER PRODUCT `W = A ⊗ B` of two small learned factors `A (p×p)` and `B (q×q)` with `n = p·q`. Stores only `O(p²+q²) ≈ O(n)` weights instead of `O(n²)`, and the dense `n×n` Kronecker matrix is NEVER materialized: `x` is reshaped to a `q×p` matrix `X` (`X[i,j] = x[i·p+j]`) and the matvec is two small GEMMs `Y = B·X·Aᵀ` (`O(n·(p+q)) = O(n^1.5)`), flattened back as `y[i·p+j] = Y[i,j] (+ bias)`. Under this row-major vec convention `y = vec(B·X·Aᵀ)` exactly equals `(A⊗B)·x`. Backward is the exact transpose chain `dL/dX = Bᵀ·dY·A`, `dL/dA = dYᵀ·(B·X)`, `dL/dB = dY·(X·Aᵀ)ᵀ` (all small GEMMs; input, `dA` and `dB` all numerically gradient-checked). The factor split `p` defaults to `round(√n)` (largest divisor ≤ √n when `n` is not a perfect square) and round-trips via `FStruct[1]`; `q = n div p`. A genuinely different structured operator from `TNNetCirculantLinear` (single cyclic kernel), `TNNetHouseholderLinear` (exactly orthogonal), `TNNetMonarchLinear` (two block-diagonal factors + permutation) and LoRA (low-rank): Kronecker is a single tensor-product factorisation. Suppress-bias via `FStruct[3]`. The map is square, so `n` is inferred from the previous layer's size; created with `TNNetKroneckerLinear.Create()` (or `Create(1)` to suppress bias, or `Create(0, p)` to pin the factor split). See `examples/KroneckerLinear/`.
`TNNetTropicalLinear`	1D, 2D, or 3D (`Din=Depth`)	None	A max-plus / min-plus morphological dense layer that computes in the TROPICAL (max-plus) semiring instead of the usual multiply-accumulate ring: `y_i = max_j (x_j + W[i,j])` (a morphological DILATION), with a paired ERODE mode `y_i = min_j (x_j + W[i,j])` selected by a constructor flag (round-trips via `FStruct[6]`). The weights are learnable additive thresholds and the combine op is max/min, so the layer learns piecewise-linear convex (dilation) / concave (erosion) functions and tropical polynomials — a genuinely different operator from `TNNetFullConnect` and the structured-linear family (`TNNetCirculantLinear`/`TNNetHouseholderLinear`/`TNNetBitLinear`, all `sum_j W·x`) and from parameter-free max/min pooling. Forward is `O(Din·Dout)`; backward is the same hard arg-max/arg-min subgradient as `TNNetMaxPool` — cache the winning `j` per output and route `dL/dx[j] += dy_i`, `dL/dW[i,j] += dy_i` (non-differentiable at ties). Both the input and weight gradients are numerically gradient-checked (away from the tie kink). Created with `TNNetTropicalLinear.Create(Dout)` (dilation) or `Create(Dout, 1)` (erode). See `examples/TropicalMorphology/`.
`TNNetTropicalConv`	2D or 3D (SizeX x SizeY x Depth)	None	The SPATIAL sibling of `TNNetTropicalLinear` — a grayscale morphological dilation/erosion convolution with a learnable additive structuring element (SE) over a `(SizeX,SizeY,Depth)` patch: `y[x,y,co] = max_{dx,dy,ci} (input[x+dx,y+dy,ci] + SE[dx,dy,ci,co])` (dilation), with a paired ERODE mode `min_{dx,dy,ci} (input − SE)` selected by a constructor flag (round-trips via `FStruct[6]`). Subclasses `TNNetConvolutionLinear` for the conv geometry (kernel/padding/stride) but replaces the multiply-accumulate with the max-plus/min-plus semiring and a single SE weight neuron — distinct from parameter-free `TNNetMaxPool` (learnable additive SE, not a fixed window) and from ordinary linear conv (`sum W·x`). Backward is the hard arg-max/arg-min one-hot subgradient (MaxPool convention): cache the winning `(dx,dy,ci)` tap per output cell and route `dy` to that single input cell and that single SE tap (non-differentiable at ties). Both input and SE gradients numerically gradient-checked (away from the tie kink). Created with `TNNetTropicalConv.Create(Features, FeatureSize, Padding, Stride)` (dilation) or `Create(Features, FeatureSize, Padding, Stride, 1)` (erode).
`TNNetCondConv`	2D or 3D (SizeX x SizeY x Depth)	None	Conditionally-parameterized ("dynamic") convolution (Yang et al. 2019, CondConv, NeurIPS): owns a bank of `K` expert kernels `W_1..W_K` (each a normal `Features × FeatureSize × FeatureSize × InChannels` kernel) plus a tiny per-sample routing head (global-average-pool → FullConnect → sigmoid) emitting `K` mixing coefficients `alpha_k` per input sample; the effective kernel is the per-sample blend `W_eff = sum_k alpha_k·W_k` applied as ONE ordinary convolution — so inference cost stays that of a single conv regardless of `K` while capacity grows with the bank. Backward routes `dL/dW_k = alpha_k·dL/dW_eff`, sends `dL/dalpha_k = <dL/dW_eff, W_k>` back through the sigmoid + FC + pool into the input, and propagates the standard conv input gradient through `W_eff` (input, expert-bank, and routing-head gradients all numerically gradient-checked). Distinct from `TNNetHyperConv` (GENERATES the whole kernel from a second tensor in one shot) and `AddMixtureOfExperts` (mixes K expert OUTPUTS post-hoc, K forward passes); CondConv mixes K kernels BEFORE the conv (one forward pass). `K`/`Features`/`FeatureSize`/`Padding`/`Stride` round-trip via `FStruct`. Created with `TNNetCondConv.Create(K, Features, FeatureSize, Padding, Stride)`. See `examples/CondConv/`.
`TNNetSpectralNorm`	1D, 2D, or 3D	None	Spectral-normalized dense layer (Miyato et al. 2018): forward divides the weight matrix by its largest singular value `sigma_1` (estimated by power iteration, `Iters` steps in `FStruct[5]`, default 10) so the effective operator `W/sigma_1` has spectral norm ~1; `sigma_1` is treated as constant in backward (input error propagated through the scaled weights).
`TNNetHighway`	1D, 2D, or 3D (shape-preserving over `Depth`)	Optional	Highway-network layer (Srivastava, Greff & Schmidhuber 2015): `y = T(x)⊙H(x) + (1−T(x))⊙x` with an input-dependent transform gate `T(x) = sigmoid(W_T·x + b_T)` and learned transform `H(x) = activation(W_H·x + b_H)`. The input-dependent learned-gate ancestor of the ResNet skip connection — unlike `TNNetReZero` (scalar) and `TNNetGatedResidual` (per-channel constant), the gate is computed from the activation each forward pass and carries the identity `(1−T)·x`. Gate bias inits negative (`b_T ≈ −1.5`, the paper's carry trick) so a fresh deep stack starts near identity. Per-channel pointwise, so it composes inside the residual builders. See `examples/HighwayDepth/`.
`TNNetHyperLinear`	main: 1D/2D/3D → `Dout`	None	HyperNetwork dense layer (Ha, Dai & Le 2016): owns NO trainable weights — its weight matrix is GENERATED by an upstream net and read from a SECOND input tensor (`WeightsSource`, a flat `Din*Dout (+Dout)` row-major matrix + optional bias) rather than from `Neurons[]`. Two-source wiring like `TNNetCrossAttention`. Forward `y = W_gen·x (+b)`; backward propagates into BOTH the main input (`W_gen^T·dy`) and the generated-weights tensor (`dy⊗x`, `dy`) so the generator trains end-to-end. Composite helper `TNNet.AddHyperLinear(Din, Dout, ContextLayer, UseBias=true)` wires the generator FullConnect (off `ContextLayer`) and the weightless `TNNetHyperLinear` (off the main path) in one call. See `examples/HyperNetwork/`.
`TNNetHyperConv`	main: 3D image → VALID conv	None	Spatial cousin of `TNNetHyperLinear` — a weightless convolution whose kernel is GENERATED by an upstream net and read from a SECOND input tensor instead of `Neurons[]`. VALID conv, stride 1; kernel laid out `W[o,ky,kx,i]` flat (`((oK+ky)K+kx)InChannels + i`) plus optional per-output-channel bias. Backward propagates into BOTH the main image and the generated-kernel tensor so the generator trains end-to-end. Composite helper `TNNet.AddHyperConv(InChannels, OutChannels, FeatureSize, ContextLayer, UseBias=true)` wires the generator FullConnect and the weightless `TNNetHyperConv` in one call. Note: the generator emits the whole `OutCKKInC` kernel in one shot, which caps kernel/channel size (memory/param trade-off).
`TNNetModernHopfield`	2D (SeqLen x 1 x `d`)	Pattern bank	Modern continuous Hopfield associative-memory layer (Ramsauer et al. 2020, Hopfield Networks is All You Need): an ENERGY-BASED memory that ITERATES a softmax retrieval to a fixed point against a learnable stored-pattern bank `X` of shape `(NumPatterns, d)`. Per query position, `xi := X^T·softmax(beta·X·xi)` is applied for `K` update steps — `K=1` reduces to ordinary attention, `K>1` sharpens toward the nearest stored memory (the whole point). Inverse-temperature `beta` controls the metastable-vs-sharp regime. Forward caches per-step softmax weights; backward differentiates through the unrolled `K` steps (the SDPA softmax-Jacobian path, summed over steps, scattered into both bank and input). Composite helper `TNNet.AddModernHopfieldRetrieval(NumPatterns, KSteps, Beta)` wires a `(SeqLen,1,d)` query against a fresh learnable bank. See `examples/HopfieldAssociativeMemory/`.
`TNNetProductKeyMemory`	Input: `SeqLen x 1 x QueryDim` (even `QueryDim`, divisible by `Heads` with `QueryDim/Heads` even). Output: `SeqLen x 1 x (Heads*ValueDim)`.	Half-keys + values (per head)	Large, sparsely-accessed product-key memory (Lample et al. 2019, Large Memory Layers with Product Keys): factorizes a `NumKeys`-row memory into the Cartesian product of two small half-key banks `K1`, `K2` (each `sqrt(NumKeys)` rows of `(QueryDim/Heads)/2`). A query is split in half, each half scored against its own bank, the top-`TopK` per half taken, the `TopK x TopK` combinations re-scored, and the global top-`TopK` product keys selected in `O(sqrt(NumKeys))` work; a softmax over those gates a sparse weighted sum over only the touched rows of the learned value table. Multi-head (`Heads>1`): the query is split into `Heads` contiguous sub-queries, each running an independent product-key lookup against its own `K1`/`K2`/`V` banks (`3*Heads` neurons; head `h` owns neurons `3h,3h+1,3h+2`), and the `Heads` `ValueDim`-wide outputs are concatenated along `Depth`. `Heads=1` is byte-for-byte the v1 single-head path. Forward caches the chosen indices + weights; backward scatters the value-gradient into only the touched value rows and pushes the exact softmax-Jacobian score-gradient back through both half-key dot products into the query and `K1`/`K2`. Distinct from `TNNetModernHopfield` (dense softmax over a small fully-retrieved bank), `TNNetEmbedding` (one-hot lookup), and MoE (expert MLPs). Composite helper `TNNet.AddProductKeyMemory(NumKeys, ValueDim, TopK, Heads)`. See `examples/ProductKeyMemory/`.
`TNNetNTMMemory`	Input: `T x 1 x InputDim`. Output: `T x 1 x SlotWidth` (per-step read vectors).	Key/beta/erase/add projections	The first writable differentiable external-memory layer in the library — a Neural Turing Machine (Graves et al. 2014, Neural Turing Machines, arXiv:1410.5401). Unlike the read-only associative memories (`TNNetModernHopfield` iterated recall, `TNNetProductKeyMemory` sparse lookup) it carries a persistent memory matrix `M` (`NumSlots × SlotWidth`) that the layer both reads and writes while sweeping the time axis. Per step the input is projected to a content key (cosine-addressed against every slot row, sharpened by a softplus key-strength `beta`, softmaxed over slots → weights `w`), a read `r_t = w^T·M` (the step's output), and a sigmoid erase `e` + add `a` that update `M[i] := M[i]·(1 − w[i]·e) + w[i]·a` (erase-then-add, same `w`). The four projection matrices (8 neurons `Wk,bk,Wb,bb,We,be,Wa,ba`) are the only trainables; backward is full BPTT through the recurrent `M` update — because `M_t` depends on both `M_{t-1}` and `w_t`, `dL/dM` and `dL/dw` both chain backward across steps (input, read-path and write-projection gradients all numerically gradient-checked). `NumSlots`/`SlotWidth` round-trip via `FStruct[0..1]`, `InitVal` via `FFloatSt[0]`; `M` is re-initialised to a small constant each sweep and is not persisted across loads (re-init on load, like `SetAdjacency`). v1 is content addressing only, single read+write head; location-based shift/interpolation addressing and the DNC temporal-link matrix are documented follow-ups. Created with `TNNetNTMMemory.Create(NumSlots, SlotWidth, InitVal=0.001)`. See `examples/NeuralTuringMachine/`.
`TNNetSoftDecisionTree`	Input: flat `Din`-vector. Output: `1 x 1 x OutputDepth`.	None (mixture of leaf vectors)	Differentiable soft (oblique) decision tree (Kontschieder et al. 2015, Deep Neural Decision Forests; Frosst & Hinton 2017, Distilling a Neural Network Into a Soft Decision Tree) — a structurally new hierarchical soft-routing paradigm for this library (not a matrix factorization, attention, recurrence or kernel method). A balanced binary tree of fixed depth `D` has `2^D−1` inner nodes and `2^D` leaves. Each inner node `i` is a learnable linear gate `p_i = sigmoid(beta·(w_i·x + b_i))`; a sample reaches a leaf with probability equal to the product of the left (`p_i`) / right (`1−p_i`) gate decisions along its root-to-leaf path, and the output is the path-probability-weighted mixture `y = sum_l P(leaf=l

Convolutional Layers

Neurons, filters, and kernels are often used as synonyms in the context of neural networks, particularly in convolutional neural networks (CNNs). They are closely related concepts that are used interchangeably. Here's why:

Neurons: in artificial neural networks, neurons are the basic computational units. They receive input, process it, and produce an output. In the context of CNNs, the term "neuron" is sometimes used to refer to a single element in a feature map.
Filters: in CNNs, filters (also called convolution kernels) are small matrices of weights that slide over the input data to detect specific features. Each filter produces a feature map in the output layer.
Kernels: in image processing and CNNs, kernels are small matrices used for various operations like blurring, sharpening, or edge detection. In the context of CNNs, kernels and filters are essentially the same thing. The reason these terms are often used synonymously is that they all contribute to the feature detection and transformation process in neural networks:
- A single filter/kernel can be thought of as a specialized neuron that detects a specific feature across the entire input.
- The weights in a filter/kernel are analogous to the weights in a traditional neuron.
- The output of applying a filter/kernel to an input region is similar to the activation of a neuron in response to its inputs.

In practice, when implementing CNNs, the terms "filter" and "kernel" are more commonly used than "neuron" when referring to the convolutional layers. However, the underlying concept of a computational unit that processes input and produces output remains the same across these terms.

Convolutional layers are fundamental building blocks in neural networks, particularly in the field of computer vision and image processing. They are designed to automatically and adaptively learn spatial hierarchies of features from input data, such as images.

In the context of the CAI Neural API, convolutional layers are implemented as classes derived from TNNetConvolutionAbstract. This abstract base class provides the core functionality for convolutional operations.

The structure of a convolutional layer typically includes:

Input: A multi-dimensional array (usually 3D for images: width, height, and channels).
Kernels (or filters): small matrices of weights that slide over the input.
Feature maps: the output produced by applying the kernels to the input.

Key parameters of convolutional layers include:

Number of features (or filters).
Feature size (kernel size).
Padding.
Stride.

The CAI Neural API offers several types of convolutional layers:

TNNetConvolution: the standard convolutional layer.
TNNetConvolutionLinear: a convolutional layer without an activation function.
TNNetConvolutionReLU: a convolutional layer with a ReLU activation function.
TNNetComplexConv: the 2-dimensional complex base rung of the same Cayley–Dickson convolution family — each 2-channel input patch (Re,Im) is complex-multiplied by a learned complex filter tap w = a + b·i (Re' = a·Re − b·Im, Im' = a·Im + b·Re) and accumulated, coupling the real and imaginary parts across space using only ~1/2 the weights of an equal-width real convolution. Input Depth and the feature count must both be multiples of 2. It reuses the gradient-checked 2×2 complex-multiply forward/backward kernel of its dense sibling TNNetComplexLinear. See examples/ComplexLinear/.
TNNetQuaternionConv: a hypercomplex convolution whose per-output-channel filter taps are quaternion weights — each 4-channel input patch is Hamilton-multiplied by a learned quaternion and accumulated, so the four channel components are coupled (rotation/scaling in quaternion space) using only ~1/4 the weights of an equal-width real convolution. Input Depth and the feature count must both be multiples of 4. It reuses the gradient-checked 4×4 Hamilton forward/backward kernel of its dense sibling TNNetQuaternionLinear (the library's first hypercomplex layer). See examples/QuaternionConv/ and examples/QuaternionLinear/.
TNNetOctonionConv: the 8-dimensional octonion (Cayley–Dickson) generalization of TNNetQuaternionConv — each 8-channel input patch is octonion-multiplied by a learned octonion filter tap and accumulated, coupling all eight channel components across space using only ~1/8 the weights of an equal-width real convolution. Input Depth and the feature count must both be multiples of 8. It reuses the same auditable, norm-multiplicativity-verified 8×8 Cayley–Dickson forward/backward kernel as its dense sibling TNNetOctonionLinear. See examples/OctonionConv/ and examples/OctonionLinear/.

Convolutional layers are crucial in neural networks because they:

Automatically learn hierarchical features from data.
Maintain spatial relationships in the input.
Reduce the number of parameters compared to fully connected layers.
Enable the network to be translation-invariant.

In practice, convolutional layers are often used in combination with other layer types, such as pooling layers (e.g., TNNetMaxPool) and normalization layers (e.g., TNNetMovingStdNormalization), to create powerful neural network architectures for tasks like image classification, object detection, and segmentation.

Here's a brief example of how to create a convolutional layer using the CAI Neural API:

NN := TNNet.Create();
NN.AddLayer([
  TNNetInput.Create(32, 32, 3),  // Input layer for 32x32 RGB images
  TNNetConvolutionLinear.Create(
    {Features=}64,     // Number of output features
    {FeatureSize=}5,   // 5x5 kernel size
    {Padding=}2,       // Padding of 2 pixels
    {Stride=}1,        // Stride of 1 pixel
    {SuppressBias=}1   // Suppress bias
  ),
  TNNetReLU6.Create()  // Activation function
]);

This example creates a convolutional layer with 64 features, a 5x5 kernel size, padding of 2, and a stride of 1, followed by a ReLU6 activation function.

These are tha available convolutional layers in CAI:

Layer Name	Input/Output Dimensions	Activation	Description
`TNNetConvolutionLinear`	1D, 2D, or 3D	None	Linear convolutional layer without activation. Useful for intermediate layers.
`TNNetSpectralNormConv`	1D, 2D, or 3D	None	Spectral-normalized convolutional layer: the convolution analogue of `TNNetSpectralNorm`. Forward divides the filters by the largest singular value `sigma_1` of the flattened kernel matrix (`out_channels x in_channelskxky`, estimated by power iteration reusing `TNNet.EstimateSpectralNorm`) so the effective conv operator is norm-bounded; `sigma_1` is treated as constant in backward.
`TNNetConvolution`	1D, 2D, or 3D	tanh	Standard convolutional layer. Versatile for feature extraction in tasks like image recognition.
`TNNetConvolutionReLU`	1D, 2D, or 3D	ReLU	Convolutional layer with ReLU activation. Helps mitigate vanishing gradient problem.
`TNNetConvolutionSwish`	1D, 2D, or 3D	Swish	Convolutional layer with Swish activation. Performs better than ReLU in some cases.
`TNNetConvolutionHardSwish`	1D, 2D, or 3D	Hard Swish	Convolutional layer with Hard Swish activation. It is similar to swish but it's faster.
`TNNetConvolutionSharedWeights`	1D, 2D, or 3D	same as linked layer	Convolutional layer that uses the weights from another layer
`TNNetPointwiseConvLinear`	1D, 2D, or 3D	None	Linear 1x1 convolution. Useful for channel mixing without spatial operations.
`TNNetPointwiseConvReLU`	1D, 2D, or 3D	ReLU	1x1 convolution with ReLU. Efficient for channel-wise dimensionality reduction or expansion.
`TNNetPointwiseConv`	1D, 2D, or 3D	tanh	1x1 convolution. Useful for autoencoding architectures.
`TNNetDepthwiseConvLinear`	1D, 2D, or 3D	None	Linear depthwise convolution. Useful when additional non-linearity is not required.
`TNNetDepthwiseConv`	1D, 2D, or 3D	tanh	Depthwise convolution with tanh activation. Reduces computational cost by processing each channel separately.
`TNNetDepthwiseConvReLU`	1D, 2D, or 3D	ReLU	Depthwise convolution with ReLU activation. Combines depthwise efficiency with the benefits of ReLU.
`TNNet.AddSeparableConvLinear`	1D, 2D, or 3D	None	Adds a linear separable convolution. Useful for lightweight models with reduced parameter count.
`TNNet.AddSeparableConvReLU`	1D, 2D, or 3D	ReLU	Adds a separable convolution with ReLU. Combines depthwise and pointwise for efficient feature extraction.
`TNNet.AddConvOrSeparableConv`	1D, 2D, or 3D	Optional	Adds standard or separable convolution. Supports optional ReLU and normalization for versatile design.
`TNNet.AddGroupedConvolution`	1D, 2D, or 3D	Optional	Adds a grouped convolution. Allows efficient parallel processing of input channels.

Grouped pointwise convolutions are an interesting and efficient variant of standard convolutions in neural networks. Grouped pointwise convolutions are a type of convolution operation where the input channels are divided into groups, and each group is processed separately. This is particularly useful for 1x1 convolutions (pointwise) where the spatial dimensions are not affected. The grouped approach can significantly reduce the number of parameters in a neural network as shown in the papers Grouped Pointwise Convolutions Reduce Parameters in Convolutional Neural Networks and An Enhanced Scheme for Reducing the Complexity of Pointwise Convolutions in CNNs for Image Classification Based on Interleaved Grouped Filters without Divisibility Constraints. By reducing parameters, these convolutions can make models more efficient in terms of computation and memory usage. These convolutions can be combined with other techniques like normalization and intergroup connections. This flexibility allows for the creation of more sophisticated network designs. Grouped pointwise convolutions are particularly useful in efficient network designs, such as mobile or edge computing applications where resource constraints are significant. They allow for maintaining model expressivity while reducing computational requirements.

The grouped pointwise convolutional layers are:

Layer Name	Input/Output Dimensions	Activation	Description
`TNNetGroupedPointwiseConvLinear`	1D, 2D, or 3D	None	Linear 1x1 grouped convolution. Useful for channel mixing without spatial operations.
`TNNetGroupedPointwiseConvReLU`	1D, 2D, or 3D	ReLU	1x1 grouped convolution with ReLU. Efficient for channel-wise dimensionality reduction or expansion.
`TNNetGroupedPointwiseConvHardSwish`	1D, 2D, or 3D	Hard Swish	1x1 grouped convolution wish fast hard swish activation function.

Locally Connected Layers

A locally connected layer is a type of neural network layer that shares some similarities with convolutional layers but has some distinct characteristics:

Structure: Locally connected layers, like convolutional layers, operate on local regions of the input. However, unlike convolutional layers, they do not share weights across different positions in the input.
Weight independence: Each local region in the input has its own set of weights, which are not shared with other regions. This allows the layer to learn position-specific features.
Flexibility: Locally connected layers offer more flexibility in learning spatial hierarchies compared to fully connected layers, while still maintaining position-specific information unlike convolutional layers.
Parameters: These layers typically have more parameters than convolutional layers due to the lack of weight sharing, which can lead to increased computational complexity and memory usage.
Use cases: Locally connected layers can be useful in scenarios where position-specific features are important, such as in face recognition tasks where different parts of the face have distinct characteristics based on their location.

Layer Name	Input/Output Dimensions	Activation	Description
`TNNetLocalConnectLinear`	1D, 2D, or 3D	None	Locally connected layer with ReLU activation.
`TNNetLocalConnect`	1D, 2D, or 3D	tanh	Locally connected layer with `htan` as the default activation function.
`TNNetLocalConnectReLU`	1D, 2D, or 3D	ReLU	Locally connected layer with ReLU activation.

Min / Max / Avg Pools

Max, min, and avg poolings are downsampling techniques used in neural networks, particularly in convolutional neural networks (CNNs). Let's explore each of these pooling types as implemented in the CAI Neural API:

Max Pooling (TNNetMaxPool): Max pooling selects the maximum value from a defined region of the input.

It reduces spatial dimensions while retaining the most prominent features.
Useful for detecting specific features regardless of their position in the input.

Min Pooling (TNNetMinPool): Min pooling selects the minimum value from a defined region of the input.

It can be useful for detecting dark features or gaps in the input.
Less common than max pooling but valuable in specific scenarios.

Average Pooling (TNNetAvgPool): Average pooling calculates the average value of a defined region of the input.

It smooths the input and can help in reducing noise.
Often used when we want to preserve more contextual information compared to max pooling.

Unique pooling variants in the API:

TNNetMinMaxPool: Performs both max and min pooling and concatenates the results.
TNNetAvgMaxPool: Combines average and max pooling.
TNNetLpPool: generalized Lp pooling y = ((1/N)·Σ|xᵢ|^p)^(1/p) over each window, with a configurable real exponent p (TNNetLpPool.Create(PoolSize, Stride, Padding, p), default p=2). p=1 is mean-of-absolute-values, p=2 is RMS pooling, and large p approaches max pooling — a single knob interpolating between average and max pooling. Its analytic backward pass ∂y/∂xᵢ = (y^(1-p)/N)·|xᵢ|^(p-1)·sign(xᵢ) is numerically gradient-checked.
TNNetSoftPool: exponentially-weighted ("softmax") pooling (Stergiou, Poppe & Kalliatakis, 2021). Over each window it computes wᵢ = exp(β·xᵢ)/Σⱼexp(β·xⱼ) and y = Σᵢ wᵢ·xᵢ (TNNetSoftPool.Create(PoolSize, Stride, Padding, β), default β=1; window softmax stabilised by subtracting the window max). The optional inverse-temperature β is a single knob spanning the average↔max family: β → ∞ recovers max pooling, β → 0 recovers average pooling, and β = 1 is the original SoftPool. Unlike max pooling every cell receives gradient: its analytic backward pass ∂y/∂xᵢ = wᵢ·(1 + β·(xᵢ − y)) is numerically gradient-checked across a β sweep.
TNNetStochasticPool: stochastic pooling (Zeiler & Fergus, 2013). Over each window it builds a probability distribution pᵢ = aᵢ/Σⱼaⱼ from the (assumed non-negative, e.g. post-ReLU) activations. While training (toggled on by TNNet.EnableDropouts(true), like dropout) it samples one cell with probability pᵢ and outputs it, routing the whole window gradient to that sampled cell (like max pooling routes to its argmax); at inference it is deterministic, outputting the probability-weighted expectation y = Σᵢ pᵢ·aᵢ. Sampling uses the library RNG so it is reproducible under a fixed RandSeed. If a window sum is ≤ 0 (degenerate / negative activations) it falls back to the plain window mean. Constructor params (size/stride/padding) match TNNetMaxPool; assumes square feature maps (SizeX = SizeY).

Backpropagation in pooling layers: During backpropagation, pooling layers distribute the gradient differently:

Max Pooling: The gradient is passed only to the neuron that had the maximum value during the forward pass.
Min Pooling: Similar to max pooling, but for the minimum value.
Average Pooling: The gradient is divided equally among all neurons in the pooling region.

The CAI Neural API implements these backpropagation methods in the respective Backpropagate() functions of each pooling class.

Deconvolution (Upsampling) counterparts: The API also provides deconvolution or upsampling layers, which can be seen as the inverse operations of pooling:

TNNetDeMaxPool: a deconvolution layer that can upsample the input.
TNNetUpsample: also known as depth_to_space, this layer can increase the spatial dimensions of the input.

These layers are crucial in architectures like autoencoders or in tasks requiring upsampling, such as image segmentation.

When to use each pooling type:

Max Pooling: it is useful for detecting features regardless of their exact location. It's commonly used in classification tasks.
Min Pooling: it is useful when the absence of features is important, or when working with inverted data.
Average Pooling: it is good for preserving more context and reducing noise. Often used in later layers of the network.
TNNetMinMaxPool: used when you want to capture both the presence and absence of features.
TNNetAvgMaxPool: used when you need to balance between preserving prominent features and maintaining context.

Layer Name	Input/Output Dimensions	Description
`TNNetAvgPool`	1D, 2D, or 3D	Average pooling layer for reducing spatial dimensions.
`TNNetAdaptiveAvgPool`	1D, 2D, or 3D	Adaptive average pooling (PyTorch `AdaptiveAvgPool2d` style): produces a fixed target output `(SizeX, SizeY)` regardless of the input spatial size, leaving depth unchanged. Each output cell averages the input cells in its adaptive window (`start=floor(o·In/Out)`, `end=ceil((o+1)·In/Out)`; windows may overlap when `In` is not a multiple of `Out`, and the backward pass accumulates each input cell's contribution). `Create(size)` makes a square output; `Create(sizeX, sizeY)` a rectangular one. Setting the target to `1×1` gives global average pooling; setting it equal to the input size is the identity.
`TNNetAdaptiveMaxPool`	1D, 2D, or 3D	Adaptive max pooling (PyTorch `AdaptiveMaxPool2d` style): produces a fixed target output `(SizeX, SizeY)` regardless of the input spatial size, leaving depth unchanged. Each output cell takes the maximum over the input cells in its adaptive window (same `start=floor(o·In/Out)`, `end=ceil((o+1)·In/Out)` mapping as `TNNetAdaptiveAvgPool`; windows may overlap when `In` is not a multiple of `Out`). The backward pass routes each output error to its window's argmax cell and accumulates (an input cell can be the argmax of several overlapping windows). `Create(size)` makes a square output; `Create(sizeX, sizeY)` a rectangular one. Setting the target to `1×1` gives global max pooling; setting it equal to the input size is the identity.
`TNNetMaxPool`	1D, 2D, or 3D	Max pooling layer for reducing spatial dimensions.
`TNNetMaxBlurPool`	2D (square)	Anti-aliased / shift-invariant max pooling (Zhang 2019, Making Convolutional Networks Shift-Invariant Again). Takes the max densely at stride 1, then applies a fixed (non-trainable) separable binomial `[1,2,1]×[1,2,1]/16` low-pass blur subsampled by the stride (borders clamped and re-normalized so the live taps sum to 1). This removes the aliasing that plain strided max pooling introduces, so the output shifts more gracefully as the input shifts. Constructor params (size/stride/padding) match `TNNetMaxPool`; assumes square feature maps (`SizeX = SizeY`). See examples/MaxBlurPool.
`TNNetBlurPool`	2D (square)	Anti-aliasing pooling primitive (Zhang 2019, Making Convolutional Networks Shift-Invariant Again). The pure low-pass sibling of `TNNetMaxBlurPool`: it applies the same fixed (non-trainable) separable binomial `[1,2,1]×[1,2,1]/16` blur subsampled by the stride (borders clamped and re-normalized so the live taps sum to 1) directly to its input, with no max stage — so it can sit after any layer (a strided conv, an average pool) to suppress aliasing, not just after a max. Constructor params (size/stride/padding) match `TNNetMaxPool`; assumes square feature maps (`SizeX = SizeY`).
`TNNetMinPool`	1D, 2D, or 3D	Min pooling layer for reducing spatial dimensions.
`TNNet.AddMinMaxPool`	1D, 2D, or 3D	Performs both min and max pooling, then concatenates the results.
`TNNet.AddAvgMaxPool`	1D, 2D, or 3D	Performs both average and max pooling, then concatenates the results.

The CAI Neural API also provides specialized versions:

TNNetMaxChannel and TNNetMinChannel: perform max and min operations across the entire channel into a single number per channel.
TNNetAvgChannel: averages the entire channel into a single number per channel.

Layer Name	Input/Output Dimensions	Description
`TNNetAvgChannel`	2D or 3D (output: 1D)	Calculates the average value per channel.
`TNNetMaxChannel`	2D or 3D (output: 1D)	Calculates the maximum value per channel.
`TNNetMinChannel`	2D or 3D (output: 1D)	Calculates the minimum value per channel.
`TNNetGather`	2D or 3D (output: depth 1)	Selects a single depth channel: `Output[x,y,0] := Input[x,y,Channel]`.
`TNNet.AddMinMaxChannel`	1D, 2D, or 3D	Performs both min and max channel operations, then concatenates the results.
`TNNet.AddAvgMaxChannel`	1D, 2D, or 3D	Performs both average and max channel operations, then concatenates the results.

Trainable Normalization Layers Allowing Faster Learning/Convergence

Normalization layers may offer:

Improved training stability.
Better generalization.
Potential for faster convergence.

The available normalization techniques are:

Zero-centering (TNNetChannelZeroCenter).
Standard deviation normalization (TNNetMovingStdNormalization, TNNetChannelStdNormalization).
Per-sample layer normalization (TNNetLayerNorm, TNNetRMSNorm, TNNetGroupNorm).

See the Normalization cheat sheet for a side-by-side comparison of every normalization layer (axes reduced over, learnable parameters, formula, and when to use each).

Layer Name	Input/Output Dimensions	Description
`TNNetChannelZeroCenter`	1D, 2D, or 3D	Trainable zero-centering normalization.
`TNNetMovingStdNormalization`	1D, 2D, or 3D	Trainable standard deviation normalization.
`TNNetChannelStdNormalization`	1D, 2D, or 3D	Trainable per-channel standard deviation normalization.
`TNNetLayerNorm`	1D, 2D, or 3D	Per-sample layer normalization (zero mean, unit variance) with learnable per-element scale and bias.
`TNNetRMSNorm`	1D, 2D, or 3D	Per-sample root-mean-square normalization (no mean subtraction) with learnable per-element scale.
`TNNetRMSNormGated`	1D, 2D, or 3D	Per-sample root-mean-square normalization (no mean subtraction) followed by a learnable per-channel sigmoid gate: `y[x,y,d] = (x / sqrt(mean(x^2) + eps)) * sigmoid(g[d])`. The only learnable params are the `Depth` gate logits `g[d]` (init 0, so the gate is `0.5` at start; no per-element gamma).
`TNNetSwitchableNorm`	1D, 2D, or 3D	Learnable softmax-weighted convex combination of a LayerNorm-style and an RMSNorm-style per-sample normalization of the same input: `y = a_ln * L + a_rms * R`, where `(a_ln, a_rms) = softmax(w_ln, w_rms)`, `L = (x - mean)/sqrt(var + eps)` and `R = x / sqrt(mean(x^2) + eps)`. The only learnable params are the two scalar mixing logits (init 0, so a 50/50 blend at start; no per-element gamma/beta).
`TNNetGroupNorm`	1D, 2D, or 3D	Normalizes within `Groups` contiguous channel groups, with learnable per-element scale and bias.
`TNNetGRN`	2D or 3D	Global Response Normalization (ConvNeXt-V2, Woo et al. 2023). Channel-wise contrast normalization with learnable per-channel `gamma` and `beta` (both init 0, so identity at start): `Y = gamma * (X * Nx) + beta + X`, where `Nx[c] =
`TNNetZScore`	1D, 2D, or 3D	Per-sample z-score normalization: `y = (x - mean) / sqrt(var + eps)`. No learnable parameters; the unparameterised core of `TNNetLayerNorm`.
`TNNetDyT`	1D, 2D, or 3D	Dynamic Tanh (Liu et al. 2025): a normalization-free LayerNorm alternative `y[c] = gamma[c]·tanh(alpha·x) + beta[c]`, with a single layer-wide learnable `alpha` plus per-channel learnable `gamma` (init 1) and `beta` (init 0). No batch or per-sample statistics. Created with `TNNetDyT.Create()`.
`TNNetWeightStandardization`	1D	Weight-standardized dense layer (Qiao et al. 2019). A `TNNetFullConnectLinear` that standardizes each output neuron's weight vector to zero-mean, unit-variance (`ŵ = (w − μ)/sqrt(var + eps)`, biased variance) before the forward dot product. Smooths the loss landscape; pairs well with GroupNorm. The exact standardization Jacobian is propagated to the raw weights and is numerically gradient-checked. Created with `TNNetWeightStandardization.Create(Neurons[, eps])`.
`TNNetWeightNormLinear`	1D	Weight-normalized dense layer (the simple g=1 form of Weight Normalization, Salimans & Kingma 2016 / a differentiable unit-L2 weight constraint). A `TNNetFullConnectLinear` that L2-normalizes each output neuron's weight vector to unit norm (`ŵ = w/sqrt(Σwᵢ² + eps)`) before the forward dot product — a differentiable reparametrization, not a post-step hard projection. The exact unit-norm Jacobian is propagated to the raw weights and is numerically gradient-checked. Created with `TNNetWeightNormLinear.Create(Neurons[, eps])`.
`TNNetKANLayer`	1D	Kolmogorov-Arnold dense layer (Liu et al. 2024) — a drop-in `D_in → D_out` replacement for `TNNetFullConnectLinear` in which every input→output edge carries its own learned univariate function instead of a single scalar weight: `y_j = Σ_i φ_{ij}(x_i)`, with `φ_{ij}(x) = Σ_{k=0..K} c_{ijk}·T_k(tanh x_i)` over a fixed first-kind Chebyshev basis (only the `D_in·D_out·(K+1)` coefficients train). Distinct from `TNNetSplineActivation`, which is a depth-preserving per-channel activation. Analytic input + per-coefficient gradients (numerically gradient-checked); initialised near-linear so an untrained layer behaves like a plain linear layer. Created with `TNNetKANLayer.Create(D_out[, K])` (default `K=4`).
`TNNetKANConv`	2D	Kolmogorov-Arnold convolution — the convolutional sibling of `TNNetKANLayer`. Each receptive-field patch is mapped to an output value by a sum of learned univariate edge functions instead of a linear dot product: `y = Σ_{p,i} φ_{p,i}(x_{p,i})` with `φ(x) = Σ_{k=0..K} c_k·T_k(tanh x)` over the same fixed first-kind Chebyshev basis (only the `FeatureSize²·InputDepth·(K+1)` coefficients per filter train). Subclasses `TNNetConvolutionLinear` (no output bias), initialised near-linear, analytic input + per-coefficient gradients (numerically gradient-checked). An optional basis selector (7th constructor arg, `csKANBasisChebyshev` default / `csKANBasisBSpline`) swaps the Chebyshev polynomials for the KAN paper's original fixed-knot B-spline parameterisation — clamped open-uniform knots over the `tanh`-squashed `[-1,1]` support, `G+K` coefficients per edge evaluated by the Cox–de Boor recurrence (and its analytic derivative for the backward pass), Greville-abscissa near-linear init; B-splines have compact support (locally smooth, flat extrapolation) vs Chebyshev's global oscillatory polynomials. Created with `TNNetKANConv.Create(Features, FeatureSize, Padding, Stride, K[, SuppressBias[, Basis]])`. See `examples/KANConv/`.
`TNNet.AddMovingNorm`	1D, 2D, or 3D	Possible replacement for batch normalization.
`TNNet.AddChannelMovingNorm`	1D, 2D, or 3D	Possible replacement for batch normalization, applied per channel.

TNNetLayerNorm normalizes each input sample over all its elements (SizeX*SizeY*Depth) to zero mean and unit variance, then applies a learnable per-element scale (gamma) and bias (beta). Unlike batch normalization it does not depend on batch statistics, which makes it well suited to transformers and recurrent models. Add it with NN.AddLayer(TNNetLayerNorm.Create());.

TNNetRMSNorm is a cheaper, transformer-friendly variant that divides each sample by the root mean square of its elements (no mean subtraction) and applies a learnable per-element scale. Add it with NN.AddLayer(TNNetRMSNorm.Create());.

TNNetRMSNormGated keeps the same RMS normalization but replaces the per-element scale with a learnable per-channel sigmoid gate sigmoid(g[d]). The gate logits are initialised to 0, so an untrained layer halves each normalized activation (sigmoid(0) = 0.5) and the channels open or close independently during training. Add it with NN.AddLayer(TNNetRMSNormGated.Create());.

TNNetSwitchableNorm lets the network learn how much LayerNorm vs RMSNorm to apply to the same input. It computes both a LayerNorm-style normalization L = (x - mean)/sqrt(var + eps) and an RMSNorm-style normalization R = x / sqrt(mean(x^2) + eps) per sample, then blends them with a softmax over two learnable scalar logits: y = a_ln * L + a_rms * R with (a_ln, a_rms) = softmax(w_ln, w_rms). There is no per-element gamma/beta; the only parameters are the two mixing logits, both initialised to 0 so an untrained layer is an exact 50/50 blend. Add it with NN.AddLayer(TNNetSwitchableNorm.Create());.

TNNetGroupNorm splits the input channels (Depth) into Groups contiguous groups and normalizes each group independently, then applies a learnable per-element scale and bias. Depth must be divisible by Groups; otherwise it falls back to a single group. Pass the group count to the constructor, e.g. NN.AddLayer(TNNetGroupNorm.Create(8));.

Non Trainable and per Sample Normalization Layers

Normalization layers (TNNetLayerMaxNormalization, TNNetLayerStdNormalization, TNNetLocalResponseNorm2D, TNNetLocalResponseNormDepth) help stabilize training and can improve model performance by managing the scale and distribution of activations. They are particularly useful in deep networks where the scale of values can change dramatically between layers.

TNNetLayerMaxNormalization normalizes based on the maximum value, while TNNetLayerStdNormalization uses standard deviation. These are particularly useful when you want to normalize the activations within a specific range or distribution without learning any parameters. They can be applied to various network architectures and are especially helpful when dealing with varying scales of input features.

TNNetLocalResponseNorm2D and TNNetLocalResponseNormDepth implement types of local Response Normalization (LRN). LRN is inspired by lateral inhibition in real neurons. It's particularly useful in Convolutional Neural Networks (CNNs) for image processing tasks. You may use it in scenarios where you want to create competition amongst neuron outputs in the same layer.

TNNetLocalResponseNorm2D is applied across nearby kernel maps at the same spatial position, while TNNetLocalResponseNormDepth normalizes across the depth dimension. These layers can help in increasing the generalization capability of the model, reducing the chances of overfitting and enhancing the model's ability to detect high-frequency features with a big response.

Random layers (TNNetRandomMulAdd, TNNetChannelRandomMulAdd) serve as powerful regularization techniques, helping to prevent overfitting and improve the model's ability to generalize. They can be especially beneficial when working with limited datasets or when you want your model to be robust to small variations in input.

Layer Name	Input/Output Dimensions	Description
`TNNetLayerMaxNormalization`	1D, 2D, or 3D	Non-trainable max normalization per layer.
`TNNetLayerStdNormalization`	1D, 2D, or 3D	Non-trainable standard deviation normalization per layer.
`TNNetLocalResponseNorm2D`	2D or 3D	Non-trainable local response normalization for 2D or 3D input.
`TNNetLocalResponseNormDepth`	2D or 3D	Non-trainable local response normalization with depth normalization.
`TNNetRandomMulAdd`	1D, 2D, or 3D	Adds random multiplication and random bias (shift).
`TNNetChannelRandomMulAdd`	1D, 2D, or 3D	Adds random multiplication and random bias (shift) per channel.
`TNNetGaussianNoise`	1D, 2D, or 3D	Additive `N(0, σ²)` noise at training, identity at inference. σ stored in `FFloatSt[0]`.
`TNNetGaussianDropout`	1D, 2D, or 3D	Multiplicative `N(1, σ²)` noise at training, identity at inference. σ stored in `FFloatSt[0]`.
`TNNetDropBlock`	2D or 3D	Structured spatial dropout (Ghiasi et al. 2018, "DropBlock"). At training it zeroes contiguous `block_size × block_size` square regions of the feature map — one spatial mask broadcast across all `Depth` channels — so spatially-correlated neighbours drop together (unlike `TNNetDropout` which scatters per-element, or `TNNetSpatialDropout2D` which drops whole channels). Seeds are sampled at rate `gamma = (1-keep) * feat_area / (block² * valid_area)` only where a full block fits, then dilated into blocks; survivors are rescaled by `count_all / count_kept` to preserve the expected activation. Identity at inference. Backward gates through the same stored mask. Created with `TNNetDropBlock.Create(block_size, drop_prob)`.
`TNNetMinMaxNorm`	1D, 2D, or 3D	Per-sample min-max normalization `y = (x - min(x)) / (max(x) - min(x) + eps)`, reduced over the whole sample volume so the output range is approximately `[0, 1]`. Non-trainable; `eps` defaults to 1e-7 (constructor-configurable, round-trips via Save/Load). Backward routes the bulk `1/denom` gradient plus the exact argmin/argmax coupling terms. A per-channel mode (`TNNetMinMaxNorm.Create(eps, {PerChannel:=}True)`) reduces min/max over the spatial positions ONLY, independently for each depth channel, so every channel is normalized to its own `[0, 1]` range; the flag round-trips via Save/Load and full-volume stays the default. Created with `TNNetMinMaxNorm.Create()`, `TNNetMinMaxNorm.Create(eps)`, or `TNNetMinMaxNorm.Create(eps, PerChannel)`.

These layers provide various tools for normalization, regularization, and introducing controlled variability in neural networks. The choice of which layers to use and where to place them in your network architecture depends on the specific problem you're trying to solve, the characteristics of your data, and the behavior you want to encourage in your model.

Concatenation, Summation and Reshaping Layers

These layers are essential for creating flexible and powerful neural network architectures. Let's break them down:

Concatenation Layers: There are two main types of concatenation layers in the CAI Neural API: a. TNNetConcat:
- This layer concatenates outputs from multiple layers along the depth dimension.
- It's designed to work with layers that have the same spatial dimensions (X and Y sizes).
- Usage: It's particularly useful when you want to combine features from different processing paths in your network. b. TNNetDeepConcat:
- This layer also concatenates outputs from multiple layers, but it's specifically optimized for the depth dimension.
- It maintains separate arrays to track the depths of each layer and channel, allowing for efficient deep concatenation.
- Usage: Ideal for creating architectures that process information in parallel and then combine the results.
Summation Layer (TNNetSum):
- This layer adds together the outputs of multiple layers element-wise.
- It's designed to work with layers of the same size.
- Usage: Commonly used in residual network (ResNet) style architectures, where it allows for skip connections that help mitigate the vanishing gradient problem and enable the training of very deep networks.

These layers provide several benefits in neural network design:

Flexibility: they allow for the creation of complex, non-linear network topologies that can process information in parallel and then combine it in various ways.
Feature Fusion: concatenation and summation layers enable the network to combine features from different processing streams, potentially capturing multi-scale or multi-aspect information.
Skip Connections: summation layers are crucial for implementing skip connections, which are fundamental to many modern architectures like ResNets and DenseNets.
Dimensionality Manipulation: the transposition layers allow for creative manipulations of data dimensions, which can be crucial for certain types of operations or for interfacing between different parts of a network.
Custom Architectures: these layers provide the building blocks for designing novel network architectures tailored to specific tasks or data types.

By using these layers creatively, developers can build highly customized and efficient neural network architectures that are optimized for their specific use cases.

Layer Name	Input/Output Dimensions	Description
`TNNetConcat`	1D, 2D, or 3D	Concatenates previous layers into a single layer.
`TNNetDeepConcat`	1D, 2D, or 3D	Concatenates previous layers along the depth axis. This is useful with DenseNet like architectures. Use `TNNetDeepConcat` instead of `TNNetConcat` if you need to add convolutions after concating layers.
`TNNetIdentity`	1D, 2D, or 3D	Identity layer that passes the input unchanged.
`TNNetIdentityWithoutBackprop`	1D, 2D, or 3D	Allows the forward pass but prevents backpropagation.
`TNNetLogCoshLoss`	1D, 2D, or 3D	Log-Cosh regression output head. Forward is identity passthrough; backward replaces the framework-seeded `(output - target)` gradient with `tanh(output - target)`, the bounded gradient of `L = sum log(cosh(output - target))`. Use as the last layer of a regression net. Created with `TNNetLogCoshLoss.Create()`.
`TNNetCharbonnierLoss`	1D, 2D, or 3D	Charbonnier ("smooth-L1") regression output head, popular for super-resolution. Forward is identity passthrough; backward replaces the seeded `(output - target)` gradient with `(output - target) / sqrt((output - target)^2 + eps^2)`, always bounded in [-1, 1]. `eps` defaults to 1e-3 (constructor-configurable, round-trips via Save/Load). Created with `TNNetCharbonnierLoss.Create()` or `TNNetCharbonnierLoss.Create(eps)`.
`TNNetQuantileLoss`	1D, 2D, or 3D	Quantile (pinball) regression output head for estimating conditional quantiles / prediction intervals. For target quantile `q` in (0,1) and residual `e = target - prediction`, the loss is `L_q(e) = max(q·e, (q-1)·e)` (`q=0.5` recovers the median / MAE). Forward is identity passthrough; backward replaces the framework-seeded `(prediction - target)` gradient with the subgradient `-q` when under-predicting (`e>0`), `(1-q)` when over-predicting (`e<0`), and `0` at the kink. `q` defaults to 0.5 (constructor-configurable, validated in (0,1), round-trips via Save/Load). Created with `TNNetQuantileLoss.Create()` or `TNNetQuantileLoss.Create(q)`. See `examples/QuantileRegression`.
`TNNetMultiQuantileLoss`	1D, 2D, or 3D	Single-model multi-quantile pinball head: emits an `N`-wide output (one channel per target quantile) so all `N` quantiles are predicted jointly in one forward pass instead of training `N` separate models. Each output channel `i` is trained with its own pinball loss (quantile `q_i`) against the same scalar target; backward writes the per-channel subgradient mirroring `TNNetQuantileLoss`. The quantile list is serialized (`N` capped at 8). A non-differentiable inference-time monotonicity guard, the class method `TNNetMultiQuantileLoss.SortAscending`, sorts each `N`-channel group so the `q=0.1` prediction never crosses `q=0.9` ("quantile crossing"). Created with `TNNetMultiQuantileLoss.Create()` (defaults `[0.1, 0.5, 0.9]`) or `TNNetMultiQuantileLoss.Create([...])`. See `examples/QuantileRegression`.
`TNNetNLLLoss`	1D, 2D, or 3D	Negative-log-likelihood classification output head, the companion to `TNNetLogSoftMax`. Consumes per-position log-probabilities over the depth axis. Forward is identity passthrough; backward writes the exact NLL gradient `-target` per position (so a `TNNetLogSoftMax -> TNNetNLLLoss` stack reproduces softmax cross-entropy, `softmax(logits) - target`, the numerically stable way). Created with `TNNetNLLLoss.Create()`.
`TNNetKLDivergence`	1D, 2D, or 3D	Kullback-Leibler divergence output head, `KL(target‖pred) = sum(target·log(target/pred))`. Place it after a `TNNetSoftMax` so the input is a probability distribution `q`. Forward is identity passthrough; backward writes the analytic gradient `dL/dq_i = -target_i / q_i`, with `q` clamped to `[1e-7, 1]` for stability and zero-target terms contributing no gradient (`0·log0 := 0`). Useful for soft-label / knowledge-distillation training. Created with `TNNetKLDivergence.Create()`.
`TNNetTverskyLoss`	1D, 2D, or 3D	Tversky segmentation output head (Salehi et al. 2017). Operates on probability-space inputs (after a sigmoid/softmax) with binary/one-hot targets, reduced over the whole volume. With `TP=sum(p·g)`, `FP=sum(p·(1-g))`, `FN=sum((1-p)·g)`, the Tversky index is `TI = (TP+s)/(TP+α·FP+β·FN+s)` and the loss is `L = 1 - TI`. `α`/`β` trade false positives vs false negatives (defaults 0.5/0.5), `s` is a smoothing constant (default 1.0); all round-trip via Save/Load. Forward is identity passthrough; backward writes the analytic `dL/dp_i`. Created with `TNNetTverskyLoss.Create()` or `TNNetTverskyLoss.Create(alpha, beta, smooth)`.
`TNNetDiceLoss`	1D, 2D, or 3D	Dice (Sørensen-Dice / F1) segmentation output head — the `α=β=0.5` special case of `TNNetTverskyLoss` (`L = 1 - 2·TP/(2·TP+FP+FN)`), so it reuses the Tversky forward/backward. Standard choice for class-imbalanced segmentation. Created with `TNNetDiceLoss.Create()`.
`TNNetWingLoss`	1D, 2D, or 3D	Wing regression output head (Feng et al. 2018), designed for facial-landmark localization. Per-element loss with a logarithmic core `w·ln(1+
`TNNetLabelSmoothingLoss`	1D, 2D, or 3D	Label-smoothing classification output head (Szegedy et al. 2016). Place it after a `TNNetSoftMax`. It replaces the one-hot target `t` with the smoothed `t' = (1-eps)·t + eps/NumClasses` (NumClasses = depth) and propagates the softmax cross-entropy gradient `p - t'`, discouraging over-confident logits. `eps` defaults to 0.1 (round-trips via Save/Load). Forward is identity passthrough. Created with `TNNetLabelSmoothingLoss.Create()` or `TNNetLabelSmoothingLoss.Create(eps)`.
`TNNetTripletLoss`	1D, 2D, or 3D	Triplet-margin metric-learning output head. Splits the input depth into 3 equal anchor/positive/negative chunks (`d = Depth div 3`; requires `Depth mod 3 = 0`) and per spatial cell computes the hinge `L = max(0, ‖a-p‖² - ‖a-n‖² + margin)`. There is no external target — supervision is implicit in the a\|p\|n layout. Forward is identity passthrough; when the hinge is active the backward writes `dL/da=2(n-p)`, `dL/dp=-2(a-p)`, `dL/dn=2(a-n)` into the three depth slices (zero otherwise). `margin` defaults to 1.0 (round-trips via Save/Load). Created with `TNNetTripletLoss.Create()` or `TNNetTripletLoss.Create(margin)`.
`TNNetReshape`	1D, 2D, or 3D	Reshapes the input into a different dimension.
`TNNetExpandDims`	1D, 2D, or 3D	numpy-style single-axis shape helper. Lays the whole input out as a length-`N = SizeX·SizeY·Depth` vector along a chosen axis, forcing the other two axes to size 1: axis 0 → `(N,1,1)`, axis 1 → `(1,N,1)`, axis 2 → `(1,1,N)` (default). Element-count-preserving pure reshape (identity data/gradient flow); the exact inverse of `TNNetSqueeze`. Created with `TNNetExpandDims.Create(Axis)` (default axis 2).
`TNNetSqueeze`	1D, 2D, or 3D	numpy-style shape helper that collapses any `(SizeX, SizeY, Depth)` volume to the canonical compact depth vector `(1, 1, N)`, removing unit spatial axes. Element-count-preserving pure reshape (identity data/gradient flow); inverts `TNNetExpandDims`. Less error-prone than open-coding `TNNetReshape`. `TNNetSqueeze.Create()` collapses all axes; `TNNetSqueeze.Create(Axis)` drops only the one specified unit axis (asserting the other two are size 1), the exact single-axis inverse of `TNNetExpandDims(Axis)`.
`TNNetSum`	1D, 2D, or 3D	Sums the outputs from previous layers, useful for ResNet-style networks.
`TNNetFiLM`	1D, 2D, or 3D	Feature-wise Linear Modulation (Perez et al. 2018). A parameter-free two-input layer that conditions one branch on another: `Out[x,y,c] = gamma[c]·feature[x,y,c] + beta[c]`, where the per-channel `gamma`/`beta` come from a separate conditioning branch (not the layer's own weights), so the modulation is input-dependent rather than a fixed affine. Input 0 is the feature map `(SizeX, SizeY, Depth)`; input 1 is the conditioning vector `(1, 1, 2·Depth)` packed as `gamma\|beta` (broadcast over space). Wire it with `TNNetFiLM.Create([featureLayer, condLayer])`. Backward routes error to both inputs (`dgamma=Σ feature·dOut`, `dbeta=Σ dOut`), so the conditioning sub-network trains end-to-end. With `gamma=1, beta=0` it reproduces the feature map exactly. See the worked FiLM conditioning example.
`TNNetUpsample`	3D	Upsamples channels (depth) into spatial data, converting depth into spatial resolution. For example, a 128x128x256 activation map will be converted to 256x256x64. The number of channels is always divided by 4 while the resolution increases.
`TNNetPixelShuffle`	3D	Sub-pixel convolution (Shi et al. 2016). Parameter-free depth-to-space rearrangement with a configurable upscale factor `r`: input `(W, H, C)` with `C mod (rr) = 0` becomes `(Wr, Hr, C / (rr))`. Created with `TNNetPixelShuffle.Create(r)` (default `r=2`). The backward pass is the exact inverse gather, so the layer round-trips cleanly.
`TNNetMaskedMean`	3D (output: `(1, SizeY, D-1)`)	Mean over the SizeX (sequence) axis with the last input channel acting as a `{0,1}` validity mask. Positions where mask ≤ 0.5 are excluded from the average; rows whose mask is entirely zero produce a zero output and zero gradient. Parameter-free.
`TNNetMaskedMax`	3D (output: `(1, SizeY, D-1)`)	Max over the SizeX (sequence) axis with the last input channel acting as a `{0,1}` validity mask. Masked-out positions are treated as `-infinity`; rows whose mask is entirely zero produce a zero output and zero gradient. Parameter-free.

Embedding heads

For contrastive / metric-learning models the goal is not to classify but to embed: map each input to a vector so that semantically-similar inputs land close together and dissimilar ones far apart. Three layers compose into such a head:

Layer Name	Input/Output Dimensions	Description
`TNNetL2Normalize`	1D, 2D, or 3D	Divides the input by its L2 norm so each embedding lives on the unit sphere (cosine geometry). The reduction axis is configurable: `Create()`/`Create(axis=0)` normalizes per-`(x,y)` over depth, `Create(1)` over the whole flattened sample (Keras "UnitNorm"), `Create(2)` per-channel; an optional `eps` (default 1e-8) stabilises the denominator. The exact backward Jacobian is applied. No trainable parameters.
`TNNetCosineSimilarity`	2D or 3D (output: `(SizeX, SizeY, 1)`)	Splits the input depth into two equal halves `a` and `b` and produces the per-`(x,y)` scalar `cos(a, b) = (a·b)/(‖a‖·‖b‖ + eps)`. Requires an even input depth `>= 2`. Useful as a Siamese/twin-tower similarity head. The exact cosine Jacobian is back-propagated. No trainable parameters.
`TNNetTripletLoss`	1D, 2D, or 3D	Triplet-margin metric-learning output head — splits the input depth into 3 equal `anchor\|positive\|negative` chunks and per spatial cell computes the hinge `L = max(0, ‖a-p‖² - ‖a-n‖² + margin)`. There is no external target; supervision is implicit in the `a\|p\|n` layout. See the loss-head table above for the full gradient details.
`TNNetCosineEmbeddingLoss`	1D, 2D, or 3D	Pairwise cosine-embedding metric-learning output head (PyTorch `CosineEmbeddingLoss` family) — splits the input depth as `a\|b\|y` with `Depth = 2*d + 1` (odd, `>= 3`). Per spatial cell, with `cos = (a·b)/(‖a‖·‖b‖ + eps)` and per-position label `y` (1 = similar, 0 = dissimilar), `L = y·(1 - cos) + (1 - y)·max(0, cos - margin)²`. There is no external target; supervision is implicit in the `a\|b\|y` layout. Forward is identity passthrough; backward writes the analytic cosine gradient into the `a`/`b` channels and 0 into the `y` channel. `margin` defaults to 0.0 (must be in `[-1, 1]`, round-trips via Save/Load). Created with `TNNetCosineEmbeddingLoss.Create()` or `TNNetCosineEmbeddingLoss.Create(margin)`.
`TNNetInfoNCELoss`	1D, 2D, or 3D	InfoNCE / contrastive output head (SimCLR/CPC family) — splits the input depth into `K+1` equal slabs of `d` channels each: a query `q` followed by `K` keys `k_0..k_{K-1}` where `k_0` is the POSITIVE key and the rest are negatives, so `Depth = d*(K+1)`. Per spatial cell, with dot-product similarity `s_j = (q·k_j)/tau` and `p = softmax(s)`, `L = -s_0 + logsumexp_j(s_j)`. There is no external target; supervision is implicit in the `q\|k_0\|..\|k_{K-1}` layout. Forward is identity passthrough; backward writes the analytic gradients `dL/dq = (1/tau)(Σ_j p_j·k_j - k_0)`, `dL/dk_0 = (1/tau)(p_0-1)·q`, `dL/dk_j = (1/tau)·p_j·q` (`j>0`). Embedding dim `d` is stored in `FStruct[0]` (`>= 1`) and temperature `tau` in `FFloatSt[0]` (default 0.07, must be `> 0`); both round-trip via Save/Load. SetPrevLayer validates `Depth mod d = 0` and `(Depth div d) >= 3`. Created with `TNNetInfoNCELoss.Create()` or `TNNetInfoNCELoss.Create(EmbeddingDim, Temperature)`.
`TNNetCenterLoss`	1D, 2D, or 3D	Center-loss output head (Wen et al. 2016) — a PENALTY head that pulls each feature toward a trainable per-class center, meant to be ADDED ALONGSIDE a separate classification head (it contributes only the center-pull gradient, NOT any softmax/cross-entropy term). Splits the input depth as `x\|y` with `Depth = d + 1` (`>= 2`): `x` are the `d` feature channels and the last channel holds the integer class label `y`. Per spatial cell, with active class `c = round(y)`, `L = (λ/2)·‖x - c_c‖²`. There is no external target; supervision is implicit in the `x\|y` layout. The `K` class centers (each of dim `d`) are stored as `K` trainable neurons (one weight vector each) and serialize automatically. Forward is identity passthrough; backward writes the feature gradient `dL/dx = λ·(x - c_c)` into the feature channels (0 into the label channel) and accumulates the center-pull gradient `(c_c - x)` into the active center's neuron delta. NOTE: the per-sample gradient path cannot see other minibatch samples, so the paper's cross-batch EMA center update is out of scope; centers are learned by the optimizer like any weight. `K` is stored in `FStruct[0]` (default 2, `>= 1`) and `λ` in `FFloatSt[0]` (default 1.0, `> 0`); both round-trip via Save/Load. Created with `TNNetCenterLoss.Create()` or `TNNetCenterLoss.Create(NumClasses, Lambda)`.
`TNNetVectorQuantizer`	1D, 2D, or 3D	VQ-VAE codebook bottleneck (van den Oord et al. 2017, "Neural Discrete Representation Learning"). Replaces each input feature VECTOR (the `Depth`-vector `z_e` at every spatial position) with its nearest entry `z_q` from a learnable codebook of `K` vectors (each of dim `Input.Depth`); output shape equals input shape. The `K` codebook vectors are stored as `K` trainable neurons (one `Depth`-length weight vector each) and serialize automatically. Forward picks the codebook index minimizing the squared-L2 distance to `z_e` and writes that code to the output. Backward uses the straight-through estimator (the output gradient flows to `z_e` unchanged), adds the commitment gradient `2·β·(z_e - z_q)` to the input gradient, and accumulates the codebook-pull gradient `2·(z_q - z_e)` into the chosen code's neuron delta (`FBatchUpdate` respected). `K` is stored in `FStruct[0]` (default 8, `>= 1`) and the commitment cost `β` in `FFloatSt[0]` (default 0.25, `> 0`); both round-trip via Save/Load. Created with `TNNetVectorQuantizer.Create()` or `TNNetVectorQuantizer.Create(NumCodes, Commitment)`.
`TNNetArcFace`	1D, 2D, or 3D	ArcFace additive angular-margin softmax output head (Deng et al. 2019) — a SELF-CONTAINED softmax-cross-entropy head with a trainable per-class weight matrix. Splits the input depth as `x\|y` with `Depth = d + 1` (`>= 2`): `x` are the `d` embedding channels and the last channel holds the integer class label `y`. Both the embedding and each class weight `W_k` are L2-normalized, so `cos(θ_k) = <x̂, Ŵ_k>`. For the true class `c = round(y)` the additive angular margin `m` is applied: `cos(θ'_c) = cos(θ_c)·cos(m) - sin(θ_c)·sin(m)`. With logits `z_k = s·cos(θ_k)` (`z_c = s·cos(θ'_c)`), `L = -log(softmax(z)_c)`. There is no external target; supervision is implicit in the `x\|y` layout. The `K` class weight vectors (each of dim `d`) are stored as `K` trainable neurons (one weight vector each) and serialize automatically. Forward is identity passthrough; backward writes `dL/dx` into the embedding channels (0 into the label channel) and accumulates `dL/dW_k` into each weight neuron's delta. NOTE: the per-sample gradient path cannot see other minibatch samples (standard for this framework's loss heads). `K` is stored in `FStruct[0]` (default 2, `>= 1`), margin `m` in `FFloatSt[0]` (default 0.5 rad, `>= 0`) and scale `s` in `FFloatSt[1]` (default 30.0, `> 0`); all round-trip via Save/Load. Created with `TNNetArcFace.Create()` or `TNNetArcFace.Create(NumClasses, Margin, Scale)`. See `examples/ArcFaceEmbedding` for a margin sweep showing the angular margin tighten intra-class cosine clusters.
`TNNetEvidentialRegression`	1D, 2D, or 3D (`Depth`-vector of packed params)	Deep Evidential Regression uncertainty head (Amini et al., NeurIPS 2020, arXiv:1910.02600) — a single-forward-pass regression head that reports BOTH aleatoric (data noise) and epistemic (model) uncertainty with no sampling and no ensemble. Per scalar target it emits the 4 parameters of a Normal-Inverse-Gamma (NIG) higher-order distribution: `gamma` (mean, linear), `nu`, `alpha`, `beta`. The previous layer must emit `4D` raw channels packed over `Depth` as `[gamma \| raw_nu \| raw_alpha \| raw_beta]` per target; forward is an identity-style passthrough that applies the positivity links in place — `gamma` stays linear, softplus* → `nu>0` and `beta>0`, `1+softplus` → `alpha>1` — so `Compute` returns usable parameters. Like `TNNetMixtureDensity` it OWNS its loss: `Backpropagate` overwrites `FOutputError` with the EXACT `dL/d{gamma,nu,alpha,beta}` chained through the softplus links, where `L` is the NIG negative log-likelihood (a Student-t marginal, closed form via a self-contained `LnGammaF`/`DigammaF`) PLUS the paper's evidence regularizer `lambda*
`TNNetEvidentialClassification`	1D, 2D, or 3D (`K`-vector of class evidence)	Evidential Deep Learning classification head (Sensoy et al., NeurIPS 2018, arXiv:1806.01768) — the classification sibling of `TNNetEvidentialRegression`. Treats the previous layer's `K` raw outputs as evidence for a Dirichlet over the `K`-class simplex: `alpha_k = 1 + softplus(raw_k)`, strength `S = sum_k alpha_k`. Forward is an identity-style passthrough that applies the softplus link in place so `Compute` returns the concentration vector `alpha`. From one deterministic pass it reads off the mean class probabilities `p_k = alpha_k/S` and a single scalar uncertainty mass `u = K/S` in [0,1] (`u→1` when all evidence vanishes = the network abstains) — no sampling, no ensemble. Like its regression sibling it OWNS its loss: `Backpropagate` overwrites `FOutputError` with the EXACT `dL/d(raw_k)` chained through softplus, where `L` is the EDL Bayes-risk expected-MSE (`sum_k (y_k-p_k)^2 + p_k(1-p_k)/(S+1)`, Eq. 5) PLUS a `lambda·KL`-to-uniform regularizer on the misleading-evidence Dirichlet `alpha~ = y + (1-y)*alpha` (self-contained `LnGammaF`/`DigammaF`/`TrigammaF`). Inference helpers: `Alpha(k)`, `Prediction(k) = alpha_k/S`, `Uncertainty = K/S`. Distinct from a plain `TNNetSoftMax` head (a softmax gives a point probability with no abstention signal) — EDL's `u` rises on out-of-distribution / ambiguous inputs. `K` round-trips via `FStruct[0]`, `lambda` via `FFloatSt[0]`. Created with `TNNetEvidentialClassification.Create()` or `TNNetEvidentialClassification.Create(NumClasses, Lambda)`. See `examples/EvidentialClassification/` where `u` rises ~5–6× out of distribution.
`TNNetMixtureDensity`	1D, 2D, or 3D (`Depth`-vector of packed params)	Mixture Density Network regression output head (Bishop 1994) — the first head that predicts a full conditional distribution (multi-modal, heteroscedastic) rather than a point estimate, so it can model one-to-many / inverse problems where an MSE head collapses to the conditional mean. The previous layer must emit `K(1 + 2D)` raw channels, packed over `Depth` as `[K mixing logits \| KD means \| KD raw scales]`, parameterizing a `K`-component diagonal-Gaussian mixture over a `D`-dim target. Forward is an identity-style passthrough that transforms the params in place: softmax over the `K` mixing logits → `pi`, raw means untouched, softplus (`ln(1+exp(s))`, chosen over `exp` for robustness) on the scales → `sigma`, so `Compute` returns usable inference parameters. There is no external target beyond the `D` regression values: the framework seeds the target, this head reads `y` from the first `D` target channels and OWNS the negative-log-likelihood loss, with `Backpropagate` overwriting the whole `FOutputError` with the EXACT `dNLL/d(raw param)` using the numerically-stable log-sum-exp form over components (responsibilities `gamma_k`; `dNLL/da_k = pi_k - gamma_k`, `dNLL/dmu = gamma_k(mu-y)/sigma^2`, `dNLL/ds = gamma_k(1/sigma - (y-mu)^2/sigma^3)sigmoid(s)`). Inference helper `SampleMixture` draws a `D`-dim sample (pick a component by its `pi` weights, then sample that diagonal Gaussian); `MixtureNLL` scores a target. `K` round-trips via `FStruct[0]` (default 2), `D` via `FStruct[1]` (default 1). Created with `TNNetMixtureDensity.Create()` or `TNNetMixtureDensity.Create(NumComponents, TargetDim)`. NOTE: the head packs over the `Depth` axis, so the trunk must emit `TNNetFullConnectLinear(1, 1, K(1+2*D))` (not the SizeX-packing `Create(N)` form). See `examples/MixtureDensity/` for the classic one-to-many inverse-map demo where the mixture recovers the multiple branches a plain MSE head averages into the gap.

How to build a contrastive / metric-learning head. Build a small embedding sub-net (an MLP or conv trunk) that maps the input to an embed_dim vector, end it with TNNetL2Normalize so embeddings live on the unit sphere, then train it with TNNetTripletLoss. Because the triplet head takes no external target — supervision is implicit in its anchor|positive|negative depth layout — you feed it three embeddings at once. The cleanest fully-native way is a weight-shared siamese net: feed the triplet as three spatial positions, embed each with pointwise (featuresize=1) layers so the same weights apply to all three, TNNetL2Normalize, then TNNetReshape(1, 1, 3*embed_dim) to lay the three embeddings out as the a|p|n depth chunks the loss head consumes. At inference, drop the loss head and read the embedding directly; use TNNetCosineSimilarity (or a plain dot product on unit-norm vectors) to score pairs. See the worked Triplet embedding example.

Split Channels

TNNetSplitChannels and TNNetSplitChannelEvery are specialized layer types in the CAI Neural API that allow for selective channel manipulation within neural networks.

TNNetSplitChannels: This layer is designed to pick or split selected channels from the previous layer. It provides fine-grained control over which specific channels are passed on to subsequent layers in the network. Key features:
- It can be created with a specific range of channels (ChannelStart and ChannelLen) or with an array of specific channel indices.
Potential uses:
- Feature selection: Allowing the network to focus on specific features represented by certain channels.
- Creating multiple parallel paths in the network that process different subsets of the input channels.
- Implementing attention-like mechanisms by selectively passing certain channels forward.
TNNetSplitChannelEvery: This layer is a specialized version of TNNetSplitChannels. It splits channels at regular intervals.

Potential uses:
- Creating regular patterns of channel selection throughout the network.
- Implementing a form of grouped convolutions or channel-wise operations.
- Reducing the computational load by consistently selecting a subset of channels at regular intervals.

Both these layers offer powerful tools for manipulating the flow of information through the network's channels. They allow for the creation of more complex and efficient network architectures by providing fine control over which features (represented by channels) are processed in different parts of the network.

These layers could be particularly useful in scenarios where:

You want to reduce the computational complexity of your model by focusing on the most important channels.
You're designing a network with multiple parallel paths, each operating on different subsets of the input features.
You're implementing custom attention mechanisms or feature selection techniques within your network.

Picking the right channel-select layer. Three closely related layers select depth channels — pick by the shape of the selection:

TNNetGather(Channel) selects a single channel (Output[x,y,0] := Input[x,y,Channel], output depth 1) — the degenerate one-index case.
TNNetSplitChannels selects a contiguous range (Create(ChannelStart, ChannelLen)) or an explicit list, and is the right tool for plain slicing / parallel-path splits.
TNNetGatherChannels([i0, i1, ...]) selects an arbitrary, ordered, possibly-repeated index list (Output[x,y,k] := Input[x,y,Channels[k]], output depth = list length), so it doubles as a learnable-free channel reorder / prune / duplicate. Add it via the convenience builder TNNet.AddGatherChannels([...]). See the runnable examples/GatherChannelsRouting/ demo. Repeats are allowed; backward accumulates the duplicated output errors onto the shared source channel.

Layer Name	Input/Output Dimensions	Description
`TNNetSplitChannels`	2D or 3D	Splits or copies channels from the input. This layer allows getting a subset of the input channels.
`TNNetSplitChannelEvery`	2D or 3D	Splits channels from the input every few channels. As example, this layer allows getting half (GetChannelEvery=2) or a third (GetChannelEvery=3) of the input channels.
`TNNetInterleaveChannels`	2D or 3D	If you're using grouped convolutions in your network, `TNNetInterleaveChannels` could be particularly useful. It can help mix information between groups, allowing for more interaction between different feature groups.
`TNNetCumSum`	2D or 3D	Parameter-free cumulative sum along a configurable axis. `TNNetCumSum.Create` defaults to the depth axis (`Output[x, y, c] = sum_{k=0..c} Input[x, y, k]`); `TNNetCumSum.Create(Axis)` selects `0 = X`, `1 = Y`, or `2 = Depth`. Output shape equals input shape. Useful as a learned linear position feature on a constant input.
`TNNetRoll`	2D or 3D	Circular shift by `Shift` (integer, can be negative) along a selectable axis: `TNNetRoll.Create(Shift)` rolls the depth axis (default), `TNNetRoll.Create(Shift, Axis)` selects the axis (`Axis` 0 = X, 1 = Y, 2 = Depth). E.g. depth: `Output[x, y, c] = Input[x, y, (c - Shift) mod Depth]`. Parameter-free deterministic permutation; `Create(K, a)` followed by `Create(-K, a)` round-trips to the identity. Legacy depth-roll serializations load unchanged.

Transposing Layers

The layers TNNetTransposeXD and TNNetTransposeYD are specialized layer types in the CAI Neural API that perform specific transposition operations on the input data. These transposition operations can be particularly useful in various neural network architectures and data processing pipelines:

Reshaping Data: they allow for flexible reshaping of data between different network layers, which can be crucial for certain model designs.
Feature Manipulation: by swapping spatial and depth dimensions, these layers can help in reorganizing feature representations, which might be beneficial for subsequent processing steps.
Dimension Reduction or Expansion: depending on the input shape, these transpositions can effectively reduce or expand certain dimensions, potentially helping in compressing or expanding feature representations.
Adapting to Different Input Formats: these layers can be useful when dealing with data that comes in different formats or when interfacing between different parts of a neural network that expect data in specific shapes.
Custom Architecture Designs: they provide flexibility in designing custom neural network architectures that may require unconventional data flows between layers.

These layers are implemented with both forward (Compute) and backward (Backpropagate) methods, indicating that they are fully integrated into the network's training process and can be used in the middle of a network, not just as preprocessing steps. This can be particularly valuable for researchers and practitioners working on novel network designs or dealing with unconventional data structures.

Layer Name	Input/Output Dimensions	Description
`TNNetTransposeXD`	2D or 3D	It transposes the X and Depth axes of the input data. It swaps the spatial dimension along the width (X-axis) with the channel or feature dimension (Depth axis).
`TNNetTransposeYD`	2D or 3D	It transposes the Y and Depth axes of the input data. It swaps the spatial dimension along the height (Y-axis) with the channel or feature dimension (Depth axis).

Layers with Activation Functions and no Trainable Parameter

Activation functions are a fundamental component of neural networks. These functions play several crucial roles in neural networks:

Introducing non-linearity: this allows the network to model complex, non-linear relationships in data.
Normalizing outputs: many activation functions map inputs to a fixed range, helping to prevent issues like exploding gradients.
Representing features: different activation functions can help in capturing various types of patterns or features in the data. The choice of activation function can significantly impact the performance and learning capabilities of a neural network, and different problems may benefit from different activation functions.

The CAI Neural API supports various types of activation functions, as per the below table:

Layer Name	Input/Output Dimensions	Activation	Description
`TNNetReLU`	1D, 2D, or 3D	ReLU	Applies the ReLU activation function.
`TNNetReLU6`	1D, 2D, or 3D	ReLU6	ReLU activation clipped at 6.
`TNNetReLUL`	1D, 2D, or 3D	ReLUL	Leaky version of ReLU.
`TNNetLeakyReLU`	1D, 2D, or 3D	Leaky ReLU	Applies a leaky ReLU activation function.
`TNNetPReLUChannel`	1D, 2D, or 3D	PReLU/channel	Per-channel Parametric ReLU (He et al. 2015): `y = x` if `x >= 0` else `alpha[c] * x`, with one learnable `alpha` per depth channel (initialized to 0.25). Created with `TNNetPReLUChannel.Create()`.
`TNNetVeryLeakyReLU`	1D, 2D, or 3D	Very Leaky ReLU	Applies a very leaky ReLU activation function.
`TNNetRReLU`	1D, 2D, or 3D	Randomized Leaky ReLU	Randomized Leaky ReLU (Xu et al. 2015): `y = x` if `x >= 0` else `a * x`. During training (`Enabled = True`, the default) the negative slope `a` is sampled uniformly from `[lower, upper]` once per forward pass; at inference (`Enabled = False`) the fixed average slope `(lower + upper)/2` is used. Created with `TNNetRReLU.Create()` (defaults `lower = 1/8`, `upper = 1/3`) or `TNNetRReLU.Create(lower, upper)`.
`TNNetReLUSqrt`	1D, 2D, or 3D	ReLU Sqrt	ReLU activation function with square root scaling.
`TNNetSquaredReLU`	1D, 2D, or 3D	Squared ReLU	Squared ReLU activation: `relu(x)^2`. From the Primer paper (https://arxiv.org/abs/2109.08668). Created with `TNNetSquaredReLU.Create()`.
`TNNetShiftedReLU`	1D, 2D, or 3D	Shifted ReLU	Parameter-free ReLU variant `y = max(-1, x)` allowing a small negative range without saturating. Created with `TNNetShiftedReLU.Create()`.
`TNNetThreshold`	1D, 2D, or 3D	Threshold	Threshold activation: `y = x if x > theta else value`. Generalizes ReLU; useful as a sparsifier when `theta > 0`. Created with `TNNetThreshold.Create(theta, value)` (both default to 0).
`TNNetTopK`	1D, 2D, or 3D	TopK	Per spatial cell, keep the `K` largest activations along the depth axis and zero the rest. Gradient flows only through kept positions. Created with `TNNetTopK.Create(K)`.
`TNNetHardConcrete`	1D, 2D, or 3D	HardConcrete	Learnable L0-sparsity gate (Louizos et al. 2018): a per-depth-channel multiplicative gate `z in [0,1]` whose `log_alpha` is trained, so a fraction of channels are pruned to exactly 0. Stochastic hard-concrete reparameterization during training (gate enabled), deterministic gate `clip(sigmoid(log_alpha)*(zeta-gamma)+gamma,0,1)` at inference. Created with `TNNetHardConcrete.Create(beta, gamma, zeta)` (paper defaults `2/3, -0.1, 1.1`). See `examples/HardConcreteSparsity/`.
`TNNetLogSigmoid`	1D, 2D, or 3D	LogSigmoid	Stable log-sigmoid activation: `y = log(sigmoid(x)) = -softplus(-x)`. Pairs with binary cross-entropy with logits. Created with `TNNetLogSigmoid.Create()`.
`TNNetSoftPlus`	1D, 2D, or 3D	SoftPlus	SoftPlus activation, a smooth approximation of ReLU: `ln(1 + exp(x))`. Created with `TNNetSoftPlus.Create()`.
`TNNetSoftPlusBeta`	1D, 2D, or 3D	SoftPlusBeta	Generalized SoftPlus with a fixed sharpness `beta`: `y = (1/beta)·ln(1 + exp(beta·x))`, derivative `sigmoid(beta·x)`. Numerically stable for large `beta·x`. `beta = 1` recovers `TNNetSoftPlus`. Created with `TNNetSoftPlusBeta.Create(beta)` (default `1.0`).
`TNNetSoftExponential`	1D, 2D, or 3D	SoftExponential	Godfrey & Gashler parametric activation with a fixed `alpha`: `-ln(1 - alpha·(x + alpha))/alpha` for `alpha < 0`, identity for `alpha = 0`, `(exp(alpha·x) - 1)/alpha + alpha` for `alpha > 0`. Created with `TNNetSoftExponential.Create(alpha)` (default `0.0` = identity).
`TNNetSerf`	1D, 2D, or 3D	Serf	Search-of-erf activation: `y = x * erf(softplus(x))`. Smooth Mish-like drop-in (https://arxiv.org/abs/2108.09598). Created with `TNNetSerf.Create()`.
`TNNetErf`	1D, 2D, or 3D	Erf	Gauss error function activation: `y = erf(x)`. Closed-form GELU partner with derivative `(2/sqrt(pi)) * exp(-x^2)`. Reuses the Abramowitz–Stegun 7.1.26 polynomial helper that powers `TNNetSerf` (FPC's math unit does not export `erf`). Created with `TNNetErf.Create()`.
`TNNetSwishLearnable`	1D, 2D, or 3D	SwishL	Swish with a single learnable scalar `beta`, initialised to 1.0 (starts identical to `TNNetSwish`). Forward `y = x * sigmoid(beta*x)`; backward updates both input gradient and `beta` (Ramachandran et al. 2017, https://arxiv.org/abs/1710.05941). Created with `TNNetSwishLearnable.Create()`.
`TNNetMishLearnable`	1D, 2D, or 3D	MishL	Mish with a single learnable inner-scale `alpha`, initialised to 1.0 (starts identical to `TNNetMish`). Forward `y = x * tanh(softplus(alpha*x))`; backward updates both the input gradient and `alpha`. Sibling of `TNNetSwishLearnable`. Created with `TNNetMishLearnable.Create()` or `TNNetMishLearnable.Create(alpha)`.
`TNNetAconC`	1D, 2D, or 3D	ACON-C	"Activate Or Not" (Ma et al. 2021), a learnable generalization of Swish: `y = (p1-p2)·x·sigmoid(beta·(p1-p2)·x) + p2·x` with one learnable triple `(p1, p2, beta)` per depth channel, initialised to `(1, 0, 1)` so an untrained layer is exactly `TNNetSwish`. Backward updates the input gradient and all three per-channel parameters. Created with `TNNetAconC.Create()`.
`TNNetSReLU`	1D, 2D, or 3D	S-shaped ReLU	S-shaped ReLU (Jin et al. 2016): a continuous piecewise-linear activation with four learnable parameters per depth channel — right knee `(t_r, a_r)` and left knee `(t_l, a_l)`. `y = t_r + a_r·(x - t_r)` for `x >= t_r`, `y = t_l + a_l·(x - t_l)` for `x <= t_l`, else `y = x`. Initialised to `(t_r, a_r, t_l, a_l) = (0, 1, 0, 0)` so an untrained layer is exactly `TNNetReLU` (set `a_l = 0.01` for a leaky start). Backward updates the input gradient and all four per-channel parameters. Created with `TNNetSReLU.Create()` or `TNNetSReLU.Create(t_r, a_r, t_l, a_l)`.
`TNNetAPL`	1D, 2D, or 3D	APL	Adaptive Piecewise Linear unit (Agostinelli et al. 2015, https://arxiv.org/abs/1412.6830): `h(x) = max(0, x) + Σ_{s=1..S} a[s,c]·max(0, -x + b[s,c])` with `S` learnable hinges (default 2) per depth channel, each having a slope `a[s,c]` and a knee `b[s,c]` (2·S·Depth learnable scalars total). Initialised with slopes `a = 0.25` and knees spread over `[0,1]`. Backward updates the input gradient and all per-channel slopes and knees. Created with `TNNetAPL.Create()` or `TNNetAPL.Create(NumHinges)`.
`TNNetSplineActivation`	1D, 2D, or 3D	Spline	KAN-flavored (Kolmogorov-Arnold) per-channel learnable piecewise-linear activation: `K+1` learnable control-point values `y[0..K,c]` at `K+1` FIXED, evenly-spaced knots over `[-Range, +Range]`, linearly interpolated (and linearly extrapolated beyond the end knots). `(K+1)·Depth` learnable scalars total; only the values are trained, the knots are fixed. Initialised to the identity (`y[i,c] = t[i]`) so an untrained layer is exactly `y = x` everywhere. Backward updates the input gradient (the local segment slope) and the two bracketing control points. Created with `TNNetSplineActivation.Create()` (K=4 intervals, Range=2.0) or `TNNetSplineActivation.Create(NumIntervals, Range)`.
`TNNetMetaAconC`	1D, 2D, or 3D	Meta-ACON	Data-dependent-`beta` sibling of `TNNetAconC` (Ma et al. 2021): the ACON-C switch `beta[c]` is generated from a spatial squeeze of the input, `beta[c] = sigmoid(gamma[c]·mean_spatial(x_c) + delta[c])`, so the activation adapts per sample (vs `TNNetAconC`'s static learned `beta`). Four learnable per-channel parameters `(p1, p2, gamma, delta)`; backward carries the extra gradient path through the squeeze mean. Uses a per-channel affine-over-squeeze as a tractable in-pattern simplification of the paper's cross-channel bottleneck. Created with `TNNetMetaAconC.Create()`.
`TNNetSoftPlusBetaLearnable`	1D, 2D, or 3D	SoftPlusBetaL	Learnable-`beta` variant of `TNNetSoftPlusBeta`: `y = (1/beta)·ln(1 + exp(beta·x))` with a single learnable `beta` (default 1.0), derivative `sigmoid(beta·x)`. Backward updates both the input gradient and `beta`. Created with `TNNetSoftPlusBetaLearnable.Create()` or `TNNetSoftPlusBetaLearnable.Create(beta)`.
`TNNetPhish`	1D, 2D, or 3D	Phish	Phish activation: `y = x * tanh(gelu(x))`, with GELU computed via the tanh approximation (Naveen, 2022, https://arxiv.org/abs/2208.04458). Smooth Mish/Serf sibling. Created with `TNNetPhish.Create()`.
`TNNetISRU`	1D, 2D, or 3D	ISRU	Inverse Square Root Unit: `y = x / sqrt(1 + alpha * x^2)`. Everywhere smooth, derivative `1 / (1 + alpha*x^2)^(3/2)` (Carlile et al., 2017, https://arxiv.org/abs/1710.09967). Created with `TNNetISRU.Create()` or `TNNetISRU.Create(alpha)` (default `alpha = 1.0`, must be `> 0`).
`TNNetISRLU`	1D, 2D, or 3D	ISRLU	Inverse Square Root Linear Unit: `y = x` for `x >= 0`, `y = x / sqrt(1 + alpha * x^2)` for `x < 0` (Carlile et al., 2017). Identity-on-the-right sibling of ISRU. Created with `TNNetISRLU.Create()` or `TNNetISRLU.Create(alpha)`.
`TNNetTanhExp`	1D, 2D, or 3D	TanhExp	TanhExp activation: `y = x * tanh(exp(x))`. Smooth, high-convergence ReLU alternative (https://arxiv.org/abs/2003.09855). Created with `TNNetTanhExp.Create()`.
`TNNetBentIdentity`	1D, 2D, or 3D	BentIdentity	Bent Identity activation: `y = (sqrt(x^2 + 1) - 1)/2 + x`. Smooth, with always-positive slope. Created with `TNNetBentIdentity.Create()`.
`TNNetLisht`	1D, 2D, or 3D	LiSHT	Linearly Scaled Hyperbolic Tangent: `y = x * tanh(x)`. Non-monotonic smooth ReLU alternative. Created with `TNNetLisht.Create()`.
`TNNetGaussianActivation`	1D, 2D, or 3D	Gaussian	Gaussian activation: `exp(-x^2)`. Created with `TNNetGaussianActivation.Create()`.
`TNNetSign`	1D, 2D, or 3D	Sign	Sign activation: `y = sign(x)`. Saturated straight-through-estimator backward (gradient passes through only on `
`TNNetSqrt`	1D, 2D, or 3D	Sqrt	Eps-clamped square root: `y = sqrt(max(x, 1e-6))`. Created with `TNNetSqrt.Create()`.
`TNNetExp`	1D, 2D, or 3D	Exp	Overflow-clamped exponential: `y = exp(min(x, 30))`. Created with `TNNetExp.Create()`.
`TNNetLog`	1D, 2D, or 3D	Log	Eps-clamped natural log: `y = ln(max(x, 1e-8))`. Created with `TNNetLog.Create()`.
`TNNetReciprocal`	1D, 2D, or 3D	Reciprocal	Eps-clamped reciprocal: `y = 1/(sign(x) * max(
`TNNetSELU`	1D, 2D, or 3D	SELU	Self-normalizing activation function.
`TNNetSigmoid`	1D, 2D, or 3D	Sigmoid	Sigmoid activation function.
`TNNetSoftMax`	1D, 2D, or 3D	SoftMax	SoftMax activation function.
`TNNetCenteredSoftmax`	1D, 2D, or 3D	C SoftMax	SoftMax preceded by per-sample mean subtraction. Mathematically equivalent to `TNNetSoftMax` (softmax is shift-invariant) so the input gradient is identical; differs only in the numerical-stability profile of the forward `exp`. Drop-in replacement when extreme input magnitudes risk overflow. Created with `TNNetCenteredSoftmax.Create()`.
`TNNetEntropyRegularizer`	1D, 2D, or 3D	Passthrough	Identity forward; backward injects an extra `lambda * (ln(p + 1e-7) + 1)` gradient that corresponds to adding `-lambda * H(p)` to the loss. Place right after a softmax: `lambda > 0` encourages confident (low-entropy) outputs; `lambda < 0` encourages uniform ones. Created with `TNNetEntropyRegularizer.Create(lambda)` (default `lambda = 0.01`).
`TNNetGradientReversal`	1D, 2D, or 3D	Passthrough	Identity forward; backward multiplies the upstream gradient by `-lambda` (Ganin et al. 2015, https://arxiv.org/abs/1505.07818). Used as the hinge between a shared feature trunk and an adversarial domain-classifier head in Domain-Adversarial Neural Networks (DANN), so the trunk is steered toward features the adversary cannot exploit. Created with `TNNetGradientReversal.Create(lambda)` (default `lambda = 1.0`).
`TNNetCoordConv`	2D or 3D	+ 2 channels	Parameter-free CoordConv (Liu et al. 2018, https://arxiv.org/abs/1807.03247). Concatenates two normalized X/Y coordinate channels (`(2x/(SizeX-1)) - 1` and `(2y/(SizeY-1)) - 1`, both in `[-1, 1]`; 0 when the corresponding axis has size 1) to the input on the depth axis. Output shape is `(SizeX, SizeY, Depth + 2)`. The coordinate channels carry no gradient — backward forwards only the first `Depth` error channels to the previous layer. Placing CoordConv immediately before a convolution gives that convolution direct access to absolute `(x, y)` position. Created with `TNNetCoordConv.Create()`.
`TNNetSoftMaxOne`	1D, 2D, or 3D	SoftMaxOne	"Off by one" softmax: `y_i = exp(x_i) / (1 + sum_j exp(x_j))` (Miller, 2023). Outputs do NOT sum to 1; the leftover mass lets attention attend to nothing without an explicit sink token. Numerically-stable max-shift forward; full softmax-Jacobian backward. Created with `TNNetSoftMaxOne.Create()`.
`TNNetGumbelSoftmax`	1D, 2D, or 3D	Gumbel SoftMax	Differentiable categorical sampling (Jang et al. 2016 / Maddison et al. 2016): `y = softmax((logits + g) / tau)` with `g = -ln(-ln(U))`, `U ~ Uniform(0,1)`. The Gumbel noise is added only while training (the layer descends `TNNetAddNoiseBase`, so `EnableDropouts(true)` turns it on); inference is the deterministic `softmax(logits / tau)`. Lower `tau` sharpens toward a one-hot draw. In hard mode the forward output is the one-hot argmax while the backward uses the soft sample's exact softmax-Jacobian (times `1/tau`) as a straight-through estimator. Created with `TNNetGumbelSoftmax.Create()` (`tau = 1.0`, soft) or `TNNetGumbelSoftmax.Create(tau, hard)`.
`TNNetSparsemax`	1D, 2D, or 3D	Sparsemax	Euclidean projection onto the probability simplex (Martins & Astudillo, 2016, https://arxiv.org/abs/1602.02068), applied per spatial `(x, y)` over the depth axis (same scope as `TNNetPointwiseSoftMax`). Forward sorts the depth vector descending to find the support size `k`, then writes `p[i] = max(0, z[i] - tau)` where `tau = (sum(z_sorted[0..k-1]) - 1) / k`. Outputs sum to 1 and contain TRUE zeros outside the support — a natural drop-in for sparse attention. Backward is the JVP through the support set: `grad_z[i] = grad_p[i] - mean_{j in S}(grad_p[j])` for `i in S`, `0` otherwise. Created with `TNNetSparsemax.Create()`.
`TNNetPointwiseSoftMax`	2D or 3D	1x1 SoftMax	Pointwise (1x1) SoftMax activation function.
`TNNetSinkhorn`	2D (N x 1 x N)	None	Differentiable optimal-transport / doubly-stochastic normalization (Mena et al. 2018, Learning Latent Permutations with Gumbel-Sinkhorn Networks, arXiv:1802.08665) — where `TNNetSoftMax`/`TNNetSparsemax`/entmax normalize ONE axis, this is the first layer that normalizes a square `(N,1,N)` score matrix to be doubly stochastic (every row AND every column sums to 1). Forward iterates Sinkhorn–Knopp in log-space for stability: starting from `score/tau` it runs `KIter` alternating row then column subtract-logsumexp normalizations, then `exp`. Because doubly-stochastic matrices are the convex hull of permutation matrices, as temperature `tau → 0` the output sharpens toward a hard permutation, making a permutation a smooth differentiable function of a score matrix — the building block for differentiable sorting / learnable permutations / soft bipartite matching (the soft, trainable relaxation of the hard non-differentiable Hungarian/auction assignment). No trainable parameters. Backward unrolls all `2·KIter` steps (each caches its input log-matrix) and applies the exact softmax-style adjoint of each subtract-logsumexp, then the trailing `exp` and `1/tau` factors; the input gradient is numerically gradient-checked (max-abs-err ≈1.7e-5). `KIter`/`tau` round-trip via `FStruct[0]`/`FFloatSt[0]`; `SetTau` allows annealing across training. Created with `TNNetSinkhorn.Create(KIter=20, tau=1.0)`. See `examples/SinkhornSort/`.
`TNNetPointwiseNorm`	2D or 3D	1x1 Norm	Pointwise (1x1) normalization.
`TNNet.AddGroupedPointwiseSoftMax`	2D or 3D	Gr 1x1 Norm	Grouped pointwise (1x1) SoftMax.
`TNNetSwish`	1D, 2D, or 3D	Swish	Swish activation function.
`TNNetSwish6`	1D, 2D, or 3D	Swish 6	Swish activation clipped at 6.
`TNNetHardSwish`	1D, 2D, or 3D	Hard Swish	Hard version of Swish activation.
`TNNetESwish`	1D, 2D, or 3D	ESwish	Beta-generalized Swish: `y = beta * x * sigmoid(beta * x)`. Created with `TNNetESwish.Create(beta)` (default `beta = 1.25`).
`TNNetHyperbolicTangent`	1D, 2D, or 3D	tanh	Hyperbolic tangent activation function.
`TNNetLeCunTanh`	1D, 2D, or 3D	LeCunTanh	LeCun scaled tanh: `y = 1.7159 * tanh((2/3) * x)`, tuned so `f(+/-1) ~= +/-1` (LeCun et al., "Efficient Backprop", 1998). Created with `TNNetLeCunTanh.Create()`.
`TNNetSinhAct`	1D, 2D, or 3D	Sinh	Hyperbolic sine activation: `y = sinh(x)`, derivative `cosh(x)`. Unbounded; use only with bounded inputs. Created with `TNNetSinhAct.Create()`.
`TNNetArcSinh`	1D, 2D, or 3D	ArcSinh	Inverse hyperbolic sine activation: `y = arcsinh(x) = ln(x + sqrt(x^2 + 1))`, derivative `1/sqrt(x^2 + 1)`. Monotonic, smooth, never saturates. Created with `TNNetArcSinh.Create()`.
`TNNetLogCoshActivation`	1D, 2D, or 3D	LogCosh	Log-Cosh activation: `y = log(cosh(x))`, derivative `tanh(x)`. Smooth-L1 style; behaves like `x^2/2` near zero and like `
`TNNetSin`	1D, 2D, or 3D	Sin	Periodic activation: `y = sin(x)`. Useful as a SIREN-style coordinate activation. Created with `TNNetSin.Create()`.
`TNNetCos`	1D, 2D, or 3D	Cos	Periodic activation: `y = cos(x)`. Phase-shifted partner to `TNNetSin`. Created with `TNNetCos.Create()`.
`TNNetSinc`	1D, 2D, or 3D	Sinc	Normalized sinc activation: `y = sin(x)/x`, with analytic limit `y = 1` at `x = 0`. Created with `TNNetSinc.Create()`.
`TNNetPower`	1D, 2D, or 3D	Power	Applies a power activation function.
`TNNetMulByConstant`	1D, 2D, or 3D	* C	Multiplies the output by a constant.
`TNNetNegate`	1D, 2D, or 3D	* -1	Multiplies the previous output by -1.
`TNNetSignedSquareRoot`	1D, 2D, or 3D	SSR	Square root of the input absolute value preserving the original sign. `y = Sign(x) * Sqrt(Abs(x))`
`TNNetSignedSquareRoot1`	1D, 2D, or 3D	SSR1	If `Abs(x) < 1` then `y = x`, otherwise, `y = Sign(x) * Sqrt(Abs(x))`.
`TNNetSignedSquareRootN`	1D, 2D, or 3D	SSRN	If `Abs(x) < N` then `y = x`, otherwise, `y = Sign(x) * Sqrt(Abs(x)-N+1)+N-1`.

Gated Linear Units

Gated Linear Units split the input along the channel (depth) axis into two equal halves A and B, and output A multiplied by a gating activation applied to B. The output depth is therefore half of the input depth, and the input depth must be even. These layers have no trainable parameters. They are commonly used inside transformer feed-forward blocks (https://arxiv.org/abs/2002.05202).

Layer Name	Input/Output Dimensions	Description
`TNNetGLU`	1D, 2D, or 3D (even depth)	Gated Linear Unit: outputs `A * sigmoid(B)` (https://arxiv.org/abs/1612.08083). Created with `TNNetGLU.Create()`.
`TNNetGEGLU`	1D, 2D, or 3D (even depth)	GELU-gated linear unit: outputs `A * GELU(B)`. Created with `TNNetGEGLU.Create()`.
`TNNetSwiGLU`	1D, 2D, or 3D (even depth)	Swish-gated linear unit: outputs `A * Swish(B)`, where `Swish(x) = x * sigmoid(x)`. Created with `TNNetSwiGLU.Create()`.
`TNNetTanhGLU`	1D, 2D, or 3D (even depth)	Tanh-gated linear unit: outputs `A * tanh(B)`. Parameter-free; mirrors `TNNetGLU` with the sigmoid gate swapped for tanh. Created with `TNNetTanhGLU.Create()`.

Attention

Layer Name	Input/Output Dimensions	Description
`TNNetScaledDotProductAttention`	Input: `SeqLen x 1 x 3*d_k` (`Q\|K\|V` concatenated along depth). Output: `SeqLen x 1 x d_k`.	Single-head scaled dot-product attention: `scores[i,j] = dot(Q[i], K[j]) / sqrt(d_k)`, row-softmax, then `out[i] = sum_j attn[i,j]*V[j]`. Optional causal (upper-triangle) mask. Parameter-free. Created with `TNNetScaledDotProductAttention.Create(d_k, CausalMask=false)`.
`TNNetCosineSimilarityAttention`	Input: `SeqLen x 1 x 3*d_k` (`Q\|K\|V` concatenated along depth). Output: `SeqLen x 1 x d_k`.	Drop-in variant of scaled dot-product attention whose raw `Q.K^T` score is replaced by a cosine-similarity score `score[i,j] = scale * (Q[i]/\|\|Q[i]\|\|) . (K[j]/\|\|K[j]\|\|)` — each query and key row is L2-normalized over the `d_k` feature axis (with an epsilon guard) before the dot product. Everything after the scores (row-softmax, `V`-weighting) is identical to SDPA. Because cosine scores are bounded in `[-scale, +scale]` this removes the unbounded-logit problem of dot-product attention (more stable softmax / no score blow-up at large `d_k`). Optional causal mask; fixed `scale` (default `1.0`) round-trips via serialization. Parameter-free. The exact L2-normalization Jacobian is back-propagated. Created with `TNNetCosineSimilarityAttention.Create(d_k, CausalMask=false, Scale=1.0)`.
`TNNetSinkAttention`	Input: `SeqLen x 1 x 3*d_k` (`Q\|K\|V` concatenated along depth). Output: `SeqLen x 1 x d_k`.	Drop-in variant of scaled dot-product attention with `K` learnable attention-sink slots (StreamingLLM, Xiao et al. 2023). The `K` learnable `(key,value)` sink pairs are prepended to the real keys/values and every query attends to them regardless of the causal mask (sinks are never masked). Softmax runs over the concatenation `[K sinks ++ SeqLen real keys]`, giving it an always-available place to dump probability mass; this stabilises long-context / causal attention (otherwise the first real token tends to act as an implicit sink). Scoring reuses SDPA's `1/sqrt(d_k)` scaling and causal convention for the real keys. The `K(2d_k)` sink params are stored as `2*K` neurons (keys then values) so they train and serialize automatically; `K` round-trips via `Struct[2]`. Sink keys init small-random, sink values init zero. Created with `TNNetSinkAttention.Create(d_k, CausalMask=false, NumSinks=1)`.
`TNNetDifferentialAttention`	Input: `SeqLen x 1 x 3*d_k` (`Q\|K\|V` concatenated along depth). Output: `SeqLen x 1 x d_k`.	Differential Transformer attention head (Ye et al. 2024). Splits the shared `Q`/`K` depth slabs in half into two `(Q1,K1)` and `(Q2,K2)` half-width sub-heads, computes two independent softmax maps scaled by `1/sqrt(d_k/2)`, and outputs their scaled difference applied to the full-width shared `V`: `(softmax(Q1·K1^T/√(d_k/2)) − λ·softmax(Q2·K2^T/√(d_k/2)))·V`. The second map estimates and cancels common-mode attention noise, sharpening long-range retrieval. `λ` is a single learnable scalar (init `≈0.8`, stepped like `TNNetReZero`'s weight and mirrored into the structure string so it round-trips). Requires even `d_k`; causal mask honoured on both maps. Created with `TNNetDifferentialAttention.Create(d_k, CausalMask=false, LambdaInit=0.8)`.
`TNNetLinearAttention`	Input: `SeqLen x 1 x 3*d_k` (`Q\|K\|V` concatenated along depth). Output: `SeqLen x 1 x d_k`.	Softmax-free linear attention (Katharopoulos et al. 2020, Transformers are RNNs) — the first sub-quadratic attention in this repo. Replaces the `softmax(QK^T)V` core with a positive feature map `φ(x)=elu(x)+1` on `Q`/`K`, then exploits associativity: `out_t = φ(Q_t)·S / (φ(Q_t)·Z)` where `S = Σ_s φ(K_s)⊗V_s` (`d_k×d_k`) and `Z = Σ_s φ(K_s)` are accumulated once over the sequence. Cost is `O(SeqLen·d_k²)` — linear in sequence length, with no `SeqLen×SeqLen` score matrix ever formed. Non-causal (full-prefix) variant; at `SeqLen=1` the normaliser cancels and the output reduces to `V_1` exactly. Parameter-free. Created with `TNNetLinearAttention.Create(d_k)`.
`TNNetLinformerAttention`	Input: `SeqLen x 1 x 3*d_k` (`Q\|K\|V` concatenated along depth). Output: `SeqLen x 1 x d_k`.	Linformer (Wang et al. 2020, Linformer: Self-Attention with Linear Complexity). Keeps the softmax but projects the Key and Value sequences DOWN along the sequence axis from `SeqLen` to a small fixed rank `k ≪ SeqLen` with two learnable matrices `E, F` (each `k×SeqLen`): `K' = E·K`, `V' = F·V`, then `Attn = softmax(Q·K'ᵀ / √d_k)` (a `SeqLen×k` score matrix) and `Out = Attn·V'`, making attention O(SeqLen·k). Because `E,F` carry a fixed `SeqLen` dimension the layer requires a FIXED SeqLen (asserted in `SetPrevLayer`). `E,F` are two trainable neurons with exact finite-difference-checked input AND weight gradients; `d_k`/`k`/`SeqLen` round-trip via `FStruct`. Distinct from the kernel-feature linear-attention family (which drops softmax) — Linformer low-rank-projects the sequence instead. Created with `TNNetLinformerAttention.Create(d_k, k, SeqLen)`. See `examples/Linformer/`.
`TNNetForgetGateBias`	Input: `SeqLen x 1 x Depth` (per-position features). Output: `SeqLen x SeqLen x 1` (additive score bias).	The Forgetting Transformer (FoX) decay-bias generator (Lin et al. 2025, Forgetting Transformer: Softmax Attention with a Forget Gate) — the only layer here that puts a data-dependent forget gate inside SOFTMAX attention (every other forget gate in tree acts on an `O(d²)` recurrent state: `TNNetGatedLinearAttention`, `TNNetWKV`, `TNNetDeltaNet`; here it acts on the `O(L²)` score matrix). One weight neuron computes a per-position forget value `f_t = sigmoid(w·x_t + b) ∈ (0,1)`, accumulates `F_t = Σ_{k≤t} ln f_k`, and emits the strictly-lower-triangular additive decay bias `D[j,i] = F_i − F_j` for `j≤i` (`−∞` above, so the causal mask folds in for free). Added to the raw `Q·Kᵀ/√d` scores before softmax, this multiplies each attention weight by `prod_{k=j+1..i} f_k` — an input-conditioned exponential discount of older tokens (the softmax analogue of GLA's recurrence, but with full pairwise attention retained). Backward is the exact prefix-sum adjoint (`dL/dF_i = Σ_{j≤i}dD[i,j] − Σ_{k≥i}dD[k,i]`, then `df_t = (Σ_{s≥t}dF_s)/f_t`, sigmoid chain into `w`/`b`/input). The composite builder `TNNet.AddForgettingAttention(Heads, d_model)` wires per-head forget gates into a full (inherently causal) softmax-attention block from existing primitives (no new class, save/load free). Created with `TNNetForgetGateBias.Create()` (Depth inferred).
`TNNetPerformerAttention`	Input: `SeqLen x 1 x 3*d_k` (`Q\|K\|V` concatenated along depth). Output: `SeqLen x 1 x d_k`.	Performer / FAVOR+ (Choromanski et al. 2020, Rethinking Attention with Performers). Uses positive random features to give an unbiased estimate of the softmax kernel `exp(q·k)` at linear cost: for an `m×d_k` frozen projection `W`, `φ(x)=exp(W·x−‖x‖²/2)/√m` so `E[φ(q)·φ(k)]=exp(q·k)`. Attention reassociates like the kernel family — `S=Σ_s φ(K_s)⊗V_s` (`m×d_v`), `Z=Σ_s φ(K_s)`, `Out_t=(φ(Q_t)·S)/(φ(Q_t)·Z)` — at O(SeqLen·m·d_v) with no `SeqLen×SeqLen` matrix. `W`'s rows are i.i.d. `N(0,1)`, orthogonalized block-wise when `m≥d_k` (the lower-variance "+" in FAVOR+). `W` is frozen (no weight gradient) but `dL/dQ`, `dL/dK` backprop through `φ`; `d_k`/`m`/RNG seed round-trip via `FStruct` so `W` reloads bit-identically. Unlike `TNNetLinearAttention` (deterministic `elu+1`, a different kernel) Performer approximates the true softmax. Created with `TNNetPerformerAttention.Create(d_k, m, Seed)`. See `examples/Performer/`.

Multi-head self-attention builder. TNNet.AddMultiHeadSelfAttention(d_model, Heads, CausalMask=false) wires the single-head TNNetScaledDotProductAttention above into a full multi-head block in one call: it splits the [Q_all|K_all|V_all] (depth 3*d_model) input slab into Heads per-head [Q_h|K_h|V_h] slices (d_k = d_model/Heads) via TNNetSplitChannels, runs one SDPA head per slice, concatenates the head outputs back to depth d_model with TNNetDeepConcat, and applies a TNNetPointwiseConvLinear(d_model) per-token out-projection. The two intermediate steps are also exposed as TNNet.AddSplitQKVHeads and TNNet.AddMultiHeadSDPAConcat. (The out-projection is pointwise rather than TNNetFullConnectLinear because over a SeqLen x 1 x d_model tensor a fully-connected layer would flatten and mix the whole sequence into one vector.) An optional Variant argument (avSDPA default, avDifferential, avSink) swaps the plain per-head SDPA for TNNetDifferentialAttention or TNNetSinkAttention (with NumSinks sink slots) — the default keeps every existing call bit-for-bit unchanged.

Multi-head cross-attention builder. TNNet.AddMultiHeadCrossAttention(d_model, Heads, QuerySource, KeyValueSource, CausalMask=false) wires encoder-decoder cross-attention in one call: the Query is projected (token-wise TNNetPointwiseConvLinear(d_model)) from QuerySource (the decoder stream, a QSeqLen x 1 x d_model token tensor) while the Keys and Values are projected from a separate KeyValueSource (the encoder output, a KVSeqLen x 1 x d_model token tensor). The query and key/value sequence lengths may differ — the result lives on the query grid (QSeqLen x 1 x d_model). Per head it slices d_k = d_model/Heads channels out of each projection, packs them as [Q_h|K_h|V_h] with TNNetDeepConcat, runs one TNNetScaledDotProductAttention head, concatenates the heads, and applies a token-wise TNNetPointwiseConvLinear(d_model) out-projection. (As with self-attention, every projection is pointwise rather than TNNetFullConnect*, which would flatten the sequence axis.)

Grouped-Query / Multi-Query attention builder. TNNet.AddMultiHeadGroupedQueryAttention(d_model, QueryHeads, KVHeads, CausalMask=false) builds the GQA attention shape used by modern LLMs (Llama-2/3, Mistral): the Query is projected to the full d_model (QueryHeads heads of d_k = d_model/QueryHeads) but the Keys and Values are projected to only KVHeads*d_k channels, so several query heads share one key/value head (QueryHeads/KVHeads heads per group). KVHeads=1 degenerates to Multi-Query Attention; KVHeads=QueryHeads to plain multi-head attention. Each query head slices its own d_k Q channels plus the d_k channels of its shared KV group, packs [Q_h|K_group|V_group], runs one TNNetScaledDotProductAttention, and the heads are concatenated and out-projected with a token-wise TNNetPointwiseConvLinear(d_model). The win is inference-memory: the K/V projection params shrink by a factor QueryHeads/KVHeads versus full MHA. Requires d_model mod QueryHeads = 0 and QueryHeads mod KVHeads = 0.

Multi-head Latent Attention (MLA) builder. TNNet.AddMultiHeadLatentAttention(d_model, Heads, LatentDim, CausalMask=false) builds the DeepSeek-V2 (Liu et al. 2024) attention shape, a compression axis orthogonal to GQA: instead of sharing full-width K/V across query-head groups, MLA low-rank-factors the K/V projection. Each token is first down-projected to a tiny shared latent c_KV of width LatentDim << d_model (the only state a decoder would cache), then K and V are reconstructed per head by up-projections from c_KV. Query is projected to the full d_model per head; each head packs [Q_h|K_h|V_h], runs one TNNetScaledDotProductAttention, and the heads are concatenated and out-projected with a token-wise TNNetPointwiseConvLinear(d_model) (all projections are pointwise so the token axis is preserved). The win is cacheable-state size: LatentDim/(2*d_model) of plain MHA's K/V cache. The default RopeDim=0 is NoPE; RopeDim>0 (even) adds the paper's decoupled-RoPE slice: RoPE cannot be applied to the compressed latent (the up-projection would smear positions), so a small extra rope dimension carries position — rope-Q is a per-head projection of x rotated by TNNetRotaryEmbedding, while rope-K is one projection of x shared across all heads and rotated once, so the decode state grows by only RopeDim. Each head attends with concat(Q_h,ropeQ_h)·concat(K_h,ropeK). For KV-cache incremental decode the per-head SDPA layers support BeginIncrementalDecode, and TNNetRotaryEmbedding.PositionOffset lets a streamed length-1 token be rotated with its absolute position; examples/LatentAttention/ demonstrates both this path and a true latent-only cache (d_c floats/token, < 1e-5 faithful) with a printed cache-memory comparison. See examples/LatentAttention/.

Set Transformer (ISAB + PMA) builders. Two permutation-invariant set primitives (Lee et al. 2019, Set Transformer), each owning a learnable bank of vectors. TNNet.AddInducedSetAttention(InducingPoints, Heads) wires TNNetInducedSetAttention (ISAB): instead of O(N^2) self-attention over the N input tokens it keeps a small bank of M = InducingPoints learnable inducing points and applies two stacked cross-attention blocks — H = MAB(I, X) (the M points attend over the N inputs → (M,d)), then Y = MAB(X, H) (the N inputs attend back over the M summaries → (N,d)) — giving an O(N*M) shape-preserving set-to-set map. TNNet.AddAttentionPooling(NumSeeds, Heads) wires TNNetAttentionPooling (PMA): pools a variable-length set (N,1,d) to a fixed (k,1,d) (k = NumSeeds) by letting k learnable seed vectors cross-attend over the inputs — a trainable, content-addressed, permutation-invariant readout (k=1 is a learned-query weighted-sum pool, categorically unlike the parameter-free TNNetAvgChannel/TNNetMaxChannel). Heads=1 emits the bare single-head layer with identity Q/K/V projections (only the inducing/seed bank is learnable), keeping the two-stage softmax-Jacobian backward exact and gradient-checkable. Heads>1 builds a genuine multi-head MAB by the repo's concat-of-H idiom (no head-axis tensor): d_model splits into Heads subspaces of width d_model/Heads, each head gets its own learnable per-token input projection (a 1×1 TNNetPointwiseConvLinear that preserves the set axis) feeding a single-head ISAB/PMA with its own bank, the heads are concatenated back to d_model and run through a learnable per-token out-projection — so each head keeps the exact softmax-Jacobian backward while the model now learns Q/K/V projections and mixes head subspaces (d_model must be divisible by Heads). See examples/SetTransformer/. TNNet.AddSAB(InducingPoints, Heads, DFF) completes the family with the paper's full Set Attention Block: it wraps the multi-head ISAB MAB in two post-norm residual sub-blocks — H = LayerNorm(X + MAB(X,X)) then out = LayerNorm(H + FFN(H)), where FFN is a token-wise TNNetPointwiseConvReLU(DFF) → TNNetPointwiseConvLinear(d_model) (1×1 convs, so the (N,1,d_model) set axis and permutation-equivariance are preserved). See examples/SetAttentionBlock/.

Sinkhorn (doubly-stochastic) attention builder. TNNet.AddSinkhornAttention(out Attended, W; KIter=20, Tau=1.0) builds a single-head attention block that is a drop-in contrast to softmax attention: it projects the input to Q/K/V with token-wise TNNetPointwiseConvLinear, forms the scaled QKᵀ score matrix with TNNetDotProducts, and then — instead of the one-sided row softmax — normalizes the scores with the existing TNNetSinkhorn layer so the attention map is doubly stochastic (rows AND columns sum to 1), before weighting the values and out-projecting. Where standard attention lets every query distribute its own probability mass independently (columns can starve or saturate), the doubly-stochastic map balances how much each key is attended to overall — the optimal-transport view of attention. The TNNetSinkhorn normalizer is returned in W for inspection/annealing (SetTau), and Attended is the block output. KIter/Tau are the Sinkhorn iteration count and temperature. The default softmax path (AddSingleHeadSelfAttention / AddMultiHeadSelfAttention) is unchanged. See examples/SinkhornMatching/.

Spiking block builder. TNNet.AddSpikingBlock(pHidden, tau=2.0, V_th=1.0, alpha=2.0, LearnDynamics=false) wires the canonical spiking-network linear → LIF → rate-readout pipeline over a (T, 1, D) tensor on the time axis in one call: a per-timestep TNNetPointwiseConvLinear(pHidden) synaptic-current projection (pointwise so each time step is projected independently — TNNetFullConnect* would flatten/mix the time axis), then a TNNetLIFNeuron(tau, V_th, alpha, LearnDynamics) emitting a binary spike train, then a TNNetAvgChannel rate readout averaging spikes over time to a (1, 1, pHidden) firing-rate vector (ready for a dense head). Returns the rate-readout layer. LearnDynamics forwards to the LIF layer's opt-in trainable per-channel threshold/leak. Composes existing layers (no new class). See examples/SpikingMNIST/.

Talking-Heads attention builder. TNNet.AddTalkingHeadsAttention(Heads, d_model, CausalMask=false, PreSoftmaxMix=true, PostSoftmaxMix=true) builds full Talking-Heads multi-head attention (Shazeer et al. 2020) — a learnable Heads x Heads linear mix across heads applied to the attention maps. Because this repo has no single head-axis tensor (multi-head is H separate concatenated TNNetScaledDotProductAttention layers, whose softmax is fused and never exposes a per-head logit slab), this is necessarily a builder, not a drop-in layer: it composes attention from finer primitives so the per-head scores become a real graph tensor. Per head it forms score_h = TNNetDotProducts(Q_h, K_h) scaled by 1/sqrt(d_k), transposes so heads stack on the Depth axis, and TNNetDeepConcats the H slabs into a (key, query, H) tensor; the cross-head mix is then a TNNetPointwiseConvLinear(H) (a 1x1 conv over the head axis at every (key,query) position — exactly Shazeer's H x H multiply). A PreSoftmaxMix mix is applied to the logits and a PostSoftmaxMix mix to the post-softmax weights (each individually toggleable for ablation), with the softmax taken over the key axis between them; then weights·V per head, concat, and a token-wise TNNetPointwiseConvLinear(d_model) out-projection. Composes existing serializable layers (no new class), so save/load works for free. Requires d_model mod Heads = 0.

Conformer block builder. TNNet.AddConformerBlock(Heads, d_ff, ConvKernelSize) builds the convolution-augmented transformer block of Conformer (Gulati et al. 2020) over a (SeqLen, 1, d_model) tensor. It is a "macaron" block that sandwiches a multi-head self-attention module (global mixing) and a convolution module (local mixing) between two half-step feed-forward modules, each sub-module a pre-norm residual, with a final LayerNorm: x += 0.5·FFN(x); x += MHSA(x); x += Conv(x); x += 0.5·FFN(x); x := LayerNorm(x). Composed entirely from existing serializable primitives (TNNetLayerNorm, TNNetPointwiseConvLinear per-token projections, AddMultiHeadSelfAttention, TNNetGLU conv gating, TNNetCausalConv1D 1-D conv over the time axis, TNNetSwish, TNNetSum residuals, TNNetMulByConstant(0.5) macaron scaling), so it needs no new leaf class and round-trips through SaveToString/LoadFromString; shape-preserving so blocks stack. (The paper's per-channel depthwise conv has no 1-D-over-sequence primitive in tree yet, so the channel-mixing TNNetCausalConv1D is the documented stand-in.) See examples/Conformer/.

Transformer encoder block builder. TNNet.AddTransformerEncoderBlock(d_model, Heads, d_ff, PreNorm=true, CausalMask=false) assembles a complete transformer encoder block over a SeqLen x 1 x d_model tensor in one call: an attention sub-block (LayerNorm → token-wise Q|K|V slab projection TNNetPointwiseConvLinear(3*d_model) → AddMultiHeadSelfAttention → residual sum) followed by a SwiGLU feed-forward sub-block (LayerNorm → TNNetPointwiseConvLinear(2*d_ff) → TNNetSwiGLU → TNNetPointwiseConvLinear(d_model) → residual sum). With PreNorm=true (default) each LayerNorm precedes its sub-block (x + Sublayer(LayerNorm(x))); with PreNorm=false it follows the residual sum (LayerNorm(x + Sublayer(x)), post-norm). Every projection — including both FFN projections — is a pointwise (1×1) convolution so the token axis is preserved (TNNetFullConnect* would flatten the whole sequence). The output shape stays SeqLen x 1 x d_model, so blocks can be stacked.

Transformer decoder block builder. TNNet.AddTransformerDecoderBlock(d_model, Heads, d_ff, EncoderOutput, PreNorm=true) assembles a complete encoder-decoder transformer decoder block over a SeqLen x 1 x d_model decoder stream in one call by composing three residual sub-blocks: (1) a causal multi-head self-attention sub-block (same wiring as the encoder block with CausalMask=true); (2) a cross-attention sub-block whose Query comes from the decoder stream and whose Key/Value come from the explicit EncoderOutput layer (a KVSeqLen x 1 x d_model encoder-memory tensor, via AddMultiHeadCrossAttention); and (3) a token-wise SwiGLU feed-forward sub-block. PreNorm places each LayerNorm before its sub-block (default) or after the residual sum (post-norm), matching AddTransformerEncoderBlock. The query and encoder-memory sequence lengths may differ — the output stays on the decoder grid (SeqLen x 1 x d_model), so decoder blocks can be stacked. See examples/TransformerDecoderBlock/.

Perceiver encoder builder. TNNet.AddPerceiverEncoder(NumLatents, d_latent, Heads, Depth, d_ff=0, PreNorm=true) assembles a Perceiver / Perceiver-IO latent-bottleneck encoder (Jaegle et al. 2021, arXiv:2103.03206) over a (InputSeqLen, 1, d_model) input in one call, decoupling the bulk of the compute from the input length. It (1) optionally projects the input width to d_latent (a 1×1 TNNetPointwiseConvLinear, inserted only when d_model <> d_latent), then (2) reads the whole input into a small fixed-size learnable latent array Z of shape (NumLatents, 1, d_latent) (NumLatents << InputSeqLen) via AddAttentionPooling(NumLatents, Heads) — the PMA seed bank is the Perceiver latent array, and this input→latent cross-attention is the only place the input enters, at cost linear in InputSeqLen; then (3) refines Z with a Depth-deep tower of AddTransformerEncoderBlock(Heads, d_ff) latent self-attention blocks whose quadratic cost lives only in the cheap NumLatents² tower. Output length is NumLatents regardless of input length — distinct from AddInducedSetAttention (projects back to the n input rows) and AddAttentionPooling alone (single pool, no self-attention refinement). Composes existing layers (no new class). See examples/Perceiver/.

Attention Masking

Layer Name	Input/Output Dimensions	Description
`TNNetMaskedFill`	2D or 3D	Causal (upper-triangle) mask for self-attention score maps. Adds a large negative constant to positions where the column index (X) is greater than the row index (Y), so they contribute (almost) zero attention after a softmax. No trainable parameter; the backward pass is a straight passthrough. Created with `TNNetMaskedFill.Create()` or `TNNetMaskedFill.Create(MaskValue)`. The overload `TNNetMaskedFill.Create(MaskValue, Offset, LowerTriangle)` selects a configurable pattern: causal masks `X > Y + Offset`, anti-causal (`LowerTriangle=True`) masks `X < Y - Offset`; the default `Offset=0, LowerTriangle=False` reproduces the strict upper-triangle behaviour exactly.
`TNNetSlidingWindowMaskedFill`	2D or 3D	Banded local causal mask (Mistral / Longformer style). Each query position Y attends only to keys in the window `[Y-W+1 .. Y]`; positions in the strict future (X > Y) or too far in the past (X < Y-W+1) get a large negative constant added. With `W >= SeqLen` it reduces to the full causal `TNNetMaskedFill`. No trainable parameter; backward is a straight passthrough. Created with `TNNetSlidingWindowMaskedFill.Create(Window)` or `TNNetSlidingWindowMaskedFill.Create(Window, MaskValue)`.

Recurrent / State-Space Sequence Mixing

TNNetDiagonalSSM is the first recurrent layer in the library: a diagonal-state linear-recurrence ("SSM-lite") sequence mixer that provides an O(n) causal alternative to the O(n^2) scaled-dot-product-attention head. The input is a (SeqLen, 1, Depth) sequence laid out along the X axis (the same convention the attention layers use); the recurrence runs left-to-right along X with the depth channels fully parallel. See examples/DiagonalSSM/ for a single-layer demo that prints the learned per-channel decay spectrum.

Layer Name	Input/Output Dimensions	Description
`TNNetDiagonalSSM`	2D (SeqLen x 1 x Depth)	Per-channel diagonal state-space recurrence: state `h_t = a[d]h_{t-1} + b[d]x_t`, output `y_t = c[d]h_t + e[d]x_t` (the `e*x` skip is the S4D/S5 feedthrough). Four learnable per-channel vectors `(a, b, c, e)`; the decay is stored as `a = sigmoid(a_raw)` so it stays in `(0,1)` and the recurrence is unconditionally stable. Forward is a single left-to-right sweep; backward is backprop-through-time. Created with `TNNetDiagonalSSM.Create()`.
`TNNetClosedFormContinuous`	2D (SeqLen x 1 x Depth)	A CfC "liquid" recurrent cell (Hasani et al. 2022, Closed-form continuous-time neural networks, Nature MI): a sequence mixer that updates a hidden state with the analytic closed-form solution of a liquid time-constant ODE rather than a numerical integrator. Per step `h_t = sigmoid(-(Wt·x_t + b_t)·t) ⊙ tanh(Wg·x_t + b_g) + (1 - sigmoid(...)) ⊙ h_{t-1}` — an input-dependent, per-channel continuous-time constant that gates between a fast `tanh` input pathway and the previous state, with elapsed time `t = (step+1)/SeqLen`. This is distinct from its neighbours in the sequence-mixer family: `AddNeuralODEBlock` numerically integrates a residual field; `TNNetRetention` / `TNNetDiagonalSSM` use fixed / input-independent decay kernels; `TNNetSelectiveSSM` is the input-dependent state-space cousin; and Highway / GatedResidual gates are depthwise but not recurrent over time. Storage is four learnable tensors `Wt`, `Wg` (Depth×Depth) and `b_t`, `b_g` (Depth-long). Forward is the explicit per-timestep recurrence; backward is backprop-through-time over the unrolled steps. Created with `TNNetClosedFormContinuous.Create()` (Depth inferred from the previous layer); composite helper `TNNet.AddClosedFormContinuous()` wraps the cell in a pre-norm RMSNorm residual (`y = x + CfC(RMSNorm(x))`) so it drops into a transformer-style block in one call (mirrors `AddRetention` / `AddNeuralODEBlock`). For non-causal sequence tasks, `TNNet.AddBidirectionalClosedFormContinuous()` runs a forward CfC plus a reverse branch (`FlipX → CfC → FlipX`) and concatenates the two along `Depth` (output `Depth` doubles, so wrap with a pointwise projection if a residual is wanted). See `examples/LiquidCfC/` for a remember-then-recall toy contrasting it against an SDPA head at matched parameter count, and `examples/LiquidCfCvsSSM/` for a last-write-wins task where the input-dependent time constant beats a fixed-decay `TNNetDiagonalSSM` at matched parameters.
`TNNetSLSTMCell`	2D (SeqLen x 1 x Depth)	The scalar xLSTM cell (Beck et al. 2024, xLSTM: Extended Long Short-Term Memory): the FIRST classic-LSTM-style multiplicative-gate recurrence in the library — every other sequence mixer (`TNNetDiagonalSSM`, `TNNetClosedFormContinuous`, `TNNetRetention`) is linear-state or fixed/learned decay with no input/forget/output gates. Its distinguishing machinery is exponential input/forget gates `i_t = exp(W_i·x_t + r_i·h_{t-1} + b_i)`, `f_t = exp(...)` (sharper storage revision than sigmoid gates), made trainable by a running-max stabilizer state `m_t = max(log f_t + m_{t-1}, log i_t)` that renormalizes the unbounded exp gates (`i'_t = exp(log i_t - m_t)`, `f'_t = exp(log f_t + m_{t-1} - m_t)`) so they never overflow — the paper's key trick. State updates `c_t = f'_t·c_{t-1} + i'_t·tanh(W_z·x_t + r_z·h_{t-1} + b_z)` and normalizer `n_t = f'_t·n_{t-1} + i'_t`, output `h_t = sigmoid(W_o·x_t + r_o·h_{t-1} + b_o) ⊙ (c_t / n_t)`. Twelve learnable tensors (`W_{z,i,f,o}`, recurrent `r_{z,i,f,o}` Depth×Depth, biases `b_{z,i,f,o}` with `b_f` init +1). Forward is the explicit per-timestep recurrence caching per-step gates/`m_t`; backward is backprop-through-time with `m_t` treated as a stop-gradient running max. Created with `TNNetSLSTMCell.Create()` (Depth inferred); composite helper `TNNet.AddSLSTM()` wraps it in a pre-norm RMSNorm residual (`y = x + sLSTM(RMSNorm(x))`). See `examples/SLSTMvsCfC/`.
`TNNetMLSTMCell`	2D (SeqLen x 1 x Depth)	The matrix-memory mLSTM cell of xLSTM (Beck et al. 2024) — the parallelisable, attention-like sibling of the scalar `TNNetSLSTMCell`. Instead of a scalar cell state it carries a `Depth×Depth` outer-product covariance memory `C_t = f'_t·C_{t-1} + i'_t·(v_t·k_tᵀ)` plus a normalizer vector `n_t = f'_t·n_{t-1} + i'_t·k_t`, both renormalized by the same running-max stabilizer `m_t` and exponential input/forget gates as sLSTM. Per step it projects the input into query/key/value (`q_t=W_q·x_t`, `k_t=W_k·x_t`, `v_t=W_v·x_t`) and reads out `h_t = o_t ⊙ (C_t·q_t / max(
`TNNetKalmanFilterCell`	2D (T x 1 x StateDim)	A differentiable diagonal Kalman filter (Kalman 1960) — the first layer in the library to propagate uncertainty rather than only a deterministic state: it carries a per-channel state mean `x` AND a covariance `P` over the time axis and forms an adaptive Kalman gain that trades model prediction against the new observation. Genuinely distinct from every other recurrent/state-space layer here (`TNNetDiagonalSSM`, `TNNetSelectiveSSM`, `TNNetClosedFormContinuous`, `TNNetSLSTMCell`/`TNNetMLSTMCell`, `TNNetNTMMemory`) — none maintain a covariance or build a gain from it. Treating each step's input as the observation `z_t`, it sweeps the two-phase recurrence — predict `x⁻_t = a·x_{t-1}`, `P⁻_t = a²·P_{t-1} + Q`; update `g_t = P⁻_t/(P⁻_t+R)`, `x_t = x⁻_t + g_t·(z_t − x⁻_t)`, `P_t = (1−g_t)·P⁻_t` — and emits the filtered means `x_t`. Per-channel learnable scalars (3·`StateDim` params): transition `a = tanh(a_raw) ∈ (−1,1)` (bounded for stability) and process/measurement noise `Q = softplus(Q_raw)`, `R = softplus(R_raw)` (positive, so the gain stays in `(0,1)` by construction). `x_0=0`, `P_0=1` re-init each sweep (not persisted, like NTMMemory's `M`). Backward is full BPTT with two coupled adjoint scans — the covariance adjoint runs right-to-left alongside the mean adjoint because `g_t` depends on `P⁻_t` which depends on `P_{t-1}` (input and per-channel weight gradients both numerically gradient-checked with bounded params). `StateDim = input Depth`, `FStruct[0]=StateDim`; created with `TNNetKalmanFilterCell.Create()`. See `examples/KalmanFilter/`.
`TNNetHamiltonianCell`	2D (T x 1 x 2·D)	A structure-preserving (symplectic) dynamics cell — a Hamiltonian Neural Network (Greydanus et al. 2019, arXiv:1906.01563). Unlike every other continuous-dynamics layer here, which regresses the time-derivative field directly (`AddNeuralODEBlock` integrates an unconstrained residual field; `TNNetClosedFormContinuous` is a liquid closed-form gate; `TNNetDiagonalSSM`/`TNNetSelectiveSSM` are linear state spaces; `TNNetKalmanFilterCell` propagates uncertainty — none conserve energy), this cell parameterizes a scalar learned Hamiltonian `H(q,p)` with a small inner `tanh` MLP and produces the symplectic gradient field `dq/dt = +∂H/∂p`, `dp/dt = −∂H/∂q`, then integrates `Steps` sequential symplectic-Euler sub-steps (step size `dt`) along the time axis. The conserved quantity falls out of the construction, so a pendulum / mass-spring trajectory stays on its energy level set instead of spiralling. The depth axis packs the `(q
`TNNetFourierMix`	2D (SeqLen x 1 x Depth)	The FNet parameter-free Fourier token mixer (Lee-Thorp et al. 2021, FNet: Mixing Tokens with Fourier Transforms): replaces self-attention with an unparameterised 2D discrete Fourier transform across the sequence (X) and hidden (Depth) axes, keeping only the real part — `y[a,b] = sum_{s,h} x[s,h]·cos(2π(a·s/L + b·h/D)) = Re(DFT_seq(DFT_hidden(x)))`. Holds no trainable weights at all: mixing is a fixed linear operator. Because `Re(DFT)` is a fixed self-adjoint real operator `M` (`M[(a,b),(s,h)] = cos(2π(a·s/L + b·h/D))`, symmetric under swapping the index pairs), the exact input gradient is the same DFT applied to `dL/dy` — derived, not assumed, and verified against finite differences. Distinct from the other sequence mixers: `TNNetTokenShift` is an RWKV t-1 shift, `TNNetCirculantLinear` is a learned circular convolution, attention is a learned mix — this one is a fixed, weightless spectral mix. Default forward/backward is the exact direct `O(n²)` DFT sum (arbitrary `SeqLen`/`Depth`); set `UseFFT := true` for an opt-in separable radix-2 FFT fast path (power-of-two `SeqLen` and `Depth`; default OFF, agrees with the direct path to <1e-5). Asserts `SizeY=1`. Created with `TNNetFourierMix.Create()`. See `examples/FourierMix/`.
`TNNetSpectralConv1D`	2D (SeqLen x 1 x Depth)	The Fourier Neural Operator spectral convolution (Li et al. 2021, Fourier Neural Operator for Parametric PDEs, arXiv:2010.08895) — the first layer with learnable complex spectral weights (distinct from the parameter-free `TNNetFourierMix` and the fixed-random `TNNetFourierFeatures`). Per channel it takes a real radix-2 FFT along `SeqLen` (reusing the proven `FourierMixFFT` helper), truncates to the lowest `Modes` frequencies (a spectral low-pass that makes the operator resolution-invariant — high modes are zeroed, not learned), applies a learnable per-`(in-channel, out-channel)` complex weight `R[m]` per kept mode (an `InDepth×OutDepth` complex matmul mixing real/imag via the 2×2 complex-multiply block, the same weight-packing idiom as `TNNetQuaternionLinear`/`TNNetOctonionLinear`), then inverse-FFTs back to the `SeqLen` domain. A single weight neuron holds `2·Modes·InDepth·OutDepth` reals (real+imag). Backward is the exact real adjoint of the FFT → complex-matmul → IFFT pipeline; both the input and the complex-weight gradients are numerically gradient-checked. Asserts a power-of-two `SizeX` and `SizeY=1`. `Modes` round-trips via `FStruct[5]`. Created with `TNNetSpectralConv1D.Create(OutDepth, Modes)`. See `examples/FourierNeuralOperator/`.
`TNNetSpectralConv2D`	3D (SizeX x SizeY x Depth)	The 2-D Fourier Neural Operator spectral convolution (Li et al. 2021, arXiv:2010.08895) — the 2-D sibling of `TNNetSpectralConv1D` over an image. It takes a 2-D FFT (real radix-2 FFT along X for every row, then along Y for every column — both reuse the proven `FourierMixFFT` helper), truncates to the lowest `ModesX × ModesY` 2-D modes (a 2-D spectral low-pass), applies a learnable per-`(in-channel, out-channel)` complex weight `R[mx,my]` per kept 2-D mode (an `InDepth×OutDepth` complex matmul packed via the same 2×2 complex-multiply idiom as `TNNetQuaternionLinear`/`TNNetOctonionLinear`), then inverse-2D-FFTs back and takes the real part. Because the learned weights live in 2-D mode space, not grid space, the same weights describe the same continuous operator at any resolution (resolution-invariance). A single weight neuron holds `2·ModesX·ModesY·InDepth·OutDepth` reals. Backward is the exact real adjoint of the 2-D FFT → complex-matmul → 2-D IFFT pipeline; both the input and the complex-weight gradients are numerically gradient-checked. Asserts power-of-two `SizeX`/`SizeY`. `OutDepth`/`ModesX`/`ModesY` round-trip via `FStruct[0..2]`. Created with `TNNetSpectralConv2D.Create(OutDepth, ModesX, ModesY)`. See `examples/SpectralConv2D/`.
`TNNetDWT1D`	2D (SeqLen x 1 x Depth)	Lifting-scheme single-level 1-D Discrete Wavelet Transform — a localized multi-resolution time-frequency primitive, the complement to the global-frequency FFT family (`TNNetFourierMix`, `TNNetSpectralConv1D/2D`) which cannot represent a transient localized in both time and scale. Along `SeqLen` it runs the second-generation lifting scheme per channel (split even/odd samples → predict the odd half: `detail = odd − P(even)` → update the even half: `approx = even + U(detail)`), so it is exactly invertible by construction (the inverse runs the same steps in reverse with flipped signs — a free correctness oracle, `IDWT(DWT(x)) == x` to ~1e-7) and `O(SeqLen)` (no FFT). Output `(L,1,D) → (L div 2, 1, 2·D)` = `[approx \| detail]` concatenated along `Depth`; odd `L` is handled by whole-sample symmetric edge extension. Fixed-filter mode (default) selects `csDWT1DHaar` / `csDWT1DCDF53` (CDF/LeGall 5/3) / `csDWT1DDaub4` (Daubechies-4) coefficients giving an orthonormal/biorthogonal transform out of the box; opt-in learnable mode makes the predict/update taps trainable while keeping the lifting structure fixed, so perfect reconstruction is preserved for any tap values (the "learnable wavelet" / scattering-net idea — structurally distinct from `TNNetSpectralConv`'s learnable global complex modes). Backward is the exact linear adjoint lifting; the input gradient and the learnable-tap gradient are both numerically gradient-checked. `Filter`/`Learnable`/`TapCount` round-trip via `FStruct[0..2]`; public `InverseChannel()` exposes the IDWT. Created with `TNNetDWT1D.Create(Filter, Learnable=false)`. Stack `K` levels into a dyadic packet tree with `TNNet.AddWaveletPacketTransform(Levels, Filter, Learnable)`. See `examples/WaveletDenoise/`.
`TNNetCausalConv1D`	2D (SeqLen x 1 x Depth)	Learnable 1D convolution along the X (time) axis with left-only zero padding of `FeatureSize-1`, so the sequence length is preserved and output position `t` depends only on input positions `<= t` (no future leakage). One neuron per output channel holds a `(K, 1, InputDepth)` weight window plus a bias. An attention-free `O(nK)` causal sequence mixer that pairs with `TNNetTokenShift`. An optional `Dilation` parameter (default 1) gives a WaveNet-style exponentially-growing receptive field: taps are spaced `Dilation` apart in time and the left pad grows to `Dilation(FeatureSize-1)` (`Dilation=1` is identical to the dense conv). Created with `TNNetCausalConv1D.Create(NumFeatures, FeatureSize)` (optional `SuppressBias`, `Dilation`).
`TNNetWKV`	2D (SeqLen x 1 x 2·Depth)	The RWKV weighted key-value (WKV) time-mixing operator (Peng et al. 2023, RWKV: Reinventing RNNs for the Transformer Era): a softmax-free, attention-free sequence mixer that is the defining recurrence of RWKV-4. Over a `k
`TNNetCrossWKV`	2D — receptance source (SeqLen x 1 x Depth) + key\|value source (SeqLen x 1 x 2·Depth)	A two-source variant of `TNNetWKV` that generalises the RWKV-4 WKV recurrence to a separate key/value stream, exactly as `TNNetCrossAttention` generalises self-attention's packed `Q\|K\|V` to two sources. Where `TNNetWKV` splits its OWN input into the `k\|v` pair driving the state — so the memory it accumulates and the stream reading it are ONE sequence — `TNNetCrossWKV` reads `key\|value` from a SEPARATE `KeyValueSource` (depth `2·Depth`) than the receptance/query stream (`PrevLayer`, depth `Depth`), making cross-WKV / decode-time external memory expressible. Per channel/timestep it runs the EXACT log-space-stable RWKV-v4 kernel (`wkv_t = (a_{t-1}+e^{u+k_t}v_t)/(b_{t-1}+e^{u+k_t})`, `a_t=e^{-w}a_{t-1}+e^{k_t}v_t`, per-channel learnable `w=softplus(w_raw)` + bonus `u`, running-max stabiliser) with `k,v` from the key\|value source and a `sigmoid(r_t)` receptance gate from the query source: `y_t = sigmoid(r_t)·wkv_t`. The key\|value source index serializes like `TNNetConcat`/`TNNetCrossAttention` (round-trips through `SaveToString`/`LoadFromString`); backward is exact coupled BPTT folding `dL/dk,dL/dv` into the key\|value source, `dL/dr` into the receptance source, and `dL/dw,dL/du` per channel (input grads into BOTH sources + weight grads numerically checked). The default equal-seqlen contract keeps both sources the same length (read-out at `t` uses the state through the kv source up to `t`). An opt-in asymmetric / full-context mode (`TNNetCrossWKV.Create(KeyValueSource, {Asymmetric:=}true)`, round-trips via `FStruct[1]`) summarises the WHOLE key\|value stream once into a final per-channel state and lets the receptance/query stream be a different length `QSeqLen <> KVSeqLen` — true permuted associative recall rather than a position-aligned copy (every query reads the same full-context summary `A/B`; the bonus `u` is unused on this path). Created with `TNNetCrossWKV.Create(KeyValueSource)` (equal-seqlen) or `TNNetCrossWKV.Create(KeyValueSource, true)` (asymmetric). See `examples/CrossWKV/` — a cross-copy task where the two-source arm hits 100% exact recall while a memory-blind single-source WKV stays at the chance floor.
`TNNetGatedLinearAttention`	2D (SeqLen x 1 x Depth)	Gated Linear Attention (GLA) (Yang et al. 2023, Gated Linear Attention Transformers with Hardware-Efficient Training, arXiv:2312.06635): a matrix-state linear-attention recurrence whose defining novelty is a data-dependent PER-CHANNEL (vector) diagonal forget gate `alpha_t = sigmoid(W_a·x_t)` that row-scales a 2-D outer-product memory `S` (`d×d`): `S_t[d,e] = alpha_t[d]·S_{t-1}[d,e] + k_t[d]·v_t[e]`, read-out `y_t = q_t^T S_t`, causal left-to-right scan. This is the only member of the linear-attention family with an input-dependent vector forget gate on a full outer-product matrix state — the mechanism modern Gated-DeltaNet / Mamba-2 / RWKV-6 build on. Distinct from its neighbours: `TNNetWKV` (fixed-learned per-channel decay, not input-dependent), `TNNetRetention` (single scalar `gamma`), `TNNetMLSTMCell` (scalar exp gates + running-max), `TNNetDeltaNet` (scalar `beta` write gate, no multiplicative forget), `TNNetSelectiveSSM` (gates a diagonal vector state, not a matrix). Storage is a five-neuron bank `W_q/W_k/W_v/W_a` (Depth×Depth) + gate bias `b_a` (Depth-long); keys are L2-normalized (DeltaNet idiom) and queries `1/sqrt(d)`-scaled, with `FStruct[0..1]=d_k/d_v` (square in v1). Forward is the exact per-token scan; backward is exact BPTT carrying the full `dL/dS` (the gate row-scales the carry) — input + all four weight sets numerically gradient-checked. Asserts `SizeY=1`; output shape == input. Created with `TNNetGatedLinearAttention.Create()` (Depth inferred); the composite `TNNet.AddGatedLinearAttention` builder wraps it as a drop-in time-mixing block with `TNNetTokenShift` + a per-token input projection + a sigmoid receptance gate + output projection (mirroring `AddRWKVTimeMix`). See `examples/GatedLinearAttention/` for an overwrite key→value recall demo contrasting GLA vs `TNNetDeltaNet` vs `TNNetRetention`.
`TNNetTestTimeTraining`	2D (SeqLen x 1 x Depth)	The Test-Time Training (TTT) sequence mixer (Sun et al. 2024, Learning to (Learn at Test Time): RNNs with Expressive Hidden States, arXiv:2407.04620). The recurrent hidden state is itself a small model `W` whose weights are updated by one explicit gradient-descent step on a self-supervised reconstruction loss at every timestep, so the scan literally trains an inner net as it reads. Per token: `k_t,v_t,q_t = theta_K/V/Q · x_t`; inner loss `ell_t = ½‖W(k_t)−v_t‖²`; TTT update `W_t = W_{t−1} − eta·∇_W ell_t` (learnable `eta = softplus(eta_raw)`); read-out `y_t = W_t(q_t)`. Two inner-model variants behind one `FStruct[1]` flag: TTT-Linear (`W` a matrix; rank-1 MSE-gradient step — closely related to `TNNetDeltaNet` but with a learnable per-layer inner LR `eta` and a raw, un-normalized key instead of DeltaNet's sigmoid gate + L2-key), and TTT-MLP (`W = W2·GeLU(W1·)`, a genuine non-linear fast-weight update a single matrix cannot express — the headline novelty). Because the forward already runs an inner backward to form `∇_W ell_t`, the outer backward is exact second-order BPTT through the inner update (a Hessian-vector product for the MLP arm, the same 2nd-inner-tape pattern as `TNNetHamiltonianCell`); input / view-projection / `eta` gradients are numerically gradient-checked for both arms. Asserts `SizeY=1`; output shape == input. Created with `TNNetTestTimeTraining.Create(Variant, Hidden)` (`Variant` 0=Linear, 1=MLP). See `examples/TestTimeTraining/`.
`TNNetTitansMemory`	2D (SeqLen x 1 x Depth)	The Titans test-time neural long-term memory sequence mixer (Behrouz et al. 2024, Titans: Learning to Memorize at Test Time, arXiv:2501.00663), the Memory-as-Context (MAC) leaf-layer variant. Like `TNNetTestTimeTraining` the hidden state is a small inner MLP `M(z)=W2·GeLU(W1·z)` gradient-descended at inference on the per-token associative loss `½‖M(k_t)−v_t‖²`, but Titans adds the two headline mechanisms TTT lacks: (a) a momentum / "surprise" state `S_t = η⊙S_{t−1} − θ⊙∇_t` so a surprising token keeps writing for several steps, and (b) a data-dependent forget gate `M_t = (1−α_t)⊙M_{t−1} + S_t` (`α_t = sigmoid(α_raw + W_α x_t)` per channel) that adaptively erases stale memory. Gates are per-output-channel (W2 row `o` ↦ gate`[o]`, W1 row `j` ↦ gate`[j mod Depth]`); `η = sigmoid(η_raw)`, `θ = sigmoid(θ_raw)` are learnable per-channel scalars; read-out `y_t = M_t(q_t)`. Distinct from every sibling: `TNNetTestTimeTraining` (plain SGD step, no momentum, no forgetting), `TNNetDeltaNet`/`TNNetGatedLinearAttention` (matrix-state delta/gated-linear recurrences, not a gradient step on a non-linear MLP), `TNNetNTMMemory` (content-addressed bank). Storage: `theta_K/V/Q` (Depth×Depth), `eta_raw/theta_raw/alpha_raw` (Depth), `W_alpha` (Depth×Depth), initial inner fast-weights `W1_0` (H×Depth) / `W2_0` (Depth×H). Forward is the explicit scan (each step runs an inner backward to form `∇_t`); outer backward is exact second-order BPTT (a GeLU Hessian-vector product) carrying `dL/dW1_t`, `dL/dW2_t` and `dL/dS_t` right-to-left through the coupled momentum/forget adjoint scans (input + all weight sets numerically gradient-checked, max-abs err ≈5e-4). Asserts `SizeY=1`; output shape == input. Created with `TNNetTitansMemory.Create(Hidden)` (Depth inferred). See `examples/TitansMemory/`.
`TNNetImplicitLongConv`	2D (SeqLen x 1 x Depth)	The Hyena Hierarchy implicit long convolution (Poli et al. 2023): a causal depthwise convolution whose per-channel filter spans the WHOLE sequence (length `SeqLen`), yet is generated IMPLICITLY by a tiny shared MLP over positional features and multiplied by a learnable exponential-decay window, so the parameter count does NOT grow with `SeqLen`. This is distinct from `TNNetCausalConv1D` (a SHORT fixed-length kernel learned directly) and `TNNetDiagonalSSM` (a per-channel linear recurrence) — neither parametrizes a full-length filter from positions. Forward is the direct `O(L^2)` causal time-domain sum (FFT `O(L log L)` is a documented stretch goal); backward is analytic into both the input and the implicit-MLP/decay weights. Initialised near-identity (small filter) so the block starts close to a no-op. Created with `TNNetImplicitLongConv.Create()`; the order-2 builder `TNNet.AddHyenaOperator(d_model, Hidden)` assembles the data-controlled gated Hyena recurrence around it. See `examples/HyenaOperator`.
`TNNetSpatialGatingUnit`	2D (SeqLen x 1 x Depth)	The gMLP Spatial Gating Unit (Liu et al. 2021, Pay Attention to MLPs): an attention-free token mixer with no queries/keys/values and no per-pair dot product. It splits the `Depth` channels in half into `u` and `v`, applies one learned, content-independent `SeqLen x SeqLen` weight matrix `W` (plus per-position bias) across the sequence axis of `v` (`v'[n] = bias[n] + Σ_m W[n,m]·v[m]`, the same static spatial projection for every channel), and gates multiplicatively `out[n] = u[n]·v'[n]`, halving the output `Depth`. `W` is fixed after training, so the mix is data-independent — a distinct primitive, not a re-skin of attention. The `SeqLen x SeqLen` matrix makes it fixed-length: `SeqLen` is pinned at construction and `SetPrevLayer` rejects a mismatched `SizeX`, `SizeY<>1`, or odd `Depth`. `W` is initialised near-identity / small so the block starts close to a no-op. Created with `TNNetSpatialGatingUnit.Create(SeqLen)`; builders `TNNet.AddSpatialGatingUnit(SeqLen)` and the full block `TNNet.AddgMLPBlock(SeqLen, d_model, d_ffn)` (channel-MLP up → split+SGU → channel-MLP down, residual; the gMLP-paper LayerNorms bound the gate). See `examples/SpatialGatingUnit`.
`TNNetLIFNeuron`	2D (T x 1 x Depth)	None
`TNNetALIFNeuron`	2D (T x 1 x Depth)	Opt-in (per-channel V_th/leak)
`TNNetGroupConvP4`	3D (SizeX x SizeY x Depth)	The p4 group-equivariant lifting convolution (Cohen & Welling 2016, Group Equivariant Convolutional Networks, arXiv:1602.07576) — the first layer here that is rotation-equivariant by construction, not merely measured after the fact by `TNNet.EquivarianceReport`. One learned `K×K` kernel bank per feature is shared across the 4 rotations of the C4 group (rot-{0,90,180,270}): the layer convolves the input with the 4 rotated views of each filter and stacks the responses along a new 4-fold orientation sub-axis of `Depth` (output `Depth = 4·FeaturesCount`, channel `= co·4 + r`), so a 90° rotation of the input cyclically permutes the orientation channels instead of scrambling them. Distinct from the parameter-free flip involutions (`TNNetFlipX/FlipY`, data-augmentation primitives) and from the hypercomplex/CondConv layers (which share weights across a different algebra, not a spatial symmetry group). Subclasses `TNNetConvolutionLinear` and reuses the im2col/AddArea path, materialising the 3 rotated kernel views from the one trained bank; backward folds the 4 orientation gradients back onto the single shared kernel (the rotation-tied weight-gradient sum — exactly gradient-checked, alongside an exact-equivariance forward test to machine precision). v1 is the lifting (plane → C4 field) conv; the full p4m group (+reflections) and steerable/SO(2) harmonics are noted follow-ups. Created with `TNNetGroupConvP4.Create(FeaturesCount, FeatureSize, ...)`. See `examples/GroupEquivariantMNIST/`.
`TNNetGroupPoolP4`	3D (SizeX x SizeY x 4·F → SizeX x SizeY x F)	The companion group-pool head for `TNNetGroupConvP4`: max- (default) or mean-reduces over the 4 orientation channels of each feature, collapsing a C4 feature field back to a rotation-invariant `(SizeX, SizeY, FeaturesCount)` map for a classifier tail. Closes the lift → equivariant-process → invariant-readout loop. Created with `TNNetGroupPoolP4.Create()`.
`TNNetMinGRU`	2D (SeqLen x 1 x Depth)	The minimal GRU of Feng et al. 2024 (Were RNNs all we needed?, arXiv:2410.01201). A stripped-down GRU whose update gate and candidate depend on the current input only — `z_t = sigmoid(W_z·x_t + b_z)`, `h̃_t = W_h·x_t + b_h`, `h_t = (1 − z_t)⊙h_{t-1} + z_t⊙h̃_t` — so the gates carry no `h_{t-1}` feedback. That is the whole point: removing the recurrent dependence in the gates turns the cell into a linear scan that is fully parallelizable over the time axis (the paper's headline), distinguishing it from the xLSTM family (`TNNetSLSTMCell`/`TNNetMLSTMCell`), whose exp-gates do read `h_{t-1}`. Four weight tensors (`W_z`,`W_h` Depth×Depth, `b_z`,`b_h` Depth). Forward is a left-to-right scan; backward is exact BPTT (input + all weight grads numerically gradient-checked). Created with `TNNetMinGRU.Create()` (Depth inferred). See `examples/MinimalRNN/`.
`TNNetMinLSTM`	2D (SeqLen x 1 x Depth)	The minimal LSTM sibling from the same paper (Feng et al. 2024). Like `TNNetMinGRU`, the forget/input gates depend on `x_t` only — `f_t = sigmoid(W_f·x_t + b_f)`, `i_t = sigmoid(W_i·x_t + b_i)`, `h̃_t = W_h·x_t + b_h` — then are normalized `f'_t = f_t/(f_t+i_t)`, `i'_t = i_t/(f_t+i_t)` and combined `h_t = f'_t⊙h_{t-1} + i'_t⊙h̃_t`, so the cell is parallelizable for the same reason. Six weight tensors (`W_f`,`W_i`,`W_h` Depth×Depth, `b_f`,`b_i`,`b_h` Depth). Backward differentiates the coupled `f/(f+i)` gate normalization exactly (input + all weight grads numerically gradient-checked). Created with `TNNetMinLSTM.Create()` (Depth inferred). See `examples/MinimalRNN/`.
`TNNetLRU`	2D (SeqLen x 1 x Depth)	The Linear Recurrent Unit (Orvieto et al. 2023, Resurrecting Recurrent Neural Networks for Long Sequences, arXiv:2303.06349): a stable, complex-diagonal linear recurrence. Each output channel pairs with its own complex eigenvalue parameterized for guaranteed stability — `
`TNNetLegendreMemoryUnit`	2D (SeqLen x 1 x Depth)	The Legendre Memory Unit (Voelker et al. 2019, Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks). Carries an order-`N` memory vector `m_t` per channel that holds the coefficients of a shifted-Legendre-polynomial projection of a sliding window of the input — an orthogonal-polynomial compression that is genuinely different from the exponential-decay kernels of the diagonal SSMs. Driven by the dense, structured, NON-diagonal HiPPO-LegS transition matrix `A_ij = (2i+1)·(−1 if i<j else (−1)^(i−j+1))` and input vector `B_i = (2i+1)·(−1)^i`, discretized once at build time (Euler step `Abar = I + (1/theta)·A`, `Bbar = (1/theta)·B`) so the HiPPO matrices need no gradient. Forward sweeps `m_t = Abar·m_{t−1} + Bbar·u_t` along the time axis; a learned per-channel read-out (one weight neuron, `Depth×Order`) collapses the `N` coefficients to one value per step (output shape == input). Memory cost is `N` numbers regardless of window length. Distinct from `TNNetDiagonalSSM` (real diagonal), `TNNetLRU` (complex diagonal), and the matrix-memory linear-attention cells (`TNNetGatedLinearAttention`/`TNNetDeltaNet`/`TNNetWKV`) — the non-diagonal Legendre basis fills a real structural gap. Backward is a clean right-to-left adjoint scan plus the read-out weight gradient (input + weight grads numerically gradient-checked; `Abar` checked against a brute-force reference). `theta` is a fixed build-time constant in v1 (learnable per-channel `theta` is a logged follow-up). Created with `TNNetLegendreMemoryUnit.Create(Order, theta)`. See `examples/LegendreMemoryUnit/`.

MLP-Mixer block builder. TNNet.AddMLPMixerBlock(TokensHidden, ChannelsHidden, ActFn=TNNetReLU) assembles a complete all-MLP Mixer block (Tolstikhin et al. 2021, MLP-Mixer: An all-MLP Architecture for Vision) over a (Tokens, 1, Channels) sequence in one call — an attention-free alternative to AddTransformerEncoderBlock that mixes information with two MLPs along different axes, each wrapped in a pre-LayerNorm residual: (1) a token-mixing MLP that TNNetTransposeXD-swaps the token/channel axes, runs a pointwise Linear(TokensHidden) → ActFn → Linear(Tokens) across the (now-Depth) token axis — the same MLP shared over every channel — then transposes back; followed by (2) a channel-mixing MLP Linear(ChannelsHidden) → ActFn → Linear(Channels) applied independently per token. Every projection is a pointwise (1×1) convolution so the token axis is preserved and both sub-blocks are shape-preserving (Tokens and Channels are read from the current last layer), so the output stays (Tokens, 1, Channels) and blocks stack. See examples/MLPMixer/.

Gated Linear Attention block builder. TNNet.AddGatedLinearAttentionBlock(d_ff, PreNorm=true, NormClass=nil) assembles a complete transformer-style block over a (SeqLen, 1, d_model) sequence in one call — an attention-free linear-recurrence alternative to AddTransformerEncoderBlock that swaps the multi-head self-attention arm for the GLA time-mixing builder. It composes two pre-LayerNorm (or post-LayerNorm with PreNorm=false) residual sub-blocks with inline TNNetSum residuals: (1) a GLA time-mixing sub-block (LayerNorm → AddGatedLinearAttention → residual sum), wrapping the TNNetGatedLinearAttention matrix-state per-channel-gated linear-attention recurrence; followed by (2) a SwiGLU feed-forward sub-block (LayerNorm → TNNetPointwiseConvLinear(2*d_ff) → TNNetSwiGLU → TNNetPointwiseConvLinear(d_model) → residual sum). d_model is inferred from the input depth and every projection is a pointwise (1×1) convolution, so the output stays (SeqLen, 1, d_model) and blocks stack. Composes existing layers (no new class). See examples/GatedLinearAttentionBlock/, where a 3-block tower reaches 100% exact recall vs 81% for the bare mixer on an overwrite key→value recall task.

Fourier Neural Operator block builders. TNNet.AddFourierNeuralOperator1D(OutDepth, Modes, ActFn=nil) and TNNet.AddFourierNeuralOperator2D(OutDepth, ModesX, ModesY, ActFn=nil) wrap the TNNetSpectralConv1D/TNNetSpectralConv2D leaf layers in the canonical FNO Fourier layer (Li et al. 2021) y = ActFn(SpectralConv(x) + W₁ₓ₁(x)): a global spectral branch in mode space summed with a local pointwise (1×1 conv, TNNetPointwiseConvLinear) residual branch in grid space, both mapping Depth → OutDepth from the same input, then an optional activation (ActFn=nil = none). Stack these blocks to build a resolution-invariant PDE-surrogate operator. Note: TNNetSpectralConv2D's radix-2 FFT requires power-of-two SizeX/SizeY. See examples/FourierNeuralOperator/ and examples/SpectralConv2D/.

Trainable Bias (Shift) and Multiplication (Scaling) per Cell or Channel Allowing Faster Learning and Convergence

When TNNetCellBias is added after convolutional layers, it introduces a trainable bias to each output cell of the convolutional layer. This can have several effects on the neural network:

Fine-tuning: TNNetCellBias allows for fine-tuning of the network's output by adding a learnable bias to each cell. This can help the network adjust its predictions more precisely.
Increased flexibility: by adding a bias to each cell individually, the network gains additional parameters to optimize, potentially allowing it to learn more complex representations.
Improved learning speed: placing this layer before and after convolutions can speed up learning. This is because it gives the network an additional way to adjust its output, potentially making it easier to find optimal solutions.
Parameter increase: adding TNNetCellBias increases the number of trainable parameters in the network. While this can be beneficial for learning, it also increases the model's complexity and the risk of overfitting.

It's worth noting that the effectiveness of adding TNNetCellBias after convolutional layers can vary depending on the specific architecture and problem at hand. While it can potentially speed up learning and improve the network's flexibility, it's important to experiment and validate its impact on your particular use case. TNNetChannelBias adds a trainable bias to each channel in the output. It's like TNNetCellBias, but operating on entire channels instead of individual cells.

Layer Name	Input/Output Dimensions	Description
`TNNetCellBias`	1D, 2D, or 3D	Trainable bias (shift) for each cell.
`TNNetCellMul`	1D, 2D, or 3D	Trainable multiplication (scaling) for each cell.
`TNNetChannelBias`	1D, 2D, or 3D	Trainable bias (shift) for each channel.
`TNNetChannelMul`	1D, 2D, or 3D	Trainable multiplication (scaling) for each channel.
`TNNetGatedResidual`	1D, 2D, or 3D	Per-channel learnable residual gate: `y[x,y,d] = alpha[d] * x[x,y,d]`, with one learnable scalar `alpha` per depth channel (init 0.0 by default). The per-channel generalisation of `TNNetReZero` (single shared scalar). Wrap a residual branch with it (`Sum([Sublayer, PrevLayer])`) so each channel starts as the identity and opens its gate independently during training. Created with `TNNetGatedResidual.Create()` or `TNNetGatedResidual.Create(initialAlpha)`. The `TNNet.AddGatedResidual(pSublayers)` builder wires this pattern in one call: `y = x + GatedResidual(Sublayer(x))` (no normalization — the per-channel gate is the only added parameter and starts the branch at zero contribution).

Composite helper: TNNet.AddSEBlock(InputLayer, ReductionRatio) wires the standard Squeeze-and-Excitation pattern (TNNetAvgChannel -> TNNetFullConnectReLU(C/r) -> TNNetFullConnectSigmoid(C) -> TNNetChannelMulByLayer) onto an existing branch. See examples/SEBlockCifar/.

Composite helper: TNNet.AddCBAM(InputLayer, ReductionRatio=16, SpatialKernelSize=7) wires the Convolutional Block Attention Module (Woo et al. 2018) over a conv feature map in one call — channel attention then spatial attention, shape-preserving. Channel attention pools the map with both global average (TNNetAvgChannel) and global max (TNNetMaxChannel), runs each through a reduce->ReLU->expand MLP, sums them, sigmoids, and rescales per channel (TNNetChannelMulByLayer) — the dual-pool extension of AddSEBlock. Spatial attention then builds a 2-channel descriptor (pointwise C->2 conv), applies a padded SpatialKernelSize conv -> sigmoid gate, and rescales per spatial position. v1 simplifications (vs the paper): two separate channel MLPs rather than one shared MLP, and a learned C->2 spatial descriptor rather than fixed avg/max-over-depth. Keep inputs square (TNNetMaxChannel assumes SizeX == SizeY). See examples/CBAMAttention/.

Composite helper: TNNet.AddMixtureOfExperts(InputLayer, NumExperts, ExpertHiddenDim) wires a soft/dense Mixture-of-Experts feed-forward block (Shazeer et al. 2017) in one call, shape-preserving so it is a drop-in FFN replacement (d_model = InputLayer.Output.Depth). A token-wise gating network (TNNetPointwiseConvLinear(NumExperts) -> TNNetSoftMax) produces per-expert weights g; NumExperts parallel shape-preserving expert MLPs (TNNetPointwiseConvReLU(ExpertHiddenDim) -> TNNetPointwiseConvLinear(d_model)) each compute E_e(x); the block returns Sum_e g[e] * E_e(x). Each scalar gate weight is sliced with TNNetSplitChannels(e,1), broadcast across d_model with TNNetDeepConcat.Replicate, and cell-multiplied into the expert output with TNNetCellMulByCell (the same broadcast-multiply mechanism AddCBAM uses), then summed with TNNetSum. v1 is a soft, dense gate: every expert runs on every token and the outputs are blended — it trains end-to-end with existing layers (no new gradient code). See examples/MixtureOfExperts/.

Composite helper: TNNet.AddTopKMixtureOfExperts(InputLayer, NumExperts, ExpertHiddenDim, TopCnt, out AuxLossHead, AuxCoeff=0.01) is the sparse hard top-k sibling of AddMixtureOfExperts. The gate softmax is routed through the new TNNetTopKGate(TopCnt) — which keeps the TopCnt largest gate weights per token, zeroes the rest, and renormalizes the survivors to sum to 1 (so each token is a convex blend of only its top-TopCnt experts), with an exact fused mask+renorm Jacobian backward dL/dg_j = (1/s)(dL/dy_j − Σ_i dL/dy_i·y_i) over the surviving set. A load-balancing auxiliary loss head TNNetLoadBalanceLoss(TopCnt, Coeff) is attached as a second output leaf and returned via out AuxLossHead: it implements the Switch-Transformer loss (Fedus et al. 2021) L_aux = Coeff·E·Σ_i f_i·P_i where E = NumExperts, f_i is the (stop-gradient) fraction of tokens whose top-TopCnt set touches expert i and P_i is expert i's mean gate probability, so the gradient flows through P_i only (dL_aux/dg_t[i] = Coeff·E·f_i/T) and pushes the gate to spread load instead of collapsing onto one expert. v1 still evaluates all experts then masks (correct, but not yet compute-sparse); Gumbel-noise gating and true sparse dispatch are logged follow-ups. See examples/TopKMoE/.

Composite helper: TNNet.AddMixtureOfDepths(InputLayer, BlockBuilder, Capacity) wires a Mixture-of-Depths conditional-compute block (Raposo et al. 2024) in one call, shape-preserving so it is a drop-in trunk wrapper. A per-token router (TNNetPointwiseConvLinear(1) over Depth -> TNNetSigmoid, pointwise so the token axis is preserved) scores each of the SeqLen positions; the top-Capacity positions are selected along the sequence axis (TransposeXD -> TNNetTopK(Capacity) -> TransposeXD, since TNNetTopK masks over Depth), their router weight is broadcast across d_model and cell-multiplied into the wrapped block's output (keeping the router on the gradient path through the hard top-k), and the result is added residually so non-selected positions pass through unchanged. With Capacity = SeqLen it is bit-for-bit equal to the wrapped block alone (the degenerate correctness anchor). A load-balancing auxiliary loss and a Gumbel/learned-threshold router are logged follow-ups. See examples/MixtureOfDepths/.

Composite helper: TNNet.AddReversibleBlock(InputLayer, HiddenDim) wires a RevNet-style reversible additive-coupling block in one call: it splits the input depth into halves x1|x2 and produces y1 = x1 + F(x2), y2 = x2 + G(y1), output = Concat(y1, y2) (shape-identical to the input), where F/G are small pointwise residual functions. The defining property is exact analytic invertibility — x2 = y2 - G(y1), x1 = y1 - F(x2) recovers the input without inverting F/G. See examples/ReversibleBlock/.

TNNetAffineCoupling is the library's first exact-likelihood normalizing-flow primitive (RealNVP, Dinh et al. 2016, arXiv:1605.08803; Glow, Kingma & Dhariwal 2018, arXiv:1807.03039) — a bijection whose Jacobian log-determinant is available in closed form, so a stack of these (interleaved with fixed channel permutations) trains by exact maximum likelihood under a unit-Gaussian base. Distinct from the memory-saving AddReversibleBlock (a RevNet recompute trick with NO tractable Jacobian) and from TNNetMixtureDensity (a density head that is not invertible). It splits the Depth axis into two contiguous halves a|b; one half passes through unchanged and conditions an affine map of the other: forward y_a = x_a, y_b = x_b·exp(s) + t; inverse (sampling) x_b = (y_b − t)·exp(−s), where [s_pre; t] = W·x_a + b is a single per-position linear conditioner over the unchanged half and s = clamp·tanh(s_pre/clamp) is the Glow log-scale clamp for stability. The transformed half is chosen by the pTransformSecond constructor flag so stacked couplings update every channel. log|det J| = Σ s is exposed as the public read-only LogDetJacobian, so a flow's NLL is loss = 0.5·||z||² − Σ_couplings(LogDetJacobian); backward folds the negative-log-det gradient in via LogDetLossWeight (default 1.0). FStruct[0]=transform-second flag, FStruct[1]=inverse flag, FFloatSt[0]=s clamp; all round-trip via Save/Load. Created with TNNetAffineCoupling.Create() or TNNetAffineCoupling.Create(pTransformSecond, pInverse, pClamp). See examples/NormalizingFlow/ (fits a 2-D two-moons density by maximum likelihood and samples new points back through the inverse flow).

TNNetInvertible1x1Conv is Glow's learnable invertible 1×1 convolution (Kingma & Dhariwal 2018, arXiv:1807.03039 sec. 3.2) — the channel-mixing companion to TNNetAffineCoupling. It applies a learnable C×C matrix W per spatial position across the Depth axis, y[x,y,:] = W·x[x,y,:], a trainable generalization of the fixed channel permutation that couplings rely on for mixing. Distinct from TNNetHouseholderLinear (exactly orthogonal → log-det identically 0, zero volume change) and from TNNetPointwiseConvLinear (generic 1×1 conv with no tractable inverse or log-det). W is parametrized by its LU decomposition W = P·L·(U + diag(s)) with P a fixed permutation chosen at init (regenerated from a stored seed), L unit-lower-triangular, U strictly-upper-triangular, and s the log-scale vector; this makes the per-position log-det the cheap Σ log|s| (O(C)) instead of an O(C³) determinant. LogDetJacobian returns SizeX·SizeY·Σ log|s| for the current forward, composing additively with the couplings' log-dets; backward folds the negative-log-det gradient in via LogDetLossWeight (default 1.0). The map is exactly invertible by triangular solves (no matrix inverse) — pass pInverse=true for the sampling direction, reusing the same L/U/s. FStruct[0]=C, FStruct[1]=permutation seed, FStruct[2]=inverse flag; L/U/s round-trip via the per-neuron save. Created with TNNetInvertible1x1Conv.Create() or TNNetInvertible1x1Conv.Create(pInverse, pPermSeed). See examples/NormalizingFlow/, which interleaves it with TNNetAffineCoupling (the real Glow step) and shows the learnable mixing reaching a higher mean log-likelihood than the fixed-permute baseline.

TNNetActNorm is Glow's data-dependent activation normalization (Kingma & Dhariwal 2018, arXiv:1807.03039 sec. 3.1) — the third and final Glow flow-step primitive, completing the canonical trio TNNetActNorm → TNNetInvertible1x1Conv → TNNetAffineCoupling. It is a per-channel invertible affine transform y[.,.,c] = s[c]·x[.,.,c] + b[c] with learnable scale s (stored as a log-scale, s = exp(logs), to stay strictly non-zero) and bias b. On the first forward minibatch it lazily data-dependent-initialises logs = −ln(std[c]) and b = −mean[c]/std[c] so that batch comes out per-channel ~0 mean / ~1 variance; an "initialised" flag in FStruct guards the init so it fires exactly once and never re-fires on reload — after which logs,b are ordinary trainable weights (identical behaviour at train and sample time, exactly the property a flow needs). Genuinely distinct from every existing normalizer (TNNetChannelStdNormalization, TNNetGroupNorm, TNNetInstanceNorm, …): those recompute statistics each forward and expose no inverse and no log-det, so they cannot sit inside a flow's exact-likelihood NLL. ActNorm's log-det is the cheap LogDetJacobian = SizeX·SizeY·Σ_c logs[c] (one term per channel), composing additively with the other two flow layers; backward folds the negative-log-det gradient in via LogDetLossWeight (default 1.0) and propagates to logs, b, and the input. Exactly invertible by x = (z − b)/s (pInverse=true, no solve needed). FStruct[0]=C, FStruct[1]=initialised flag, FStruct[2]=inverse flag; neuron 0 = logs, neuron 1 = b, both round-tripping via Save/Load. Created with TNNetActNorm.Create() or TNNetActNorm.Create(pInverse).

Composite helper: TNNet.AddNeuralODEBlock(InputLayer, HiddenDim, Steps) wires a continuous-depth (Neural ODE) residual block (Chen et al. 2018) in one call, shape-preserving so it is a drop-in replacement for a residual trunk (d_model = InputLayer.Output.Depth). A residual step x_{n+1} = x_n + f(x_n) is one explicit Euler step of dx/dt = f(x,t); this block replaces a stack of distinct residual blocks with one shared f integrated over Steps Euler sub-steps with fixed step h = 1/Steps. f is a shape-preserving pointwise sub-block over Depth (TNNetPointwiseConvReLU(HiddenDim) -> TNNetPointwiseConvLinear(d_model)); step 1 owns the only real weights and every later step reuses them via TNNetConvolutionSharedWeights, so the parameter count is independent of Steps (the "depth for free" property). Each step scales f's output by h (TNNetMulByConstant) and adds it residually (TNNetSum). An optional 4-arg overload AddNeuralODEBlock(InputLayer, HiddenDim, Steps, Method) selects the integrator via TNNetODEMethod = (odeEuler, odeMidpoint) — the midpoint (RK2) method evaluates the same shared f twice per step (k1 = f(y); k2 = f(y + (h/2)*k1); y := y + h*k2), improving accuracy with no extra parameters (the 3-arg form stays Euler, bit-for-bit unchanged). Training uses ordinary stored-activation backprop through the unrolled steps; the adjoint-sensitivity O(1)-memory backward is a logged follow-up. See examples/NeuralODE/ (which also renders an ASCII trajectory visualisation of the learned flow untangling two interleaving half-moons).

Composite helper: TNNet.AddDeepEquilibriumBlock(InputLayer, HiddenDim, MaxIters) wires a Deep Equilibrium block (Bai/Kolter/Koltun 2019) — the implicit cousin of AddNeuralODEBlock. Where Neural-ODE unrolls a fixed number of explicit Euler steps, a DEQ defines its output as the fixed point z* = f(z*; x) of a shape-preserving weight-tied map f, found by iterating z := f(z+x) from z_0 = 0 until the residual ||z_{k+1}-z_k|| falls below tolerance or a MaxIters cap (a data-dependent "adaptive depth", parameter count independent of the iteration count). The forward runs a damped, output-bounded Picard iteration; the backward is the tractable jacobian-free phantom gradient (Geng et al. 2021) — all iterates except the last are detached, so gradients flow through only the final f application (the exact implicit-function-theorem gradient is a logged follow-up). f reuses one TNNetDeepEquilibriumSharedConv (a weight-tied conv that rebuilds its cache each forward so every application is byte-identical, as a true fixed point requires). See examples/DeepEquilibrium/.

Composite helper: TNNet.AddPonderNetBlock(InputLayer, HiddenDim, MaxSteps, PriorLambda) wires a PonderNet block (Banino, Balaguer & Blundell 2021) — a learned probabilistic-halting adaptive-computation paradigm, distinct from the implicit fixed-point AddDeepEquilibriumBlock and from any fixed-depth stack. A weight-tied step f is applied up to MaxSteps times (h_n = h_{n-1} + f(h_{n-1}), parameter count independent of MaxSteps via TNNetDeepEquilibriumSharedConv weight sharing); a shared tiny halting head TNNetPonderHalting emits lambda_n = sigmoid(...) in (0,1) per step, giving the geometric halting distribution p_n = lambda_n * prod_{k<n}(1-lambda_k) (the last step forces lambda=1 so the p_n sum to 1). The block output is the smooth p_n-weighted sum of the per-step states — no hard argmax, so the block is differentiable end-to-end. The companion TNNetPonderCostLoss head adds KL(p || truncated-geometric(prior_lambda)), a "pay to ponder" regularizer that pulls the expected step count toward the prior's mean so the model only spends extra steps where the task forces it. Inference always unrolls MaxSteps (static shapes). See examples/PonderNet/.

Composite helper: TNNet.AddFiLMConditioned(featLayer, condLayer) wires Feature-wise Linear Modulation in one call: condLayer -> TNNetFullConnectLinear(2*D) -> reshape(1,1,2*D) -> TNNetFiLM([featLayer, cond]), inferring D = featLayer.Output.Depth. It removes the manual Depth -> 2*Depth bookkeeping every FiLM call site repeats and mirrors the AddPreNormResidual/AddGatedResidual builder family. See examples/FiLMConditioning/.

Composite helper: TNNet.AddAffineBlock wires a learnable per-channel affine transform y[d] = gamma[d]*x[d] + beta[d] in one call (TNNetChannelMul -> TNNetChannelBias), separable from FullConnect. It starts as the exact identity (gamma=1, beta=0), so it can be inserted into a frozen network and fine-tuned cheaply (BitFit-style adaptation). See examples/AffineFineTune/.

Composite helper: TNNet.AddLoRAAdapter(FrozenLayer, Rank, Alpha=1.0) wires a LoRA low-rank adapter (Hu et al. 2021) in one call: a rank-r bypass down: TNNetPointwiseConvLinear(Rank) -> up: TNNetPointwiseConvLinear(d_out) is built from the input feeding FrozenLayer, scaled by Alpha/Rank, and added residually to FrozenLayer's output. The up projection is zero-initialised, so the adapter is the exact identity perturbation at step 0 — the frozen base's output is bit-for-bit unchanged on the first forward. Freeze the base (per-layer LearningRate := 0, BitFit-style) and train only the adapter for parameter-efficient fine-tuning. NOTE: the builder zeros the up weights at construction, so do not call NN.InitWeights() afterwards (it would re-randomise them). See examples/LoRAFineTune/.

Embedding Layers

TNNetEmbedding is designed to convert input tokens (usually represented as integers) into dense vector representations (embedding vectors). TNNetTokenAndPositionalEmbedding extends TNNetEmbedding by adding positional information to the token embeddings. This is crucial for transformer models that don't have an inherent notion of sequence order. Both layers are crucial for modern NLP tasks, especially when working with transformer-based models. They allow the network to work with text data by converting tokens into rich, informative vector representations that capture both semantic meaning and positional information. By using TNNetTokenAndPositionalEmbedding, you're equipping your model with the fundamental building blocks needed for advanced NLP tasks as it provides both embedding and positional encoding.

To illustrate how these layers might be used in practice, let's consider a simple example. Suppose you're building a language model for text generation. You could use these layers:

    FNN.AddLayer([
      TNNetInput.Create(csContextLen, 1, 1),
      TNNetTokenAndPositionalEmbedding.Create(csModelVocabSize, csEmbedDim, {EncodeZero=}1, {ScaleEmbedding=}0.02, {ScalePositional=}0.01)
    ]);

    for I := 1 to 2 do FNN.AddTransformerBlockCAI({Heads=}8, {IntermediateDim=}2048, {NoForward=}true, {HasNorm=}false);

    FNN.AddLayer([
      TNNetPointwiseConvLinear.Create(csModelVocabSize),
      TNNetPointwiseSoftMax.Create({SkipBackpropDerivative=}1)
    ]);

The above example resembles a simplified version of models like GPT (Generative Pre-trained Transformer). It's designed to process sequential data such as text generation tasks. The use of token and positional embeddings, followed by transformer blocks, is a standard approach in modern NLP models. The final pointwise convolution and softmax layers are typical for generating probability distributions over a vocabulary, which is common in language models. The number of transformer blocks (2) indicates that this is a lightweight model. The choice of parameters like embedding dimensions, number of heads, and intermediate dimensions would depend on the specific requirements of the task and computational constraints.

Layer Name	Input/Output Dimensions	Description
TNNetEmbedding	Input: 1D integer tokens. Output: 2D (sequence_length x embedding_size).	Converts input tokens into dense vector representations. Parameters include vocabulary size, embedding size, scaling factor, and whether to encode zero. Allows for training of embedding weights through backpropagation.
TNNetTokenAndPositionalEmbedding	Input: 1D integer tokens. Output: 2D (sequence_length x embedding_size)	Extends TNNetEmbedding by adding positional information to token embeddings. This layer is crucial for transformer architectures.
TNNetSinusoidalTimeEmbedding	Input: 1x1x1 scalar timestep `t`. Output: 1x1xEmbeddingSize.	DDPM-style scalar-timestep encoder (Ho et al. 2020, https://arxiv.org/abs/2006.11239): `emb[i]=sin(tfreq[i])`, `emb[half+i]=cos(tfreq[i])` with `freq[i]=exp(-ln(MaxPeriod)*i/half)`. Distinct from `TNNetSinusoidalPositionalEmbedding`, which is the additive Vaswani encoding on the sequence (X) axis. No learnable parameters; backward is a no-op in v1. Created with `TNNetSinusoidalTimeEmbedding.Create(EmbeddingSize, MaxPeriod=10000)` (EmbeddingSize must be even).
TNNetFourierFeatures	Input: 1D/2D/3D coordinate vector (Depth = D_in). Output: 1x1x(2*M).	Fixed (non-trainable) random Fourier-feature coordinate embedding (Rahimi & Recht 2007; Tancik et al. 2020, https://arxiv.org/abs/2006.10739): maps `x` through a frozen Gaussian frequency matrix `B ~ N(0, sigma^2)` of shape `D_in x M` and outputs `[cos(2piB^T x), sin(2piB^T x)]` along depth. Lets a plain coordinate-MLP fit high-frequency detail (overcomes spectral bias); `sigma` sets the frequency bandwidth. `B` is sampled once from a seeded RNG and serialized, so save/load reproduces the exact mapping. No parameter gradient (only input gradient flows). Created with `TNNetFourierFeatures.Create(M, sigma, Seed=0)`.
TNNetRandomFourierFeatures	Input: 1D/2D/3D vector (Depth = D_in). Output: 1x1x(2*D).	RBF/Gaussian-kernel random features (Rahimi & Recht 2007, Random Features for Large-Scale Kernel Machines): maps `x` through a frozen Gaussian projection `W ~ N(0, 1/sigma^2)` of shape `D x D_in` and outputs `phi(x) = sqrt(1/D) * [cos(W x), sin(W x)]` along depth, so that `<phi(x),phi(y)> ≈ exp(-‖x-y‖²/(2·sigma²))` — i.e. a downstream linear layer over `phi(x)` approximates an RBF-kernel machine (the kernel-method family otherwise absent from this repo). Mathematically distinct from the learnable-FFT layers (`TNNetFourierMix`, `TNNetSpectralConv1D/2D`, `TNNetCirculantLinear` FFT path) — a fixed random Gaussian projection approximating a shift-invariant kernel, not a transform along the signal axis — and from `TNNetFourierFeatures` (Tancik-style coordinate embedding: `2π` factor, no `sqrt(1/D)` kernel scale, no trainable mode). `W` is sampled from a seeded RNG and serialized so save/load reproduces the exact map. Frozen by default (classic RFF — only the input gradient flows); pass `pTrainable<>0` for the deep-kernel-learning variant that also accumulates `dL/dW` (`sigma` stays fixed). Created with `TNNetRandomFourierFeatures.Create(D, sigma=1.0, Trainable=0, Seed=0)`. See `examples/RandomFourierFeatures/`.

Opposing Operations

Layer Name	Input/Output Dimensions	Activation	Description
`TNNetDeLocalConnect`	1D, 2D, or 3D	tanh	Opposing operation to `TNNetLocalConnect`.
`TNNetDeLocalConnectReLU`	1D, 2D, or 3D	ReLU	Opposing operation to `TNNetLocalConnectReLU`.
`TNNetDeconvolution`	1D, 2D, or 3D	tanh	Opposing operation to `TNNetConvolution`, also known as transposed convolution.
`TNNetDeconvolutionReLU`	1D, 2D, or 3D	ReLU	Opposing operation to convolution with ReLU activation (`TNNetConvolutionReLU`).
`TNNetDeMaxPool`	1D, 2D, or 3D	None	Opposing operation to max pooling layer `TNNetMaxPool`.

Weight Initializers

This API implements popular weight initialization methods including He (Kaiming) and Glorot/Bengio (Xavier):

InitUniform(Value: TNeuralFloat = 1).
InitLeCunUniform(Value: TNeuralFloat = 1).
InitHeUniform(Value: TNeuralFloat = 1).
InitHeUniformDepthwise(Value: TNeuralFloat = 1).
InitHeGaussian(Value: TNeuralFloat = 0.5).
InitHeGaussianDepthwise(Value: TNeuralFloat = 0.5).
InitGlorotBengioUniform(Value: TNeuralFloat = 1).
InitSELU(Value: TNeuralFloat = 1).

Data Augmentation Methods Implemented at TVolume

procedure FlipX();
procedure FlipY();
procedure CopyCropping(Original: TVolume; StartX, StartY, pSizeX, pSizeY: integer);
procedure CopyResizing(Original: TVolume; NewSizeX, NewSizeY: integer);
procedure AddGaussianNoise(pMul: TNeuralFloat);
procedure AddSaltAndPepper(pNum: integer; pSalt: integer = 2; pPepper: integer = -2);

Closest Layer Types to Other APIs (work in progress)

NEURAL	Keras	PyTorch
`TNNetFullConnect`	`layers.Dense(activation='tanh')`	`nn.Linear nn.Tanh()`
`TNNetFullConnectReLU`	`layers.Dense(activation='relu')`	`nn.Linear nn.ReLU()`
`TNNetFullConnectLinear`	`layers.Dense(activation=None)`	`nn.Linear`
`TNNetFullConnectSigmoid`	`layers.Dense(activation='sigmoid')`	`nn.Linear nn.Sigmoid()`
`TNNetReLU`	`activations.relu`	`nn.ReLU()`
`TNNetLeakyReLU`	`activations.relu(alpha=0.01)`	`nn.LeakyReLU(0.01)`
`TNNetVeryLeakyReLU`	`activations.relu(alpha=1/3)`	`nn.LeakyReLU(1/3)`
`TNNetReLUSqrt`
`TNNetSELU`	`activations.selu`	`nn.SELU`
`TNNetSigmoid`	`activations.sigmoid`	`nn.Sigmoid`
`TNNetSoftMax`	`activations.softmax`	`nn.Softmax`
`TNNetHyperbolicTangent`	`activations.tanh`	`nn.Tanh`
`TNNetPower`
`TNNetAvgPool`	`layers.AveragePooling2D`	`nn.AvgPool2d`
`TNNetMaxPool`	`layers.MaxPool2D`	`nn.MaxPool2d`
`TNNetMaxPoolPortable`	`layers.MaxPool2D`	`nn.MaxPool2d`
`TNNetMinPool`
`TNNet.AddMinMaxPool`
`TNNet.AddAvgMaxPool`
`TNNetAvgChannel`	`layers.GlobalAveragePooling2D`	`nn.AvgPool2d`
`TNNetMaxChannel`	`layers.GlobalMaxPool2D`	`nn.MaxPool2d`
`TNNetGlobalSumPool`
`TNNetMinChannel`
`TNNet.AddMinMaxChannel`
`TNNet.AddAvgMaxChannel`	cai.layers.GlobalAverageMaxPooling2D
`TNNetConcat`	`layers.Concatenate(axis=1)`	`torch.cat`
`TNNetDeepConcat`	`layers.Concatenate(axis=3)`	`torch.cat`
`TNNetIdentity`		`nn.Identity`
`TNNetIdentityWithoutBackprop`
`TNNetReshape`	`layers.Reshape`	`torch.reshape`
`TNNetSplitChannels`	cai.layers.CopyChannels
`TNNetSplitChannelEvery`
`TNNetSum`	`layers.Add`	`torch.add`
`TNNetCellMulByCell`	`layers.Multiply`
`TNNetChannelMulByLayer`	`layers.Multiply`
`TNNetUpsample`	`tf.nn.depth_to_space`

Adding Layers

You can add layers one by one or you can add an array of layers in one go. Follows an example adding layers one by one:

NN.AddLayer(TNNetConvolutionReLU.Create({Features=}64, {FeatureSize=}5, {Padding=}2, {Stride=}1));
NN.AddLayer(TNNetMaxPool.Create(2));

The next example shows how to add an array of layers that is equivalent to the above example:

NN.AddLayer([
  TNNetConvolutionReLU.Create({Features=}64, {FeatureSize=}5, {Padding=}2, {Stride=}1),
  TNNetMaxPool.Create(2)
]);

Multi-path Architectures Support

Since 2017, this API supports multi-paths architectures. You can create multi-paths with AddLayerAfter method. For concatenating (merging) paths, you can call either TNNetConcat or TNNetDeepConcat. Follows an example:

// Creates The Neural Network
NN := TNNet.Create();
 
// This network splits into 2 paths and then is later concatenated
InputLayer := NN.AddLayer(TNNetInput.Create(32, 32, 3));
 
// First branch starting from InputLayer (5x5 features)
NN.AddLayerAfter(TNNetConvolutionReLU.Create({Features=}16, {FeatureSize=}5, {Padding=}2, {Stride=}1), InputLayer);
NN.AddLayer(TNNetMaxPool.Create(2));
NN.AddLayer(TNNetConvolutionReLU.Create({Features=}64, {FeatureSize=}5, {Padding=}2, {Stride=}1));
NN.AddLayer(TNNetMaxPool.Create(2));
EndOfFirstPath := NN.AddLayer(TNNetConvolutionReLU.Create({Features=}64, {FeatureSize=}5, {Padding=}2, {Stride=}1));
 
// Another branch starting from InputLayer (3x3 features)
NN.AddLayerAfter(TNNetConvolutionReLU.Create({Features=}16, {FeatureSize=}3, {Padding=}1, {Stride=}1), InputLayer);
NN.AddLayer(TNNetMaxPool.Create(2));
NN.AddLayer(TNNetConvolutionReLU.Create({Features=}64, {FeatureSize=}3, {Padding=}1, {Stride=}1));
NN.AddLayer(TNNetMaxPool.Create(2));
EndOfSecondPath := NN.AddLayer(TNNetConvolutionReLU.Create({Features=}64, {FeatureSize=}3, {Padding=}1, {Stride=}1));
 
// Concats both branches into one branch.
NN.AddLayer(TNNetDeepConcat.Create([EndOfFirstPath, EndOfSecondPath]));
NN.AddLayer(TNNetConvolutionReLU.Create({Features=}64, {FeatureSize=}3, {Padding=}1, {Stride=}1));
NN.AddLayer(TNNetLayerFullConnectReLU.Create(64));
NN.AddLayer(TNNetLayerFullConnectReLU.Create(NumClasses));

These source code examples show AddLayerAfter:

DenseNetBC L40
Identity Shortcut Connection - ResNet building block

You can find more about multi-path architectures at:

Dataset Support

These datasets can be easily loaded:

CIFAR-10

procedure CreateCifar10Volumes(out ImgTrainingVolumes, ImgValidationVolumes, ImgTestVolumes: TNNetVolumeList);

Source code example: Simple CIFAR-10 Image Classifier

CIFAR-100

procedure CreateCifar100Volumes(out ImgTrainingVolumes, ImgValidationVolumes, ImgTestVolumes: TNNetVolumeList);

Source code example: CAI Optimized DenseNet CIFAR-100 Image Classifier

MNIST and Fashion MNIST

procedure CreateMNISTVolumes(out ImgTrainingVolumes, ImgValidationVolumes,
  ImgTestVolumes: TNNetVolumeList;
  TrainFileName, TestFileName: string;
  Verbose:boolean = true;
  IsFashion:boolean = false);

Source code examples:

One Class per Folder with Image Classification

In the case that your dataset has one class per folder, you can call CreateVolumesFromImagesFromFolder for loading your data into RAM:

// change ProportionToLoad to a smaller number if you don't have enough RAM.
ProportionToLoad := 1;
WriteLn('Loading ', Round(ProportionToLoad*100), '% of the Plant leave disease dataset into memory.');
CreateVolumesFromImagesFromFolder
(
  ImgTrainingVolumes, ImgValidationVolumes, ImgTestVolumes,
  {FolderName=}'plant', {pImageSubFolder=}'',
  {color_encoding=}csRGB{RGB},
  {TrainingProp=}0.9*ProportionToLoad,
  {ValidationProp=}0.05*ProportionToLoad,
  {TestProp=}0.05*ProportionToLoad,
  {NewSizeX=}128, {NewSizeY=}128
);

The example above shows how to load the dataset with 90% loaded into training and 5% loaded for each validation and testing. Images are being resized to 128x128.

Source code examples:

Is your Dataset too Big for RAM? You should use TNeuralImageLoadingFit.

In the case that your image classification dataset is too big to be stored in RAM, you can follow this example:

    FTrainingFileNames, FValidationFileNames, FTestFileNames: TFileNameList;
...
    ProportionToLoad := 1;
    CreateFileNameListsFromImagesFromFolder(
      FTrainingFileNames, FValidationFileNames, FTestFileNames,
      {FolderName=}'places_folder/train', {pImageSubFolder=}'',
      {TrainingProp=}0.9*ProportionToLoad,
      {ValidationProp=}0.05*ProportionToLoad,
      {TestProp=}0.05*ProportionToLoad
    );

Then, you can call a fitting method made specific for this:

NeuralFit := TNeuralImageLoadingFit.Create;
...
NeuralFit.FitLoading({NeuralNetworkModel}NN, {ImageSizeX}256, {ImageSizeY}256, FTrainingFileNames, FValidationFileNames, FTestFileNames, {BatchSize}256, {Epochs}100);

TNeuralImageLoadingFit.FitLoading has been tested with Places365-Standard Small images 256x256 with easy directory structure. You can follow this example:

Simple Plant Leaf Disease Image Classifier with Few RAM

Loading and Saving Images with Volumes

When loading an image from a file, the easiest and fastest method is calling LoadImageFromFileIntoVolume(ImageFileName:string; V:TNNetVolume). When loading from an TFPMemoryImage, you can load with LoadImageIntoVolume(M: TFPMemoryImage; Vol:TNNetVolume). For saving an image, the fastest method is SaveImageFromVolumeIntoFile(V: TNNetVolume; ImageFileName: string).

Fitting your Neural Network

The easiest way to train your neural network is utilizing unit neuralfit.pas. Inside this unit, you’ll find the class TNeuralImageFit that is used by many examples.

Image Classification

TNeuralImageFit has been designed for image classification tasks and can be called as follows:

procedure Fit(pNN: TNNet;
  pImgVolumes, pImgValidationVolumes, pImgTestVolumes: TNNetVolumeList;
  pNumClasses, pBatchSize, Epochs: integer);

Each volume should be provided with property tag that contains the corresponding class. TNeuralImageFit internally implements data augmentation techniques: flipping, making gray, cropping and resizing. These techniques can be controlled with:

property HasImgCrop: boolean read FHasImgCrop write FHasImgCrop;
property HasMakeGray: boolean read FHasMakeGray write FHasMakeGray;
property HasFlipX: boolean read FHasFlipX write FHasFlipX;
property HasFlipY: boolean read FHasFlipY write FHasFlipY;
property MaxCropSize: integer read FMaxCropSize write FMaxCropSize;

Once you have a trained neural network, you can use an advanced classification procedure that will average the classification probability of the input image with its flipped and cropped versions. This process frequently gives a higher classification accuracy at the expense of internally running the very same neural network a number of times. This is how you can classify images:

procedure ClassifyImage(pNN: TNNet; pImgInput, pOutput: TNNetVolume);

In the case that you would like to look into TNeuralImageFit in more detail, the Simple CIFAR-10 Image Classifier example is a good starting point.

Training with Volume Pair Lists - TNeuralFit

In the case that your training, validation and testing data can be defined as volume pairs from input volume to output volume, the easiest way to train your neural network will be calling TNeuralFit. This class has the following fitting method:

procedure Fit(pNN: TNNet;
  pTrainingVolumes, pValidationVolumes, pTestVolumes: TNNetVolumePairList;
  pBatchSize, Epochs: integer);

Both AND, OR and XOR with neuralfit unit and hypotenuse function examples load volume pair lists for training.

Training with Volume Pairs - TNeuralDataLoadingFit

The TNeuralFit implementation has a limitation: your dataset needs to be placed into RAM. In the case that your dataset is too large for RAM, you can call TNeuralDataLoadingFit:

TNNetGetPairFn = function(Idx: integer; ThreadId: integer): TNNetVolumePair of object;
TNNetGet2VolumesProc = procedure(Idx: integer; ThreadId: integer; pInput, pOutput: TNNetVolume) of object;
  
TNeuralDataLoadingFit = class(TNeuralFitBase)
...
    procedure FitLoading(pNN: TNNet;
      TrainingCnt, ValidationCnt, TestCnt, pBatchSize, Epochs: integer;
      pGetTrainingPair, pGetValidationPair, pGetTestPair: TNNetGetPairFn); overload;
    procedure FitLoading(pNN: TNNet;
      TrainingCnt, ValidationCnt, TestCnt, pBatchSize, Epochs: integer;
      pGetTrainingProc, pGetValidationProc, pGetTestProc: TNNetGet2VolumesProc); overload;

The Hypotenuse with FitLoading example uses TNeuralDataLoadingFit so it creates training pairs on the fly.

TNeuralFitBase

TNeuralImageFit and TNeuralDataLoadingFit both descend from TNeuralFitBase. From TNeuralFitBase, you can define training properties:

property Inertia: single read FInertia write FInertia;
property InitialEpoch: integer read FInitialEpoch write FInitialEpoch;
property InitialLearningRate: single read FInitialLearningRate write FInitialLearningRate;
property LearningRateDecay: single read FLearningRateDecay write FLearningRateDecay;
property CyclicalLearningRateLen: integer read FCyclicalLearningRateLen write FCyclicalLearningRateLen;
property Momentum: single read FInertia write FInertia;
property L2Decay: single read FL2Decay write FL2Decay;
property FileNameBase: string read FFileNameBase write FFileNameBase;

You can also collect current statistics:

property CurrentEpoch: integer read FCurrentEpoch;
property CurrentStep: integer read FCurrentStep;
property CurrentLearningRate: single read FCurrentLearningRate;
property TestAccuracy: TNeuralFloat read FTestAccuracy;
property TrainingAccuracy: TNeuralFloat read FTrainingAccuracy;
property Running: boolean read FRunning;

Some events are available:

property OnStart: TNotifyEvent read FOnStart write FOnStart;
property OnAfterStep: TNotifyEvent read FOnAfterStep write FOnAfterStep;
property OnAfterEpoch: TNotifyEvent read FOnAfterEpoch write FOnAfterEpoch;

You can define your own learning rate schedule:

property CustomLearningRateScheduleFn: TCustomLearningRateScheduleFn read FCustomLearningRateScheduleFn write FCustomLearningRateScheduleFn;
property CustomLearningRateScheduleObjFn: TCustomLearningRateScheduleObjFn read FCustomLearningRateScheduleObjFn write FCustomLearningRateScheduleObjFn;

Got Too Many Console Messages?

TNeuralFitBase descends from TMObject that allows you to code your own message treatment:

property MessageProc: TGetStrProc read FMessageProc write FMessageProc;
property ErrorProc: TGetStrProc read FErrorProc write FErrorProc;

On your own code, you could something is:

MyFit.MessageProc := {$IFDEF FPC}@{$ENDIF}Self.MessageProc;
MyFit.ErrorProc := {$IFDEF FPC}@{$ENDIF}Self.ErrorProc;

If you don’t need any message at all, you can hide messages by calling:

procedure HideMessages();

You can also disable fitting verbosity with:

property Verbose: boolean read FVerbose write FVerbose;

Your code will look like this:

NeuralFit := TNeuralImageFit.Create;
...
NeuralFit.Verbose := false;
NeuralFit.HideMessages();

Parallel Computing - The neuralthread.pas

This API has easy to use, lightweight and platform independent parallel processing API methods.

As an example, assuming that you need to run a procedure 10 times in parallel, you can create 10 thread workers as follows:

FProcs := TNeuralThreadList.Create( 10 );

As an example, this is the procedure that we intend to run in parallel:

procedure MyClassName.RunNNThread(index, threadnum: integer);
begin
  WriteLn('This is thread ',index,' out of ',threadnum,' threads.');
end;

Then, to run the procedure RunNNThread passed as parameter 10 times in parallel, do this:

FProcs.StartProc({$IFDEF FPC}@RunNNThread{$ELSE}RunNNThread{$ENDIF});

You can control the blocking mode (waiting threads to finish before the program continues) as per declaration:

procedure StartProc(pProc: TNeuralProc; pBlock: boolean = true);

Or, if you prefer, you can specifically say when to wait for threads to finish as per this example:

FProcs.StartProc({$IFDEF FPC}@RunNNThread{$ELSE}RunNNThread{$ENDIF}, false);
// insert your code here
FProcs.WaitForProc(); // waits until all threads are finished.

When you are done, you should call:

FProcs.Free;

Introspection & diagnostics

Beyond the runnable examples above, TNNet exposes a family of in-process introspection and diagnostic methods (most are demonstrated by the linked examples). They are grouped here by what they inspect; the linked example carries the full description, sample output and caveats.

Architecture & cost

TNNet.PrintSummary / SummaryString — Keras-style table of per-layer index, class, output shape (X,Y,D), param and neuron counts, ending with totals (SummaryString returns it as a string instead of writing to stdout). Used throughout the examples (e.g. ConfusionMatrixReport, GradientNormReport, PerplexityEval).
TNNet.ToGraphvizDot — emits a Graphviz .dot of the layer DAG (one node per layer, edges following the real graph incl. multi-input TNNetSum / TNNetDeepConcat); render with dot -Tpng net.dot -o net.png. → example
TNNet.DiffArchitecture(OtherNet) / DiffArchitectureFromString(s) — unified-diff-style report of architectural differences between two networks (LCS-aligned so single inserts/removes don't cascade). → example
TNNet.ReceptiveFieldReport(NN) — analytically propagates the receptive-field recurrence through the spatial layers (size, jump, input coverage, global-mixing cut point); no data needed. → example
TNNet.CountFLOPsPerLayer(NN) — per-layer forward-pass FLOP estimate and each layer's share, flagging layer classes the estimator doesn't model. → example
TNNet.LayerTimingReport(NN, Sample, Iterations) — per-layer forward-pass wall-clock cost: mean microseconds/forward and percent of total (ASCII #-bar) measured over Iterations forward passes.

Weights-only (no forward pass)

TNNet.WeightHistogramReport(NN) — per-trainable-layer weight statistics and ASCII bar histograms. → example
TNNet.WeightSpectrumReport(NN) — top singular value per layer (power iteration via the reusable TNNet.EstimateSpectralNorm helper); flags rank-1 collapse / high spectral norm. → example
TNNet.EstimateSpectralRadius(Layer, Iters) — power-iteration estimate of a layer weight matrix's spectral radius ρ = |λ|_max (the largest-magnitude eigenvalue, via the v := W·v / ‖W·v‖ recurrence — no Wᵀ step), the eigenvalue sibling of the singular-value EstimateSpectralNorm. For a non-symmetric matrix ρ ≤ σ_1, so radius is the tighter, correct stability target for recurrent / echo-state reservoirs (scale to ρ_target < 1 directly) while σ_1 (the norm) is the right Lipschitz/clipping bound. Requires a square matrix (returns 0 otherwise). → used in example
TNNet.WeightSpectralTailReport(NN) — label-free HT-SR quality metric: power-law tail exponent alpha per layer plus a network-average weighted-alpha (the WeightWatcher metric). → example

Activations & representation (forward probe batch)

TNNet.ActivationStatsReport(NN, Samples) — per-layer activation distribution statistics; flags near-collapsed / saturating layers. → example
TNNet.DeadNeuronReport(NN, Samples) — dead-unit counts across ReLU-family layers over a probe batch. → example
TNNet.NeuronCorrelationReport(NN, Samples) — intra-layer neuron redundancy: a |rho| histogram, top correlated pairs, and an effective-neuron count. → example
TNNet.IntrinsicDimensionReport(NN, Probes) — per-layer effective dimensionality: PCA participation ratio + the nonlinear TwoNN manifold estimate, side by side. → example
TNNet.RepresentationSimilarityReport(NN, Probes [, OtherNet]) — linear-CKA similarity between every pair of layer activations (rotation/scale-invariant; optional cross-net). → example
TNNet.RoutingEntropyReport(NN, Probe) — batch-level routing diagnostic for a TNNetSoftDecisionTree layer: per-leaf occupancy (is the tree using all 2^D leaves or collapsing onto a few?), average per-gate binary entropy (crisp vs mushy splits), and average per-sample effective-leaf-count exp(−Σ P_l·ln P_l) (decisive vs diffuse routing). The statistical companion to the example's single-point decision path. → example

Classifier evaluation & calibration (forward, labels)

TNNet.ConfusionMatrixReport(NN, Samples, NumClasses) — confusion matrix + precision/recall/F1 + most-confused pairs + per-class hard-example indices. → example
TNNet.TopLogitMarginReport(NN, Samples, NumClasses) — per-sample top1 − top2 logit margin, per-class stats, and a lowest-margin "hard examples" pool. → example
TNNet.PerplexityReport(NN, Tokens, ContextLen) — cross-entropy, perplexity, bits-per-character, top-k accuracy and worst-K positions for a sequence head. → example
TNNet.TTAReport(NN, Probes, Labels) — test-time-augmentation accuracy over a fixed transform menu vs the clean baseline, with a helps/neutral/hurts verdict. → example
TNNet.DecisionBoundaryReport(NN, Probes) — ASCII argmax map, confidence overlay and boundary-length estimate for a 2-input classifier head. → example
neuralcalibration unit (separate from the *Report family) — CalibrationReport / ComputeCalibration (ECE/MCE, Brier, reliability diagram) and FitTemperature (temperature scaling, never mutating the backbone). → example

Gradient & curvature geometry (forward + backward, frozen net)

TNNet.GradientNormReport(NN, Input, Target) — per-layer ‖dL/dx‖ and ‖dL/dW‖ with vanishing/exploding flags. → example
TNNet.LossLandscapeProbe(NN, Samples, K, R) — loss along a filter-normalised random direction; sharpness scalar + loss-doubling radius. → example
TNNet.GradientConflictReport(NN, Samples [, UseTrueLabel, LayerIdx]) — pairwise per-sample gradient cosines: conflict fraction + per-class-pair mean-cosine matrix. → example
TNNet.GradientNoiseScaleReport(NN, Samples [, UseTrueLabel, LayerIdx]) — gradient signal-to-noise ratio and the simple noise scale B_simple (the critical batch size). → example
TNNet.NeuralTangentKernelReport(NN, Samples [, TargetClass]) — empirical NTK Gram, its eigenspectrum, condition number and kernel-target alignment. → example
TNNet.HessianCurvatureReport(NN, Samples) — loss-surface sharpness: Hessian trace + top eigenvalue via finite-difference Hessian-vector products. → example
TNNet.LocalLearningCoefficientReport(NN, Samples [, ChainLen, Eps, Gamma]) — Local Learning Coefficient (LLC, an empirical RLCT from Singular Learning Theory): effective-dimensionality / degeneracy of the minimum, estimated from a tempered anchored-SGLD chain (LLC_hat = n·β·(mean_chain[L] − L(w*))). Counts flat, degenerate directions the 2nd-order Hessian top eigenvalue is blind to; non-destructive (weights snapshotted and restored). → example
TNNet.EnableInputGradient — helper that resizes the input layer's error tensors so a backward pass can deposit d(output)/d(input) on Layers[0] (off by default; needed by the saliency, adversarial and effective-RF reports).

Robustness & uncertainty

TNNet.AdversarialRobustnessReport(NN, Samples, Labels, EpsList) — FGSM accuracy-vs-eps degradation curve, per-sample critical-eps histogram, robust/fragile verdict. → example
TNNet.MCDropoutUncertaintyReport(NN, Probes [, Labels]) — Monte-Carlo-Dropout total / aleatoric / epistemic (BALD) uncertainty, keeping dropout active at inference. → example
TNNet.EquivarianceReport(NN, Probes) — output invariance error under a fixed flip / reverse / roll transform menu, with an invariant/sensitive verdict. → example
TNNet.EffectiveReceptiveFieldReport(NN, Probes) — empirical (gradient-measured) receptive field vs the analytical one — what a unit actually weights. → example

Interpretability & attribution

TNNet.SaliencyReport(NN, Probe) — input-gradient / SmoothGrad / Integrated-Gradients heatmaps for the predicted class (with the IG completeness check). → example
TNNet.GradCAMReport(NN, Probe [, ConvLayerIdx, ForcedClass]) — Grad-CAM (Selvaraju et al. 2017) coarse, class-discriminative conv-feature heatmap for the predicted class, nearest-upsampled to the input plane (complements the fine input-pixel SaliencyReport). → example
TNNet.LRPReport(NN, Probe [, ForcedClass, TopK, Eps]) — Layer-wise Relevance Propagation (Bach et al. 2015): a conservation method (not a gradient one) that back-distributes the explained logit's relevance via the epsilon-rule, printing the per-layer conservation residual, the top-k most-relevant input positions and an input-relevance heatmap (skips attention/norm layers honestly). → example
TNNet.AttentionEntropyReport(NN, Probes) — per-row attention entropy with dead/spike head flags for every TNNetScaledDotProductAttention layer. → example
TNNet.ActivationPatchingReport(NN, CleanInput, CorruptInput [, TargetIdx]) — causal trace: which layer's activations carry the information that decides the prediction. → example
TNNet.LogitLensReport(NN, pInput [, HeadStartIdx]) — re-applies the net's own trained head at each depth (zero new params) to see when the prediction crystallises. → example
TNNet.TunedLensReport(NN, pInput [, HeadStartIdx, TrainIters, LearningRate]) — the learned sibling of the logit lens (Belrose et al. 2023): fits one per-layer affine translator (frozen trunk + head) on the unlabelled probe by KL-to-self, then prints the tuned lens' KL-to-final and entropy side by side with the raw logit-lens columns — the tuned curve commits earlier and tracks the final answer more faithfully (lower KL-to-final). → example
TNNet.PredictionDepthReport(NN, Support, SupportLabels, Queries [, QueryLabels]) — per-example difficulty via the depth where a k-NN vote locks onto the final answer. → example
TNNet.LayerSensitivityReport(NN, Samples [, Targets]) — output/loss delta from small multiplicative per-layer weight perturbations, with a fragility verdict. → example
TNNet.MagnitudePruningReport(NN, Samples [, Labels, Tolerance, PerLayer]) — no-retrain accuracy-vs-sparsity curve and the prunability knee (global or per-layer). → example

Two-net comparison

TNNet.ModeConnectivityReport(NN, SnapshotB, Samples) — loss barrier along the linear interpolation between two trained nets ("same basin or separated?"). → example
TNNet.PermutationAlignReport(NN, SnapshotB, Samples [, ScoreMode, K]) — "Git Re-Basin": loss barrier before vs after quotienting out neuron-permutation symmetry. → example

NLP

This NLP source code example shows a (hello world) small neural network trained on the Tiny Stories dataset. A more complex NLP example showing the implementation of the GPT-3 Small architecture is also available.

In short, this API supports:

Samplers: TNNetSamplerGreedy, TNNetSamplerTopK and TNNetSamplerTopP.
A logit processor for repetition control: TNNetTokenHistoryPenalty — a stateful pre-sampler that reshapes the next-token logits in place using generation history, with three standard knobs (repetition penalty in the sign-correct CTRL form, frequency penalty, and presence penalty). Use it as Penalty.Apply(Logits); tok := Sampler.GetToken(Logits); Penalty.RegisterToken(tok);.
A tokenizer: TNeuralTokenizer.
A transformer decoder: AddTransformerBlockCAI.

Publications from the Author

In the case that you would like to know more about what the CAI's author is working at, here we go.

Optimizing the first layers of a convolutional neural network:

Optimizing deep layers of a convolutional neural network:

Optimizing LLMs:

Saving 77% of the Parameters in Large Language Models Technical Report

Publicações em Português:

Contributing

Pull requests are welcome. Having requests accepted might be hard.

Paid Support

In the case that you need help with your own A.I. project (Pascal, Python, or PHP), please feel free to contact the author of this API.

Citing this API

You can cite this API in BibTeX format with:

@software{cai_neural_api_2021_5810077,
  author       = {Joao Paulo Schwarz Schuler},
  title        = {CAI NEURAL API},
  month        = dec,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {v1.0.6},
  doi          = {10.5281/zenodo.5810077},
  url          = {https://doi.org/10.5281/zenodo.5810077}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2,211 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
docs		docs
examples		examples
neural		neural
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE-EXCEPTION.LGPL		LICENSE-EXCEPTION.LGPL
README.md		README.md
tasklist.md		tasklist.md

Folders and files

Latest commit

History

Repository files navigation