This project is a subproject from a bigger and older project called CAI and is sister to Keras based K-CAI NEURAL API. You can find trained neural network models in the pre-trained-neural-api-networks repository.
| Basics of Neural Networks in Pascal - Loading and Saving | Neural Networks for Absolute Beginners! Learning a Simple Function | Coding a Neural Network in Pascal that Learns to Calculate the Hypotenuse |
- The Pascal computer language is easy to learn. Pascal allows developers to make a readable and understandable source code.
- You'll be able to make super-fast native code and at the same time have a readable code.
- This API can outperform some major APIs in some architectures.
You'll need Lazarus development environment. If you have an OpenCL capable device, you'll need its OpenCL drivers. Many examples use the CIFAR-10 dataset. You'll also find examples for the CIFAR-100, MNIST, Fashion MNIST and the Places365-Standard Small images 256x256 dataset.
This project is Lazarus based. That said, as of release v2.0.0, a number of units do compile with Delphi and you can create and run neural networks with Delphi. You'll be able to compile these units with Delphi: neuralvolume, neuralnetwork, neuralab, neuralabfun, neuralbit, neuralbyteprediction, neuralcache, neuraldatasets, neuralgeneric, neuralplanbuilder, Neural OpenCL, Neural Threading and neuralfit.
Clone this project, add the neural folder to your Lazarus unit search path and you'll be ready to go!
You can get A.I. powered help from these tools:
- CAI Neural API support at Poe (free)
- CAI Neural API support at Poe
- CAI Neural API support at ChatGPT4
The documentation covers:
- Easy examples
- Simple image classification examples
- Youtube videos
- Advanced examples
- Data structures (Volumes)
- Neural network layers
- Dataset support
- Training (fitting) your neural network
- Parallel computing
- Full set of examples
- Normalization Cheat Sheet
- Layer Authoring Guide — checklist for adding a new layer plus mini-guides on reading numerical-gradient failures and picking a tolerance
- Other scientific publications from the same author
You can click on the image above to watch the video.
Assuming that you would like to train a neural network to learn a function that has 2 inputs and one output, you could start with something like this:
NN.AddLayer([
TNNetInput.Create(2),
TNNetFullConnectReLU.Create(32),
TNNetFullConnectReLU.Create(32),
TNNetFullConnectLinear.Create(1)
]);
The example above has 2 inputs (TNNetInput), 2 dense layers (TNNetFullConnectReLU) with 32 neurons each and one output (TNNetFullConnectLinear).
You can learn more about how to build and train simple neural networks at the following source code examples:
- Only one neuron
- Training a neural network to learn the hypotenuse function
- Training a neural network to learn the hypotenuse function with FitLoading
- Training a neural network to learn boolean functions AND, OR and XOR with neuralfit unit
- Training a neural network to learn boolean functions AND, OR and XOR without neuralfit unit
- Reptile first-order meta-learning: learning an initialization that adapts to a new sine-regression task in a few SGD steps
- Lottery Ticket Hypothesis: magnitude-prune a small dense net, then retrain the sparse mask from the original init vs from fresh random weights — the original-init "winning ticket" matches the dense net and beats random reinit at moderate-to-high sparsity (pure CPU)
Loading is very easy:
NN := TNNet.Create;
NN.LoadFromFile('MyTrainedNeuralNetwork.nn');
Saving is as easy:
NN.SaveToFile('MyTrainedNeuralNetwork.nn');
The CIFAR-10 dataset is a well-known collection of images commonly used to train machine learning and computer vision algorithms. It was created by the Canadian Institute for Advanced Research (CIFAR). It contains 60K 32x32 color images. The images are classified into 10 different classes, with 6,000 images per class. The classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. Despite its relatively low resolution and small size, CIFAR-10 can be challenging for models to achieve high accuracy, making it a good dataset for testing advancements in machine learning techniques.
Follows a source code example for the CIFAR-10 image classification:
NN := TNNet.Create();
NN.AddLayer([
TNNetInput.Create(32, 32, 3), //32x32x3 Input Image
TNNetConvolutionReLU.Create({Features=}16, {FeatureSize=}5, {Padding=}0, {Stride=}1, {SuppressBias=}0),
TNNetMaxPool.Create({Size=}2),
TNNetConvolutionReLU.Create({Features=}32, {FeatureSize=}5, {Padding=}0, {Stride=}1, {SuppressBias=}0),
TNNetMaxPool.Create({Size=}2),
TNNetConvolutionReLU.Create({Features=}32, {FeatureSize=}5, {Padding=}0, {Stride=}1, {SuppressBias=}0),
TNNetFullConnectReLU.Create({Neurons=}32),
TNNetFullConnectLinear.Create(NumClasses),
TNNetSoftMax.Create()
]);
CreateCifar10Volumes(ImgTrainingVolumes, ImgValidationVolumes, ImgTestVolumes);
WriteLn('Neural Network will minimize error with:');
WriteLn(' Layers: ', NN.CountLayers());
WriteLn(' Neurons:', NN.CountNeurons());
WriteLn(' Weights:', NN.CountWeights());
NeuralFit := TNeuralImageFit.Create;
NeuralFit.InitialLearningRate := fLearningRate;
NeuralFit.Inertia := fInertia;
NeuralFit.Fit(NN, ImgTrainingVolumes, ImgValidationVolumes, ImgTestVolumes, NumClasses, {batchsize}128, {epochs}100);
These examples train a neural network to classify images in classes such as: image has a cat, image has a dog, image has an airplane...
- Simple CIFAR-10 Image Classifier
- Simple CIFAR-10 Image Classifier with OpenCL
- Many neural network architectures for CIFAR-10 image classification
- MNIST, Fashion MNIST and CIFAR-100
You can save and load trained models (neural networks) with TNNet.SaveToFile and TNNet.LoadFromFile. The file format is portable meaning that you can train on CPU and run on GPU or train in AMD and run on ARM as examples. The following code shows a simple example for image classification loading a pre-trained model:
procedure ClassifyOneImageSimple;
var
NN: TNNet;
ImageFileName: string;
NeuralFit: TNeuralImageFit;
begin
WriteLn('Loading Neural Network...');
NN := TNNet.Create;
NN.LoadFromFile('SimplePlantLeafDisease-20230720.nn');
NeuralFit := TNeuralImageFit.Create;
ImageFileName := 'plant/Apple___Black_rot/image (1).JPG';
WriteLn('Processing image: ', ImageFileName);
WriteLn(
'The class of the image is: ',
NeuralFit.ClassifyImageFromFile(NN, ImageFileName)
);
NeuralFit.Free;
NN.Free;
end;
Some videos make referrence to uvolume unit. The current neuralvolume unit used to be called uvolume. This is why it's mentioned.
Although these examples require deeper understanding about neural networks, they are very interesting:
- Identity Shortcut Connection - ResNet building block
- ResNet-20 - includes a web server example
- DenseNetBC L40
- Separable Convolutions - MobileNet building block
- Gradient Ascent - Visualizing patterns from inner neurons in image classification
- Artificial Art - Let a neural network produce art via a generative adversarial network
- Super Resolution - A neural network learns how to increase image resolution
- CIFAR-10 Resized - A program that resizes CIFAR-10 and CIFAR-100 images to 64x64 and 128x128 pixels.
- Autoencoder - Shows an autoencoder built with hyperbolic tangents and trained with Tiny ImageNet 200.
There is also a full set of examples that you can look at.
Volumes behave like dynamically created arrays. They are the main array like structure used by this API. TNNetVolume class allows you to create volumes that can be accessed as 1D, 2D or 3D arrays and be operated with Advanced Vector Extensions (AVX) - Single Instruction Multiple Data (SIMD) instruction set. The usual way to create a volume is:
constructor Create(pSizeX, pSizeY, pDepth: integer; c: T = 0);
You can access the data as 1D or 3D with:
property Raw[x: integer]: T read GetRaw write SetRaw;
property Data[x, y, d: integer]: T read Get write Store; default;
Your code will look like this:
// Usage Examples
vInput := TNNetVolume.Create(32, 32, 3);
vInput[1, 1, 1] := 1;
vInput[2, 2, 2] := vInput[1, 1, 1] + 1;
vInput.Raw[10] := 5;
vInput.RandomizeGaussian();
WriteLn('Avg: ', vInput.GetAvg());
WriteLn('Variance: ', vInput.GetVariance());
WriteLn('Std Dev: ', vInput.GetStdDeviation());
WriteLn('Multiplying by 10');
vInput.Mul(10);
WriteLn('Avg: ', vInput.GetAvg());
WriteLn('Variance: ', vInput.GetVariance());
WriteLn('Std Dev: ', vInput.GetStdDeviation());
As examples, you can add, subtract, multiply and calculate dot products with:
procedure Add(Original: TNNetVolume); overload;
procedure Sub(Original: TNNetVolume); overload;
procedure Mul(Value: Single); overload;
function DotProduct(Original: TNNetVolume): TNeuralFloat; overload;
In the case that you need the raw position or raw pointer to an element of the volume, you can get with:
function GetRawPos(x, y, d: integer): integer; overload;
function GetRawPos(x, y: integer): integer; overload;
function GetRawPtr(x, y, d: integer): pointer; overload;
function GetRawPtr(x, y: integer): pointer; overload;
function GetRawPtr(x: integer): pointer; overload;
You can easily operate volumes with OpenCL via TEasyOpenCLV:
TEasyOpenCLV = class (TEasyOpenCL)
public
function CreateBuffer(flags: cl_mem_flags; V: TNNetVolume): cl_mem; overload;
function CreateInputBuffer(V: TNNetVolume): cl_mem; overload;
function CreateHostInputBuffer(V: TNNetVolume): cl_mem; overload;
function CreateOutputBuffer(V: TNNetVolume): cl_mem; overload;
function CreateBuffer(V: TNNetVolume): cl_mem; overload;
function WriteBuffer(buffer: cl_mem; V: TNNetVolume; blocking: cl_bool = CL_FALSE): integer;
function ReadBuffer(buffer: cl_mem; V: TNNetVolume; blocking: cl_bool = CL_TRUE): integer;
function CreateAndWriteBuffer(V: TNNetVolume; var buffer: cl_mem): integer; overload;
function CreateAndWriteBuffer(V: TNNetVolume): cl_mem; overload;
function CreateWriteSetArgument(V: TNNetVolume; kernel:cl_kernel; arg_index: cl_uint): cl_mem;
function CreateOutputSetArgument(V: TNNetVolume; kernel:cl_kernel; arg_index: cl_uint): cl_mem;
end;
Volumes can be organized in pairs:
/// Implements a pair of volumes
TNNetVolumePair = class(TObject)
protected
FA: TNNetVolume;
FB: TNNetVolume;
public
constructor Create(); overload;
constructor Create(pA, pB: TNNetVolume); overload;
constructor CreateCopying(pA, pB: TNNetVolume); overload;
destructor Destroy(); override;
property A:TNNetVolume read FA;
property B:TNNetVolume read FB;
property I:TNNetVolume read FA;
property O:TNNetVolume read FB;
end;
Depending on the problem that you are trying to solve, modelling the training with pairs or pair lists might be helpful. Typically, a pair will be (input, desired output). This is how volume lists and volume pair lists have been implemented:
TNNetVolumeList = class (specialize TFPGObjectList<TNNetVolume>
TNNetVolumePairList = class (specialize TFPGObjectList<TNNetVolumePair>)
The layered structure of artificial neural networks is inspired by the organization of the human brain and nervous system. In the human brain, information processing occurs in a hierarchical manner. Sensory inputs are first processed by lower-level neurons, which extract simple features. These features are then passed on to deeper neurons that combine them to recognize more complex patterns. This hierarchical processing is mirrored in artificial neural networks through the use of stacked layers.
Biological neurons are connected to each other through synapses, forming complex networks. Similarly, in artificial neural networks, neurons in one layer are connected to neurons in the next layer, mimicking this interconnected structure. Biological neurons fire (activate) based on a non-linear response to their inputs. This non-linearity is crucial for the brain's ability to learn complex patterns. In artificial neural networks, we use non-linear activation functions (such as ReLU) to introduce this non-linearity. Different regions of the brain specialize in processing different types of information. For instance, the visual cortex has layers specialized for detecting edges, shapes, and complex objects. This specialization is reflected in artificial neural networks, where different layers can learn to recognize different levels of abstraction.
In the context of artificial neural networks, we can see this biologically-inspired layered approach implemented. For example:
NN := TNNet.Create();
NN.AddLayer([
TNNetInput.Create(32, 32, 3),
TNNetConvolutionLinear.Create({neurons=}16, {featuresize}3, {padding}1, {stride}1),
TNNetReLU6.Create()
]);This code snippet demonstrates the creation of a neural network with an input layer and a convolutional layer followed by a ReLU6 activation. This structure is inspired by the visual cortex in the brain, where neurons respond to specific patterns in their receptive fields, similar to how convolutional layers operate. The CAI Neural API also supports the creation of more complex, biologically-inspired architectures. These architectures are designed with multiple layers of different types, mirroring the complex structure of the brain.
Artificial neural networks with multiple layers and specialized structures are inspired by the hierarchical and specialized nature of biological neural processing. It's important to note that while artificial neural networks are inspired by biological neural networks, they are highly simplified models. The human brain is far more complex, with various types of neurons, complex connectivity patterns, and mechanisms we don't yet fully understand. However, the layered structure in artificial neural networks has proven to be a powerful approach for solving complex problems in machine learning, inspired by the remarkable capabilities of biological neural networks.
The input layer serves as the gateway to the entire network. It's like the sensory organs of our brain, receiving information from the outside world. Without an input layer, the neural network would have no way to receive and interpret the initial data, making it impossible to perform any meaningful computations or learning tasks. The TNNetInput class implements the input layer.
Fully connected layers, also known as dense layers, are a fundamental component of neural networks. In these layers, every neuron is connected to every neuron in the previous layer, allowing for comprehensive information processing across the entire network.
In the context of the CAI Neural API, fully connected layers are represented by various classes derived from TNNetFullConnect. These layers play a crucial role in transforming input data and learning complex patterns.
The computation process in a fully connected layer involves:
- Multiplying input values by the layer's weights.
- Adding bias terms (if not suppressed).
- Applying an activation function (if present).
Key types of fully connected layers include:
TNNetFullConnectLinear: a basic fully connected layer without an activation function. It performs a linear transformation of the input data.TNNetFullConnectReLU: incorporates the Rectified Linear Unit (ReLU) activation function. ReLU introduces non-linearity by outputting the input for positive values and zero for negative values, helping the network learn complex patterns.TNNetFullConnectSigmoid: applies the sigmoid activation function to the layer's output. Sigmoid squashes the output between 0 and 1, useful for binary classification tasks.TNNetComplexLinear: the 2-dimensional complex base rung of the same Cayley–Dickson family — each 2-channel group(Re,Im)is multiplied by a learned complexw = a + b·ivia the 2×2 complex-multiply block[[a,-b],[b,a]](Re' = a·Re − b·Im,Im' = a·Im + b·Re), using ~1/2 the weights of an equal-width real dense layer while still mixing the real and imaginary parts. The forward block equals the true complex product and is norm-multiplicative|w·X| = |w|·|X|. Input/outputDepthmust be multiples of 2. Seeexamples/ComplexLinear/.TNNetQuaternionLinear: a hypercomplex dense layer that shares each 4×4 Hamilton block from a single learned quaternion, coupling 4-channel groups (rotation/scaling in quaternion space) with ~1/4 the weights of an equal-width real dense layer. Input/outputDepthmust be multiples of 4.TNNetOctonionLinear: the 8-dimensional octonion (Cayley–Dickson) generalization of the above — each 8-channel group is multiplied by a learned octonion via the (non-associative) octonion product, assembled as an 8×8 signed block, using ~1/8 the weights of an equal-width real dense layer. The implementation hard-codes an auditable Cayley–Dickson sign table verified by a norm-multiplicativity test (|W·X| = |W|·|X|). Input/outputDepthmust be multiples of 8. Seeexamples/OctonionLinear/.
Fully connected layers are typically used in neural network architectures as:
- Hidden layers for processing and transforming features.
- Output layers for producing final predictions.
For example, in the provided context, we see a simple neural network structure:
NN := TNNet.Create();
NN.AddLayer([
TNNetInput.Create(2), // Input layer with 2 inputs
TNNetFullConnect.Create(2), // Hidden fully connected layer with 2 neurons
TNNetFullConnect.Create(1) // Output fully connected layer with 1 neuron
]);| Layer Name | Input/Output Dimensions | Activation | Description |
|---|---|---|---|
TNNetFullConnectLinear |
1D, 2D, or 3D | None | Fully connected layer without an activation function (linear). |
TNNetFullConnect |
1D, 2D, or 3D | tanh | Fully connected layer with tanh as the default activation function. |
TNNetFullConnectReLU |
1D, 2D, or 3D | ReLU | Fully connected layer with ReLU activation. |
TNNetFullConnectSigmoid |
1D, 2D, or 3D | Sigmoid | Fully connected layer with Sigmoid activation. |
TNNet.AddGroupedFullConnect |
1D, 2D, or 3D | Optional | Adds a grouped fully connected layer, inspired by TNNet.AddGroupedConvolution. |
TNNetBitLinear |
1D, 2D, or 3D | None | BitNet b1.58 ternary-weight linear layer: forward uses per-neuron absmean-quantized weights Wq = scale*round(clip(W/scale,-1,+1)) with `scale = mean( |
TNNetCirculantLinear |
1D, 2D, or 3D (n=Depth) |
None | Structured-matrix dense layer whose n×n weight matrix is CIRCULANT — every row is a cyclic shift of one learned length-n kernel c, so it stores O(n) weights instead of O(n²) and the map is y = circular_convolution(c, x) (+ bias). Default forward/backward is the exact direct O(n²) circular sum; set UseFFT := true for an opt-in O(n log n) radix-2 FFT fast path (power-of-two n; default OFF, bit-for-bit equivalent to the direct path to <1e-5). Distinct from LoRA (low-rank), AddGroupedFullConnect (block-diagonal) and TNNetBitLinear (quantized dense). See examples/CirculantLinear/. |
TNNetHouseholderLinear |
1D, 2D, or 3D (n=Depth) |
None | EXACTLY-orthogonal dense layer whose n×n weight is parameterized as a product of K Householder reflections Q = H_1·H_2·…·H_K with H_i = I − 2·(v_i·v_iᵀ)/(v_iᵀ·v_i); the trainable parameters are the K reflection vectors v_i ∈ ℝ^n. Q is orthogonal for any v_i (no constrained optimization or re-projection), so the map is exactly norm/volume preserving (‖y‖ = ‖x‖ with bias off) and exactly invertible (Qᵀ = H_K·…·H_1). Forward applies the reflections one at a time (O(K·n) per sample, never materializing Q), y = Q·x (+ bias); backward caches each intermediate x_i = H_i·…·H_K·x for a closed-form per-v_i gradient and propagates dL/dx = Qᵀ·dL/dy. An exactly-orthogonal Jacobian is an isometry, so deep plain stacks neither explode nor vanish (the building block for orthogonal/unitary RNNs and reversible/normalizing-flow nets). Distinct from TNNetSpectralNorm (bounds only σ_1, leaves the other singular values free), TNNetCirculantLinear/Toeplitz (constrain the matrix form, not orthogonality) and Muon (orthogonalizes the update, not the weight). The v_iᵀ·v_i → 0 denominator is guarded (degenerate reflection ≈ identity). Composite helper TNNet.AddHouseholderLinear(N, NumReflections, UseBias) (default K = n for a full orthogonal group element; K < n gives a cheaper sub-group). See examples/HouseholderOrthogonal/. |
TNNetHyperbolicLinear |
1D, 2D, or 3D (Depth-vector point in the ball) |
None | Poincaré-ball HYPERBOLIC dense layer (Ganea, Bécigneul & Hofmann 2018, Hyperbolic Neural Networks; Nickel & Kiela 2017): treats the Depth-vector input x as a point inside the open Poincaré ball of radius 1/√c (fixed curvature c>0, default 1.0) and computes the Möbius matrix–vector product followed by a hyperbolic (Möbius) bias translation, y = exp_0(M·log_0(x)) ⊕_c b, where log_0(x)=(1/√c)·atanh(√c‖x‖)·x/‖x‖ and exp_0(v)=(1/√c)·tanh(√c‖v‖)·v/‖v‖ are the log/exp maps at the origin and a ⊕_c b is Möbius addition. Reuses the TNNetFullConnectLinear weight layout (one neuron per output dim holds matrix row M[j]; the Möbius bias b is held coordinate-wise in the per-neuron bias). Genuinely non-Euclidean — unlike every other dense layer here (all flat sum_j W·x plus optional Euclidean bias), the matmul and bias act in hyperbolic space, the natural geometry for embedding trees / hierarchies with low distortion; the output always stays inside the ball (‖y‖<1/√c). Backward is the EXACT analytic chain rule through both radial atanh/tanh maps and the Möbius-addition Jacobian (input, matrix-row and Möbius-bias gradients all numerically gradient-checked); the ‖x‖→1/√c boundary and ‖x‖→0 origin are guarded by an EPS/series fallback so the Jacobian stays finite. Curvature c round-trips via FFloatSt[0]; suppress-bias via FStruct[3]. Created with TNNetHyperbolicLinear.Create(Dout) (default c=1) or Create(Dout, c). Optional trainable curvature: Create(Dout, c, SuppressBias, LearnCurvature:=true) makes c a single learnable scalar via the constrained-scalar pattern — one extra 1-weight neuron holds a raw value mapped through c = 0.01 + 3.99*sigmoid(raw) (kept strictly positive/bounded), the exact dL/draw is accumulated via a forward-mode tangent of y through the log_0/exp_0/Möbius-add chain, and the flag round-trips via FStruct[4]. The default (fixed-c) path is byte-for-byte unchanged. |
TNNetHyperbolicDistance |
1D, 2D, or 3D (Depth-vector point in the ball) |
None | Poincaré-ball distance READOUT head (companion to TNNetHyperbolicLinear): maps the Depth-vector input x (a point inside the curvature-c Poincaré ball) to a K-vector of hyperbolic distances to K learnable prototype points p_k, d_k = dist_c(x, p_k) = (2/√c)·atanh(√c·‖(-x) ⊕_c p_k‖) (Möbius addition ⊕_c as in TNNetHyperbolicLinear). Reuses the TNNetFullConnectLinear weight layout (one neuron per prototype holds p_k, bias suppressed). A usable hyperbolic classification/regression head — small distances mean "close in hyperbolic space". Backward is the EXACT analytic chain rule through the radial atanh and the Möbius-addition Jacobian (input and per-prototype gradients both numerically gradient-checked); boundary/origin guarded by EPS/series fallback. Curvature c round-trips via FFloatSt[0], prototype count K via FStruct[0]. Created with TNNetHyperbolicDistance.Create(K) (default c=1) or Create(K, c). See examples/HyperbolicEmbedding/. |
TNNetMonarchLinear |
1D, 2D, or 3D (n=Depth) |
None | Sub-quadratic STRUCTURED dense layer whose n×n weight is a MONARCH matrix (Dao et al. 2022, Monarch: Expressive Structured Matrices for Efficient and Accurate Training): a product of two block-diagonal factors interleaved by a fixed reshape-transpose permutation, y = Pᵀ·(L·(P·(R·x))) (+ bias), where R and L are block-diagonal with b blocks of size m×m (n = b·m, default b = round(√n) for perfect-square n, else the largest divisor ≤ √n; b round-trips via FStruct[1]) and P is the fixed (b,m)→(m,b) index-gather permutation (no weights). Stores/runs in O(n^1.5) instead of O(n²) yet provably contains the DFT, the Hadamard transform and ordinary convolutions — a genuinely different structured operator from TNNetCirculantLinear (single cyclic kernel), LoRA (low-rank) and AddGroupedFullConnect (one block-diagonal, no permutation mix). Forward applies R, permute, L, un-permute as four cheap passes (never materializing the dense n×n); backward is the exact transpose chain (dL/dx, dL/dR, dL/dL all block-local matmuls), both factor gradients numerically gradient-checked. Suppress-bias via FStruct[3]. The map is square, so n is inferred from the previous layer's size; created with TNNetMonarchLinear.Create() (or Create(1) to suppress bias). |
TNNetKroneckerLinear |
1D, 2D, or 3D (n=Depth) |
None | Sub-quadratic STRUCTURED dense layer whose n×n weight is a single KRONECKER PRODUCT W = A ⊗ B of two small learned factors A (p×p) and B (q×q) with n = p·q. Stores only O(p²+q²) ≈ O(n) weights instead of O(n²), and the dense n×n Kronecker matrix is NEVER materialized: x is reshaped to a q×p matrix X (X[i,j] = x[i·p+j]) and the matvec is two small GEMMs Y = B·X·Aᵀ (O(n·(p+q)) = O(n^1.5)), flattened back as y[i·p+j] = Y[i,j] (+ bias). Under this row-major vec convention y = vec(B·X·Aᵀ) exactly equals (A⊗B)·x. Backward is the exact transpose chain dL/dX = Bᵀ·dY·A, dL/dA = dYᵀ·(B·X), dL/dB = dY·(X·Aᵀ)ᵀ (all small GEMMs; input, dA and dB all numerically gradient-checked). The factor split p defaults to round(√n) (largest divisor ≤ √n when n is not a perfect square) and round-trips via FStruct[1]; q = n div p. A genuinely different structured operator from TNNetCirculantLinear (single cyclic kernel), TNNetHouseholderLinear (exactly orthogonal), TNNetMonarchLinear (two block-diagonal factors + permutation) and LoRA (low-rank): Kronecker is a single tensor-product factorisation. Suppress-bias via FStruct[3]. The map is square, so n is inferred from the previous layer's size; created with TNNetKroneckerLinear.Create() (or Create(1) to suppress bias, or Create(0, p) to pin the factor split). See examples/KroneckerLinear/. |
TNNetTropicalLinear |
1D, 2D, or 3D (Din=Depth) |
None | A max-plus / min-plus morphological dense layer that computes in the TROPICAL (max-plus) semiring instead of the usual multiply-accumulate ring: y_i = max_j (x_j + W[i,j]) (a morphological DILATION), with a paired ERODE mode y_i = min_j (x_j + W[i,j]) selected by a constructor flag (round-trips via FStruct[6]). The weights are learnable additive thresholds and the combine op is max/min, so the layer learns piecewise-linear convex (dilation) / concave (erosion) functions and tropical polynomials — a genuinely different operator from TNNetFullConnect* and the structured-linear family (TNNetCirculantLinear/TNNetHouseholderLinear/TNNetBitLinear, all sum_j W·x) and from parameter-free max/min pooling. Forward is O(Din·Dout); backward is the same hard arg-max/arg-min subgradient as TNNetMaxPool — cache the winning j* per output and route dL/dx[j*] += dy_i, dL/dW[i,j*] += dy_i (non-differentiable at ties). Both the input and weight gradients are numerically gradient-checked (away from the tie kink). Created with TNNetTropicalLinear.Create(Dout) (dilation) or Create(Dout, 1) (erode). See examples/TropicalMorphology/. |
TNNetTropicalConv |
2D or 3D (SizeX x SizeY x Depth) | None | The SPATIAL sibling of TNNetTropicalLinear — a grayscale morphological dilation/erosion convolution with a learnable additive structuring element (SE) over a (SizeX,SizeY,Depth) patch: y[x,y,co] = max_{dx,dy,ci} (input[x+dx,y+dy,ci] + SE[dx,dy,ci,co]) (dilation), with a paired ERODE mode min_{dx,dy,ci} (input − SE) selected by a constructor flag (round-trips via FStruct[6]). Subclasses TNNetConvolutionLinear for the conv geometry (kernel/padding/stride) but replaces the multiply-accumulate with the max-plus/min-plus semiring and a single SE weight neuron — distinct from parameter-free TNNetMaxPool (learnable additive SE, not a fixed window) and from ordinary linear conv (sum W·x). Backward is the hard arg-max/arg-min one-hot subgradient (MaxPool convention): cache the winning (dx,dy,ci) tap per output cell and route dy to that single input cell and that single SE tap (non-differentiable at ties). Both input and SE gradients numerically gradient-checked (away from the tie kink). Created with TNNetTropicalConv.Create(Features, FeatureSize, Padding, Stride) (dilation) or Create(Features, FeatureSize, Padding, Stride, 1) (erode). |
TNNetCondConv |
2D or 3D (SizeX x SizeY x Depth) | None | Conditionally-parameterized ("dynamic") convolution (Yang et al. 2019, CondConv, NeurIPS): owns a bank of K expert kernels W_1..W_K (each a normal Features × FeatureSize × FeatureSize × InChannels kernel) plus a tiny per-sample routing head (global-average-pool → FullConnect → sigmoid) emitting K mixing coefficients alpha_k per input sample; the effective kernel is the per-sample blend W_eff = sum_k alpha_k·W_k applied as ONE ordinary convolution — so inference cost stays that of a single conv regardless of K while capacity grows with the bank. Backward routes dL/dW_k = alpha_k·dL/dW_eff, sends dL/dalpha_k = <dL/dW_eff, W_k> back through the sigmoid + FC + pool into the input, and propagates the standard conv input gradient through W_eff (input, expert-bank, and routing-head gradients all numerically gradient-checked). Distinct from TNNetHyperConv (GENERATES the whole kernel from a second tensor in one shot) and AddMixtureOfExperts (mixes K expert OUTPUTS post-hoc, K forward passes); CondConv mixes K kernels BEFORE the conv (one forward pass). K/Features/FeatureSize/Padding/Stride round-trip via FStruct. Created with TNNetCondConv.Create(K, Features, FeatureSize, Padding, Stride). See examples/CondConv/. |
TNNetSpectralNorm |
1D, 2D, or 3D | None | Spectral-normalized dense layer (Miyato et al. 2018): forward divides the weight matrix by its largest singular value sigma_1 (estimated by power iteration, Iters steps in FStruct[5], default 10) so the effective operator W/sigma_1 has spectral norm ~1; sigma_1 is treated as constant in backward (input error propagated through the scaled weights). |
TNNetHighway |
1D, 2D, or 3D (shape-preserving over Depth) |
Optional | Highway-network layer (Srivastava, Greff & Schmidhuber 2015): y = T(x)⊙H(x) + (1−T(x))⊙x with an input-dependent transform gate T(x) = sigmoid(W_T·x + b_T) and learned transform H(x) = activation(W_H·x + b_H). The input-dependent learned-gate ancestor of the ResNet skip connection — unlike TNNetReZero (scalar) and TNNetGatedResidual (per-channel constant), the gate is computed from the activation each forward pass and carries the identity (1−T)·x. Gate bias inits negative (b_T ≈ −1.5, the paper's carry trick) so a fresh deep stack starts near identity. Per-channel pointwise, so it composes inside the residual builders. See examples/HighwayDepth/. |
TNNetHyperLinear |
main: 1D/2D/3D → Dout |
None | HyperNetwork dense layer (Ha, Dai & Le 2016): owns NO trainable weights — its weight matrix is GENERATED by an upstream net and read from a SECOND input tensor (WeightsSource, a flat Din*Dout (+Dout) row-major matrix + optional bias) rather than from Neurons[]. Two-source wiring like TNNetCrossAttention. Forward y = W_gen·x (+b); backward propagates into BOTH the main input (W_gen^T·dy) and the generated-weights tensor (dy⊗x, dy) so the generator trains end-to-end. Composite helper TNNet.AddHyperLinear(Din, Dout, ContextLayer, UseBias=true) wires the generator FullConnect (off ContextLayer) and the weightless TNNetHyperLinear (off the main path) in one call. See examples/HyperNetwork/. |
TNNetHyperConv |
main: 3D image → VALID conv | None | Spatial cousin of TNNetHyperLinear — a weightless convolution whose kernel is GENERATED by an upstream net and read from a SECOND input tensor instead of Neurons[]. VALID conv, stride 1; kernel laid out W[o,ky,kx,i] flat (((o*K+ky)*K+kx)*InChannels + i) plus optional per-output-channel bias. Backward propagates into BOTH the main image and the generated-kernel tensor so the generator trains end-to-end. Composite helper TNNet.AddHyperConv(InChannels, OutChannels, FeatureSize, ContextLayer, UseBias=true) wires the generator FullConnect and the weightless TNNetHyperConv in one call. Note: the generator emits the whole OutC*K*K*InC kernel in one shot, which caps kernel/channel size (memory/param trade-off). |
TNNetModernHopfield |
2D (SeqLen x 1 x d) |
Pattern bank | Modern continuous Hopfield associative-memory layer (Ramsauer et al. 2020, Hopfield Networks is All You Need): an ENERGY-BASED memory that ITERATES a softmax retrieval to a fixed point against a learnable stored-pattern bank X of shape (NumPatterns, d). Per query position, xi := X^T·softmax(beta·X·xi) is applied for K update steps — K=1 reduces to ordinary attention, K>1 sharpens toward the nearest stored memory (the whole point). Inverse-temperature beta controls the metastable-vs-sharp regime. Forward caches per-step softmax weights; backward differentiates through the unrolled K steps (the SDPA softmax-Jacobian path, summed over steps, scattered into both bank and input). Composite helper TNNet.AddModernHopfieldRetrieval(NumPatterns, KSteps, Beta) wires a (SeqLen,1,d) query against a fresh learnable bank. See examples/HopfieldAssociativeMemory/. |
TNNetProductKeyMemory |
Input: SeqLen x 1 x QueryDim (even QueryDim, divisible by Heads with QueryDim/Heads even). Output: SeqLen x 1 x (Heads*ValueDim). |
Half-keys + values (per head) | Large, sparsely-accessed product-key memory (Lample et al. 2019, Large Memory Layers with Product Keys): factorizes a NumKeys-row memory into the Cartesian product of two small half-key banks K1, K2 (each sqrt(NumKeys) rows of (QueryDim/Heads)/2). A query is split in half, each half scored against its own bank, the top-TopK per half taken, the TopK x TopK combinations re-scored, and the global top-TopK product keys selected in O(sqrt(NumKeys)) work; a softmax over those gates a sparse weighted sum over only the touched rows of the learned value table. Multi-head (Heads>1): the query is split into Heads contiguous sub-queries, each running an independent product-key lookup against its own K1/K2/V banks (3*Heads neurons; head h owns neurons 3h,3h+1,3h+2), and the Heads ValueDim-wide outputs are concatenated along Depth. Heads=1 is byte-for-byte the v1 single-head path. Forward caches the chosen indices + weights; backward scatters the value-gradient into only the touched value rows and pushes the exact softmax-Jacobian score-gradient back through both half-key dot products into the query and K1/K2. Distinct from TNNetModernHopfield (dense softmax over a small fully-retrieved bank), TNNetEmbedding (one-hot lookup), and MoE (expert MLPs). Composite helper TNNet.AddProductKeyMemory(NumKeys, ValueDim, TopK, Heads). See examples/ProductKeyMemory/. |
TNNetNTMMemory |
Input: T x 1 x InputDim. Output: T x 1 x SlotWidth (per-step read vectors). |
Key/beta/erase/add projections | The first writable differentiable external-memory layer in the library — a Neural Turing Machine (Graves et al. 2014, Neural Turing Machines, arXiv:1410.5401). Unlike the read-only associative memories (TNNetModernHopfield iterated recall, TNNetProductKeyMemory sparse lookup) it carries a persistent memory matrix M (NumSlots × SlotWidth) that the layer both reads and writes while sweeping the time axis. Per step the input is projected to a content key (cosine-addressed against every slot row, sharpened by a softplus key-strength beta, softmaxed over slots → weights w), a read r_t = w^T·M (the step's output), and a sigmoid erase e + add a that update M[i] := M[i]·(1 − w[i]·e) + w[i]·a (erase-then-add, same w). The four projection matrices (8 neurons Wk,bk,Wb,bb,We,be,Wa,ba) are the only trainables; backward is full BPTT through the recurrent M update — because M_t depends on both M_{t-1} and w_t, dL/dM and dL/dw both chain backward across steps (input, read-path and write-projection gradients all numerically gradient-checked). NumSlots/SlotWidth round-trip via FStruct[0..1], InitVal via FFloatSt[0]; M is re-initialised to a small constant each sweep and is not persisted across loads (re-init on load, like SetAdjacency). v1 is content addressing only, single read+write head; location-based shift/interpolation addressing and the DNC temporal-link matrix are documented follow-ups. Created with TNNetNTMMemory.Create(NumSlots, SlotWidth, InitVal=0.001). See examples/NeuralTuringMachine/. |
TNNetSoftDecisionTree |
Input: flat Din-vector. Output: 1 x 1 x OutputDepth. |
None (mixture of leaf vectors) | Differentiable soft (oblique) decision tree (Kontschieder et al. 2015, Deep Neural Decision Forests; Frosst & Hinton 2017, Distilling a Neural Network Into a Soft Decision Tree) — a structurally new hierarchical soft-routing paradigm for this library (not a matrix factorization, attention, recurrence or kernel method). A balanced binary tree of fixed depth D has 2^D−1 inner nodes and 2^D leaves. Each inner node i is a learnable linear gate p_i = sigmoid(beta·(w_i·x + b_i)); a sample reaches a leaf with probability equal to the product of the left (p_i) / right (1−p_i) gate decisions along its root-to-leaf path, and the output is the path-probability-weighted mixture `y = sum_l P(leaf=l |
Neurons, filters, and kernels are often used as synonyms in the context of neural networks, particularly in convolutional neural networks (CNNs). They are closely related concepts that are used interchangeably. Here's why:
- Neurons: in artificial neural networks, neurons are the basic computational units. They receive input, process it, and produce an output. In the context of CNNs, the term "neuron" is sometimes used to refer to a single element in a feature map.
- Filters: in CNNs, filters (also called convolution kernels) are small matrices of weights that slide over the input data to detect specific features. Each filter produces a feature map in the output layer.
- Kernels: in image processing and CNNs, kernels are small matrices used for various operations like blurring, sharpening, or edge detection. In the context of CNNs, kernels and filters are essentially the same thing.
The reason these terms are often used synonymously is that they all contribute to the feature detection and transformation process in neural networks:
- A single filter/kernel can be thought of as a specialized neuron that detects a specific feature across the entire input.
- The weights in a filter/kernel are analogous to the weights in a traditional neuron.
- The output of applying a filter/kernel to an input region is similar to the activation of a neuron in response to its inputs.
In practice, when implementing CNNs, the terms "filter" and "kernel" are more commonly used than "neuron" when referring to the convolutional layers. However, the underlying concept of a computational unit that processes input and produces output remains the same across these terms.
Convolutional layers are fundamental building blocks in neural networks, particularly in the field of computer vision and image processing. They are designed to automatically and adaptively learn spatial hierarchies of features from input data, such as images.
In the context of the CAI Neural API, convolutional layers are implemented as classes derived from TNNetConvolutionAbstract. This abstract base class provides the core functionality for convolutional operations.
The structure of a convolutional layer typically includes:
- Input: A multi-dimensional array (usually 3D for images: width, height, and channels).
- Kernels (or filters): small matrices of weights that slide over the input.
- Feature maps: the output produced by applying the kernels to the input.
Key parameters of convolutional layers include:
- Number of features (or filters).
- Feature size (kernel size).
- Padding.
- Stride.
The CAI Neural API offers several types of convolutional layers:
TNNetConvolution: the standard convolutional layer.TNNetConvolutionLinear: a convolutional layer without an activation function.TNNetConvolutionReLU: a convolutional layer with a ReLU activation function.TNNetComplexConv: the 2-dimensional complex base rung of the same Cayley–Dickson convolution family — each 2-channel input patch(Re,Im)is complex-multiplied by a learned complex filter tapw = a + b·i(Re' = a·Re − b·Im,Im' = a·Im + b·Re) and accumulated, coupling the real and imaginary parts across space using only ~1/2 the weights of an equal-width real convolution. InputDepthand the feature count must both be multiples of 2. It reuses the gradient-checked 2×2 complex-multiply forward/backward kernel of its dense siblingTNNetComplexLinear. Seeexamples/ComplexLinear/.TNNetQuaternionConv: a hypercomplex convolution whose per-output-channel filter taps are quaternion weights — each 4-channel input patch is Hamilton-multiplied by a learned quaternion and accumulated, so the four channel components are coupled (rotation/scaling in quaternion space) using only ~1/4 the weights of an equal-width real convolution. InputDepthand the feature count must both be multiples of 4. It reuses the gradient-checked 4×4 Hamilton forward/backward kernel of its dense siblingTNNetQuaternionLinear(the library's first hypercomplex layer). Seeexamples/QuaternionConv/andexamples/QuaternionLinear/.TNNetOctonionConv: the 8-dimensional octonion (Cayley–Dickson) generalization ofTNNetQuaternionConv— each 8-channel input patch is octonion-multiplied by a learned octonion filter tap and accumulated, coupling all eight channel components across space using only ~1/8 the weights of an equal-width real convolution. InputDepthand the feature count must both be multiples of 8. It reuses the same auditable, norm-multiplicativity-verified 8×8 Cayley–Dickson forward/backward kernel as its dense siblingTNNetOctonionLinear. Seeexamples/OctonionConv/andexamples/OctonionLinear/.
Convolutional layers are crucial in neural networks because they:
- Automatically learn hierarchical features from data.
- Maintain spatial relationships in the input.
- Reduce the number of parameters compared to fully connected layers.
- Enable the network to be translation-invariant.
In practice, convolutional layers are often used in combination with other layer types, such as pooling layers (e.g., TNNetMaxPool) and normalization layers (e.g., TNNetMovingStdNormalization), to create powerful neural network architectures for tasks like image classification, object detection, and segmentation.
Here's a brief example of how to create a convolutional layer using the CAI Neural API:
NN := TNNet.Create();
NN.AddLayer([
TNNetInput.Create(32, 32, 3), // Input layer for 32x32 RGB images
TNNetConvolutionLinear.Create(
{Features=}64, // Number of output features
{FeatureSize=}5, // 5x5 kernel size
{Padding=}2, // Padding of 2 pixels
{Stride=}1, // Stride of 1 pixel
{SuppressBias=}1 // Suppress bias
),
TNNetReLU6.Create() // Activation function
]);This example creates a convolutional layer with 64 features, a 5x5 kernel size, padding of 2, and a stride of 1, followed by a ReLU6 activation function.
These are tha available convolutional layers in CAI:
| Layer Name | Input/Output Dimensions | Activation | Description |
|---|---|---|---|
TNNetConvolutionLinear |
1D, 2D, or 3D | None | Linear convolutional layer without activation. Useful for intermediate layers. |
TNNetSpectralNormConv |
1D, 2D, or 3D | None | Spectral-normalized convolutional layer: the convolution analogue of TNNetSpectralNorm. Forward divides the filters by the largest singular value sigma_1 of the flattened kernel matrix (out_channels x in_channels*kx*ky, estimated by power iteration reusing TNNet.EstimateSpectralNorm) so the effective conv operator is norm-bounded; sigma_1 is treated as constant in backward. |
TNNetConvolution |
1D, 2D, or 3D | tanh | Standard convolutional layer. Versatile for feature extraction in tasks like image recognition. |
TNNetConvolutionReLU |
1D, 2D, or 3D | ReLU | Convolutional layer with ReLU activation. Helps mitigate vanishing gradient problem. |
TNNetConvolutionSwish |
1D, 2D, or 3D | Swish | Convolutional layer with Swish activation. Performs better than ReLU in some cases. |
TNNetConvolutionHardSwish |
1D, 2D, or 3D | Hard Swish | Convolutional layer with Hard Swish activation. It is similar to swish but it's faster. |
TNNetConvolutionSharedWeights |
1D, 2D, or 3D | same as linked layer | Convolutional layer that uses the weights from another layer |
TNNetPointwiseConvLinear |
1D, 2D, or 3D | None | Linear 1x1 convolution. Useful for channel mixing without spatial operations. |
TNNetPointwiseConvReLU |
1D, 2D, or 3D | ReLU | 1x1 convolution with ReLU. Efficient for channel-wise dimensionality reduction or expansion. |
TNNetPointwiseConv |
1D, 2D, or 3D | tanh | 1x1 convolution. Useful for autoencoding architectures. |
TNNetDepthwiseConvLinear |
1D, 2D, or 3D | None | Linear depthwise convolution. Useful when additional non-linearity is not required. |
TNNetDepthwiseConv |
1D, 2D, or 3D | tanh | Depthwise convolution with tanh activation. Reduces computational cost by processing each channel separately. |
TNNetDepthwiseConvReLU |
1D, 2D, or 3D | ReLU | Depthwise convolution with ReLU activation. Combines depthwise efficiency with the benefits of ReLU. |
TNNet.AddSeparableConvLinear |
1D, 2D, or 3D | None | Adds a linear separable convolution. Useful for lightweight models with reduced parameter count. |
TNNet.AddSeparableConvReLU |
1D, 2D, or 3D | ReLU | Adds a separable convolution with ReLU. Combines depthwise and pointwise for efficient feature extraction. |
TNNet.AddConvOrSeparableConv |
1D, 2D, or 3D | Optional | Adds standard or separable convolution. Supports optional ReLU and normalization for versatile design. |
TNNet.AddGroupedConvolution |
1D, 2D, or 3D | Optional | Adds a grouped convolution. Allows efficient parallel processing of input channels. |
Grouped pointwise convolutions are an interesting and efficient variant of standard convolutions in neural networks. Grouped pointwise convolutions are a type of convolution operation where the input channels are divided into groups, and each group is processed separately. This is particularly useful for 1x1 convolutions (pointwise) where the spatial dimensions are not affected. The grouped approach can significantly reduce the number of parameters in a neural network as shown in the papers Grouped Pointwise Convolutions Reduce Parameters in Convolutional Neural Networks and An Enhanced Scheme for Reducing the Complexity of Pointwise Convolutions in CNNs for Image Classification Based on Interleaved Grouped Filters without Divisibility Constraints. By reducing parameters, these convolutions can make models more efficient in terms of computation and memory usage. These convolutions can be combined with other techniques like normalization and intergroup connections. This flexibility allows for the creation of more sophisticated network designs. Grouped pointwise convolutions are particularly useful in efficient network designs, such as mobile or edge computing applications where resource constraints are significant. They allow for maintaining model expressivity while reducing computational requirements.
The grouped pointwise convolutional layers are:
| Layer Name | Input/Output Dimensions | Activation | Description |
|---|---|---|---|
TNNetGroupedPointwiseConvLinear |
1D, 2D, or 3D | None | Linear 1x1 grouped convolution. Useful for channel mixing without spatial operations. |
TNNetGroupedPointwiseConvReLU |
1D, 2D, or 3D | ReLU | 1x1 grouped convolution with ReLU. Efficient for channel-wise dimensionality reduction or expansion. |
TNNetGroupedPointwiseConvHardSwish |
1D, 2D, or 3D | Hard Swish | 1x1 grouped convolution wish fast hard swish activation function. |
A locally connected layer is a type of neural network layer that shares some similarities with convolutional layers but has some distinct characteristics:
- Structure: Locally connected layers, like convolutional layers, operate on local regions of the input. However, unlike convolutional layers, they do not share weights across different positions in the input.
- Weight independence: Each local region in the input has its own set of weights, which are not shared with other regions. This allows the layer to learn position-specific features.
- Flexibility: Locally connected layers offer more flexibility in learning spatial hierarchies compared to fully connected layers, while still maintaining position-specific information unlike convolutional layers.
- Parameters: These layers typically have more parameters than convolutional layers due to the lack of weight sharing, which can lead to increased computational complexity and memory usage.
- Use cases: Locally connected layers can be useful in scenarios where position-specific features are important, such as in face recognition tasks where different parts of the face have distinct characteristics based on their location.
| Layer Name | Input/Output Dimensions | Activation | Description |
|---|---|---|---|
TNNetLocalConnectLinear |
1D, 2D, or 3D | None | Locally connected layer with ReLU activation. |
TNNetLocalConnect |
1D, 2D, or 3D | tanh | Locally connected layer with htan as the default activation function. |
TNNetLocalConnectReLU |
1D, 2D, or 3D | ReLU | Locally connected layer with ReLU activation. |
Max, min, and avg poolings are downsampling techniques used in neural networks, particularly in convolutional neural networks (CNNs). Let's explore each of these pooling types as implemented in the CAI Neural API:
- Max Pooling (
TNNetMaxPool): Max pooling selects the maximum value from a defined region of the input.
- It reduces spatial dimensions while retaining the most prominent features.
- Useful for detecting specific features regardless of their position in the input.
- Min Pooling (
TNNetMinPool): Min pooling selects the minimum value from a defined region of the input.
- It can be useful for detecting dark features or gaps in the input.
- Less common than max pooling but valuable in specific scenarios.
- Average Pooling (
TNNetAvgPool): Average pooling calculates the average value of a defined region of the input.
- It smooths the input and can help in reducing noise.
- Often used when we want to preserve more contextual information compared to max pooling.
Unique pooling variants in the API:
- TNNetMinMaxPool: Performs both max and min pooling and concatenates the results.
- TNNetAvgMaxPool: Combines average and max pooling.
TNNetLpPool: generalized Lp poolingy = ((1/N)·Σ|xᵢ|^p)^(1/p)over each window, with a configurable real exponentp(TNNetLpPool.Create(PoolSize, Stride, Padding, p), defaultp=2).p=1is mean-of-absolute-values,p=2is RMS pooling, and largepapproaches max pooling — a single knob interpolating between average and max pooling. Its analytic backward pass∂y/∂xᵢ = (y^(1-p)/N)·|xᵢ|^(p-1)·sign(xᵢ)is numerically gradient-checked.TNNetSoftPool: exponentially-weighted ("softmax") pooling (Stergiou, Poppe & Kalliatakis, 2021). Over each window it computeswᵢ = exp(β·xᵢ)/Σⱼexp(β·xⱼ)andy = Σᵢ wᵢ·xᵢ(TNNetSoftPool.Create(PoolSize, Stride, Padding, β), defaultβ=1; window softmax stabilised by subtracting the window max). The optional inverse-temperatureβis a single knob spanning the average↔max family:β → ∞recovers max pooling,β → 0recovers average pooling, andβ = 1is the original SoftPool. Unlike max pooling every cell receives gradient: its analytic backward pass∂y/∂xᵢ = wᵢ·(1 + β·(xᵢ − y))is numerically gradient-checked across aβsweep.TNNetStochasticPool: stochastic pooling (Zeiler & Fergus, 2013). Over each window it builds a probability distributionpᵢ = aᵢ/Σⱼaⱼfrom the (assumed non-negative, e.g. post-ReLU) activations. While training (toggled on byTNNet.EnableDropouts(true), like dropout) it samples one cell with probabilitypᵢand outputs it, routing the whole window gradient to that sampled cell (like max pooling routes to its argmax); at inference it is deterministic, outputting the probability-weighted expectationy = Σᵢ pᵢ·aᵢ. Sampling uses the library RNG so it is reproducible under a fixedRandSeed. If a window sum is≤ 0(degenerate / negative activations) it falls back to the plain window mean. Constructor params (size/stride/padding) matchTNNetMaxPool; assumes square feature maps (SizeX = SizeY).
Backpropagation in pooling layers: During backpropagation, pooling layers distribute the gradient differently:
- Max Pooling: The gradient is passed only to the neuron that had the maximum value during the forward pass.
- Min Pooling: Similar to max pooling, but for the minimum value.
- Average Pooling: The gradient is divided equally among all neurons in the pooling region.
The CAI Neural API implements these backpropagation methods in the respective Backpropagate() functions of each pooling class.
Deconvolution (Upsampling) counterparts: The API also provides deconvolution or upsampling layers, which can be seen as the inverse operations of pooling:
TNNetDeMaxPool: a deconvolution layer that can upsample the input.TNNetUpsample: also known as depth_to_space, this layer can increase the spatial dimensions of the input.
These layers are crucial in architectures like autoencoders or in tasks requiring upsampling, such as image segmentation.
When to use each pooling type:
- Max Pooling: it is useful for detecting features regardless of their exact location. It's commonly used in classification tasks.
- Min Pooling: it is useful when the absence of features is important, or when working with inverted data.
- Average Pooling: it is good for preserving more context and reducing noise. Often used in later layers of the network.
TNNetMinMaxPool: used when you want to capture both the presence and absence of features.TNNetAvgMaxPool: used when you need to balance between preserving prominent features and maintaining context.
| Layer Name | Input/Output Dimensions | Description |
|---|---|---|
TNNetAvgPool |
1D, 2D, or 3D | Average pooling layer for reducing spatial dimensions. |
TNNetAdaptiveAvgPool |
1D, 2D, or 3D | Adaptive average pooling (PyTorch AdaptiveAvgPool2d style): produces a fixed target output (SizeX, SizeY) regardless of the input spatial size, leaving depth unchanged. Each output cell averages the input cells in its adaptive window (start=floor(o·In/Out), end=ceil((o+1)·In/Out); windows may overlap when In is not a multiple of Out, and the backward pass accumulates each input cell's contribution). Create(size) makes a square output; Create(sizeX, sizeY) a rectangular one. Setting the target to 1×1 gives global average pooling; setting it equal to the input size is the identity. |
TNNetAdaptiveMaxPool |
1D, 2D, or 3D | Adaptive max pooling (PyTorch AdaptiveMaxPool2d style): produces a fixed target output (SizeX, SizeY) regardless of the input spatial size, leaving depth unchanged. Each output cell takes the maximum over the input cells in its adaptive window (same start=floor(o·In/Out), end=ceil((o+1)·In/Out) mapping as TNNetAdaptiveAvgPool; windows may overlap when In is not a multiple of Out). The backward pass routes each output error to its window's argmax cell and accumulates (an input cell can be the argmax of several overlapping windows). Create(size) makes a square output; Create(sizeX, sizeY) a rectangular one. Setting the target to 1×1 gives global max pooling; setting it equal to the input size is the identity. |
TNNetMaxPool |
1D, 2D, or 3D | Max pooling layer for reducing spatial dimensions. |
TNNetMaxBlurPool |
2D (square) | Anti-aliased / shift-invariant max pooling (Zhang 2019, Making Convolutional Networks Shift-Invariant Again). Takes the max densely at stride 1, then applies a fixed (non-trainable) separable binomial [1,2,1]×[1,2,1]/16 low-pass blur subsampled by the stride (borders clamped and re-normalized so the live taps sum to 1). This removes the aliasing that plain strided max pooling introduces, so the output shifts more gracefully as the input shifts. Constructor params (size/stride/padding) match TNNetMaxPool; assumes square feature maps (SizeX = SizeY). See examples/MaxBlurPool. |
TNNetBlurPool |
2D (square) | Anti-aliasing pooling primitive (Zhang 2019, Making Convolutional Networks Shift-Invariant Again). The pure low-pass sibling of TNNetMaxBlurPool: it applies the same fixed (non-trainable) separable binomial [1,2,1]×[1,2,1]/16 blur subsampled by the stride (borders clamped and re-normalized so the live taps sum to 1) directly to its input, with no max stage — so it can sit after any layer (a strided conv, an average pool) to suppress aliasing, not just after a max. Constructor params (size/stride/padding) match TNNetMaxPool; assumes square feature maps (SizeX = SizeY). |
TNNetMinPool |
1D, 2D, or 3D | Min pooling layer for reducing spatial dimensions. |
TNNet.AddMinMaxPool |
1D, 2D, or 3D | Performs both min and max pooling, then concatenates the results. |
TNNet.AddAvgMaxPool |
1D, 2D, or 3D | Performs both average and max pooling, then concatenates the results. |
The CAI Neural API also provides specialized versions:
TNNetMaxChannelandTNNetMinChannel: perform max and min operations across the entire channel into a single number per channel.TNNetAvgChannel: averages the entire channel into a single number per channel.
| Layer Name | Input/Output Dimensions | Description |
|---|---|---|
TNNetAvgChannel |
2D or 3D (output: 1D) | Calculates the average value per channel. |
TNNetMaxChannel |
2D or 3D (output: 1D) | Calculates the maximum value per channel. |
TNNetMinChannel |
2D or 3D (output: 1D) | Calculates the minimum value per channel. |
TNNetGather |
2D or 3D (output: depth 1) | Selects a single depth channel: Output[x,y,0] := Input[x,y,Channel]. |
TNNet.AddMinMaxChannel |
1D, 2D, or 3D | Performs both min and max channel operations, then concatenates the results. |
TNNet.AddAvgMaxChannel |
1D, 2D, or 3D | Performs both average and max channel operations, then concatenates the results. |
Normalization layers may offer:
- Improved training stability.
- Better generalization.
- Potential for faster convergence.
The available normalization techniques are:
- Zero-centering (
TNNetChannelZeroCenter). - Standard deviation normalization (
TNNetMovingStdNormalization,TNNetChannelStdNormalization). - Per-sample layer normalization (
TNNetLayerNorm,TNNetRMSNorm,TNNetGroupNorm).
See the Normalization cheat sheet for a side-by-side comparison of every normalization layer (axes reduced over, learnable parameters, formula, and when to use each).
| Layer Name | Input/Output Dimensions | Description |
|---|---|---|
TNNetChannelZeroCenter |
1D, 2D, or 3D | Trainable zero-centering normalization. |
TNNetMovingStdNormalization |
1D, 2D, or 3D | Trainable standard deviation normalization. |
TNNetChannelStdNormalization |
1D, 2D, or 3D | Trainable per-channel standard deviation normalization. |
TNNetLayerNorm |
1D, 2D, or 3D | Per-sample layer normalization (zero mean, unit variance) with learnable per-element scale and bias. |
TNNetRMSNorm |
1D, 2D, or 3D | Per-sample root-mean-square normalization (no mean subtraction) with learnable per-element scale. |
TNNetRMSNormGated |
1D, 2D, or 3D | Per-sample root-mean-square normalization (no mean subtraction) followed by a learnable per-channel sigmoid gate: y[x,y,d] = (x / sqrt(mean(x^2) + eps)) * sigmoid(g[d]). The only learnable params are the Depth gate logits g[d] (init 0, so the gate is 0.5 at start; no per-element gamma). |
TNNetSwitchableNorm |
1D, 2D, or 3D | Learnable softmax-weighted convex combination of a LayerNorm-style and an RMSNorm-style per-sample normalization of the same input: y = a_ln * L + a_rms * R, where (a_ln, a_rms) = softmax(w_ln, w_rms), L = (x - mean)/sqrt(var + eps) and R = x / sqrt(mean(x^2) + eps). The only learnable params are the two scalar mixing logits (init 0, so a 50/50 blend at start; no per-element gamma/beta). |
TNNetGroupNorm |
1D, 2D, or 3D | Normalizes within Groups contiguous channel groups, with learnable per-element scale and bias. |
TNNetGRN |
2D or 3D | Global Response Normalization (ConvNeXt-V2, Woo et al. 2023). Channel-wise contrast normalization with learnable per-channel gamma and beta (both init 0, so identity at start): Y = gamma * (X * Nx) + beta + X, where `Nx[c] = |
TNNetZScore |
1D, 2D, or 3D | Per-sample z-score normalization: y = (x - mean) / sqrt(var + eps). No learnable parameters; the unparameterised core of TNNetLayerNorm. |
TNNetDyT |
1D, 2D, or 3D | Dynamic Tanh (Liu et al. 2025): a normalization-free LayerNorm alternative y[c] = gamma[c]·tanh(alpha·x) + beta[c], with a single layer-wide learnable alpha plus per-channel learnable gamma (init 1) and beta (init 0). No batch or per-sample statistics. Created with TNNetDyT.Create(). |
TNNetWeightStandardization |
1D | Weight-standardized dense layer (Qiao et al. 2019). A TNNetFullConnectLinear that standardizes each output neuron's weight vector to zero-mean, unit-variance (ŵ = (w − μ)/sqrt(var + eps), biased variance) before the forward dot product. Smooths the loss landscape; pairs well with GroupNorm. The exact standardization Jacobian is propagated to the raw weights and is numerically gradient-checked. Created with TNNetWeightStandardization.Create(Neurons[, eps]). |
TNNetWeightNormLinear |
1D | Weight-normalized dense layer (the simple g=1 form of Weight Normalization, Salimans & Kingma 2016 / a differentiable unit-L2 weight constraint). A TNNetFullConnectLinear that L2-normalizes each output neuron's weight vector to unit norm (ŵ = w/sqrt(Σwᵢ² + eps)) before the forward dot product — a differentiable reparametrization, not a post-step hard projection. The exact unit-norm Jacobian is propagated to the raw weights and is numerically gradient-checked. Created with TNNetWeightNormLinear.Create(Neurons[, eps]). |
TNNetKANLayer |
1D | Kolmogorov-Arnold dense layer (Liu et al. 2024) — a drop-in D_in → D_out replacement for TNNetFullConnectLinear in which every input→output edge carries its own learned univariate function instead of a single scalar weight: y_j = Σ_i φ_{ij}(x_i), with φ_{ij}(x) = Σ_{k=0..K} c_{ijk}·T_k(tanh x_i) over a fixed first-kind Chebyshev basis (only the D_in·D_out·(K+1) coefficients train). Distinct from TNNetSplineActivation, which is a depth-preserving per-channel activation. Analytic input + per-coefficient gradients (numerically gradient-checked); initialised near-linear so an untrained layer behaves like a plain linear layer. Created with TNNetKANLayer.Create(D_out[, K]) (default K=4). |
TNNetKANConv |
2D | Kolmogorov-Arnold convolution — the convolutional sibling of TNNetKANLayer. Each receptive-field patch is mapped to an output value by a sum of learned univariate edge functions instead of a linear dot product: y = Σ_{p,i} φ_{p,i}(x_{p,i}) with φ(x) = Σ_{k=0..K} c_k·T_k(tanh x) over the same fixed first-kind Chebyshev basis (only the FeatureSize²·InputDepth·(K+1) coefficients per filter train). Subclasses TNNetConvolutionLinear (no output bias), initialised near-linear, analytic input + per-coefficient gradients (numerically gradient-checked). An optional basis selector (7th constructor arg, csKANBasisChebyshev default / csKANBasisBSpline) swaps the Chebyshev polynomials for the KAN paper's original fixed-knot B-spline parameterisation — clamped open-uniform knots over the tanh-squashed [-1,1] support, G+K coefficients per edge evaluated by the Cox–de Boor recurrence (and its analytic derivative for the backward pass), Greville-abscissa near-linear init; B-splines have compact support (locally smooth, flat extrapolation) vs Chebyshev's global oscillatory polynomials. Created with TNNetKANConv.Create(Features, FeatureSize, Padding, Stride, K[, SuppressBias[, Basis]]). See examples/KANConv/. |
TNNet.AddMovingNorm |
1D, 2D, or 3D | Possible replacement for batch normalization. |
TNNet.AddChannelMovingNorm |
1D, 2D, or 3D | Possible replacement for batch normalization, applied per channel. |
TNNetLayerNorm normalizes each input sample over all its elements (SizeX*SizeY*Depth) to zero mean and unit variance, then applies a learnable per-element scale (gamma) and bias (beta). Unlike batch normalization it does not depend on batch statistics, which makes it well suited to transformers and recurrent models. Add it with NN.AddLayer(TNNetLayerNorm.Create());.
TNNetRMSNorm is a cheaper, transformer-friendly variant that divides each sample by the root mean square of its elements (no mean subtraction) and applies a learnable per-element scale. Add it with NN.AddLayer(TNNetRMSNorm.Create());.
TNNetRMSNormGated keeps the same RMS normalization but replaces the per-element scale with a learnable per-channel sigmoid gate sigmoid(g[d]). The gate logits are initialised to 0, so an untrained layer halves each normalized activation (sigmoid(0) = 0.5) and the channels open or close independently during training. Add it with NN.AddLayer(TNNetRMSNormGated.Create());.
TNNetSwitchableNorm lets the network learn how much LayerNorm vs RMSNorm to apply to the same input. It computes both a LayerNorm-style normalization L = (x - mean)/sqrt(var + eps) and an RMSNorm-style normalization R = x / sqrt(mean(x^2) + eps) per sample, then blends them with a softmax over two learnable scalar logits: y = a_ln * L + a_rms * R with (a_ln, a_rms) = softmax(w_ln, w_rms). There is no per-element gamma/beta; the only parameters are the two mixing logits, both initialised to 0 so an untrained layer is an exact 50/50 blend. Add it with NN.AddLayer(TNNetSwitchableNorm.Create());.
TNNetGroupNorm splits the input channels (Depth) into Groups contiguous groups and normalizes each group independently, then applies a learnable per-element scale and bias. Depth must be divisible by Groups; otherwise it falls back to a single group. Pass the group count to the constructor, e.g. NN.AddLayer(TNNetGroupNorm.Create(8));.
Normalization layers (TNNetLayerMaxNormalization, TNNetLayerStdNormalization, TNNetLocalResponseNorm2D, TNNetLocalResponseNormDepth) help stabilize training and can improve model performance by managing the scale and distribution of activations. They are particularly useful in deep networks where the scale of values can change dramatically between layers.
TNNetLayerMaxNormalization normalizes based on the maximum value, while TNNetLayerStdNormalization uses standard deviation. These are particularly useful when you want to normalize the activations within a specific range or distribution without learning any parameters. They can be applied to various network architectures and are especially helpful when dealing with varying scales of input features.
TNNetLocalResponseNorm2D and TNNetLocalResponseNormDepth implement types of local Response Normalization (LRN). LRN is inspired by lateral inhibition in real neurons. It's particularly useful in Convolutional Neural Networks (CNNs) for image processing tasks. You may use it in scenarios where you want to create competition amongst neuron outputs in the same layer.
TNNetLocalResponseNorm2D is applied across nearby kernel maps at the same spatial position, while TNNetLocalResponseNormDepth normalizes across the depth dimension. These layers can help in increasing the generalization capability of the model, reducing the chances of overfitting and enhancing the model's ability to detect high-frequency features with a big response.
Random layers (TNNetRandomMulAdd, TNNetChannelRandomMulAdd) serve as powerful regularization techniques, helping to prevent overfitting and improve the model's ability to generalize. They can be especially beneficial when working with limited datasets or when you want your model to be robust to small variations in input.
| Layer Name | Input/Output Dimensions | Description |
|---|---|---|
TNNetLayerMaxNormalization |
1D, 2D, or 3D | Non-trainable max normalization per layer. |
TNNetLayerStdNormalization |
1D, 2D, or 3D | Non-trainable standard deviation normalization per layer. |
TNNetLocalResponseNorm2D |
2D or 3D | Non-trainable local response normalization for 2D or 3D input. |
TNNetLocalResponseNormDepth |
2D or 3D | Non-trainable local response normalization with depth normalization. |
TNNetRandomMulAdd |
1D, 2D, or 3D | Adds random multiplication and random bias (shift). |
TNNetChannelRandomMulAdd |
1D, 2D, or 3D | Adds random multiplication and random bias (shift) per channel. |
TNNetGaussianNoise |
1D, 2D, or 3D | Additive N(0, σ²) noise at training, identity at inference. σ stored in FFloatSt[0]. |
TNNetGaussianDropout |
1D, 2D, or 3D | Multiplicative N(1, σ²) noise at training, identity at inference. σ stored in FFloatSt[0]. |
TNNetDropBlock |
2D or 3D | Structured spatial dropout (Ghiasi et al. 2018, "DropBlock"). At training it zeroes contiguous block_size × block_size square regions of the feature map — one spatial mask broadcast across all Depth channels — so spatially-correlated neighbours drop together (unlike TNNetDropout which scatters per-element, or TNNetSpatialDropout2D which drops whole channels). Seeds are sampled at rate gamma = (1-keep) * feat_area / (block² * valid_area) only where a full block fits, then dilated into blocks; survivors are rescaled by count_all / count_kept to preserve the expected activation. Identity at inference. Backward gates through the same stored mask. Created with TNNetDropBlock.Create(block_size, drop_prob). |
TNNetMinMaxNorm |
1D, 2D, or 3D | Per-sample min-max normalization y = (x - min(x)) / (max(x) - min(x) + eps), reduced over the whole sample volume so the output range is approximately [0, 1]. Non-trainable; eps defaults to 1e-7 (constructor-configurable, round-trips via Save/Load). Backward routes the bulk 1/denom gradient plus the exact argmin/argmax coupling terms. A per-channel mode (TNNetMinMaxNorm.Create(eps, {PerChannel:=}True)) reduces min/max over the spatial positions ONLY, independently for each depth channel, so every channel is normalized to its own [0, 1] range; the flag round-trips via Save/Load and full-volume stays the default. Created with TNNetMinMaxNorm.Create(), TNNetMinMaxNorm.Create(eps), or TNNetMinMaxNorm.Create(eps, PerChannel). |
These layers provide various tools for normalization, regularization, and introducing controlled variability in neural networks. The choice of which layers to use and where to place them in your network architecture depends on the specific problem you're trying to solve, the characteristics of your data, and the behavior you want to encourage in your model.
These layers are essential for creating flexible and powerful neural network architectures. Let's break them down:
-
Concatenation Layers: There are two main types of concatenation layers in the CAI Neural API: a.
TNNetConcat:- This layer concatenates outputs from multiple layers along the depth dimension.
- It's designed to work with layers that have the same spatial dimensions (X and Y sizes).
- Usage: It's particularly useful when you want to combine features from different processing paths in your network.
b.
TNNetDeepConcat: - This layer also concatenates outputs from multiple layers, but it's specifically optimized for the depth dimension.
- It maintains separate arrays to track the depths of each layer and channel, allowing for efficient deep concatenation.
- Usage: Ideal for creating architectures that process information in parallel and then combine the results.
-
Summation Layer (
TNNetSum):- This layer adds together the outputs of multiple layers element-wise.
- It's designed to work with layers of the same size.
- Usage: Commonly used in residual network (ResNet) style architectures, where it allows for skip connections that help mitigate the vanishing gradient problem and enable the training of very deep networks.
These layers provide several benefits in neural network design:
- Flexibility: they allow for the creation of complex, non-linear network topologies that can process information in parallel and then combine it in various ways.
- Feature Fusion: concatenation and summation layers enable the network to combine features from different processing streams, potentially capturing multi-scale or multi-aspect information.
- Skip Connections: summation layers are crucial for implementing skip connections, which are fundamental to many modern architectures like ResNets and DenseNets.
- Dimensionality Manipulation: the transposition layers allow for creative manipulations of data dimensions, which can be crucial for certain types of operations or for interfacing between different parts of a network.
- Custom Architectures: these layers provide the building blocks for designing novel network architectures tailored to specific tasks or data types.
By using these layers creatively, developers can build highly customized and efficient neural network architectures that are optimized for their specific use cases.
| Layer Name | Input/Output Dimensions | Description |
|---|---|---|
TNNetConcat |
1D, 2D, or 3D | Concatenates previous layers into a single layer. |
TNNetDeepConcat |
1D, 2D, or 3D | Concatenates previous layers along the depth axis. This is useful with DenseNet like architectures. Use TNNetDeepConcat instead of TNNetConcat if you need to add convolutions after concating layers. |
TNNetIdentity |
1D, 2D, or 3D | Identity layer that passes the input unchanged. |
TNNetIdentityWithoutBackprop |
1D, 2D, or 3D | Allows the forward pass but prevents backpropagation. |
TNNetLogCoshLoss |
1D, 2D, or 3D | Log-Cosh regression output head. Forward is identity passthrough; backward replaces the framework-seeded (output - target) gradient with tanh(output - target), the bounded gradient of L = sum log(cosh(output - target)). Use as the last layer of a regression net. Created with TNNetLogCoshLoss.Create(). |
TNNetCharbonnierLoss |
1D, 2D, or 3D | Charbonnier ("smooth-L1") regression output head, popular for super-resolution. Forward is identity passthrough; backward replaces the seeded (output - target) gradient with (output - target) / sqrt((output - target)^2 + eps^2), always bounded in [-1, 1]. eps defaults to 1e-3 (constructor-configurable, round-trips via Save/Load). Created with TNNetCharbonnierLoss.Create() or TNNetCharbonnierLoss.Create(eps). |
TNNetQuantileLoss |
1D, 2D, or 3D | Quantile (pinball) regression output head for estimating conditional quantiles / prediction intervals. For target quantile q in (0,1) and residual e = target - prediction, the loss is L_q(e) = max(q·e, (q-1)·e) (q=0.5 recovers the median / MAE). Forward is identity passthrough; backward replaces the framework-seeded (prediction - target) gradient with the subgradient -q when under-predicting (e>0), (1-q) when over-predicting (e<0), and 0 at the kink. q defaults to 0.5 (constructor-configurable, validated in (0,1), round-trips via Save/Load). Created with TNNetQuantileLoss.Create() or TNNetQuantileLoss.Create(q). See examples/QuantileRegression. |
TNNetMultiQuantileLoss |
1D, 2D, or 3D | Single-model multi-quantile pinball head: emits an N-wide output (one channel per target quantile) so all N quantiles are predicted jointly in one forward pass instead of training N separate models. Each output channel i is trained with its own pinball loss (quantile q_i) against the same scalar target; backward writes the per-channel subgradient mirroring TNNetQuantileLoss. The quantile list is serialized (N capped at 8). A non-differentiable inference-time monotonicity guard, the class method TNNetMultiQuantileLoss.SortAscending, sorts each N-channel group so the q=0.1 prediction never crosses q=0.9 ("quantile crossing"). Created with TNNetMultiQuantileLoss.Create() (defaults [0.1, 0.5, 0.9]) or TNNetMultiQuantileLoss.Create([...]). See examples/QuantileRegression. |
TNNetNLLLoss |
1D, 2D, or 3D | Negative-log-likelihood classification output head, the companion to TNNetLogSoftMax. Consumes per-position log-probabilities over the depth axis. Forward is identity passthrough; backward writes the exact NLL gradient -target per position (so a TNNetLogSoftMax -> TNNetNLLLoss stack reproduces softmax cross-entropy, softmax(logits) - target, the numerically stable way). Created with TNNetNLLLoss.Create(). |
TNNetKLDivergence |
1D, 2D, or 3D | Kullback-Leibler divergence output head, KL(target‖pred) = sum(target·log(target/pred)). Place it after a TNNetSoftMax so the input is a probability distribution q. Forward is identity passthrough; backward writes the analytic gradient dL/dq_i = -target_i / q_i, with q clamped to [1e-7, 1] for stability and zero-target terms contributing no gradient (0·log0 := 0). Useful for soft-label / knowledge-distillation training. Created with TNNetKLDivergence.Create(). |
TNNetTverskyLoss |
1D, 2D, or 3D | Tversky segmentation output head (Salehi et al. 2017). Operates on probability-space inputs (after a sigmoid/softmax) with binary/one-hot targets, reduced over the whole volume. With TP=sum(p·g), FP=sum(p·(1-g)), FN=sum((1-p)·g), the Tversky index is TI = (TP+s)/(TP+α·FP+β·FN+s) and the loss is L = 1 - TI. α/β trade false positives vs false negatives (defaults 0.5/0.5), s is a smoothing constant (default 1.0); all round-trip via Save/Load. Forward is identity passthrough; backward writes the analytic dL/dp_i. Created with TNNetTverskyLoss.Create() or TNNetTverskyLoss.Create(alpha, beta, smooth). |
TNNetDiceLoss |
1D, 2D, or 3D | Dice (Sørensen-Dice / F1) segmentation output head — the α=β=0.5 special case of TNNetTverskyLoss (L = 1 - 2·TP/(2·TP+FP+FN)), so it reuses the Tversky forward/backward. Standard choice for class-imbalanced segmentation. Created with TNNetDiceLoss.Create(). |
TNNetWingLoss |
1D, 2D, or 3D | Wing regression output head (Feng et al. 2018), designed for facial-landmark localization. Per-element loss with a logarithmic core `w·ln(1+ |
TNNetLabelSmoothingLoss |
1D, 2D, or 3D | Label-smoothing classification output head (Szegedy et al. 2016). Place it after a TNNetSoftMax. It replaces the one-hot target t with the smoothed t' = (1-eps)·t + eps/NumClasses (NumClasses = depth) and propagates the softmax cross-entropy gradient p - t', discouraging over-confident logits. eps defaults to 0.1 (round-trips via Save/Load). Forward is identity passthrough. Created with TNNetLabelSmoothingLoss.Create() or TNNetLabelSmoothingLoss.Create(eps). |
TNNetTripletLoss |
1D, 2D, or 3D | Triplet-margin metric-learning output head. Splits the input depth into 3 equal anchor/positive/negative chunks (d = Depth div 3; requires Depth mod 3 = 0) and per spatial cell computes the hinge L = max(0, ‖a-p‖² - ‖a-n‖² + margin). There is no external target — supervision is implicit in the a|p|n layout. Forward is identity passthrough; when the hinge is active the backward writes dL/da=2(n-p), dL/dp=-2(a-p), dL/dn=2(a-n) into the three depth slices (zero otherwise). margin defaults to 1.0 (round-trips via Save/Load). Created with TNNetTripletLoss.Create() or TNNetTripletLoss.Create(margin). |
TNNetReshape |
1D, 2D, or 3D | Reshapes the input into a different dimension. |
TNNetExpandDims |
1D, 2D, or 3D | numpy-style single-axis shape helper. Lays the whole input out as a length-N = SizeX·SizeY·Depth vector along a chosen axis, forcing the other two axes to size 1: axis 0 → (N,1,1), axis 1 → (1,N,1), axis 2 → (1,1,N) (default). Element-count-preserving pure reshape (identity data/gradient flow); the exact inverse of TNNetSqueeze. Created with TNNetExpandDims.Create(Axis) (default axis 2). |
TNNetSqueeze |
1D, 2D, or 3D | numpy-style shape helper that collapses any (SizeX, SizeY, Depth) volume to the canonical compact depth vector (1, 1, N), removing unit spatial axes. Element-count-preserving pure reshape (identity data/gradient flow); inverts TNNetExpandDims. Less error-prone than open-coding TNNetReshape. TNNetSqueeze.Create() collapses all axes; TNNetSqueeze.Create(Axis) drops only the one specified unit axis (asserting the other two are size 1), the exact single-axis inverse of TNNetExpandDims(Axis). |
TNNetSum |
1D, 2D, or 3D | Sums the outputs from previous layers, useful for ResNet-style networks. |
TNNetFiLM |
1D, 2D, or 3D | Feature-wise Linear Modulation (Perez et al. 2018). A parameter-free two-input layer that conditions one branch on another: Out[x,y,c] = gamma[c]·feature[x,y,c] + beta[c], where the per-channel gamma/beta come from a separate conditioning branch (not the layer's own weights), so the modulation is input-dependent rather than a fixed affine. Input 0 is the feature map (SizeX, SizeY, Depth); input 1 is the conditioning vector (1, 1, 2·Depth) packed as gamma|beta (broadcast over space). Wire it with TNNetFiLM.Create([featureLayer, condLayer]). Backward routes error to both inputs (dgamma=Σ feature·dOut, dbeta=Σ dOut), so the conditioning sub-network trains end-to-end. With gamma=1, beta=0 it reproduces the feature map exactly. See the worked FiLM conditioning example. |
TNNetUpsample |
3D | Upsamples channels (depth) into spatial data, converting depth into spatial resolution. For example, a 128x128x256 activation map will be converted to 256x256x64. The number of channels is always divided by 4 while the resolution increases. |
TNNetPixelShuffle |
3D | Sub-pixel convolution (Shi et al. 2016). Parameter-free depth-to-space rearrangement with a configurable upscale factor r: input (W, H, C) with C mod (r*r) = 0 becomes (W*r, H*r, C / (r*r)). Created with TNNetPixelShuffle.Create(r) (default r=2). The backward pass is the exact inverse gather, so the layer round-trips cleanly. |
TNNetMaskedMean |
3D (output: (1, SizeY, D-1)) |
Mean over the SizeX (sequence) axis with the last input channel acting as a {0,1} validity mask. Positions where mask ≤ 0.5 are excluded from the average; rows whose mask is entirely zero produce a zero output and zero gradient. Parameter-free. |
TNNetMaskedMax |
3D (output: (1, SizeY, D-1)) |
Max over the SizeX (sequence) axis with the last input channel acting as a {0,1} validity mask. Masked-out positions are treated as -infinity; rows whose mask is entirely zero produce a zero output and zero gradient. Parameter-free. |
For contrastive / metric-learning models the goal is not to classify but to embed: map each input to a vector so that semantically-similar inputs land close together and dissimilar ones far apart. Three layers compose into such a head:
| Layer Name | Input/Output Dimensions | Description |
|---|---|---|
TNNetL2Normalize |
1D, 2D, or 3D | Divides the input by its L2 norm so each embedding lives on the unit sphere (cosine geometry). The reduction axis is configurable: Create()/Create(axis=0) normalizes per-(x,y) over depth, Create(1) over the whole flattened sample (Keras "UnitNorm"), Create(2) per-channel; an optional eps (default 1e-8) stabilises the denominator. The exact backward Jacobian is applied. No trainable parameters. |
TNNetCosineSimilarity |
2D or 3D (output: (SizeX, SizeY, 1)) |
Splits the input depth into two equal halves a and b and produces the per-(x,y) scalar cos(a, b) = (a·b)/(‖a‖·‖b‖ + eps). Requires an even input depth >= 2. Useful as a Siamese/twin-tower similarity head. The exact cosine Jacobian is back-propagated. No trainable parameters. |
TNNetTripletLoss |
1D, 2D, or 3D | Triplet-margin metric-learning output head — splits the input depth into 3 equal anchor|positive|negative chunks and per spatial cell computes the hinge L = max(0, ‖a-p‖² - ‖a-n‖² + margin). There is no external target; supervision is implicit in the a|p|n layout. See the loss-head table above for the full gradient details. |
TNNetCosineEmbeddingLoss |
1D, 2D, or 3D | Pairwise cosine-embedding metric-learning output head (PyTorch CosineEmbeddingLoss family) — splits the input depth as a|b|y with Depth = 2*d + 1 (odd, >= 3). Per spatial cell, with cos = (a·b)/(‖a‖·‖b‖ + eps) and per-position label y (1 = similar, 0 = dissimilar), L = y·(1 - cos) + (1 - y)·max(0, cos - margin)². There is no external target; supervision is implicit in the a|b|y layout. Forward is identity passthrough; backward writes the analytic cosine gradient into the a/b channels and 0 into the y channel. margin defaults to 0.0 (must be in [-1, 1], round-trips via Save/Load). Created with TNNetCosineEmbeddingLoss.Create() or TNNetCosineEmbeddingLoss.Create(margin). |
TNNetInfoNCELoss |
1D, 2D, or 3D | InfoNCE / contrastive output head (SimCLR/CPC family) — splits the input depth into K+1 equal slabs of d channels each: a query q followed by K keys k_0..k_{K-1} where k_0 is the POSITIVE key and the rest are negatives, so Depth = d*(K+1). Per spatial cell, with dot-product similarity s_j = (q·k_j)/tau and p = softmax(s), L = -s_0 + logsumexp_j(s_j). There is no external target; supervision is implicit in the q|k_0|..|k_{K-1} layout. Forward is identity passthrough; backward writes the analytic gradients dL/dq = (1/tau)(Σ_j p_j·k_j - k_0), dL/dk_0 = (1/tau)(p_0-1)·q, dL/dk_j = (1/tau)·p_j·q (j>0). Embedding dim d is stored in FStruct[0] (>= 1) and temperature tau in FFloatSt[0] (default 0.07, must be > 0); both round-trip via Save/Load. SetPrevLayer validates Depth mod d = 0 and (Depth div d) >= 3. Created with TNNetInfoNCELoss.Create() or TNNetInfoNCELoss.Create(EmbeddingDim, Temperature). |
TNNetCenterLoss |
1D, 2D, or 3D | Center-loss output head (Wen et al. 2016) — a PENALTY head that pulls each feature toward a trainable per-class center, meant to be ADDED ALONGSIDE a separate classification head (it contributes only the center-pull gradient, NOT any softmax/cross-entropy term). Splits the input depth as x|y with Depth = d + 1 (>= 2): x are the d feature channels and the last channel holds the integer class label y. Per spatial cell, with active class c = round(y), L = (λ/2)·‖x - c_c‖². There is no external target; supervision is implicit in the x|y layout. The K class centers (each of dim d) are stored as K trainable neurons (one weight vector each) and serialize automatically. Forward is identity passthrough; backward writes the feature gradient dL/dx = λ·(x - c_c) into the feature channels (0 into the label channel) and accumulates the center-pull gradient (c_c - x) into the active center's neuron delta. NOTE: the per-sample gradient path cannot see other minibatch samples, so the paper's cross-batch EMA center update is out of scope; centers are learned by the optimizer like any weight. K is stored in FStruct[0] (default 2, >= 1) and λ in FFloatSt[0] (default 1.0, > 0); both round-trip via Save/Load. Created with TNNetCenterLoss.Create() or TNNetCenterLoss.Create(NumClasses, Lambda). |
TNNetVectorQuantizer |
1D, 2D, or 3D | VQ-VAE codebook bottleneck (van den Oord et al. 2017, "Neural Discrete Representation Learning"). Replaces each input feature VECTOR (the Depth-vector z_e at every spatial position) with its nearest entry z_q from a learnable codebook of K vectors (each of dim Input.Depth); output shape equals input shape. The K codebook vectors are stored as K trainable neurons (one Depth-length weight vector each) and serialize automatically. Forward picks the codebook index minimizing the squared-L2 distance to z_e and writes that code to the output. Backward uses the straight-through estimator (the output gradient flows to z_e unchanged), adds the commitment gradient 2·β·(z_e - z_q) to the input gradient, and accumulates the codebook-pull gradient 2·(z_q - z_e) into the chosen code's neuron delta (FBatchUpdate respected). K is stored in FStruct[0] (default 8, >= 1) and the commitment cost β in FFloatSt[0] (default 0.25, > 0); both round-trip via Save/Load. Created with TNNetVectorQuantizer.Create() or TNNetVectorQuantizer.Create(NumCodes, Commitment). |
TNNetArcFace |
1D, 2D, or 3D | ArcFace additive angular-margin softmax output head (Deng et al. 2019) — a SELF-CONTAINED softmax-cross-entropy head with a trainable per-class weight matrix. Splits the input depth as x|y with Depth = d + 1 (>= 2): x are the d embedding channels and the last channel holds the integer class label y. Both the embedding and each class weight W_k are L2-normalized, so cos(θ_k) = <x̂, Ŵ_k>. For the true class c = round(y) the additive angular margin m is applied: cos(θ'_c) = cos(θ_c)·cos(m) - sin(θ_c)·sin(m). With logits z_k = s·cos(θ_k) (z_c = s·cos(θ'_c)), L = -log(softmax(z)_c). There is no external target; supervision is implicit in the x|y layout. The K class weight vectors (each of dim d) are stored as K trainable neurons (one weight vector each) and serialize automatically. Forward is identity passthrough; backward writes dL/dx into the embedding channels (0 into the label channel) and accumulates dL/dW_k into each weight neuron's delta. NOTE: the per-sample gradient path cannot see other minibatch samples (standard for this framework's loss heads). K is stored in FStruct[0] (default 2, >= 1), margin m in FFloatSt[0] (default 0.5 rad, >= 0) and scale s in FFloatSt[1] (default 30.0, > 0); all round-trip via Save/Load. Created with TNNetArcFace.Create() or TNNetArcFace.Create(NumClasses, Margin, Scale). See examples/ArcFaceEmbedding for a margin sweep showing the angular margin tighten intra-class cosine clusters. |
TNNetEvidentialRegression |
1D, 2D, or 3D (Depth-vector of packed params) |
Deep Evidential Regression uncertainty head (Amini et al., NeurIPS 2020, arXiv:1910.02600) — a single-forward-pass regression head that reports BOTH aleatoric (data noise) and epistemic (model) uncertainty with no sampling and no ensemble. Per scalar target it emits the 4 parameters of a Normal-Inverse-Gamma (NIG) higher-order distribution: gamma (mean, linear), nu, alpha, beta. The previous layer must emit 4*D raw channels packed over Depth as [gamma | raw_nu | raw_alpha | raw_beta] per target; forward is an identity-style passthrough that applies the positivity links in place — gamma stays linear, softplus → nu>0 and beta>0, 1+softplus → alpha>1 — so Compute returns usable parameters. Like TNNetMixtureDensity it OWNS its loss: Backpropagate overwrites FOutputError with the EXACT dL/d{gamma,nu,alpha,beta} chained through the softplus links, where L is the NIG negative log-likelihood (a Student-t marginal, closed form via a self-contained LnGammaF/DigammaF) PLUS the paper's evidence regularizer `lambda* |
TNNetEvidentialClassification |
1D, 2D, or 3D (K-vector of class evidence) |
Evidential Deep Learning classification head (Sensoy et al., NeurIPS 2018, arXiv:1806.01768) — the classification sibling of TNNetEvidentialRegression. Treats the previous layer's K raw outputs as evidence for a Dirichlet over the K-class simplex: alpha_k = 1 + softplus(raw_k), strength S = sum_k alpha_k. Forward is an identity-style passthrough that applies the softplus link in place so Compute returns the concentration vector alpha. From one deterministic pass it reads off the mean class probabilities p_k = alpha_k/S and a single scalar uncertainty mass u = K/S in [0,1] (u→1 when all evidence vanishes = the network abstains) — no sampling, no ensemble. Like its regression sibling it OWNS its loss: Backpropagate overwrites FOutputError with the EXACT dL/d(raw_k) chained through softplus, where L is the EDL Bayes-risk expected-MSE (sum_k (y_k-p_k)^2 + p_k(1-p_k)/(S+1), Eq. 5) PLUS a lambda·KL-to-uniform regularizer on the misleading-evidence Dirichlet alpha~ = y + (1-y)*alpha (self-contained LnGammaF/DigammaF/TrigammaF). Inference helpers: Alpha(k), Prediction(k) = alpha_k/S, Uncertainty = K/S. Distinct from a plain TNNetSoftMax head (a softmax gives a point probability with no abstention signal) — EDL's u rises on out-of-distribution / ambiguous inputs. K round-trips via FStruct[0], lambda via FFloatSt[0]. Created with TNNetEvidentialClassification.Create() or TNNetEvidentialClassification.Create(NumClasses, Lambda). See examples/EvidentialClassification/ where u rises ~5–6× out of distribution. |
TNNetMixtureDensity |
1D, 2D, or 3D (Depth-vector of packed params) |
Mixture Density Network regression output head (Bishop 1994) — the first head that predicts a full conditional distribution (multi-modal, heteroscedastic) rather than a point estimate, so it can model one-to-many / inverse problems where an MSE head collapses to the conditional mean. The previous layer must emit K*(1 + 2*D) raw channels, packed over Depth as [K mixing logits | K*D means | K*D raw scales], parameterizing a K-component diagonal-Gaussian mixture over a D-dim target. Forward is an identity-style passthrough that transforms the params in place: softmax over the K mixing logits → pi, raw means untouched, softplus (ln(1+exp(s)), chosen over exp for robustness) on the scales → sigma, so Compute returns usable inference parameters. There is no external target beyond the D regression values: the framework seeds the target, this head reads y from the first D target channels and OWNS the negative-log-likelihood loss, with Backpropagate overwriting the whole FOutputError with the EXACT dNLL/d(raw param) using the numerically-stable log-sum-exp form over components (responsibilities gamma_k; dNLL/da_k = pi_k - gamma_k, dNLL/dmu = gamma_k*(mu-y)/sigma^2, dNLL/ds = gamma_k*(1/sigma - (y-mu)^2/sigma^3)*sigmoid(s)). Inference helper SampleMixture draws a D-dim sample (pick a component by its pi weights, then sample that diagonal Gaussian); MixtureNLL scores a target. K round-trips via FStruct[0] (default 2), D via FStruct[1] (default 1). Created with TNNetMixtureDensity.Create() or TNNetMixtureDensity.Create(NumComponents, TargetDim). NOTE: the head packs over the Depth axis, so the trunk must emit TNNetFullConnectLinear(1, 1, K*(1+2*D)) (not the SizeX-packing Create(N) form). See examples/MixtureDensity/ for the classic one-to-many inverse-map demo where the mixture recovers the multiple branches a plain MSE head averages into the gap. |
How to build a contrastive / metric-learning head. Build a small embedding sub-net (an MLP or conv trunk) that maps the input to an embed_dim vector, end it with TNNetL2Normalize so embeddings live on the unit sphere, then train it with TNNetTripletLoss. Because the triplet head takes no external target — supervision is implicit in its anchor|positive|negative depth layout — you feed it three embeddings at once. The cleanest fully-native way is a weight-shared siamese net: feed the triplet as three spatial positions, embed each with pointwise (featuresize=1) layers so the same weights apply to all three, TNNetL2Normalize, then TNNetReshape(1, 1, 3*embed_dim) to lay the three embeddings out as the a|p|n depth chunks the loss head consumes. At inference, drop the loss head and read the embedding directly; use TNNetCosineSimilarity (or a plain dot product on unit-norm vectors) to score pairs. See the worked Triplet embedding example.
TNNetSplitChannels and TNNetSplitChannelEvery are specialized layer types in the CAI Neural API that allow for selective channel manipulation within neural networks.
-
TNNetSplitChannels: This layer is designed to pick or split selected channels from the previous layer. It provides fine-grained control over which specific channels are passed on to subsequent layers in the network. Key features:- It can be created with a specific range of channels (ChannelStart and ChannelLen) or with an array of specific channel indices.
Potential uses:
- Feature selection: Allowing the network to focus on specific features represented by certain channels.
- Creating multiple parallel paths in the network that process different subsets of the input channels.
- Implementing attention-like mechanisms by selectively passing certain channels forward.
-
TNNetSplitChannelEvery: This layer is a specialized version ofTNNetSplitChannels. It splits channels at regular intervals.Potential uses:
- Creating regular patterns of channel selection throughout the network.
- Implementing a form of grouped convolutions or channel-wise operations.
- Reducing the computational load by consistently selecting a subset of channels at regular intervals.
Both these layers offer powerful tools for manipulating the flow of information through the network's channels. They allow for the creation of more complex and efficient network architectures by providing fine control over which features (represented by channels) are processed in different parts of the network.
These layers could be particularly useful in scenarios where:
- You want to reduce the computational complexity of your model by focusing on the most important channels.
- You're designing a network with multiple parallel paths, each operating on different subsets of the input features.
- You're implementing custom attention mechanisms or feature selection techniques within your network.
Picking the right channel-select layer. Three closely related layers select depth channels — pick by the shape of the selection:
TNNetGather(Channel)selects a single channel (Output[x,y,0] := Input[x,y,Channel], output depth 1) — the degenerate one-index case.TNNetSplitChannelsselects a contiguous range (Create(ChannelStart, ChannelLen)) or an explicit list, and is the right tool for plain slicing / parallel-path splits.TNNetGatherChannels([i0, i1, ...])selects an arbitrary, ordered, possibly-repeated index list (Output[x,y,k] := Input[x,y,Channels[k]], output depth = list length), so it doubles as a learnable-free channel reorder / prune / duplicate. Add it via the convenience builderTNNet.AddGatherChannels([...]). See the runnableexamples/GatherChannelsRouting/demo. Repeats are allowed; backward accumulates the duplicated output errors onto the shared source channel.
| Layer Name | Input/Output Dimensions | Description |
|---|---|---|
TNNetSplitChannels |
2D or 3D | Splits or copies channels from the input. This layer allows getting a subset of the input channels. |
TNNetSplitChannelEvery |
2D or 3D | Splits channels from the input every few channels. As example, this layer allows getting half (GetChannelEvery=2) or a third (GetChannelEvery=3) of the input channels. |
TNNetInterleaveChannels |
2D or 3D | If you're using grouped convolutions in your network, TNNetInterleaveChannels could be particularly useful. It can help mix information between groups, allowing for more interaction between different feature groups. |
TNNetCumSum |
2D or 3D | Parameter-free cumulative sum along a configurable axis. TNNetCumSum.Create defaults to the depth axis (Output[x, y, c] = sum_{k=0..c} Input[x, y, k]); TNNetCumSum.Create(Axis) selects 0 = X, 1 = Y, or 2 = Depth. Output shape equals input shape. Useful as a learned linear position feature on a constant input. |
TNNetRoll |
2D or 3D | Circular shift by Shift (integer, can be negative) along a selectable axis: TNNetRoll.Create(Shift) rolls the depth axis (default), TNNetRoll.Create(Shift, Axis) selects the axis (Axis 0 = X, 1 = Y, 2 = Depth). E.g. depth: Output[x, y, c] = Input[x, y, (c - Shift) mod Depth]. Parameter-free deterministic permutation; Create(K, a) followed by Create(-K, a) round-trips to the identity. Legacy depth-roll serializations load unchanged. |
The layers TNNetTransposeXD and TNNetTransposeYD are specialized layer types in the CAI Neural API that perform specific transposition operations on the input data. These transposition operations can be particularly useful in various neural network architectures and data processing pipelines:
- Reshaping Data: they allow for flexible reshaping of data between different network layers, which can be crucial for certain model designs.
- Feature Manipulation: by swapping spatial and depth dimensions, these layers can help in reorganizing feature representations, which might be beneficial for subsequent processing steps.
- Dimension Reduction or Expansion: depending on the input shape, these transpositions can effectively reduce or expand certain dimensions, potentially helping in compressing or expanding feature representations.
- Adapting to Different Input Formats: these layers can be useful when dealing with data that comes in different formats or when interfacing between different parts of a neural network that expect data in specific shapes.
- Custom Architecture Designs: they provide flexibility in designing custom neural network architectures that may require unconventional data flows between layers.
These layers are implemented with both forward (Compute) and backward (Backpropagate) methods, indicating that they are fully integrated into the network's training process and can be used in the middle of a network, not just as preprocessing steps. This can be particularly valuable for researchers and practitioners working on novel network designs or dealing with unconventional data structures.
| Layer Name | Input/Output Dimensions | Description |
|---|---|---|
TNNetTransposeXD |
2D or 3D | It transposes the X and Depth axes of the input data. It swaps the spatial dimension along the width (X-axis) with the channel or feature dimension (Depth axis). |
TNNetTransposeYD |
2D or 3D | It transposes the Y and Depth axes of the input data. It swaps the spatial dimension along the height (Y-axis) with the channel or feature dimension (Depth axis). |
Activation functions are a fundamental component of neural networks. These functions play several crucial roles in neural networks:
- Introducing non-linearity: this allows the network to model complex, non-linear relationships in data.
- Normalizing outputs: many activation functions map inputs to a fixed range, helping to prevent issues like exploding gradients.
- Representing features: different activation functions can help in capturing various types of patterns or features in the data. The choice of activation function can significantly impact the performance and learning capabilities of a neural network, and different problems may benefit from different activation functions.
The CAI Neural API supports various types of activation functions, as per the below table:
| Layer Name | Input/Output Dimensions | Activation | Description |
|---|---|---|---|
TNNetReLU |
1D, 2D, or 3D | ReLU | Applies the ReLU activation function. |
TNNetReLU6 |
1D, 2D, or 3D | ReLU6 | ReLU activation clipped at 6. |
TNNetReLUL |
1D, 2D, or 3D | ReLUL | Leaky version of ReLU. |
TNNetLeakyReLU |
1D, 2D, or 3D | Leaky ReLU | Applies a leaky ReLU activation function. |
TNNetPReLUChannel |
1D, 2D, or 3D | PReLU/channel | Per-channel Parametric ReLU (He et al. 2015): y = x if x >= 0 else alpha[c] * x, with one learnable alpha per depth channel (initialized to 0.25). Created with TNNetPReLUChannel.Create(). |
TNNetVeryLeakyReLU |
1D, 2D, or 3D | Very Leaky ReLU | Applies a very leaky ReLU activation function. |
TNNetRReLU |
1D, 2D, or 3D | Randomized Leaky ReLU | Randomized Leaky ReLU (Xu et al. 2015): y = x if x >= 0 else a * x. During training (Enabled = True, the default) the negative slope a is sampled uniformly from [lower, upper] once per forward pass; at inference (Enabled = False) the fixed average slope (lower + upper)/2 is used. Created with TNNetRReLU.Create() (defaults lower = 1/8, upper = 1/3) or TNNetRReLU.Create(lower, upper). |
TNNetReLUSqrt |
1D, 2D, or 3D | ReLU Sqrt | ReLU activation function with square root scaling. |
TNNetSquaredReLU |
1D, 2D, or 3D | Squared ReLU | Squared ReLU activation: relu(x)^2. From the Primer paper (https://arxiv.org/abs/2109.08668). Created with TNNetSquaredReLU.Create(). |
TNNetShiftedReLU |
1D, 2D, or 3D | Shifted ReLU | Parameter-free ReLU variant y = max(-1, x) allowing a small negative range without saturating. Created with TNNetShiftedReLU.Create(). |
TNNetThreshold |
1D, 2D, or 3D | Threshold | Threshold activation: y = x if x > theta else value. Generalizes ReLU; useful as a sparsifier when theta > 0. Created with TNNetThreshold.Create(theta, value) (both default to 0). |
TNNetTopK |
1D, 2D, or 3D | TopK | Per spatial cell, keep the K largest activations along the depth axis and zero the rest. Gradient flows only through kept positions. Created with TNNetTopK.Create(K). |
TNNetHardConcrete |
1D, 2D, or 3D | HardConcrete | Learnable L0-sparsity gate (Louizos et al. 2018): a per-depth-channel multiplicative gate z in [0,1] whose log_alpha is trained, so a fraction of channels are pruned to exactly 0. Stochastic hard-concrete reparameterization during training (gate enabled), deterministic gate clip(sigmoid(log_alpha)*(zeta-gamma)+gamma,0,1) at inference. Created with TNNetHardConcrete.Create(beta, gamma, zeta) (paper defaults 2/3, -0.1, 1.1). See examples/HardConcreteSparsity/. |
TNNetLogSigmoid |
1D, 2D, or 3D | LogSigmoid | Stable log-sigmoid activation: y = log(sigmoid(x)) = -softplus(-x). Pairs with binary cross-entropy with logits. Created with TNNetLogSigmoid.Create(). |
TNNetSoftPlus |
1D, 2D, or 3D | SoftPlus | SoftPlus activation, a smooth approximation of ReLU: ln(1 + exp(x)). Created with TNNetSoftPlus.Create(). |
TNNetSoftPlusBeta |
1D, 2D, or 3D | SoftPlusBeta | Generalized SoftPlus with a fixed sharpness beta: y = (1/beta)·ln(1 + exp(beta·x)), derivative sigmoid(beta·x). Numerically stable for large beta·x. beta = 1 recovers TNNetSoftPlus. Created with TNNetSoftPlusBeta.Create(beta) (default 1.0). |
TNNetSoftExponential |
1D, 2D, or 3D | SoftExponential | Godfrey & Gashler parametric activation with a fixed alpha: -ln(1 - alpha·(x + alpha))/alpha for alpha < 0, identity for alpha = 0, (exp(alpha·x) - 1)/alpha + alpha for alpha > 0. Created with TNNetSoftExponential.Create(alpha) (default 0.0 = identity). |
TNNetSerf |
1D, 2D, or 3D | Serf | Search-of-erf activation: y = x * erf(softplus(x)). Smooth Mish-like drop-in (https://arxiv.org/abs/2108.09598). Created with TNNetSerf.Create(). |
TNNetErf |
1D, 2D, or 3D | Erf | Gauss error function activation: y = erf(x). Closed-form GELU partner with derivative (2/sqrt(pi)) * exp(-x^2). Reuses the Abramowitz–Stegun 7.1.26 polynomial helper that powers TNNetSerf (FPC's math unit does not export erf). Created with TNNetErf.Create(). |
TNNetSwishLearnable |
1D, 2D, or 3D | SwishL | Swish with a single learnable scalar beta, initialised to 1.0 (starts identical to TNNetSwish). Forward y = x * sigmoid(beta*x); backward updates both input gradient and beta (Ramachandran et al. 2017, https://arxiv.org/abs/1710.05941). Created with TNNetSwishLearnable.Create(). |
TNNetMishLearnable |
1D, 2D, or 3D | MishL | Mish with a single learnable inner-scale alpha, initialised to 1.0 (starts identical to TNNetMish). Forward y = x * tanh(softplus(alpha*x)); backward updates both the input gradient and alpha. Sibling of TNNetSwishLearnable. Created with TNNetMishLearnable.Create() or TNNetMishLearnable.Create(alpha). |
TNNetAconC |
1D, 2D, or 3D | ACON-C | "Activate Or Not" (Ma et al. 2021), a learnable generalization of Swish: y = (p1-p2)·x·sigmoid(beta·(p1-p2)·x) + p2·x with one learnable triple (p1, p2, beta) per depth channel, initialised to (1, 0, 1) so an untrained layer is exactly TNNetSwish. Backward updates the input gradient and all three per-channel parameters. Created with TNNetAconC.Create(). |
TNNetSReLU |
1D, 2D, or 3D | S-shaped ReLU | S-shaped ReLU (Jin et al. 2016): a continuous piecewise-linear activation with four learnable parameters per depth channel — right knee (t_r, a_r) and left knee (t_l, a_l). y = t_r + a_r·(x - t_r) for x >= t_r, y = t_l + a_l·(x - t_l) for x <= t_l, else y = x. Initialised to (t_r, a_r, t_l, a_l) = (0, 1, 0, 0) so an untrained layer is exactly TNNetReLU (set a_l = 0.01 for a leaky start). Backward updates the input gradient and all four per-channel parameters. Created with TNNetSReLU.Create() or TNNetSReLU.Create(t_r, a_r, t_l, a_l). |
TNNetAPL |
1D, 2D, or 3D | APL | Adaptive Piecewise Linear unit (Agostinelli et al. 2015, https://arxiv.org/abs/1412.6830): h(x) = max(0, x) + Σ_{s=1..S} a[s,c]·max(0, -x + b[s,c]) with S learnable hinges (default 2) per depth channel, each having a slope a[s,c] and a knee b[s,c] (2·S·Depth learnable scalars total). Initialised with slopes a = 0.25 and knees spread over [0,1]. Backward updates the input gradient and all per-channel slopes and knees. Created with TNNetAPL.Create() or TNNetAPL.Create(NumHinges). |
TNNetSplineActivation |
1D, 2D, or 3D | Spline | KAN-flavored (Kolmogorov-Arnold) per-channel learnable piecewise-linear activation: K+1 learnable control-point values y[0..K,c] at K+1 FIXED, evenly-spaced knots over [-Range, +Range], linearly interpolated (and linearly extrapolated beyond the end knots). (K+1)·Depth learnable scalars total; only the values are trained, the knots are fixed. Initialised to the identity (y[i,c] = t[i]) so an untrained layer is exactly y = x everywhere. Backward updates the input gradient (the local segment slope) and the two bracketing control points. Created with TNNetSplineActivation.Create() (K=4 intervals, Range=2.0) or TNNetSplineActivation.Create(NumIntervals, Range). |
TNNetMetaAconC |
1D, 2D, or 3D | Meta-ACON | Data-dependent-beta sibling of TNNetAconC (Ma et al. 2021): the ACON-C switch beta[c] is generated from a spatial squeeze of the input, beta[c] = sigmoid(gamma[c]·mean_spatial(x_c) + delta[c]), so the activation adapts per sample (vs TNNetAconC's static learned beta). Four learnable per-channel parameters (p1, p2, gamma, delta); backward carries the extra gradient path through the squeeze mean. Uses a per-channel affine-over-squeeze as a tractable in-pattern simplification of the paper's cross-channel bottleneck. Created with TNNetMetaAconC.Create(). |
TNNetSoftPlusBetaLearnable |
1D, 2D, or 3D | SoftPlusBetaL | Learnable-beta variant of TNNetSoftPlusBeta: y = (1/beta)·ln(1 + exp(beta·x)) with a single learnable beta (default 1.0), derivative sigmoid(beta·x). Backward updates both the input gradient and beta. Created with TNNetSoftPlusBetaLearnable.Create() or TNNetSoftPlusBetaLearnable.Create(beta). |
TNNetPhish |
1D, 2D, or 3D | Phish | Phish activation: y = x * tanh(gelu(x)), with GELU computed via the tanh approximation (Naveen, 2022, https://arxiv.org/abs/2208.04458). Smooth Mish/Serf sibling. Created with TNNetPhish.Create(). |
TNNetISRU |
1D, 2D, or 3D | ISRU | Inverse Square Root Unit: y = x / sqrt(1 + alpha * x^2). Everywhere smooth, derivative 1 / (1 + alpha*x^2)^(3/2) (Carlile et al., 2017, https://arxiv.org/abs/1710.09967). Created with TNNetISRU.Create() or TNNetISRU.Create(alpha) (default alpha = 1.0, must be > 0). |
TNNetISRLU |
1D, 2D, or 3D | ISRLU | Inverse Square Root Linear Unit: y = x for x >= 0, y = x / sqrt(1 + alpha * x^2) for x < 0 (Carlile et al., 2017). Identity-on-the-right sibling of ISRU. Created with TNNetISRLU.Create() or TNNetISRLU.Create(alpha). |
TNNetTanhExp |
1D, 2D, or 3D | TanhExp | TanhExp activation: y = x * tanh(exp(x)). Smooth, high-convergence ReLU alternative (https://arxiv.org/abs/2003.09855). Created with TNNetTanhExp.Create(). |
TNNetBentIdentity |
1D, 2D, or 3D | BentIdentity | Bent Identity activation: y = (sqrt(x^2 + 1) - 1)/2 + x. Smooth, with always-positive slope. Created with TNNetBentIdentity.Create(). |
TNNetLisht |
1D, 2D, or 3D | LiSHT | Linearly Scaled Hyperbolic Tangent: y = x * tanh(x). Non-monotonic smooth ReLU alternative. Created with TNNetLisht.Create(). |
TNNetGaussianActivation |
1D, 2D, or 3D | Gaussian | Gaussian activation: exp(-x^2). Created with TNNetGaussianActivation.Create(). |
TNNetSign |
1D, 2D, or 3D | Sign | Sign activation: y = sign(x). Saturated straight-through-estimator backward (gradient passes through only on ` |
TNNetSqrt |
1D, 2D, or 3D | Sqrt | Eps-clamped square root: y = sqrt(max(x, 1e-6)). Created with TNNetSqrt.Create(). |
TNNetExp |
1D, 2D, or 3D | Exp | Overflow-clamped exponential: y = exp(min(x, 30)). Created with TNNetExp.Create(). |
TNNetLog |
1D, 2D, or 3D | Log | Eps-clamped natural log: y = ln(max(x, 1e-8)). Created with TNNetLog.Create(). |
TNNetReciprocal |
1D, 2D, or 3D | Reciprocal | Eps-clamped reciprocal: `y = 1/(sign(x) * max( |
TNNetSELU |
1D, 2D, or 3D | SELU | Self-normalizing activation function. |
TNNetSigmoid |
1D, 2D, or 3D | Sigmoid | Sigmoid activation function. |
TNNetSoftMax |
1D, 2D, or 3D | SoftMax | SoftMax activation function. |
TNNetCenteredSoftmax |
1D, 2D, or 3D | C SoftMax | SoftMax preceded by per-sample mean subtraction. Mathematically equivalent to TNNetSoftMax (softmax is shift-invariant) so the input gradient is identical; differs only in the numerical-stability profile of the forward exp. Drop-in replacement when extreme input magnitudes risk overflow. Created with TNNetCenteredSoftmax.Create(). |
TNNetEntropyRegularizer |
1D, 2D, or 3D | Passthrough | Identity forward; backward injects an extra lambda * (ln(p + 1e-7) + 1) gradient that corresponds to adding -lambda * H(p) to the loss. Place right after a softmax: lambda > 0 encourages confident (low-entropy) outputs; lambda < 0 encourages uniform ones. Created with TNNetEntropyRegularizer.Create(lambda) (default lambda = 0.01). |
TNNetGradientReversal |
1D, 2D, or 3D | Passthrough | Identity forward; backward multiplies the upstream gradient by -lambda (Ganin et al. 2015, https://arxiv.org/abs/1505.07818). Used as the hinge between a shared feature trunk and an adversarial domain-classifier head in Domain-Adversarial Neural Networks (DANN), so the trunk is steered toward features the adversary cannot exploit. Created with TNNetGradientReversal.Create(lambda) (default lambda = 1.0). |
TNNetCoordConv |
2D or 3D | + 2 channels | Parameter-free CoordConv (Liu et al. 2018, https://arxiv.org/abs/1807.03247). Concatenates two normalized X/Y coordinate channels ((2*x/(SizeX-1)) - 1 and (2*y/(SizeY-1)) - 1, both in [-1, 1]; 0 when the corresponding axis has size 1) to the input on the depth axis. Output shape is (SizeX, SizeY, Depth + 2). The coordinate channels carry no gradient — backward forwards only the first Depth error channels to the previous layer. Placing CoordConv immediately before a convolution gives that convolution direct access to absolute (x, y) position. Created with TNNetCoordConv.Create(). |
TNNetSoftMaxOne |
1D, 2D, or 3D | SoftMaxOne | "Off by one" softmax: y_i = exp(x_i) / (1 + sum_j exp(x_j)) (Miller, 2023). Outputs do NOT sum to 1; the leftover mass lets attention attend to nothing without an explicit sink token. Numerically-stable max-shift forward; full softmax-Jacobian backward. Created with TNNetSoftMaxOne.Create(). |
TNNetGumbelSoftmax |
1D, 2D, or 3D | Gumbel SoftMax | Differentiable categorical sampling (Jang et al. 2016 / Maddison et al. 2016): y = softmax((logits + g) / tau) with g = -ln(-ln(U)), U ~ Uniform(0,1). The Gumbel noise is added only while training (the layer descends TNNetAddNoiseBase, so EnableDropouts(true) turns it on); inference is the deterministic softmax(logits / tau). Lower tau sharpens toward a one-hot draw. In hard mode the forward output is the one-hot argmax while the backward uses the soft sample's exact softmax-Jacobian (times 1/tau) as a straight-through estimator. Created with TNNetGumbelSoftmax.Create() (tau = 1.0, soft) or TNNetGumbelSoftmax.Create(tau, hard). |
TNNetSparsemax |
1D, 2D, or 3D | Sparsemax | Euclidean projection onto the probability simplex (Martins & Astudillo, 2016, https://arxiv.org/abs/1602.02068), applied per spatial (x, y) over the depth axis (same scope as TNNetPointwiseSoftMax). Forward sorts the depth vector descending to find the support size k, then writes p[i] = max(0, z[i] - tau) where tau = (sum(z_sorted[0..k-1]) - 1) / k. Outputs sum to 1 and contain TRUE zeros outside the support — a natural drop-in for sparse attention. Backward is the JVP through the support set: grad_z[i] = grad_p[i] - mean_{j in S}(grad_p[j]) for i in S, 0 otherwise. Created with TNNetSparsemax.Create(). |
TNNetPointwiseSoftMax |
2D or 3D | 1x1 SoftMax | Pointwise (1x1) SoftMax activation function. |
TNNetSinkhorn |
2D (N x 1 x N) | None | Differentiable optimal-transport / doubly-stochastic normalization (Mena et al. 2018, Learning Latent Permutations with Gumbel-Sinkhorn Networks, arXiv:1802.08665) — where TNNetSoftMax/TNNetSparsemax/entmax normalize ONE axis, this is the first layer that normalizes a square (N,1,N) score matrix to be doubly stochastic (every row AND every column sums to 1). Forward iterates Sinkhorn–Knopp in log-space for stability: starting from score/tau it runs KIter alternating row then column subtract-logsumexp normalizations, then exp. Because doubly-stochastic matrices are the convex hull of permutation matrices, as temperature tau → 0 the output sharpens toward a hard permutation, making a permutation a smooth differentiable function of a score matrix — the building block for differentiable sorting / learnable permutations / soft bipartite matching (the soft, trainable relaxation of the hard non-differentiable Hungarian/auction assignment). No trainable parameters. Backward unrolls all 2·KIter steps (each caches its input log-matrix) and applies the exact softmax-style adjoint of each subtract-logsumexp, then the trailing exp and 1/tau factors; the input gradient is numerically gradient-checked (max-abs-err ≈1.7e-5). KIter/tau round-trip via FStruct[0]/FFloatSt[0]; SetTau allows annealing across training. Created with TNNetSinkhorn.Create(KIter=20, tau=1.0). See examples/SinkhornSort/. |
TNNetPointwiseNorm |
2D or 3D | 1x1 Norm | Pointwise (1x1) normalization. |
TNNet.AddGroupedPointwiseSoftMax |
2D or 3D | Gr 1x1 Norm | Grouped pointwise (1x1) SoftMax. |
TNNetSwish |
1D, 2D, or 3D | Swish | Swish activation function. |
TNNetSwish6 |
1D, 2D, or 3D | Swish 6 | Swish activation clipped at 6. |
TNNetHardSwish |
1D, 2D, or 3D | Hard Swish | Hard version of Swish activation. |
TNNetESwish |
1D, 2D, or 3D | ESwish | Beta-generalized Swish: y = beta * x * sigmoid(beta * x). Created with TNNetESwish.Create(beta) (default beta = 1.25). |
TNNetHyperbolicTangent |
1D, 2D, or 3D | tanh | Hyperbolic tangent activation function. |
TNNetLeCunTanh |
1D, 2D, or 3D | LeCunTanh | LeCun scaled tanh: y = 1.7159 * tanh((2/3) * x), tuned so f(+/-1) ~= +/-1 (LeCun et al., "Efficient Backprop", 1998). Created with TNNetLeCunTanh.Create(). |
TNNetSinhAct |
1D, 2D, or 3D | Sinh | Hyperbolic sine activation: y = sinh(x), derivative cosh(x). Unbounded; use only with bounded inputs. Created with TNNetSinhAct.Create(). |
TNNetArcSinh |
1D, 2D, or 3D | ArcSinh | Inverse hyperbolic sine activation: y = arcsinh(x) = ln(x + sqrt(x^2 + 1)), derivative 1/sqrt(x^2 + 1). Monotonic, smooth, never saturates. Created with TNNetArcSinh.Create(). |
TNNetLogCoshActivation |
1D, 2D, or 3D | LogCosh | Log-Cosh activation: y = log(cosh(x)), derivative tanh(x). Smooth-L1 style; behaves like x^2/2 near zero and like ` |
TNNetSin |
1D, 2D, or 3D | Sin | Periodic activation: y = sin(x). Useful as a SIREN-style coordinate activation. Created with TNNetSin.Create(). |
TNNetCos |
1D, 2D, or 3D | Cos | Periodic activation: y = cos(x). Phase-shifted partner to TNNetSin. Created with TNNetCos.Create(). |
TNNetSinc |
1D, 2D, or 3D | Sinc | Normalized sinc activation: y = sin(x)/x, with analytic limit y = 1 at x = 0. Created with TNNetSinc.Create(). |
TNNetPower |
1D, 2D, or 3D | Power | Applies a power activation function. |
TNNetMulByConstant |
1D, 2D, or 3D | * C | Multiplies the output by a constant. |
TNNetNegate |
1D, 2D, or 3D | * -1 | Multiplies the previous output by -1. |
TNNetSignedSquareRoot |
1D, 2D, or 3D | SSR | Square root of the input absolute value preserving the original sign. y = Sign(x) * Sqrt(Abs(x)) |
TNNetSignedSquareRoot1 |
1D, 2D, or 3D | SSR1 | If Abs(x) < 1 then y = x, otherwise, y = Sign(x) * Sqrt(Abs(x)). |
TNNetSignedSquareRootN |
1D, 2D, or 3D | SSRN | If Abs(x) < N then y = x, otherwise, y = Sign(x) * Sqrt(Abs(x)-N+1)+N-1. |
Gated Linear Units split the input along the channel (depth) axis into two equal halves A and B, and output A multiplied by a gating activation applied to B. The output depth is therefore half of the input depth, and the input depth must be even. These layers have no trainable parameters. They are commonly used inside transformer feed-forward blocks (https://arxiv.org/abs/2002.05202).
| Layer Name | Input/Output Dimensions | Description |
|---|---|---|
TNNetGLU |
1D, 2D, or 3D (even depth) | Gated Linear Unit: outputs A * sigmoid(B) (https://arxiv.org/abs/1612.08083). Created with TNNetGLU.Create(). |
TNNetGEGLU |
1D, 2D, or 3D (even depth) | GELU-gated linear unit: outputs A * GELU(B). Created with TNNetGEGLU.Create(). |
TNNetSwiGLU |
1D, 2D, or 3D (even depth) | Swish-gated linear unit: outputs A * Swish(B), where Swish(x) = x * sigmoid(x). Created with TNNetSwiGLU.Create(). |
TNNetTanhGLU |
1D, 2D, or 3D (even depth) | Tanh-gated linear unit: outputs A * tanh(B). Parameter-free; mirrors TNNetGLU with the sigmoid gate swapped for tanh. Created with TNNetTanhGLU.Create(). |
| Layer Name | Input/Output Dimensions | Description |
|---|---|---|
TNNetScaledDotProductAttention |
Input: SeqLen x 1 x 3*d_k (Q|K|V concatenated along depth). Output: SeqLen x 1 x d_k. |
Single-head scaled dot-product attention: scores[i,j] = dot(Q[i], K[j]) / sqrt(d_k), row-softmax, then out[i] = sum_j attn[i,j]*V[j]. Optional causal (upper-triangle) mask. Parameter-free. Created with TNNetScaledDotProductAttention.Create(d_k, CausalMask=false). |
TNNetCosineSimilarityAttention |
Input: SeqLen x 1 x 3*d_k (Q|K|V concatenated along depth). Output: SeqLen x 1 x d_k. |
Drop-in variant of scaled dot-product attention whose raw Q.K^T score is replaced by a cosine-similarity score score[i,j] = scale * (Q[i]/||Q[i]||) . (K[j]/||K[j]||) — each query and key row is L2-normalized over the d_k feature axis (with an epsilon guard) before the dot product. Everything after the scores (row-softmax, V-weighting) is identical to SDPA. Because cosine scores are bounded in [-scale, +scale] this removes the unbounded-logit problem of dot-product attention (more stable softmax / no score blow-up at large d_k). Optional causal mask; fixed scale (default 1.0) round-trips via serialization. Parameter-free. The exact L2-normalization Jacobian is back-propagated. Created with TNNetCosineSimilarityAttention.Create(d_k, CausalMask=false, Scale=1.0). |
TNNetSinkAttention |
Input: SeqLen x 1 x 3*d_k (Q|K|V concatenated along depth). Output: SeqLen x 1 x d_k. |
Drop-in variant of scaled dot-product attention with K learnable attention-sink slots (StreamingLLM, Xiao et al. 2023). The K learnable (key,value) sink pairs are prepended to the real keys/values and every query attends to them regardless of the causal mask (sinks are never masked). Softmax runs over the concatenation [K sinks ++ SeqLen real keys], giving it an always-available place to dump probability mass; this stabilises long-context / causal attention (otherwise the first real token tends to act as an implicit sink). Scoring reuses SDPA's 1/sqrt(d_k) scaling and causal convention for the real keys. The K*(2*d_k) sink params are stored as 2*K neurons (keys then values) so they train and serialize automatically; K round-trips via Struct[2]. Sink keys init small-random, sink values init zero. Created with TNNetSinkAttention.Create(d_k, CausalMask=false, NumSinks=1). |
TNNetDifferentialAttention |
Input: SeqLen x 1 x 3*d_k (Q|K|V concatenated along depth). Output: SeqLen x 1 x d_k. |
Differential Transformer attention head (Ye et al. 2024). Splits the shared Q/K depth slabs in half into two (Q1,K1) and (Q2,K2) half-width sub-heads, computes two independent softmax maps scaled by 1/sqrt(d_k/2), and outputs their scaled difference applied to the full-width shared V: (softmax(Q1·K1^T/√(d_k/2)) − λ·softmax(Q2·K2^T/√(d_k/2)))·V. The second map estimates and cancels common-mode attention noise, sharpening long-range retrieval. λ is a single learnable scalar (init ≈0.8, stepped like TNNetReZero's weight and mirrored into the structure string so it round-trips). Requires even d_k; causal mask honoured on both maps. Created with TNNetDifferentialAttention.Create(d_k, CausalMask=false, LambdaInit=0.8). |
TNNetLinearAttention |
Input: SeqLen x 1 x 3*d_k (Q|K|V concatenated along depth). Output: SeqLen x 1 x d_k. |
Softmax-free linear attention (Katharopoulos et al. 2020, Transformers are RNNs) — the first sub-quadratic attention in this repo. Replaces the softmax(QK^T)V core with a positive feature map φ(x)=elu(x)+1 on Q/K, then exploits associativity: out_t = φ(Q_t)·S / (φ(Q_t)·Z) where S = Σ_s φ(K_s)⊗V_s (d_k×d_k) and Z = Σ_s φ(K_s) are accumulated once over the sequence. Cost is O(SeqLen·d_k²) — linear in sequence length, with no SeqLen×SeqLen score matrix ever formed. Non-causal (full-prefix) variant; at SeqLen=1 the normaliser cancels and the output reduces to V_1 exactly. Parameter-free. Created with TNNetLinearAttention.Create(d_k). |
TNNetLinformerAttention |
Input: SeqLen x 1 x 3*d_k (Q|K|V concatenated along depth). Output: SeqLen x 1 x d_k. |
Linformer (Wang et al. 2020, Linformer: Self-Attention with Linear Complexity). Keeps the softmax but projects the Key and Value sequences DOWN along the sequence axis from SeqLen to a small fixed rank k ≪ SeqLen with two learnable matrices E, F (each k×SeqLen): K' = E·K, V' = F·V, then Attn = softmax(Q·K'ᵀ / √d_k) (a SeqLen×k score matrix) and Out = Attn·V', making attention O(SeqLen·k). Because E,F carry a fixed SeqLen dimension the layer requires a FIXED SeqLen (asserted in SetPrevLayer). E,F are two trainable neurons with exact finite-difference-checked input AND weight gradients; d_k/k/SeqLen round-trip via FStruct. Distinct from the kernel-feature linear-attention family (which drops softmax) — Linformer low-rank-projects the sequence instead. Created with TNNetLinformerAttention.Create(d_k, k, SeqLen). See examples/Linformer/. |
TNNetForgetGateBias |
Input: SeqLen x 1 x Depth (per-position features). Output: SeqLen x SeqLen x 1 (additive score bias). |
The Forgetting Transformer (FoX) decay-bias generator (Lin et al. 2025, Forgetting Transformer: Softmax Attention with a Forget Gate) — the only layer here that puts a data-dependent forget gate inside SOFTMAX attention (every other forget gate in tree acts on an O(d²) recurrent state: TNNetGatedLinearAttention, TNNetWKV, TNNetDeltaNet; here it acts on the O(L²) score matrix). One weight neuron computes a per-position forget value f_t = sigmoid(w·x_t + b) ∈ (0,1), accumulates F_t = Σ_{k≤t} ln f_k, and emits the strictly-lower-triangular additive decay bias D[j,i] = F_i − F_j for j≤i (−∞ above, so the causal mask folds in for free). Added to the raw Q·Kᵀ/√d scores before softmax, this multiplies each attention weight by prod_{k=j+1..i} f_k — an input-conditioned exponential discount of older tokens (the softmax analogue of GLA's recurrence, but with full pairwise attention retained). Backward is the exact prefix-sum adjoint (dL/dF_i = Σ_{j≤i}dD[i,j] − Σ_{k≥i}dD[k,i], then df_t = (Σ_{s≥t}dF_s)/f_t, sigmoid chain into w/b/input). The composite builder TNNet.AddForgettingAttention(Heads, d_model) wires per-head forget gates into a full (inherently causal) softmax-attention block from existing primitives (no new class, save/load free). Created with TNNetForgetGateBias.Create() (Depth inferred). |
TNNetPerformerAttention |
Input: SeqLen x 1 x 3*d_k (Q|K|V concatenated along depth). Output: SeqLen x 1 x d_k. |
Performer / FAVOR+ (Choromanski et al. 2020, Rethinking Attention with Performers). Uses positive random features to give an unbiased estimate of the softmax kernel exp(q·k) at linear cost: for an m×d_k frozen projection W, φ(x)=exp(W·x−‖x‖²/2)/√m so E[φ(q)·φ(k)]=exp(q·k). Attention reassociates like the kernel family — S=Σ_s φ(K_s)⊗V_s (m×d_v), Z=Σ_s φ(K_s), Out_t=(φ(Q_t)·S)/(φ(Q_t)·Z) — at O(SeqLen·m·d_v) with no SeqLen×SeqLen matrix. W's rows are i.i.d. N(0,1), orthogonalized block-wise when m≥d_k (the lower-variance "+" in FAVOR+). W is frozen (no weight gradient) but dL/dQ, dL/dK backprop through φ; d_k/m/RNG seed round-trip via FStruct so W reloads bit-identically. Unlike TNNetLinearAttention (deterministic elu+1, a different kernel) Performer approximates the true softmax. Created with TNNetPerformerAttention.Create(d_k, m, Seed). See examples/Performer/. |
Multi-head self-attention builder. TNNet.AddMultiHeadSelfAttention(d_model, Heads, CausalMask=false) wires the single-head TNNetScaledDotProductAttention above into a full multi-head block in one call: it splits the [Q_all|K_all|V_all] (depth 3*d_model) input slab into Heads per-head [Q_h|K_h|V_h] slices (d_k = d_model/Heads) via TNNetSplitChannels, runs one SDPA head per slice, concatenates the head outputs back to depth d_model with TNNetDeepConcat, and applies a TNNetPointwiseConvLinear(d_model) per-token out-projection. The two intermediate steps are also exposed as TNNet.AddSplitQKVHeads and TNNet.AddMultiHeadSDPAConcat. (The out-projection is pointwise rather than TNNetFullConnectLinear because over a SeqLen x 1 x d_model tensor a fully-connected layer would flatten and mix the whole sequence into one vector.) An optional Variant argument (avSDPA default, avDifferential, avSink) swaps the plain per-head SDPA for TNNetDifferentialAttention or TNNetSinkAttention (with NumSinks sink slots) — the default keeps every existing call bit-for-bit unchanged.
Multi-head cross-attention builder. TNNet.AddMultiHeadCrossAttention(d_model, Heads, QuerySource, KeyValueSource, CausalMask=false) wires encoder-decoder cross-attention in one call: the Query is projected (token-wise TNNetPointwiseConvLinear(d_model)) from QuerySource (the decoder stream, a QSeqLen x 1 x d_model token tensor) while the Keys and Values are projected from a separate KeyValueSource (the encoder output, a KVSeqLen x 1 x d_model token tensor). The query and key/value sequence lengths may differ — the result lives on the query grid (QSeqLen x 1 x d_model). Per head it slices d_k = d_model/Heads channels out of each projection, packs them as [Q_h|K_h|V_h] with TNNetDeepConcat, runs one TNNetScaledDotProductAttention head, concatenates the heads, and applies a token-wise TNNetPointwiseConvLinear(d_model) out-projection. (As with self-attention, every projection is pointwise rather than TNNetFullConnect*, which would flatten the sequence axis.)
Grouped-Query / Multi-Query attention builder. TNNet.AddMultiHeadGroupedQueryAttention(d_model, QueryHeads, KVHeads, CausalMask=false) builds the GQA attention shape used by modern LLMs (Llama-2/3, Mistral): the Query is projected to the full d_model (QueryHeads heads of d_k = d_model/QueryHeads) but the Keys and Values are projected to only KVHeads*d_k channels, so several query heads share one key/value head (QueryHeads/KVHeads heads per group). KVHeads=1 degenerates to Multi-Query Attention; KVHeads=QueryHeads to plain multi-head attention. Each query head slices its own d_k Q channels plus the d_k channels of its shared KV group, packs [Q_h|K_group|V_group], runs one TNNetScaledDotProductAttention, and the heads are concatenated and out-projected with a token-wise TNNetPointwiseConvLinear(d_model). The win is inference-memory: the K/V projection params shrink by a factor QueryHeads/KVHeads versus full MHA. Requires d_model mod QueryHeads = 0 and QueryHeads mod KVHeads = 0.
Multi-head Latent Attention (MLA) builder. TNNet.AddMultiHeadLatentAttention(d_model, Heads, LatentDim, CausalMask=false) builds the DeepSeek-V2 (Liu et al. 2024) attention shape, a compression axis orthogonal to GQA: instead of sharing full-width K/V across query-head groups, MLA low-rank-factors the K/V projection. Each token is first down-projected to a tiny shared latent c_KV of width LatentDim << d_model (the only state a decoder would cache), then K and V are reconstructed per head by up-projections from c_KV. Query is projected to the full d_model per head; each head packs [Q_h|K_h|V_h], runs one TNNetScaledDotProductAttention, and the heads are concatenated and out-projected with a token-wise TNNetPointwiseConvLinear(d_model) (all projections are pointwise so the token axis is preserved). The win is cacheable-state size: LatentDim/(2*d_model) of plain MHA's K/V cache. The default RopeDim=0 is NoPE; RopeDim>0 (even) adds the paper's decoupled-RoPE slice: RoPE cannot be applied to the compressed latent (the up-projection would smear positions), so a small extra rope dimension carries position — rope-Q is a per-head projection of x rotated by TNNetRotaryEmbedding, while rope-K is one projection of x shared across all heads and rotated once, so the decode state grows by only RopeDim. Each head attends with concat(Q_h,ropeQ_h)·concat(K_h,ropeK). For KV-cache incremental decode the per-head SDPA layers support BeginIncrementalDecode, and TNNetRotaryEmbedding.PositionOffset lets a streamed length-1 token be rotated with its absolute position; examples/LatentAttention/ demonstrates both this path and a true latent-only cache (d_c floats/token, < 1e-5 faithful) with a printed cache-memory comparison. See examples/LatentAttention/.
Set Transformer (ISAB + PMA) builders. Two permutation-invariant set primitives (Lee et al. 2019, Set Transformer), each owning a learnable bank of vectors. TNNet.AddInducedSetAttention(InducingPoints, Heads) wires TNNetInducedSetAttention (ISAB): instead of O(N^2) self-attention over the N input tokens it keeps a small bank of M = InducingPoints learnable inducing points and applies two stacked cross-attention blocks — H = MAB(I, X) (the M points attend over the N inputs → (M,d)), then Y = MAB(X, H) (the N inputs attend back over the M summaries → (N,d)) — giving an O(N*M) shape-preserving set-to-set map. TNNet.AddAttentionPooling(NumSeeds, Heads) wires TNNetAttentionPooling (PMA): pools a variable-length set (N,1,d) to a fixed (k,1,d) (k = NumSeeds) by letting k learnable seed vectors cross-attend over the inputs — a trainable, content-addressed, permutation-invariant readout (k=1 is a learned-query weighted-sum pool, categorically unlike the parameter-free TNNetAvgChannel/TNNetMaxChannel). Heads=1 emits the bare single-head layer with identity Q/K/V projections (only the inducing/seed bank is learnable), keeping the two-stage softmax-Jacobian backward exact and gradient-checkable. Heads>1 builds a genuine multi-head MAB by the repo's concat-of-H idiom (no head-axis tensor): d_model splits into Heads subspaces of width d_model/Heads, each head gets its own learnable per-token input projection (a 1×1 TNNetPointwiseConvLinear that preserves the set axis) feeding a single-head ISAB/PMA with its own bank, the heads are concatenated back to d_model and run through a learnable per-token out-projection — so each head keeps the exact softmax-Jacobian backward while the model now learns Q/K/V projections and mixes head subspaces (d_model must be divisible by Heads). See examples/SetTransformer/. TNNet.AddSAB(InducingPoints, Heads, DFF) completes the family with the paper's full Set Attention Block: it wraps the multi-head ISAB MAB in two post-norm residual sub-blocks — H = LayerNorm(X + MAB(X,X)) then out = LayerNorm(H + FFN(H)), where FFN is a token-wise TNNetPointwiseConvReLU(DFF) → TNNetPointwiseConvLinear(d_model) (1×1 convs, so the (N,1,d_model) set axis and permutation-equivariance are preserved). See examples/SetAttentionBlock/.
Sinkhorn (doubly-stochastic) attention builder. TNNet.AddSinkhornAttention(out Attended, W; KIter=20, Tau=1.0) builds a single-head attention block that is a drop-in contrast to softmax attention: it projects the input to Q/K/V with token-wise TNNetPointwiseConvLinear, forms the scaled QKᵀ score matrix with TNNetDotProducts, and then — instead of the one-sided row softmax — normalizes the scores with the existing TNNetSinkhorn layer so the attention map is doubly stochastic (rows AND columns sum to 1), before weighting the values and out-projecting. Where standard attention lets every query distribute its own probability mass independently (columns can starve or saturate), the doubly-stochastic map balances how much each key is attended to overall — the optimal-transport view of attention. The TNNetSinkhorn normalizer is returned in W for inspection/annealing (SetTau), and Attended is the block output. KIter/Tau are the Sinkhorn iteration count and temperature. The default softmax path (AddSingleHeadSelfAttention / AddMultiHeadSelfAttention) is unchanged. See examples/SinkhornMatching/.
Spiking block builder. TNNet.AddSpikingBlock(pHidden, tau=2.0, V_th=1.0, alpha=2.0, LearnDynamics=false) wires the canonical spiking-network linear → LIF → rate-readout pipeline over a (T, 1, D) tensor on the time axis in one call: a per-timestep TNNetPointwiseConvLinear(pHidden) synaptic-current projection (pointwise so each time step is projected independently — TNNetFullConnect* would flatten/mix the time axis), then a TNNetLIFNeuron(tau, V_th, alpha, LearnDynamics) emitting a binary spike train, then a TNNetAvgChannel rate readout averaging spikes over time to a (1, 1, pHidden) firing-rate vector (ready for a dense head). Returns the rate-readout layer. LearnDynamics forwards to the LIF layer's opt-in trainable per-channel threshold/leak. Composes existing layers (no new class). See examples/SpikingMNIST/.
Talking-Heads attention builder. TNNet.AddTalkingHeadsAttention(Heads, d_model, CausalMask=false, PreSoftmaxMix=true, PostSoftmaxMix=true) builds full Talking-Heads multi-head attention (Shazeer et al. 2020) — a learnable Heads x Heads linear mix across heads applied to the attention maps. Because this repo has no single head-axis tensor (multi-head is H separate concatenated TNNetScaledDotProductAttention layers, whose softmax is fused and never exposes a per-head logit slab), this is necessarily a builder, not a drop-in layer: it composes attention from finer primitives so the per-head scores become a real graph tensor. Per head it forms score_h = TNNetDotProducts(Q_h, K_h) scaled by 1/sqrt(d_k), transposes so heads stack on the Depth axis, and TNNetDeepConcats the H slabs into a (key, query, H) tensor; the cross-head mix is then a TNNetPointwiseConvLinear(H) (a 1x1 conv over the head axis at every (key,query) position — exactly Shazeer's H x H multiply). A PreSoftmaxMix mix is applied to the logits and a PostSoftmaxMix mix to the post-softmax weights (each individually toggleable for ablation), with the softmax taken over the key axis between them; then weights·V per head, concat, and a token-wise TNNetPointwiseConvLinear(d_model) out-projection. Composes existing serializable layers (no new class), so save/load works for free. Requires d_model mod Heads = 0.
Conformer block builder. TNNet.AddConformerBlock(Heads, d_ff, ConvKernelSize) builds the convolution-augmented transformer block of Conformer (Gulati et al. 2020) over a (SeqLen, 1, d_model) tensor. It is a "macaron" block that sandwiches a multi-head self-attention module (global mixing) and a convolution module (local mixing) between two half-step feed-forward modules, each sub-module a pre-norm residual, with a final LayerNorm: x += 0.5·FFN(x); x += MHSA(x); x += Conv(x); x += 0.5·FFN(x); x := LayerNorm(x). Composed entirely from existing serializable primitives (TNNetLayerNorm, TNNetPointwiseConvLinear per-token projections, AddMultiHeadSelfAttention, TNNetGLU conv gating, TNNetCausalConv1D 1-D conv over the time axis, TNNetSwish, TNNetSum residuals, TNNetMulByConstant(0.5) macaron scaling), so it needs no new leaf class and round-trips through SaveToString/LoadFromString; shape-preserving so blocks stack. (The paper's per-channel depthwise conv has no 1-D-over-sequence primitive in tree yet, so the channel-mixing TNNetCausalConv1D is the documented stand-in.) See examples/Conformer/.
Transformer encoder block builder. TNNet.AddTransformerEncoderBlock(d_model, Heads, d_ff, PreNorm=true, CausalMask=false) assembles a complete transformer encoder block over a SeqLen x 1 x d_model tensor in one call: an attention sub-block (LayerNorm → token-wise Q|K|V slab projection TNNetPointwiseConvLinear(3*d_model) → AddMultiHeadSelfAttention → residual sum) followed by a SwiGLU feed-forward sub-block (LayerNorm → TNNetPointwiseConvLinear(2*d_ff) → TNNetSwiGLU → TNNetPointwiseConvLinear(d_model) → residual sum). With PreNorm=true (default) each LayerNorm precedes its sub-block (x + Sublayer(LayerNorm(x))); with PreNorm=false it follows the residual sum (LayerNorm(x + Sublayer(x)), post-norm). Every projection — including both FFN projections — is a pointwise (1×1) convolution so the token axis is preserved (TNNetFullConnect* would flatten the whole sequence). The output shape stays SeqLen x 1 x d_model, so blocks can be stacked.
Transformer decoder block builder. TNNet.AddTransformerDecoderBlock(d_model, Heads, d_ff, EncoderOutput, PreNorm=true) assembles a complete encoder-decoder transformer decoder block over a SeqLen x 1 x d_model decoder stream in one call by composing three residual sub-blocks: (1) a causal multi-head self-attention sub-block (same wiring as the encoder block with CausalMask=true); (2) a cross-attention sub-block whose Query comes from the decoder stream and whose Key/Value come from the explicit EncoderOutput layer (a KVSeqLen x 1 x d_model encoder-memory tensor, via AddMultiHeadCrossAttention); and (3) a token-wise SwiGLU feed-forward sub-block. PreNorm places each LayerNorm before its sub-block (default) or after the residual sum (post-norm), matching AddTransformerEncoderBlock. The query and encoder-memory sequence lengths may differ — the output stays on the decoder grid (SeqLen x 1 x d_model), so decoder blocks can be stacked. See examples/TransformerDecoderBlock/.
Perceiver encoder builder. TNNet.AddPerceiverEncoder(NumLatents, d_latent, Heads, Depth, d_ff=0, PreNorm=true) assembles a Perceiver / Perceiver-IO latent-bottleneck encoder (Jaegle et al. 2021, arXiv:2103.03206) over a (InputSeqLen, 1, d_model) input in one call, decoupling the bulk of the compute from the input length. It (1) optionally projects the input width to d_latent (a 1×1 TNNetPointwiseConvLinear, inserted only when d_model <> d_latent), then (2) reads the whole input into a small fixed-size learnable latent array Z of shape (NumLatents, 1, d_latent) (NumLatents << InputSeqLen) via AddAttentionPooling(NumLatents, Heads) — the PMA seed bank is the Perceiver latent array, and this input→latent cross-attention is the only place the input enters, at cost linear in InputSeqLen; then (3) refines Z with a Depth-deep tower of AddTransformerEncoderBlock(Heads, d_ff) latent self-attention blocks whose quadratic cost lives only in the cheap NumLatents² tower. Output length is NumLatents regardless of input length — distinct from AddInducedSetAttention (projects back to the n input rows) and AddAttentionPooling alone (single pool, no self-attention refinement). Composes existing layers (no new class). See examples/Perceiver/.
| Layer Name | Input/Output Dimensions | Description |
|---|---|---|
TNNetMaskedFill |
2D or 3D | Causal (upper-triangle) mask for self-attention score maps. Adds a large negative constant to positions where the column index (X) is greater than the row index (Y), so they contribute (almost) zero attention after a softmax. No trainable parameter; the backward pass is a straight passthrough. Created with TNNetMaskedFill.Create() or TNNetMaskedFill.Create(MaskValue). The overload TNNetMaskedFill.Create(MaskValue, Offset, LowerTriangle) selects a configurable pattern: causal masks X > Y + Offset, anti-causal (LowerTriangle=True) masks X < Y - Offset; the default Offset=0, LowerTriangle=False reproduces the strict upper-triangle behaviour exactly. |
TNNetSlidingWindowMaskedFill |
2D or 3D | Banded local causal mask (Mistral / Longformer style). Each query position Y attends only to keys in the window [Y-W+1 .. Y]; positions in the strict future (X > Y) or too far in the past (X < Y-W+1) get a large negative constant added. With W >= SeqLen it reduces to the full causal TNNetMaskedFill. No trainable parameter; backward is a straight passthrough. Created with TNNetSlidingWindowMaskedFill.Create(Window) or TNNetSlidingWindowMaskedFill.Create(Window, MaskValue). |
TNNetDiagonalSSM is the first recurrent layer in the library: a diagonal-state linear-recurrence ("SSM-lite") sequence mixer that provides an O(n) causal alternative to the O(n^2) scaled-dot-product-attention head. The input is a (SeqLen, 1, Depth) sequence laid out along the X axis (the same convention the attention layers use); the recurrence runs left-to-right along X with the depth channels fully parallel. See examples/DiagonalSSM/ for a single-layer demo that prints the learned per-channel decay spectrum.
| Layer Name | Input/Output Dimensions | Description |
|---|---|---|
TNNetDiagonalSSM |
2D (SeqLen x 1 x Depth) | Per-channel diagonal state-space recurrence: state h_t = a[d]*h_{t-1} + b[d]*x_t, output y_t = c[d]*h_t + e[d]*x_t (the e*x skip is the S4D/S5 feedthrough). Four learnable per-channel vectors (a, b, c, e); the decay is stored as a = sigmoid(a_raw) so it stays in (0,1) and the recurrence is unconditionally stable. Forward is a single left-to-right sweep; backward is backprop-through-time. Created with TNNetDiagonalSSM.Create(). |
TNNetClosedFormContinuous |
2D (SeqLen x 1 x Depth) | A CfC "liquid" recurrent cell (Hasani et al. 2022, Closed-form continuous-time neural networks, Nature MI): a sequence mixer that updates a hidden state with the analytic closed-form solution of a liquid time-constant ODE rather than a numerical integrator. Per step h_t = sigmoid(-(Wt·x_t + b_t)·t) ⊙ tanh(Wg·x_t + b_g) + (1 - sigmoid(...)) ⊙ h_{t-1} — an input-dependent, per-channel continuous-time constant that gates between a fast tanh input pathway and the previous state, with elapsed time t = (step+1)/SeqLen. This is distinct from its neighbours in the sequence-mixer family: AddNeuralODEBlock numerically integrates a residual field; TNNetRetention / TNNetDiagonalSSM use fixed / input-independent decay kernels; TNNetSelectiveSSM is the input-dependent state-space cousin; and Highway / GatedResidual gates are depthwise but not recurrent over time. Storage is four learnable tensors Wt, Wg (Depth×Depth) and b_t, b_g (Depth-long). Forward is the explicit per-timestep recurrence; backward is backprop-through-time over the unrolled steps. Created with TNNetClosedFormContinuous.Create() (Depth inferred from the previous layer); composite helper TNNet.AddClosedFormContinuous() wraps the cell in a pre-norm RMSNorm residual (y = x + CfC(RMSNorm(x))) so it drops into a transformer-style block in one call (mirrors AddRetention / AddNeuralODEBlock). For non-causal sequence tasks, TNNet.AddBidirectionalClosedFormContinuous() runs a forward CfC plus a reverse branch (FlipX → CfC → FlipX) and concatenates the two along Depth (output Depth doubles, so wrap with a pointwise projection if a residual is wanted). See examples/LiquidCfC/ for a remember-then-recall toy contrasting it against an SDPA head at matched parameter count, and examples/LiquidCfCvsSSM/ for a last-write-wins task where the input-dependent time constant beats a fixed-decay TNNetDiagonalSSM at matched parameters. |
TNNetSLSTMCell |
2D (SeqLen x 1 x Depth) | The scalar xLSTM cell (Beck et al. 2024, xLSTM: Extended Long Short-Term Memory): the FIRST classic-LSTM-style multiplicative-gate recurrence in the library — every other sequence mixer (TNNetDiagonalSSM, TNNetClosedFormContinuous, TNNetRetention) is linear-state or fixed/learned decay with no input/forget/output gates. Its distinguishing machinery is exponential input/forget gates i_t = exp(W_i·x_t + r_i·h_{t-1} + b_i), f_t = exp(...) (sharper storage revision than sigmoid gates), made trainable by a running-max stabilizer state m_t = max(log f_t + m_{t-1}, log i_t) that renormalizes the unbounded exp gates (i'_t = exp(log i_t - m_t), f'_t = exp(log f_t + m_{t-1} - m_t)) so they never overflow — the paper's key trick. State updates c_t = f'_t·c_{t-1} + i'_t·tanh(W_z·x_t + r_z·h_{t-1} + b_z) and normalizer n_t = f'_t·n_{t-1} + i'_t, output h_t = sigmoid(W_o·x_t + r_o·h_{t-1} + b_o) ⊙ (c_t / n_t). Twelve learnable tensors (W_{z,i,f,o}, recurrent r_{z,i,f,o} Depth×Depth, biases b_{z,i,f,o} with b_f init +1). Forward is the explicit per-timestep recurrence caching per-step gates/m_t; backward is backprop-through-time with m_t treated as a stop-gradient running max. Created with TNNetSLSTMCell.Create() (Depth inferred); composite helper TNNet.AddSLSTM() wraps it in a pre-norm RMSNorm residual (y = x + sLSTM(RMSNorm(x))). See examples/SLSTMvsCfC/. |
TNNetMLSTMCell |
2D (SeqLen x 1 x Depth) | The matrix-memory mLSTM cell of xLSTM (Beck et al. 2024) — the parallelisable, attention-like sibling of the scalar TNNetSLSTMCell. Instead of a scalar cell state it carries a Depth×Depth outer-product covariance memory C_t = f'_t·C_{t-1} + i'_t·(v_t·k_tᵀ) plus a normalizer vector n_t = f'_t·n_{t-1} + i'_t·k_t, both renormalized by the same running-max stabilizer m_t and exponential input/forget gates as sLSTM. Per step it projects the input into query/key/value (q_t=W_q·x_t, k_t=W_k·x_t, v_t=W_v·x_t) and reads out `h_t = o_t ⊙ (C_t·q_t / max( |
TNNetKalmanFilterCell |
2D (T x 1 x StateDim) | A differentiable diagonal Kalman filter (Kalman 1960) — the first layer in the library to propagate uncertainty rather than only a deterministic state: it carries a per-channel state mean x AND a covariance P over the time axis and forms an adaptive Kalman gain that trades model prediction against the new observation. Genuinely distinct from every other recurrent/state-space layer here (TNNetDiagonalSSM, TNNetSelectiveSSM, TNNetClosedFormContinuous, TNNetSLSTMCell/TNNetMLSTMCell, TNNetNTMMemory) — none maintain a covariance or build a gain from it. Treating each step's input as the observation z_t, it sweeps the two-phase recurrence — predict x⁻_t = a·x_{t-1}, P⁻_t = a²·P_{t-1} + Q; update g_t = P⁻_t/(P⁻_t+R), x_t = x⁻_t + g_t·(z_t − x⁻_t), P_t = (1−g_t)·P⁻_t — and emits the filtered means x_t. Per-channel learnable scalars (3·StateDim params): transition a = tanh(a_raw) ∈ (−1,1) (bounded for stability) and process/measurement noise Q = softplus(Q_raw), R = softplus(R_raw) (positive, so the gain stays in (0,1) by construction). x_0=0, P_0=1 re-init each sweep (not persisted, like NTMMemory's M). Backward is full BPTT with two coupled adjoint scans — the covariance adjoint runs right-to-left alongside the mean adjoint because g_t depends on P⁻_t which depends on P_{t-1} (input and per-channel weight gradients both numerically gradient-checked with bounded params). StateDim = input Depth, FStruct[0]=StateDim; created with TNNetKalmanFilterCell.Create(). See examples/KalmanFilter/. |
TNNetHamiltonianCell |
2D (T x 1 x 2·D) | A structure-preserving (symplectic) dynamics cell — a Hamiltonian Neural Network (Greydanus et al. 2019, arXiv:1906.01563). Unlike every other continuous-dynamics layer here, which regresses the time-derivative field directly (AddNeuralODEBlock integrates an unconstrained residual field; TNNetClosedFormContinuous is a liquid closed-form gate; TNNetDiagonalSSM/TNNetSelectiveSSM are linear state spaces; TNNetKalmanFilterCell propagates uncertainty — none conserve energy), this cell parameterizes a scalar learned Hamiltonian H(q,p) with a small inner tanh MLP and produces the symplectic gradient field dq/dt = +∂H/∂p, dp/dt = −∂H/∂q, then integrates Steps sequential symplectic-Euler sub-steps (step size dt) along the time axis. The conserved quantity falls out of the construction, so a pendulum / mass-spring trajectory stays on its energy level set instead of spiralling. The depth axis packs the `(q |
TNNetFourierMix |
2D (SeqLen x 1 x Depth) | The FNet parameter-free Fourier token mixer (Lee-Thorp et al. 2021, FNet: Mixing Tokens with Fourier Transforms): replaces self-attention with an unparameterised 2D discrete Fourier transform across the sequence (X) and hidden (Depth) axes, keeping only the real part — y[a,b] = sum_{s,h} x[s,h]·cos(2π(a·s/L + b·h/D)) = Re(DFT_seq(DFT_hidden(x))). Holds no trainable weights at all: mixing is a fixed linear operator. Because Re(DFT) is a fixed self-adjoint real operator M (M[(a,b),(s,h)] = cos(2π(a·s/L + b·h/D)), symmetric under swapping the index pairs), the exact input gradient is the same DFT applied to dL/dy — derived, not assumed, and verified against finite differences. Distinct from the other sequence mixers: TNNetTokenShift is an RWKV t-1 shift, TNNetCirculantLinear is a learned circular convolution, attention is a learned mix — this one is a fixed, weightless spectral mix. Default forward/backward is the exact direct O(n²) DFT sum (arbitrary SeqLen/Depth); set UseFFT := true for an opt-in separable radix-2 FFT fast path (power-of-two SeqLen and Depth; default OFF, agrees with the direct path to <1e-5). Asserts SizeY=1. Created with TNNetFourierMix.Create(). See examples/FourierMix/. |
TNNetSpectralConv1D |
2D (SeqLen x 1 x Depth) | The Fourier Neural Operator spectral convolution (Li et al. 2021, Fourier Neural Operator for Parametric PDEs, arXiv:2010.08895) — the first layer with learnable complex spectral weights (distinct from the parameter-free TNNetFourierMix and the fixed-random TNNetFourierFeatures). Per channel it takes a real radix-2 FFT along SeqLen (reusing the proven FourierMixFFT helper), truncates to the lowest Modes frequencies (a spectral low-pass that makes the operator resolution-invariant — high modes are zeroed, not learned), applies a learnable per-(in-channel, out-channel) complex weight R[m] per kept mode (an InDepth×OutDepth complex matmul mixing real/imag via the 2×2 complex-multiply block, the same weight-packing idiom as TNNetQuaternionLinear/TNNetOctonionLinear), then inverse-FFTs back to the SeqLen domain. A single weight neuron holds 2·Modes·InDepth·OutDepth reals (real+imag). Backward is the exact real adjoint of the FFT → complex-matmul → IFFT pipeline; both the input and the complex-weight gradients are numerically gradient-checked. Asserts a power-of-two SizeX and SizeY=1. Modes round-trips via FStruct[5]. Created with TNNetSpectralConv1D.Create(OutDepth, Modes). See examples/FourierNeuralOperator/. |
TNNetSpectralConv2D |
3D (SizeX x SizeY x Depth) | The 2-D Fourier Neural Operator spectral convolution (Li et al. 2021, arXiv:2010.08895) — the 2-D sibling of TNNetSpectralConv1D over an image. It takes a 2-D FFT (real radix-2 FFT along X for every row, then along Y for every column — both reuse the proven FourierMixFFT helper), truncates to the lowest ModesX × ModesY 2-D modes (a 2-D spectral low-pass), applies a learnable per-(in-channel, out-channel) complex weight R[mx,my] per kept 2-D mode (an InDepth×OutDepth complex matmul packed via the same 2×2 complex-multiply idiom as TNNetQuaternionLinear/TNNetOctonionLinear), then inverse-2D-FFTs back and takes the real part. Because the learned weights live in 2-D mode space, not grid space, the same weights describe the same continuous operator at any resolution (resolution-invariance). A single weight neuron holds 2·ModesX·ModesY·InDepth·OutDepth reals. Backward is the exact real adjoint of the 2-D FFT → complex-matmul → 2-D IFFT pipeline; both the input and the complex-weight gradients are numerically gradient-checked. Asserts power-of-two SizeX/SizeY. OutDepth/ModesX/ModesY round-trip via FStruct[0..2]. Created with TNNetSpectralConv2D.Create(OutDepth, ModesX, ModesY). See examples/SpectralConv2D/. |
TNNetDWT1D |
2D (SeqLen x 1 x Depth) | Lifting-scheme single-level 1-D Discrete Wavelet Transform — a localized multi-resolution time-frequency primitive, the complement to the global-frequency FFT family (TNNetFourierMix, TNNetSpectralConv1D/2D) which cannot represent a transient localized in both time and scale. Along SeqLen it runs the second-generation lifting scheme per channel (split even/odd samples → predict the odd half: detail = odd − P(even) → update the even half: approx = even + U(detail)), so it is exactly invertible by construction (the inverse runs the same steps in reverse with flipped signs — a free correctness oracle, IDWT(DWT(x)) == x to ~1e-7) and O(SeqLen) (no FFT). Output (L,1,D) → (L div 2, 1, 2·D) = [approx | detail] concatenated along Depth; odd L is handled by whole-sample symmetric edge extension. Fixed-filter mode (default) selects csDWT1DHaar / csDWT1DCDF53 (CDF/LeGall 5/3) / csDWT1DDaub4 (Daubechies-4) coefficients giving an orthonormal/biorthogonal transform out of the box; opt-in learnable mode makes the predict/update taps trainable while keeping the lifting structure fixed, so perfect reconstruction is preserved for any tap values (the "learnable wavelet" / scattering-net idea — structurally distinct from TNNetSpectralConv's learnable global complex modes). Backward is the exact linear adjoint lifting; the input gradient and the learnable-tap gradient are both numerically gradient-checked. Filter/Learnable/TapCount round-trip via FStruct[0..2]; public InverseChannel() exposes the IDWT. Created with TNNetDWT1D.Create(Filter, Learnable=false). Stack K levels into a dyadic packet tree with TNNet.AddWaveletPacketTransform(Levels, Filter, Learnable). See examples/WaveletDenoise/. |
TNNetCausalConv1D |
2D (SeqLen x 1 x Depth) | Learnable 1D convolution along the X (time) axis with left-only zero padding of FeatureSize-1, so the sequence length is preserved and output position t depends only on input positions <= t (no future leakage). One neuron per output channel holds a (K, 1, InputDepth) weight window plus a bias. An attention-free O(n*K) causal sequence mixer that pairs with TNNetTokenShift. An optional Dilation parameter (default 1) gives a WaveNet-style exponentially-growing receptive field: taps are spaced Dilation apart in time and the left pad grows to Dilation*(FeatureSize-1) (Dilation=1 is identical to the dense conv). Created with TNNetCausalConv1D.Create(NumFeatures, FeatureSize) (optional SuppressBias, Dilation). |
TNNetWKV |
2D (SeqLen x 1 x 2·Depth) | The RWKV weighted key-value (WKV) time-mixing operator (Peng et al. 2023, RWKV: Reinventing RNNs for the Transformer Era): a softmax-free, attention-free sequence mixer that is the defining recurrence of RWKV-4. Over a `k |
TNNetCrossWKV |
2D — receptance source (SeqLen x 1 x Depth) + key|value source (SeqLen x 1 x 2·Depth) | A two-source variant of TNNetWKV that generalises the RWKV-4 WKV recurrence to a separate key/value stream, exactly as TNNetCrossAttention generalises self-attention's packed Q|K|V to two sources. Where TNNetWKV splits its OWN input into the k|v pair driving the state — so the memory it accumulates and the stream reading it are ONE sequence — TNNetCrossWKV reads key|value from a SEPARATE KeyValueSource (depth 2·Depth) than the receptance/query stream (PrevLayer, depth Depth), making cross-WKV / decode-time external memory expressible. Per channel/timestep it runs the EXACT log-space-stable RWKV-v4 kernel (wkv_t = (a_{t-1}+e^{u+k_t}v_t)/(b_{t-1}+e^{u+k_t}), a_t=e^{-w}a_{t-1}+e^{k_t}v_t, per-channel learnable w=softplus(w_raw) + bonus u, running-max stabiliser) with k,v from the key|value source and a sigmoid(r_t) receptance gate from the query source: y_t = sigmoid(r_t)·wkv_t. The key|value source index serializes like TNNetConcat/TNNetCrossAttention (round-trips through SaveToString/LoadFromString); backward is exact coupled BPTT folding dL/dk,dL/dv into the key|value source, dL/dr into the receptance source, and dL/dw,dL/du per channel (input grads into BOTH sources + weight grads numerically checked). The default equal-seqlen contract keeps both sources the same length (read-out at t uses the state through the kv source up to t). An opt-in asymmetric / full-context mode (TNNetCrossWKV.Create(KeyValueSource, {Asymmetric:=}true), round-trips via FStruct[1]) summarises the WHOLE key|value stream once into a final per-channel state and lets the receptance/query stream be a different length QSeqLen <> KVSeqLen — true permuted associative recall rather than a position-aligned copy (every query reads the same full-context summary A/B; the bonus u is unused on this path). Created with TNNetCrossWKV.Create(KeyValueSource) (equal-seqlen) or TNNetCrossWKV.Create(KeyValueSource, true) (asymmetric). See examples/CrossWKV/ — a cross-copy task where the two-source arm hits 100% exact recall while a memory-blind single-source WKV stays at the chance floor. |
TNNetGatedLinearAttention |
2D (SeqLen x 1 x Depth) | Gated Linear Attention (GLA) (Yang et al. 2023, Gated Linear Attention Transformers with Hardware-Efficient Training, arXiv:2312.06635): a matrix-state linear-attention recurrence whose defining novelty is a data-dependent PER-CHANNEL (vector) diagonal forget gate alpha_t = sigmoid(W_a·x_t) that row-scales a 2-D outer-product memory S (d×d): S_t[d,e] = alpha_t[d]·S_{t-1}[d,e] + k_t[d]·v_t[e], read-out y_t = q_t^T S_t, causal left-to-right scan. This is the only member of the linear-attention family with an input-dependent vector forget gate on a full outer-product matrix state — the mechanism modern Gated-DeltaNet / Mamba-2 / RWKV-6 build on. Distinct from its neighbours: TNNetWKV (fixed-learned per-channel decay, not input-dependent), TNNetRetention (single scalar gamma), TNNetMLSTMCell (scalar exp gates + running-max), TNNetDeltaNet (scalar beta write gate, no multiplicative forget), TNNetSelectiveSSM (gates a diagonal vector state, not a matrix). Storage is a five-neuron bank W_q/W_k/W_v/W_a (Depth×Depth) + gate bias b_a (Depth-long); keys are L2-normalized (DeltaNet idiom) and queries 1/sqrt(d)-scaled, with FStruct[0..1]=d_k/d_v (square in v1). Forward is the exact per-token scan; backward is exact BPTT carrying the full dL/dS (the gate row-scales the carry) — input + all four weight sets numerically gradient-checked. Asserts SizeY=1; output shape == input. Created with TNNetGatedLinearAttention.Create() (Depth inferred); the composite TNNet.AddGatedLinearAttention builder wraps it as a drop-in time-mixing block with TNNetTokenShift + a per-token input projection + a sigmoid receptance gate + output projection (mirroring AddRWKVTimeMix). See examples/GatedLinearAttention/ for an overwrite key→value recall demo contrasting GLA vs TNNetDeltaNet vs TNNetRetention. |
TNNetTestTimeTraining |
2D (SeqLen x 1 x Depth) | The Test-Time Training (TTT) sequence mixer (Sun et al. 2024, Learning to (Learn at Test Time): RNNs with Expressive Hidden States, arXiv:2407.04620). The recurrent hidden state is itself a small model W whose weights are updated by one explicit gradient-descent step on a self-supervised reconstruction loss at every timestep, so the scan literally trains an inner net as it reads. Per token: k_t,v_t,q_t = theta_K/V/Q · x_t; inner loss ell_t = ½‖W(k_t)−v_t‖²; TTT update W_t = W_{t−1} − eta·∇_W ell_t (learnable eta = softplus(eta_raw)); read-out y_t = W_t(q_t). Two inner-model variants behind one FStruct[1] flag: TTT-Linear (W a matrix; rank-1 MSE-gradient step — closely related to TNNetDeltaNet but with a learnable per-layer inner LR eta and a raw, un-normalized key instead of DeltaNet's sigmoid gate + L2-key), and TTT-MLP (W = W2·GeLU(W1·), a genuine non-linear fast-weight update a single matrix cannot express — the headline novelty). Because the forward already runs an inner backward to form ∇_W ell_t, the outer backward is exact second-order BPTT through the inner update (a Hessian-vector product for the MLP arm, the same 2nd-inner-tape pattern as TNNetHamiltonianCell); input / view-projection / eta gradients are numerically gradient-checked for both arms. Asserts SizeY=1; output shape == input. Created with TNNetTestTimeTraining.Create(Variant, Hidden) (Variant 0=Linear, 1=MLP). See examples/TestTimeTraining/. |
TNNetTitansMemory |
2D (SeqLen x 1 x Depth) | The Titans test-time neural long-term memory sequence mixer (Behrouz et al. 2024, Titans: Learning to Memorize at Test Time, arXiv:2501.00663), the Memory-as-Context (MAC) leaf-layer variant. Like TNNetTestTimeTraining the hidden state is a small inner MLP M(z)=W2·GeLU(W1·z) gradient-descended at inference on the per-token associative loss ½‖M(k_t)−v_t‖², but Titans adds the two headline mechanisms TTT lacks: (a) a momentum / "surprise" state S_t = η⊙S_{t−1} − θ⊙∇_t so a surprising token keeps writing for several steps, and (b) a data-dependent forget gate M_t = (1−α_t)⊙M_{t−1} + S_t (α_t = sigmoid(α_raw + W_α x_t) per channel) that adaptively erases stale memory. Gates are per-output-channel (W2 row o ↦ gate[o], W1 row j ↦ gate[j mod Depth]); η = sigmoid(η_raw), θ = sigmoid(θ_raw) are learnable per-channel scalars; read-out y_t = M_t(q_t). Distinct from every sibling: TNNetTestTimeTraining (plain SGD step, no momentum, no forgetting), TNNetDeltaNet/TNNetGatedLinearAttention (matrix-state delta/gated-linear recurrences, not a gradient step on a non-linear MLP), TNNetNTMMemory (content-addressed bank). Storage: theta_K/V/Q (Depth×Depth), eta_raw/theta_raw/alpha_raw (Depth), W_alpha (Depth×Depth), initial inner fast-weights W1_0 (H×Depth) / W2_0 (Depth×H). Forward is the explicit scan (each step runs an inner backward to form ∇_t); outer backward is exact second-order BPTT (a GeLU Hessian-vector product) carrying dL/dW1_t, dL/dW2_t and dL/dS_t right-to-left through the coupled momentum/forget adjoint scans (input + all weight sets numerically gradient-checked, max-abs err ≈5e-4). Asserts SizeY=1; output shape == input. Created with TNNetTitansMemory.Create(Hidden) (Depth inferred). See examples/TitansMemory/. |
TNNetImplicitLongConv |
2D (SeqLen x 1 x Depth) | The Hyena Hierarchy implicit long convolution (Poli et al. 2023): a causal depthwise convolution whose per-channel filter spans the WHOLE sequence (length SeqLen), yet is generated IMPLICITLY by a tiny shared MLP over positional features and multiplied by a learnable exponential-decay window, so the parameter count does NOT grow with SeqLen. This is distinct from TNNetCausalConv1D (a SHORT fixed-length kernel learned directly) and TNNetDiagonalSSM (a per-channel linear recurrence) — neither parametrizes a full-length filter from positions. Forward is the direct O(L^2) causal time-domain sum (FFT O(L log L) is a documented stretch goal); backward is analytic into both the input and the implicit-MLP/decay weights. Initialised near-identity (small filter) so the block starts close to a no-op. Created with TNNetImplicitLongConv.Create(); the order-2 builder TNNet.AddHyenaOperator(d_model, Hidden) assembles the data-controlled gated Hyena recurrence around it. See examples/HyenaOperator. |
TNNetSpatialGatingUnit |
2D (SeqLen x 1 x Depth) | The gMLP Spatial Gating Unit (Liu et al. 2021, Pay Attention to MLPs): an attention-free token mixer with no queries/keys/values and no per-pair dot product. It splits the Depth channels in half into u and v, applies one learned, content-independent SeqLen x SeqLen weight matrix W (plus per-position bias) across the sequence axis of v (v'[n] = bias[n] + Σ_m W[n,m]·v[m], the same static spatial projection for every channel), and gates multiplicatively out[n] = u[n]·v'[n], halving the output Depth. W is fixed after training, so the mix is data-independent — a distinct primitive, not a re-skin of attention. The SeqLen x SeqLen matrix makes it fixed-length: SeqLen is pinned at construction and SetPrevLayer rejects a mismatched SizeX, SizeY<>1, or odd Depth. W is initialised near-identity / small so the block starts close to a no-op. Created with TNNetSpatialGatingUnit.Create(SeqLen); builders TNNet.AddSpatialGatingUnit(SeqLen) and the full block TNNet.AddgMLPBlock(SeqLen, d_model, d_ffn) (channel-MLP up → split+SGU → channel-MLP down, residual; the gMLP-paper LayerNorms bound the gate). See examples/SpatialGatingUnit. |
TNNetLIFNeuron |
2D (T x 1 x Depth) | None |
TNNetALIFNeuron |
2D (T x 1 x Depth) | Opt-in (per-channel V_th/leak) |
TNNetGroupConvP4 |
3D (SizeX x SizeY x Depth) | The p4 group-equivariant lifting convolution (Cohen & Welling 2016, Group Equivariant Convolutional Networks, arXiv:1602.07576) — the first layer here that is rotation-equivariant by construction, not merely measured after the fact by TNNet.EquivarianceReport. One learned K×K kernel bank per feature is shared across the 4 rotations of the C4 group (rot-{0,90,180,270}): the layer convolves the input with the 4 rotated views of each filter and stacks the responses along a new 4-fold orientation sub-axis of Depth (output Depth = 4·FeaturesCount, channel = co·4 + r), so a 90° rotation of the input cyclically permutes the orientation channels instead of scrambling them. Distinct from the parameter-free flip involutions (TNNetFlipX/FlipY, data-augmentation primitives) and from the hypercomplex/CondConv layers (which share weights across a different algebra, not a spatial symmetry group). Subclasses TNNetConvolutionLinear and reuses the im2col/AddArea path, materialising the 3 rotated kernel views from the one trained bank; backward folds the 4 orientation gradients back onto the single shared kernel (the rotation-tied weight-gradient sum — exactly gradient-checked, alongside an exact-equivariance forward test to machine precision). v1 is the lifting (plane → C4 field) conv; the full p4m group (+reflections) and steerable/SO(2) harmonics are noted follow-ups. Created with TNNetGroupConvP4.Create(FeaturesCount, FeatureSize, ...). See examples/GroupEquivariantMNIST/. |
TNNetGroupPoolP4 |
3D (SizeX x SizeY x 4·F → SizeX x SizeY x F) | The companion group-pool head for TNNetGroupConvP4: max- (default) or mean-reduces over the 4 orientation channels of each feature, collapsing a C4 feature field back to a rotation-invariant (SizeX, SizeY, FeaturesCount) map for a classifier tail. Closes the lift → equivariant-process → invariant-readout loop. Created with TNNetGroupPoolP4.Create(). |
TNNetMinGRU |
2D (SeqLen x 1 x Depth) | The minimal GRU of Feng et al. 2024 (Were RNNs all we needed?, arXiv:2410.01201). A stripped-down GRU whose update gate and candidate depend on the current input only — z_t = sigmoid(W_z·x_t + b_z), h̃_t = W_h·x_t + b_h, h_t = (1 − z_t)⊙h_{t-1} + z_t⊙h̃_t — so the gates carry no h_{t-1} feedback. That is the whole point: removing the recurrent dependence in the gates turns the cell into a linear scan that is fully parallelizable over the time axis (the paper's headline), distinguishing it from the xLSTM family (TNNetSLSTMCell/TNNetMLSTMCell), whose exp-gates do read h_{t-1}. Four weight tensors (W_z,W_h Depth×Depth, b_z,b_h Depth). Forward is a left-to-right scan; backward is exact BPTT (input + all weight grads numerically gradient-checked). Created with TNNetMinGRU.Create() (Depth inferred). See examples/MinimalRNN/. |
TNNetMinLSTM |
2D (SeqLen x 1 x Depth) | The minimal LSTM sibling from the same paper (Feng et al. 2024). Like TNNetMinGRU, the forget/input gates depend on x_t only — f_t = sigmoid(W_f·x_t + b_f), i_t = sigmoid(W_i·x_t + b_i), h̃_t = W_h·x_t + b_h — then are normalized f'_t = f_t/(f_t+i_t), i'_t = i_t/(f_t+i_t) and combined h_t = f'_t⊙h_{t-1} + i'_t⊙h̃_t, so the cell is parallelizable for the same reason. Six weight tensors (W_f,W_i,W_h Depth×Depth, b_f,b_i,b_h Depth). Backward differentiates the coupled f/(f+i) gate normalization exactly (input + all weight grads numerically gradient-checked). Created with TNNetMinLSTM.Create() (Depth inferred). See examples/MinimalRNN/. |
TNNetLRU |
2D (SeqLen x 1 x Depth) | The Linear Recurrent Unit (Orvieto et al. 2023, Resurrecting Recurrent Neural Networks for Long Sequences, arXiv:2303.06349): a stable, complex-diagonal linear recurrence. Each output channel pairs with its own complex eigenvalue parameterized for guaranteed stability — ` |
TNNetLegendreMemoryUnit |
2D (SeqLen x 1 x Depth) | The Legendre Memory Unit (Voelker et al. 2019, Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks). Carries an order-N memory vector m_t per channel that holds the coefficients of a shifted-Legendre-polynomial projection of a sliding window of the input — an orthogonal-polynomial compression that is genuinely different from the exponential-decay kernels of the diagonal SSMs. Driven by the dense, structured, NON-diagonal HiPPO-LegS transition matrix A_ij = (2i+1)·(−1 if i<j else (−1)^(i−j+1)) and input vector B_i = (2i+1)·(−1)^i, discretized once at build time (Euler step Abar = I + (1/theta)·A, Bbar = (1/theta)·B) so the HiPPO matrices need no gradient. Forward sweeps m_t = Abar·m_{t−1} + Bbar·u_t along the time axis; a learned per-channel read-out (one weight neuron, Depth×Order) collapses the N coefficients to one value per step (output shape == input). Memory cost is N numbers regardless of window length. Distinct from TNNetDiagonalSSM (real diagonal), TNNetLRU (complex diagonal), and the matrix-memory linear-attention cells (TNNetGatedLinearAttention/TNNetDeltaNet/TNNetWKV) — the non-diagonal Legendre basis fills a real structural gap. Backward is a clean right-to-left adjoint scan plus the read-out weight gradient (input + weight grads numerically gradient-checked; Abar checked against a brute-force reference). theta is a fixed build-time constant in v1 (learnable per-channel theta is a logged follow-up). Created with TNNetLegendreMemoryUnit.Create(Order, theta). See examples/LegendreMemoryUnit/. |
MLP-Mixer block builder. TNNet.AddMLPMixerBlock(TokensHidden, ChannelsHidden, ActFn=TNNetReLU) assembles a complete all-MLP Mixer block (Tolstikhin et al. 2021, MLP-Mixer: An all-MLP Architecture for Vision) over a (Tokens, 1, Channels) sequence in one call — an attention-free alternative to AddTransformerEncoderBlock that mixes information with two MLPs along different axes, each wrapped in a pre-LayerNorm residual: (1) a token-mixing MLP that TNNetTransposeXD-swaps the token/channel axes, runs a pointwise Linear(TokensHidden) → ActFn → Linear(Tokens) across the (now-Depth) token axis — the same MLP shared over every channel — then transposes back; followed by (2) a channel-mixing MLP Linear(ChannelsHidden) → ActFn → Linear(Channels) applied independently per token. Every projection is a pointwise (1×1) convolution so the token axis is preserved and both sub-blocks are shape-preserving (Tokens and Channels are read from the current last layer), so the output stays (Tokens, 1, Channels) and blocks stack. See examples/MLPMixer/.
Gated Linear Attention block builder. TNNet.AddGatedLinearAttentionBlock(d_ff, PreNorm=true, NormClass=nil) assembles a complete transformer-style block over a (SeqLen, 1, d_model) sequence in one call — an attention-free linear-recurrence alternative to AddTransformerEncoderBlock that swaps the multi-head self-attention arm for the GLA time-mixing builder. It composes two pre-LayerNorm (or post-LayerNorm with PreNorm=false) residual sub-blocks with inline TNNetSum residuals: (1) a GLA time-mixing sub-block (LayerNorm → AddGatedLinearAttention → residual sum), wrapping the TNNetGatedLinearAttention matrix-state per-channel-gated linear-attention recurrence; followed by (2) a SwiGLU feed-forward sub-block (LayerNorm → TNNetPointwiseConvLinear(2*d_ff) → TNNetSwiGLU → TNNetPointwiseConvLinear(d_model) → residual sum). d_model is inferred from the input depth and every projection is a pointwise (1×1) convolution, so the output stays (SeqLen, 1, d_model) and blocks stack. Composes existing layers (no new class). See examples/GatedLinearAttentionBlock/, where a 3-block tower reaches 100% exact recall vs 81% for the bare mixer on an overwrite key→value recall task.
Fourier Neural Operator block builders. TNNet.AddFourierNeuralOperator1D(OutDepth, Modes, ActFn=nil) and TNNet.AddFourierNeuralOperator2D(OutDepth, ModesX, ModesY, ActFn=nil) wrap the TNNetSpectralConv1D/TNNetSpectralConv2D leaf layers in the canonical FNO Fourier layer (Li et al. 2021) y = ActFn(SpectralConv(x) + W₁ₓ₁(x)): a global spectral branch in mode space summed with a local pointwise (1×1 conv, TNNetPointwiseConvLinear) residual branch in grid space, both mapping Depth → OutDepth from the same input, then an optional activation (ActFn=nil = none). Stack these blocks to build a resolution-invariant PDE-surrogate operator. Note: TNNetSpectralConv2D's radix-2 FFT requires power-of-two SizeX/SizeY. See examples/FourierNeuralOperator/ and examples/SpectralConv2D/.
Trainable Bias (Shift) and Multiplication (Scaling) per Cell or Channel Allowing Faster Learning and Convergence
When TNNetCellBias is added after convolutional layers, it introduces a trainable bias to each output cell of the convolutional layer. This can have several effects on the neural network:
- Fine-tuning:
TNNetCellBiasallows for fine-tuning of the network's output by adding a learnable bias to each cell. This can help the network adjust its predictions more precisely. - Increased flexibility: by adding a bias to each cell individually, the network gains additional parameters to optimize, potentially allowing it to learn more complex representations.
- Improved learning speed: placing this layer before and after convolutions can speed up learning. This is because it gives the network an additional way to adjust its output, potentially making it easier to find optimal solutions.
- Parameter increase: adding
TNNetCellBiasincreases the number of trainable parameters in the network. While this can be beneficial for learning, it also increases the model's complexity and the risk of overfitting.
It's worth noting that the effectiveness of adding TNNetCellBias after convolutional layers can vary depending on the specific architecture and problem at hand. While it can potentially speed up learning and improve the network's flexibility, it's important to experiment and validate its impact on your particular use case. TNNetChannelBias adds a trainable bias to each channel in the output. It's like TNNetCellBias, but operating on entire channels instead of individual cells.
| Layer Name | Input/Output Dimensions | Description |
|---|---|---|
TNNetCellBias |
1D, 2D, or 3D | Trainable bias (shift) for each cell. |
TNNetCellMul |
1D, 2D, or 3D | Trainable multiplication (scaling) for each cell. |
TNNetChannelBias |
1D, 2D, or 3D | Trainable bias (shift) for each channel. |
TNNetChannelMul |
1D, 2D, or 3D | Trainable multiplication (scaling) for each channel. |
TNNetGatedResidual |
1D, 2D, or 3D | Per-channel learnable residual gate: y[x,y,d] = alpha[d] * x[x,y,d], with one learnable scalar alpha per depth channel (init 0.0 by default). The per-channel generalisation of TNNetReZero (single shared scalar). Wrap a residual branch with it (Sum([Sublayer, PrevLayer])) so each channel starts as the identity and opens its gate independently during training. Created with TNNetGatedResidual.Create() or TNNetGatedResidual.Create(initialAlpha). The TNNet.AddGatedResidual(pSublayers) builder wires this pattern in one call: y = x + GatedResidual(Sublayer(x)) (no normalization — the per-channel gate is the only added parameter and starts the branch at zero contribution). |
Composite helper: TNNet.AddSEBlock(InputLayer, ReductionRatio) wires the standard Squeeze-and-Excitation pattern (TNNetAvgChannel -> TNNetFullConnectReLU(C/r) -> TNNetFullConnectSigmoid(C) -> TNNetChannelMulByLayer) onto an existing branch. See examples/SEBlockCifar/.
Composite helper: TNNet.AddCBAM(InputLayer, ReductionRatio=16, SpatialKernelSize=7) wires the Convolutional Block Attention Module (Woo et al. 2018) over a conv feature map in one call — channel attention then spatial attention, shape-preserving. Channel attention pools the map with both global average (TNNetAvgChannel) and global max (TNNetMaxChannel), runs each through a reduce->ReLU->expand MLP, sums them, sigmoids, and rescales per channel (TNNetChannelMulByLayer) — the dual-pool extension of AddSEBlock. Spatial attention then builds a 2-channel descriptor (pointwise C->2 conv), applies a padded SpatialKernelSize conv -> sigmoid gate, and rescales per spatial position. v1 simplifications (vs the paper): two separate channel MLPs rather than one shared MLP, and a learned C->2 spatial descriptor rather than fixed avg/max-over-depth. Keep inputs square (TNNetMaxChannel assumes SizeX == SizeY). See examples/CBAMAttention/.
Composite helper: TNNet.AddMixtureOfExperts(InputLayer, NumExperts, ExpertHiddenDim) wires a soft/dense Mixture-of-Experts feed-forward block (Shazeer et al. 2017) in one call, shape-preserving so it is a drop-in FFN replacement (d_model = InputLayer.Output.Depth). A token-wise gating network (TNNetPointwiseConvLinear(NumExperts) -> TNNetSoftMax) produces per-expert weights g; NumExperts parallel shape-preserving expert MLPs (TNNetPointwiseConvReLU(ExpertHiddenDim) -> TNNetPointwiseConvLinear(d_model)) each compute E_e(x); the block returns Sum_e g[e] * E_e(x). Each scalar gate weight is sliced with TNNetSplitChannels(e,1), broadcast across d_model with TNNetDeepConcat.Replicate, and cell-multiplied into the expert output with TNNetCellMulByCell (the same broadcast-multiply mechanism AddCBAM uses), then summed with TNNetSum. v1 is a soft, dense gate: every expert runs on every token and the outputs are blended — it trains end-to-end with existing layers (no new gradient code). See examples/MixtureOfExperts/.
Composite helper: TNNet.AddTopKMixtureOfExperts(InputLayer, NumExperts, ExpertHiddenDim, TopCnt, out AuxLossHead, AuxCoeff=0.01) is the sparse hard top-k sibling of AddMixtureOfExperts. The gate softmax is routed through the new TNNetTopKGate(TopCnt) — which keeps the TopCnt largest gate weights per token, zeroes the rest, and renormalizes the survivors to sum to 1 (so each token is a convex blend of only its top-TopCnt experts), with an exact fused mask+renorm Jacobian backward dL/dg_j = (1/s)(dL/dy_j − Σ_i dL/dy_i·y_i) over the surviving set. A load-balancing auxiliary loss head TNNetLoadBalanceLoss(TopCnt, Coeff) is attached as a second output leaf and returned via out AuxLossHead: it implements the Switch-Transformer loss (Fedus et al. 2021) L_aux = Coeff·E·Σ_i f_i·P_i where E = NumExperts, f_i is the (stop-gradient) fraction of tokens whose top-TopCnt set touches expert i and P_i is expert i's mean gate probability, so the gradient flows through P_i only (dL_aux/dg_t[i] = Coeff·E·f_i/T) and pushes the gate to spread load instead of collapsing onto one expert. v1 still evaluates all experts then masks (correct, but not yet compute-sparse); Gumbel-noise gating and true sparse dispatch are logged follow-ups. See examples/TopKMoE/.
Composite helper: TNNet.AddMixtureOfDepths(InputLayer, BlockBuilder, Capacity) wires a Mixture-of-Depths conditional-compute block (Raposo et al. 2024) in one call, shape-preserving so it is a drop-in trunk wrapper. A per-token router (TNNetPointwiseConvLinear(1) over Depth -> TNNetSigmoid, pointwise so the token axis is preserved) scores each of the SeqLen positions; the top-Capacity positions are selected along the sequence axis (TransposeXD -> TNNetTopK(Capacity) -> TransposeXD, since TNNetTopK masks over Depth), their router weight is broadcast across d_model and cell-multiplied into the wrapped block's output (keeping the router on the gradient path through the hard top-k), and the result is added residually so non-selected positions pass through unchanged. With Capacity = SeqLen it is bit-for-bit equal to the wrapped block alone (the degenerate correctness anchor). A load-balancing auxiliary loss and a Gumbel/learned-threshold router are logged follow-ups. See examples/MixtureOfDepths/.
Composite helper: TNNet.AddReversibleBlock(InputLayer, HiddenDim) wires a RevNet-style reversible additive-coupling block in one call: it splits the input depth into halves x1|x2 and produces y1 = x1 + F(x2), y2 = x2 + G(y1), output = Concat(y1, y2) (shape-identical to the input), where F/G are small pointwise residual functions. The defining property is exact analytic invertibility — x2 = y2 - G(y1), x1 = y1 - F(x2) recovers the input without inverting F/G. See examples/ReversibleBlock/.
TNNetAffineCoupling is the library's first exact-likelihood normalizing-flow primitive (RealNVP, Dinh et al. 2016, arXiv:1605.08803; Glow, Kingma & Dhariwal 2018, arXiv:1807.03039) — a bijection whose Jacobian log-determinant is available in closed form, so a stack of these (interleaved with fixed channel permutations) trains by exact maximum likelihood under a unit-Gaussian base. Distinct from the memory-saving AddReversibleBlock (a RevNet recompute trick with NO tractable Jacobian) and from TNNetMixtureDensity (a density head that is not invertible). It splits the Depth axis into two contiguous halves a|b; one half passes through unchanged and conditions an affine map of the other: forward y_a = x_a, y_b = x_b·exp(s) + t; inverse (sampling) x_b = (y_b − t)·exp(−s), where [s_pre; t] = W·x_a + b is a single per-position linear conditioner over the unchanged half and s = clamp·tanh(s_pre/clamp) is the Glow log-scale clamp for stability. The transformed half is chosen by the pTransformSecond constructor flag so stacked couplings update every channel. log|det J| = Σ s is exposed as the public read-only LogDetJacobian, so a flow's NLL is loss = 0.5·||z||² − Σ_couplings(LogDetJacobian); backward folds the negative-log-det gradient in via LogDetLossWeight (default 1.0). FStruct[0]=transform-second flag, FStruct[1]=inverse flag, FFloatSt[0]=s clamp; all round-trip via Save/Load. Created with TNNetAffineCoupling.Create() or TNNetAffineCoupling.Create(pTransformSecond, pInverse, pClamp). See examples/NormalizingFlow/ (fits a 2-D two-moons density by maximum likelihood and samples new points back through the inverse flow).
TNNetInvertible1x1Conv is Glow's learnable invertible 1×1 convolution (Kingma & Dhariwal 2018, arXiv:1807.03039 sec. 3.2) — the channel-mixing companion to TNNetAffineCoupling. It applies a learnable C×C matrix W per spatial position across the Depth axis, y[x,y,:] = W·x[x,y,:], a trainable generalization of the fixed channel permutation that couplings rely on for mixing. Distinct from TNNetHouseholderLinear (exactly orthogonal → log-det identically 0, zero volume change) and from TNNetPointwiseConvLinear (generic 1×1 conv with no tractable inverse or log-det). W is parametrized by its LU decomposition W = P·L·(U + diag(s)) with P a fixed permutation chosen at init (regenerated from a stored seed), L unit-lower-triangular, U strictly-upper-triangular, and s the log-scale vector; this makes the per-position log-det the cheap Σ log|s| (O(C)) instead of an O(C³) determinant. LogDetJacobian returns SizeX·SizeY·Σ log|s| for the current forward, composing additively with the couplings' log-dets; backward folds the negative-log-det gradient in via LogDetLossWeight (default 1.0). The map is exactly invertible by triangular solves (no matrix inverse) — pass pInverse=true for the sampling direction, reusing the same L/U/s. FStruct[0]=C, FStruct[1]=permutation seed, FStruct[2]=inverse flag; L/U/s round-trip via the per-neuron save. Created with TNNetInvertible1x1Conv.Create() or TNNetInvertible1x1Conv.Create(pInverse, pPermSeed). See examples/NormalizingFlow/, which interleaves it with TNNetAffineCoupling (the real Glow step) and shows the learnable mixing reaching a higher mean log-likelihood than the fixed-permute baseline.
TNNetActNorm is Glow's data-dependent activation normalization (Kingma & Dhariwal 2018, arXiv:1807.03039 sec. 3.1) — the third and final Glow flow-step primitive, completing the canonical trio TNNetActNorm → TNNetInvertible1x1Conv → TNNetAffineCoupling. It is a per-channel invertible affine transform y[.,.,c] = s[c]·x[.,.,c] + b[c] with learnable scale s (stored as a log-scale, s = exp(logs), to stay strictly non-zero) and bias b. On the first forward minibatch it lazily data-dependent-initialises logs = −ln(std[c]) and b = −mean[c]/std[c] so that batch comes out per-channel ~0 mean / ~1 variance; an "initialised" flag in FStruct guards the init so it fires exactly once and never re-fires on reload — after which logs,b are ordinary trainable weights (identical behaviour at train and sample time, exactly the property a flow needs). Genuinely distinct from every existing normalizer (TNNetChannelStdNormalization, TNNetGroupNorm, TNNetInstanceNorm, …): those recompute statistics each forward and expose no inverse and no log-det, so they cannot sit inside a flow's exact-likelihood NLL. ActNorm's log-det is the cheap LogDetJacobian = SizeX·SizeY·Σ_c logs[c] (one term per channel), composing additively with the other two flow layers; backward folds the negative-log-det gradient in via LogDetLossWeight (default 1.0) and propagates to logs, b, and the input. Exactly invertible by x = (z − b)/s (pInverse=true, no solve needed). FStruct[0]=C, FStruct[1]=initialised flag, FStruct[2]=inverse flag; neuron 0 = logs, neuron 1 = b, both round-tripping via Save/Load. Created with TNNetActNorm.Create() or TNNetActNorm.Create(pInverse).
Composite helper: TNNet.AddNeuralODEBlock(InputLayer, HiddenDim, Steps) wires a continuous-depth (Neural ODE) residual block (Chen et al. 2018) in one call, shape-preserving so it is a drop-in replacement for a residual trunk (d_model = InputLayer.Output.Depth). A residual step x_{n+1} = x_n + f(x_n) is one explicit Euler step of dx/dt = f(x,t); this block replaces a stack of distinct residual blocks with one shared f integrated over Steps Euler sub-steps with fixed step h = 1/Steps. f is a shape-preserving pointwise sub-block over Depth (TNNetPointwiseConvReLU(HiddenDim) -> TNNetPointwiseConvLinear(d_model)); step 1 owns the only real weights and every later step reuses them via TNNetConvolutionSharedWeights, so the parameter count is independent of Steps (the "depth for free" property). Each step scales f's output by h (TNNetMulByConstant) and adds it residually (TNNetSum). An optional 4-arg overload AddNeuralODEBlock(InputLayer, HiddenDim, Steps, Method) selects the integrator via TNNetODEMethod = (odeEuler, odeMidpoint) — the midpoint (RK2) method evaluates the same shared f twice per step (k1 = f(y); k2 = f(y + (h/2)*k1); y := y + h*k2), improving accuracy with no extra parameters (the 3-arg form stays Euler, bit-for-bit unchanged). Training uses ordinary stored-activation backprop through the unrolled steps; the adjoint-sensitivity O(1)-memory backward is a logged follow-up. See examples/NeuralODE/ (which also renders an ASCII trajectory visualisation of the learned flow untangling two interleaving half-moons).
Composite helper: TNNet.AddDeepEquilibriumBlock(InputLayer, HiddenDim, MaxIters) wires a Deep Equilibrium block (Bai/Kolter/Koltun 2019) — the implicit cousin of AddNeuralODEBlock. Where Neural-ODE unrolls a fixed number of explicit Euler steps, a DEQ defines its output as the fixed point z* = f(z*; x) of a shape-preserving weight-tied map f, found by iterating z := f(z+x) from z_0 = 0 until the residual ||z_{k+1}-z_k|| falls below tolerance or a MaxIters cap (a data-dependent "adaptive depth", parameter count independent of the iteration count). The forward runs a damped, output-bounded Picard iteration; the backward is the tractable jacobian-free phantom gradient (Geng et al. 2021) — all iterates except the last are detached, so gradients flow through only the final f application (the exact implicit-function-theorem gradient is a logged follow-up). f reuses one TNNetDeepEquilibriumSharedConv (a weight-tied conv that rebuilds its cache each forward so every application is byte-identical, as a true fixed point requires). See examples/DeepEquilibrium/.
Composite helper: TNNet.AddPonderNetBlock(InputLayer, HiddenDim, MaxSteps, PriorLambda) wires a PonderNet block (Banino, Balaguer & Blundell 2021) — a learned probabilistic-halting adaptive-computation paradigm, distinct from the implicit fixed-point AddDeepEquilibriumBlock and from any fixed-depth stack. A weight-tied step f is applied up to MaxSteps times (h_n = h_{n-1} + f(h_{n-1}), parameter count independent of MaxSteps via TNNetDeepEquilibriumSharedConv weight sharing); a shared tiny halting head TNNetPonderHalting emits lambda_n = sigmoid(...) in (0,1) per step, giving the geometric halting distribution p_n = lambda_n * prod_{k<n}(1-lambda_k) (the last step forces lambda=1 so the p_n sum to 1). The block output is the smooth p_n-weighted sum of the per-step states — no hard argmax, so the block is differentiable end-to-end. The companion TNNetPonderCostLoss head adds KL(p || truncated-geometric(prior_lambda)), a "pay to ponder" regularizer that pulls the expected step count toward the prior's mean so the model only spends extra steps where the task forces it. Inference always unrolls MaxSteps (static shapes). See examples/PonderNet/.
Composite helper: TNNet.AddFiLMConditioned(featLayer, condLayer) wires Feature-wise Linear Modulation in one call: condLayer -> TNNetFullConnectLinear(2*D) -> reshape(1,1,2*D) -> TNNetFiLM([featLayer, cond]), inferring D = featLayer.Output.Depth. It removes the manual Depth -> 2*Depth bookkeeping every FiLM call site repeats and mirrors the AddPreNormResidual/AddGatedResidual builder family. See examples/FiLMConditioning/.
Composite helper: TNNet.AddAffineBlock wires a learnable per-channel affine transform y[d] = gamma[d]*x[d] + beta[d] in one call (TNNetChannelMul -> TNNetChannelBias), separable from FullConnect. It starts as the exact identity (gamma=1, beta=0), so it can be inserted into a frozen network and fine-tuned cheaply (BitFit-style adaptation). See examples/AffineFineTune/.
Composite helper: TNNet.AddLoRAAdapter(FrozenLayer, Rank, Alpha=1.0) wires a LoRA low-rank adapter (Hu et al. 2021) in one call: a rank-r bypass down: TNNetPointwiseConvLinear(Rank) -> up: TNNetPointwiseConvLinear(d_out) is built from the input feeding FrozenLayer, scaled by Alpha/Rank, and added residually to FrozenLayer's output. The up projection is zero-initialised, so the adapter is the exact identity perturbation at step 0 — the frozen base's output is bit-for-bit unchanged on the first forward. Freeze the base (per-layer LearningRate := 0, BitFit-style) and train only the adapter for parameter-efficient fine-tuning. NOTE: the builder zeros the up weights at construction, so do not call NN.InitWeights() afterwards (it would re-randomise them). See examples/LoRAFineTune/.
TNNetEmbedding is designed to convert input tokens (usually represented as integers) into dense vector representations (embedding vectors). TNNetTokenAndPositionalEmbedding extends TNNetEmbedding by adding positional information to the token embeddings. This is crucial for transformer models that don't have an inherent notion of sequence order. Both layers are crucial for modern NLP tasks, especially when working with transformer-based models. They allow the network to work with text data by converting tokens into rich, informative vector representations that capture both semantic meaning and positional information. By using TNNetTokenAndPositionalEmbedding, you're equipping your model with the fundamental building blocks needed for advanced NLP tasks as it provides both embedding and positional encoding.
To illustrate how these layers might be used in practice, let's consider a simple example. Suppose you're building a language model for text generation. You could use these layers:
FNN.AddLayer([
TNNetInput.Create(csContextLen, 1, 1),
TNNetTokenAndPositionalEmbedding.Create(csModelVocabSize, csEmbedDim, {EncodeZero=}1, {ScaleEmbedding=}0.02, {ScalePositional=}0.01)
]);
for I := 1 to 2 do FNN.AddTransformerBlockCAI({Heads=}8, {IntermediateDim=}2048, {NoForward=}true, {HasNorm=}false);
FNN.AddLayer([
TNNetPointwiseConvLinear.Create(csModelVocabSize),
TNNetPointwiseSoftMax.Create({SkipBackpropDerivative=}1)
]);
The above example resembles a simplified version of models like GPT (Generative Pre-trained Transformer). It's designed to process sequential data such as text generation tasks. The use of token and positional embeddings, followed by transformer blocks, is a standard approach in modern NLP models. The final pointwise convolution and softmax layers are typical for generating probability distributions over a vocabulary, which is common in language models. The number of transformer blocks (2) indicates that this is a lightweight model. The choice of parameters like embedding dimensions, number of heads, and intermediate dimensions would depend on the specific requirements of the task and computational constraints.
| Layer Name | Input/Output Dimensions | Description |
|---|---|---|
| TNNetEmbedding | Input: 1D integer tokens. Output: 2D (sequence_length x embedding_size). | Converts input tokens into dense vector representations. Parameters include vocabulary size, embedding size, scaling factor, and whether to encode zero. Allows for training of embedding weights through backpropagation. |
| TNNetTokenAndPositionalEmbedding | Input: 1D integer tokens. Output: 2D (sequence_length x embedding_size) | Extends TNNetEmbedding by adding positional information to token embeddings. This layer is crucial for transformer architectures. |
| TNNetSinusoidalTimeEmbedding | Input: 1x1x1 scalar timestep t. Output: 1x1xEmbeddingSize. |
DDPM-style scalar-timestep encoder (Ho et al. 2020, https://arxiv.org/abs/2006.11239): emb[i]=sin(t*freq[i]), emb[half+i]=cos(t*freq[i]) with freq[i]=exp(-ln(MaxPeriod)*i/half). Distinct from TNNetSinusoidalPositionalEmbedding, which is the additive Vaswani encoding on the sequence (X) axis. No learnable parameters; backward is a no-op in v1. Created with TNNetSinusoidalTimeEmbedding.Create(EmbeddingSize, MaxPeriod=10000) (EmbeddingSize must be even). |
| TNNetFourierFeatures | Input: 1D/2D/3D coordinate vector (Depth = D_in). Output: 1x1x(2*M). | Fixed (non-trainable) random Fourier-feature coordinate embedding (Rahimi & Recht 2007; Tancik et al. 2020, https://arxiv.org/abs/2006.10739): maps x through a frozen Gaussian frequency matrix B ~ N(0, sigma^2) of shape D_in x M and outputs [cos(2*pi*B^T x), sin(2*pi*B^T x)] along depth. Lets a plain coordinate-MLP fit high-frequency detail (overcomes spectral bias); sigma sets the frequency bandwidth. B is sampled once from a seeded RNG and serialized, so save/load reproduces the exact mapping. No parameter gradient (only input gradient flows). Created with TNNetFourierFeatures.Create(M, sigma, Seed=0). |
| TNNetRandomFourierFeatures | Input: 1D/2D/3D vector (Depth = D_in). Output: 1x1x(2*D). | RBF/Gaussian-kernel random features (Rahimi & Recht 2007, Random Features for Large-Scale Kernel Machines): maps x through a frozen Gaussian projection W ~ N(0, 1/sigma^2) of shape D x D_in and outputs phi(x) = sqrt(1/D) * [cos(W x), sin(W x)] along depth, so that <phi(x),phi(y)> ≈ exp(-‖x-y‖²/(2·sigma²)) — i.e. a downstream linear layer over phi(x) approximates an RBF-kernel machine (the kernel-method family otherwise absent from this repo). Mathematically distinct from the learnable-FFT layers (TNNetFourierMix, TNNetSpectralConv1D/2D, TNNetCirculantLinear FFT path) — a fixed random Gaussian projection approximating a shift-invariant kernel, not a transform along the signal axis — and from TNNetFourierFeatures (Tancik-style coordinate embedding: 2π factor, no sqrt(1/D) kernel scale, no trainable mode). W is sampled from a seeded RNG and serialized so save/load reproduces the exact map. Frozen by default (classic RFF — only the input gradient flows); pass pTrainable<>0 for the deep-kernel-learning variant that also accumulates dL/dW (sigma stays fixed). Created with TNNetRandomFourierFeatures.Create(D, sigma=1.0, Trainable=0, Seed=0). See examples/RandomFourierFeatures/. |
| Layer Name | Input/Output Dimensions | Activation | Description |
|---|---|---|---|
TNNetDeLocalConnect |
1D, 2D, or 3D | tanh | Opposing operation to TNNetLocalConnect. |
TNNetDeLocalConnectReLU |
1D, 2D, or 3D | ReLU | Opposing operation to TNNetLocalConnectReLU. |
TNNetDeconvolution |
1D, 2D, or 3D | tanh | Opposing operation to TNNetConvolution, also known as transposed convolution. |
TNNetDeconvolutionReLU |
1D, 2D, or 3D | ReLU | Opposing operation to convolution with ReLU activation (TNNetConvolutionReLU). |
TNNetDeMaxPool |
1D, 2D, or 3D | None | Opposing operation to max pooling layer TNNetMaxPool. |
This API implements popular weight initialization methods including He (Kaiming) and Glorot/Bengio (Xavier):
InitUniform(Value: TNeuralFloat = 1).InitLeCunUniform(Value: TNeuralFloat = 1).InitHeUniform(Value: TNeuralFloat = 1).InitHeUniformDepthwise(Value: TNeuralFloat = 1).InitHeGaussian(Value: TNeuralFloat = 0.5).InitHeGaussianDepthwise(Value: TNeuralFloat = 0.5).InitGlorotBengioUniform(Value: TNeuralFloat = 1).InitSELU(Value: TNeuralFloat = 1).
procedure FlipX();procedure FlipY();procedure CopyCropping(Original: TVolume; StartX, StartY, pSizeX, pSizeY: integer);procedure CopyResizing(Original: TVolume; NewSizeX, NewSizeY: integer);procedure AddGaussianNoise(pMul: TNeuralFloat);procedure AddSaltAndPepper(pNum: integer; pSalt: integer = 2; pPepper: integer = -2);
| NEURAL | Keras | PyTorch |
|---|---|---|
TNNetFullConnect |
layers.Dense(activation='tanh') |
nn.Linear nn.Tanh() |
TNNetFullConnectReLU |
layers.Dense(activation='relu') |
nn.Linear nn.ReLU() |
TNNetFullConnectLinear |
layers.Dense(activation=None) |
nn.Linear |
TNNetFullConnectSigmoid |
layers.Dense(activation='sigmoid') |
nn.Linear nn.Sigmoid() |
TNNetReLU |
activations.relu |
nn.ReLU() |
TNNetLeakyReLU |
activations.relu(alpha=0.01) |
nn.LeakyReLU(0.01) |
TNNetVeryLeakyReLU |
activations.relu(alpha=1/3) |
nn.LeakyReLU(1/3) |
TNNetReLUSqrt |
||
TNNetSELU |
activations.selu |
nn.SELU |
TNNetSigmoid |
activations.sigmoid |
nn.Sigmoid |
TNNetSoftMax |
activations.softmax |
nn.Softmax |
TNNetHyperbolicTangent |
activations.tanh |
nn.Tanh |
TNNetPower |
||
TNNetAvgPool |
layers.AveragePooling2D |
nn.AvgPool2d |
TNNetMaxPool |
layers.MaxPool2D |
nn.MaxPool2d |
TNNetMaxPoolPortable |
layers.MaxPool2D |
nn.MaxPool2d |
TNNetMinPool |
||
TNNet.AddMinMaxPool |
||
TNNet.AddAvgMaxPool |
||
TNNetAvgChannel |
layers.GlobalAveragePooling2D |
nn.AvgPool2d |
TNNetMaxChannel |
layers.GlobalMaxPool2D |
nn.MaxPool2d |
TNNetGlobalSumPool |
||
TNNetMinChannel |
||
TNNet.AddMinMaxChannel |
||
TNNet.AddAvgMaxChannel |
cai.layers.GlobalAverageMaxPooling2D | |
TNNetConcat |
layers.Concatenate(axis=1) |
torch.cat |
TNNetDeepConcat |
layers.Concatenate(axis=3) |
torch.cat |
TNNetIdentity |
nn.Identity |
|
TNNetIdentityWithoutBackprop |
||
TNNetReshape |
layers.Reshape |
torch.reshape |
TNNetSplitChannels |
cai.layers.CopyChannels | |
TNNetSplitChannelEvery |
||
TNNetSum |
layers.Add |
torch.add |
TNNetCellMulByCell |
layers.Multiply |
|
TNNetChannelMulByLayer |
layers.Multiply |
|
TNNetUpsample |
tf.nn.depth_to_space |
You can add layers one by one or you can add an array of layers in one go. Follows an example adding layers one by one:
NN.AddLayer(TNNetConvolutionReLU.Create({Features=}64, {FeatureSize=}5, {Padding=}2, {Stride=}1));
NN.AddLayer(TNNetMaxPool.Create(2));
The next example shows how to add an array of layers that is equivalent to the above example:
NN.AddLayer([
TNNetConvolutionReLU.Create({Features=}64, {FeatureSize=}5, {Padding=}2, {Stride=}1),
TNNetMaxPool.Create(2)
]);
Since 2017, this API supports multi-paths architectures. You can create multi-paths with AddLayerAfter method. For concatenating (merging) paths, you can call either TNNetConcat or TNNetDeepConcat. Follows an example:
// Creates The Neural Network
NN := TNNet.Create();
// This network splits into 2 paths and then is later concatenated
InputLayer := NN.AddLayer(TNNetInput.Create(32, 32, 3));
// First branch starting from InputLayer (5x5 features)
NN.AddLayerAfter(TNNetConvolutionReLU.Create({Features=}16, {FeatureSize=}5, {Padding=}2, {Stride=}1), InputLayer);
NN.AddLayer(TNNetMaxPool.Create(2));
NN.AddLayer(TNNetConvolutionReLU.Create({Features=}64, {FeatureSize=}5, {Padding=}2, {Stride=}1));
NN.AddLayer(TNNetMaxPool.Create(2));
EndOfFirstPath := NN.AddLayer(TNNetConvolutionReLU.Create({Features=}64, {FeatureSize=}5, {Padding=}2, {Stride=}1));
// Another branch starting from InputLayer (3x3 features)
NN.AddLayerAfter(TNNetConvolutionReLU.Create({Features=}16, {FeatureSize=}3, {Padding=}1, {Stride=}1), InputLayer);
NN.AddLayer(TNNetMaxPool.Create(2));
NN.AddLayer(TNNetConvolutionReLU.Create({Features=}64, {FeatureSize=}3, {Padding=}1, {Stride=}1));
NN.AddLayer(TNNetMaxPool.Create(2));
EndOfSecondPath := NN.AddLayer(TNNetConvolutionReLU.Create({Features=}64, {FeatureSize=}3, {Padding=}1, {Stride=}1));
// Concats both branches into one branch.
NN.AddLayer(TNNetDeepConcat.Create([EndOfFirstPath, EndOfSecondPath]));
NN.AddLayer(TNNetConvolutionReLU.Create({Features=}64, {FeatureSize=}3, {Padding=}1, {Stride=}1));
NN.AddLayer(TNNetLayerFullConnectReLU.Create(64));
NN.AddLayer(TNNetLayerFullConnectReLU.Create(NumClasses));
These source code examples show AddLayerAfter:
- DenseNetBC L40
- Identity Shortcut Connection - ResNet building block
You can find more about multi-path architectures at:
These datasets can be easily loaded:
procedure CreateCifar10Volumes(out ImgTrainingVolumes, ImgValidationVolumes, ImgTestVolumes: TNNetVolumeList);
Source code example: Simple CIFAR-10 Image Classifier
procedure CreateCifar100Volumes(out ImgTrainingVolumes, ImgValidationVolumes, ImgTestVolumes: TNNetVolumeList);
Source code example: CAI Optimized DenseNet CIFAR-100 Image Classifier
procedure CreateMNISTVolumes(out ImgTrainingVolumes, ImgValidationVolumes,
ImgTestVolumes: TNNetVolumeList;
TrainFileName, TestFileName: string;
Verbose:boolean = true;
IsFashion:boolean = false);
Source code examples:
In the case that your dataset has one class per folder, you can call CreateVolumesFromImagesFromFolder for loading your data into RAM:
// change ProportionToLoad to a smaller number if you don't have enough RAM.
ProportionToLoad := 1;
WriteLn('Loading ', Round(ProportionToLoad*100), '% of the Plant leave disease dataset into memory.');
CreateVolumesFromImagesFromFolder
(
ImgTrainingVolumes, ImgValidationVolumes, ImgTestVolumes,
{FolderName=}'plant', {pImageSubFolder=}'',
{color_encoding=}csRGB{RGB},
{TrainingProp=}0.9*ProportionToLoad,
{ValidationProp=}0.05*ProportionToLoad,
{TestProp=}0.05*ProportionToLoad,
{NewSizeX=}128, {NewSizeY=}128
);
The example above shows how to load the dataset with 90% loaded into training and 5% loaded for each validation and testing. Images are being resized to 128x128.
Source code examples:
- Simple Plant Leaf Disease Image Classifier for the PlantVillage Dataset
- Colorectal Cancer Dataset Image Classifier
- Malaria Dataset Image Classifier
- Tiny ImageNet 200
In the case that your image classification dataset is too big to be stored in RAM, you can follow this example:
FTrainingFileNames, FValidationFileNames, FTestFileNames: TFileNameList;
...
ProportionToLoad := 1;
CreateFileNameListsFromImagesFromFolder(
FTrainingFileNames, FValidationFileNames, FTestFileNames,
{FolderName=}'places_folder/train', {pImageSubFolder=}'',
{TrainingProp=}0.9*ProportionToLoad,
{ValidationProp=}0.05*ProportionToLoad,
{TestProp=}0.05*ProportionToLoad
);
Then, you can call a fitting method made specific for this:
NeuralFit := TNeuralImageLoadingFit.Create;
...
NeuralFit.FitLoading({NeuralNetworkModel}NN, {ImageSizeX}256, {ImageSizeY}256, FTrainingFileNames, FValidationFileNames, FTestFileNames, {BatchSize}256, {Epochs}100);
TNeuralImageLoadingFit.FitLoading has been tested with Places365-Standard Small images 256x256 with easy directory structure.
You can follow this example:
When loading an image from a file, the easiest and fastest method is calling LoadImageFromFileIntoVolume(ImageFileName:string; V:TNNetVolume). When loading from an TFPMemoryImage, you can load with LoadImageIntoVolume(M: TFPMemoryImage; Vol:TNNetVolume). For saving an image, the fastest method is SaveImageFromVolumeIntoFile(V: TNNetVolume; ImageFileName: string).
The easiest way to train your neural network is utilizing unit neuralfit.pas. Inside this unit, you’ll find the class TNeuralImageFit that is used by many examples.
TNeuralImageFit has been designed for image classification tasks and can be called as follows:
procedure Fit(pNN: TNNet;
pImgVolumes, pImgValidationVolumes, pImgTestVolumes: TNNetVolumeList;
pNumClasses, pBatchSize, Epochs: integer);
Each volume should be provided with property tag that contains the corresponding class. TNeuralImageFit internally implements data augmentation techniques: flipping, making gray, cropping and resizing. These techniques can be controlled with:
property HasImgCrop: boolean read FHasImgCrop write FHasImgCrop;
property HasMakeGray: boolean read FHasMakeGray write FHasMakeGray;
property HasFlipX: boolean read FHasFlipX write FHasFlipX;
property HasFlipY: boolean read FHasFlipY write FHasFlipY;
property MaxCropSize: integer read FMaxCropSize write FMaxCropSize;
Once you have a trained neural network, you can use an advanced classification procedure that will average the classification probability of the input image with its flipped and cropped versions. This process frequently gives a higher classification accuracy at the expense of internally running the very same neural network a number of times. This is how you can classify images:
procedure ClassifyImage(pNN: TNNet; pImgInput, pOutput: TNNetVolume);
In the case that you would like to look into TNeuralImageFit in more detail, the Simple CIFAR-10 Image Classifier example is a good starting point.
In the case that your training, validation and testing data can be defined as volume pairs from input volume to output volume, the easiest way to train your neural network will be calling TNeuralFit. This class has the following fitting method:
procedure Fit(pNN: TNNet;
pTrainingVolumes, pValidationVolumes, pTestVolumes: TNNetVolumePairList;
pBatchSize, Epochs: integer);
Both AND, OR and XOR with neuralfit unit and hypotenuse function examples load volume pair lists for training.
The TNeuralFit implementation has a limitation: your dataset needs to be placed into RAM. In the case that your dataset is too large for RAM, you can call TNeuralDataLoadingFit:
TNNetGetPairFn = function(Idx: integer; ThreadId: integer): TNNetVolumePair of object;
TNNetGet2VolumesProc = procedure(Idx: integer; ThreadId: integer; pInput, pOutput: TNNetVolume) of object;
TNeuralDataLoadingFit = class(TNeuralFitBase)
...
procedure FitLoading(pNN: TNNet;
TrainingCnt, ValidationCnt, TestCnt, pBatchSize, Epochs: integer;
pGetTrainingPair, pGetValidationPair, pGetTestPair: TNNetGetPairFn); overload;
procedure FitLoading(pNN: TNNet;
TrainingCnt, ValidationCnt, TestCnt, pBatchSize, Epochs: integer;
pGetTrainingProc, pGetValidationProc, pGetTestProc: TNNetGet2VolumesProc); overload;
The Hypotenuse with FitLoading example uses TNeuralDataLoadingFit so it creates training pairs on the fly.
TNeuralImageFit and TNeuralDataLoadingFit both descend from TNeuralFitBase. From TNeuralFitBase, you can define training properties:
property Inertia: single read FInertia write FInertia;
property InitialEpoch: integer read FInitialEpoch write FInitialEpoch;
property InitialLearningRate: single read FInitialLearningRate write FInitialLearningRate;
property LearningRateDecay: single read FLearningRateDecay write FLearningRateDecay;
property CyclicalLearningRateLen: integer read FCyclicalLearningRateLen write FCyclicalLearningRateLen;
property Momentum: single read FInertia write FInertia;
property L2Decay: single read FL2Decay write FL2Decay;
property FileNameBase: string read FFileNameBase write FFileNameBase;
You can also collect current statistics:
property CurrentEpoch: integer read FCurrentEpoch;
property CurrentStep: integer read FCurrentStep;
property CurrentLearningRate: single read FCurrentLearningRate;
property TestAccuracy: TNeuralFloat read FTestAccuracy;
property TrainingAccuracy: TNeuralFloat read FTrainingAccuracy;
property Running: boolean read FRunning;
Some events are available:
property OnStart: TNotifyEvent read FOnStart write FOnStart;
property OnAfterStep: TNotifyEvent read FOnAfterStep write FOnAfterStep;
property OnAfterEpoch: TNotifyEvent read FOnAfterEpoch write FOnAfterEpoch;
You can define your own learning rate schedule:
property CustomLearningRateScheduleFn: TCustomLearningRateScheduleFn read FCustomLearningRateScheduleFn write FCustomLearningRateScheduleFn;
property CustomLearningRateScheduleObjFn: TCustomLearningRateScheduleObjFn read FCustomLearningRateScheduleObjFn write FCustomLearningRateScheduleObjFn;
TNeuralFitBase descends from TMObject that allows you to code your own message treatment:
property MessageProc: TGetStrProc read FMessageProc write FMessageProc;
property ErrorProc: TGetStrProc read FErrorProc write FErrorProc;
On your own code, you could something is:
MyFit.MessageProc := {$IFDEF FPC}@{$ENDIF}Self.MessageProc;
MyFit.ErrorProc := {$IFDEF FPC}@{$ENDIF}Self.ErrorProc;
If you don’t need any message at all, you can hide messages by calling:
procedure HideMessages();
You can also disable fitting verbosity with:
property Verbose: boolean read FVerbose write FVerbose;
Your code will look like this:
NeuralFit := TNeuralImageFit.Create;
...
NeuralFit.Verbose := false;
NeuralFit.HideMessages();
This API has easy to use, lightweight and platform independent parallel processing API methods.
As an example, assuming that you need to run a procedure 10 times in parallel, you can create 10 thread workers as follows:
FProcs := TNeuralThreadList.Create( 10 );
As an example, this is the procedure that we intend to run in parallel:
procedure MyClassName.RunNNThread(index, threadnum: integer);
begin
WriteLn('This is thread ',index,' out of ',threadnum,' threads.');
end;
Then, to run the procedure RunNNThread passed as parameter 10 times in parallel, do this:
FProcs.StartProc({$IFDEF FPC}@RunNNThread{$ELSE}RunNNThread{$ENDIF});
You can control the blocking mode (waiting threads to finish before the program continues) as per declaration:
procedure StartProc(pProc: TNeuralProc; pBlock: boolean = true);
Or, if you prefer, you can specifically say when to wait for threads to finish as per this example:
FProcs.StartProc({$IFDEF FPC}@RunNNThread{$ELSE}RunNNThread{$ENDIF}, false);
// insert your code here
FProcs.WaitForProc(); // waits until all threads are finished.
When you are done, you should call:
FProcs.Free;
Beyond the runnable examples above, TNNet exposes a family of in-process introspection and diagnostic methods (most are demonstrated by the linked examples). They are grouped here by what they inspect; the linked example carries the full description, sample output and caveats.
TNNet.PrintSummary/SummaryString— Keras-style table of per-layer index, class, output shape(X,Y,D), param and neuron counts, ending with totals (SummaryStringreturns it as a string instead of writing to stdout). Used throughout the examples (e.g. ConfusionMatrixReport, GradientNormReport, PerplexityEval).TNNet.ToGraphvizDot— emits a Graphviz.dotof the layer DAG (one node per layer, edges following the real graph incl. multi-inputTNNetSum/TNNetDeepConcat); render withdot -Tpng net.dot -o net.png. → exampleTNNet.DiffArchitecture(OtherNet)/DiffArchitectureFromString(s)— unified-diff-style report of architectural differences between two networks (LCS-aligned so single inserts/removes don't cascade). → exampleTNNet.ReceptiveFieldReport(NN)— analytically propagates the receptive-field recurrence through the spatial layers (size, jump, input coverage, global-mixing cut point); no data needed. → exampleTNNet.CountFLOPsPerLayer(NN)— per-layer forward-pass FLOP estimate and each layer's share, flagging layer classes the estimator doesn't model. → exampleTNNet.LayerTimingReport(NN, Sample, Iterations)— per-layer forward-pass wall-clock cost: mean microseconds/forward and percent of total (ASCII#-bar) measured overIterationsforward passes.
TNNet.WeightHistogramReport(NN)— per-trainable-layer weight statistics and ASCII bar histograms. → exampleTNNet.WeightSpectrumReport(NN)— top singular value per layer (power iteration via the reusableTNNet.EstimateSpectralNormhelper); flags rank-1 collapse / high spectral norm. → exampleTNNet.EstimateSpectralRadius(Layer, Iters)— power-iteration estimate of a layer weight matrix's spectral radiusρ = |λ|_max(the largest-magnitude eigenvalue, via thev := W·v / ‖W·v‖recurrence — noWᵀstep), the eigenvalue sibling of the singular-valueEstimateSpectralNorm. For a non-symmetric matrixρ ≤ σ_1, so radius is the tighter, correct stability target for recurrent / echo-state reservoirs (scale toρ_target < 1directly) whileσ_1(the norm) is the right Lipschitz/clipping bound. Requires a square matrix (returns 0 otherwise). → used in exampleTNNet.WeightSpectralTailReport(NN)— label-free HT-SR quality metric: power-law tail exponentalphaper layer plus a network-average weighted-alpha (the WeightWatcher metric). → example
TNNet.ActivationStatsReport(NN, Samples)— per-layer activation distribution statistics; flags near-collapsed / saturating layers. → exampleTNNet.DeadNeuronReport(NN, Samples)— dead-unit counts across ReLU-family layers over a probe batch. → exampleTNNet.NeuronCorrelationReport(NN, Samples)— intra-layer neuron redundancy: a|rho|histogram, top correlated pairs, and an effective-neuron count. → exampleTNNet.IntrinsicDimensionReport(NN, Probes)— per-layer effective dimensionality: PCA participation ratio + the nonlinear TwoNN manifold estimate, side by side. → exampleTNNet.RepresentationSimilarityReport(NN, Probes [, OtherNet])— linear-CKA similarity between every pair of layer activations (rotation/scale-invariant; optional cross-net). → exampleTNNet.RoutingEntropyReport(NN, Probe)— batch-level routing diagnostic for aTNNetSoftDecisionTreelayer: per-leaf occupancy (is the tree using all2^Dleaves or collapsing onto a few?), average per-gate binary entropy (crisp vs mushy splits), and average per-sample effective-leaf-countexp(−Σ P_l·ln P_l)(decisive vs diffuse routing). The statistical companion to the example's single-point decision path. → example
TNNet.ConfusionMatrixReport(NN, Samples, NumClasses)— confusion matrix + precision/recall/F1 + most-confused pairs + per-class hard-example indices. → exampleTNNet.TopLogitMarginReport(NN, Samples, NumClasses)— per-sampletop1 − top2logit margin, per-class stats, and a lowest-margin "hard examples" pool. → exampleTNNet.PerplexityReport(NN, Tokens, ContextLen)— cross-entropy, perplexity, bits-per-character, top-k accuracy and worst-K positions for a sequence head. → exampleTNNet.TTAReport(NN, Probes, Labels)— test-time-augmentation accuracy over a fixed transform menu vs the clean baseline, with a helps/neutral/hurts verdict. → exampleTNNet.DecisionBoundaryReport(NN, Probes)— ASCII argmax map, confidence overlay and boundary-length estimate for a 2-input classifier head. → exampleneuralcalibrationunit (separate from the*Reportfamily) —CalibrationReport/ComputeCalibration(ECE/MCE, Brier, reliability diagram) andFitTemperature(temperature scaling, never mutating the backbone). → example
TNNet.GradientNormReport(NN, Input, Target)— per-layer‖dL/dx‖and‖dL/dW‖with vanishing/exploding flags. → exampleTNNet.LossLandscapeProbe(NN, Samples, K, R)— loss along a filter-normalised random direction; sharpness scalar + loss-doubling radius. → exampleTNNet.GradientConflictReport(NN, Samples [, UseTrueLabel, LayerIdx])— pairwise per-sample gradient cosines: conflict fraction + per-class-pair mean-cosine matrix. → exampleTNNet.GradientNoiseScaleReport(NN, Samples [, UseTrueLabel, LayerIdx])— gradient signal-to-noise ratio and the simple noise scaleB_simple(the critical batch size). → exampleTNNet.NeuralTangentKernelReport(NN, Samples [, TargetClass])— empirical NTK Gram, its eigenspectrum, condition number and kernel-target alignment. → exampleTNNet.HessianCurvatureReport(NN, Samples)— loss-surface sharpness: Hessian trace + top eigenvalue via finite-difference Hessian-vector products. → exampleTNNet.LocalLearningCoefficientReport(NN, Samples [, ChainLen, Eps, Gamma])— Local Learning Coefficient (LLC, an empirical RLCT from Singular Learning Theory): effective-dimensionality / degeneracy of the minimum, estimated from a tempered anchored-SGLD chain (LLC_hat = n·β·(mean_chain[L] − L(w*))). Counts flat, degenerate directions the 2nd-order Hessian top eigenvalue is blind to; non-destructive (weights snapshotted and restored). → exampleTNNet.EnableInputGradient— helper that resizes the input layer's error tensors so a backward pass can depositd(output)/d(input)onLayers[0](off by default; needed by the saliency, adversarial and effective-RF reports).
TNNet.AdversarialRobustnessReport(NN, Samples, Labels, EpsList)— FGSM accuracy-vs-eps degradation curve, per-sample critical-eps histogram, robust/fragile verdict. → exampleTNNet.MCDropoutUncertaintyReport(NN, Probes [, Labels])— Monte-Carlo-Dropout total / aleatoric / epistemic (BALD) uncertainty, keeping dropout active at inference. → exampleTNNet.EquivarianceReport(NN, Probes)— output invariance error under a fixed flip / reverse / roll transform menu, with an invariant/sensitive verdict. → exampleTNNet.EffectiveReceptiveFieldReport(NN, Probes)— empirical (gradient-measured) receptive field vs the analytical one — what a unit actually weights. → example
TNNet.SaliencyReport(NN, Probe)— input-gradient / SmoothGrad / Integrated-Gradients heatmaps for the predicted class (with the IG completeness check). → exampleTNNet.GradCAMReport(NN, Probe [, ConvLayerIdx, ForcedClass])— Grad-CAM (Selvaraju et al. 2017) coarse, class-discriminative conv-feature heatmap for the predicted class, nearest-upsampled to the input plane (complements the fine input-pixelSaliencyReport). → exampleTNNet.LRPReport(NN, Probe [, ForcedClass, TopK, Eps])— Layer-wise Relevance Propagation (Bach et al. 2015): a conservation method (not a gradient one) that back-distributes the explained logit's relevance via the epsilon-rule, printing the per-layer conservation residual, the top-k most-relevant input positions and an input-relevance heatmap (skips attention/norm layers honestly). → exampleTNNet.AttentionEntropyReport(NN, Probes)— per-row attention entropy with dead/spike head flags for everyTNNetScaledDotProductAttentionlayer. → exampleTNNet.ActivationPatchingReport(NN, CleanInput, CorruptInput [, TargetIdx])— causal trace: which layer's activations carry the information that decides the prediction. → exampleTNNet.LogitLensReport(NN, pInput [, HeadStartIdx])— re-applies the net's own trained head at each depth (zero new params) to see when the prediction crystallises. → exampleTNNet.TunedLensReport(NN, pInput [, HeadStartIdx, TrainIters, LearningRate])— the learned sibling of the logit lens (Belrose et al. 2023): fits one per-layer affine translator (frozen trunk + head) on the unlabelled probe by KL-to-self, then prints the tuned lens' KL-to-final and entropy side by side with the raw logit-lens columns — the tuned curve commits earlier and tracks the final answer more faithfully (lower KL-to-final). → exampleTNNet.PredictionDepthReport(NN, Support, SupportLabels, Queries [, QueryLabels])— per-example difficulty via the depth where a k-NN vote locks onto the final answer. → exampleTNNet.LayerSensitivityReport(NN, Samples [, Targets])— output/loss delta from small multiplicative per-layer weight perturbations, with a fragility verdict. → exampleTNNet.MagnitudePruningReport(NN, Samples [, Labels, Tolerance, PerLayer])— no-retrain accuracy-vs-sparsity curve and the prunability knee (global or per-layer). → example
TNNet.ModeConnectivityReport(NN, SnapshotB, Samples)— loss barrier along the linear interpolation between two trained nets ("same basin or separated?"). → exampleTNNet.PermutationAlignReport(NN, SnapshotB, Samples [, ScoreMode, K])— "Git Re-Basin": loss barrier before vs after quotienting out neuron-permutation symmetry. → example
This NLP source code example shows a (hello world) small neural network trained on the Tiny Stories dataset. A more complex NLP example showing the implementation of the GPT-3 Small architecture is also available.
In short, this API supports:
- Samplers:
TNNetSamplerGreedy,TNNetSamplerTopKandTNNetSamplerTopP. - A logit processor for repetition control:
TNNetTokenHistoryPenalty— a stateful pre-sampler that reshapes the next-token logits in place using generation history, with three standard knobs (repetition penalty in the sign-correct CTRL form, frequency penalty, and presence penalty). Use it asPenalty.Apply(Logits); tok := Sampler.GetToken(Logits); Penalty.RegisterToken(tok);. - A tokenizer:
TNeuralTokenizer. - A transformer decoder:
AddTransformerBlockCAI.
In the case that you would like to know more about what the CAI's author is working at, here we go.
Optimizing the first layers of a convolutional neural network:
- Color-aware two-branch DCNN for efficient plant disease classification.
- Reliable Deep Learning Plant Leaf Disease Classification Based on Light-Chroma Separated Branches.
Optimizing deep layers of a convolutional neural network:
- Grouped Pointwise Convolutions Reduce Parameters in Convolutional Neural Networks.
- An Enhanced Scheme for Reducing the Complexity of Pointwise Convolutions in CNNs for Image Classification Based on Interleaved Grouped Filters without Divisibility Constraints.
Optimizing LLMs:
Publicações em Português:
- A Evolução dos Algoritmos Mentais.
- Da Física à Inteligência Extrassomática.
- Inteligência Artificial Popperiana.
- Operações Lógicas Quânticas e Colorabilidade de Grafos.
Pull requests are welcome. Having requests accepted might be hard.
In the case that you need help with your own A.I. project (Pascal, Python, or PHP), please feel free to contact the author of this API.
You can cite this API in BibTeX format with:
@software{cai_neural_api_2021_5810077,
author = {Joao Paulo Schwarz Schuler},
title = {CAI NEURAL API},
month = dec,
year = 2021,
publisher = {Zenodo},
version = {v1.0.6},
doi = {10.5281/zenodo.5810077},
url = {https://doi.org/10.5281/zenodo.5810077}
}