Official transformer architecture support for Dlib #3124

Cydral · 2025-11-28T12:07:56Z

Official transformer architecture support for Dlib

This pull request represents the consolidation and stabilization of transformer-related layers and components developed throughout 2024~2025. This substantial commit introduces official support for modern language modeling in Dlib, positioning the library as a new reference implementation for neural network building in natural language processing.

The extensions have been iteratively refined over the past year, with each component tested and validated across multiple architectures and use cases. This work establishes the foundation for upcoming multimodal capabilities, with active development underway for vision transformers and combined text-image processing.

Future releases will introduce examples demonstrating transformer architectures for image processing, followed by multimodal fusion combining textual and visual information. This PR represents an important milestone that could justify a new version of Dlib to mark the official introduction of these features.

Overview

This pull request introduces complete transformer architecture support to Dlib, enabling modern language modeling capabilities while maintaining Dlib's philosophy of simple APIs and production-ready implementations. All components are written in standard C++14 for cross-platform compatibility.

Major additions

Core architectural components

Attention mechanisms:

multi-head self-attention with scaled dot-product computation
canonical and fused transformer variants for research and production use
causal masking for autoregressive generation
rotary positional embeddings (RoPE) with YaRN extension for dynamic sequence length support
absolute positional encodings for traditional transformer architectures

Specialized layers:

linear layer with plane-wise matrix multiplication for sequence processing
rms_norm layer implementing efficient RMS normalization
reshape_to layer for dimension manipulation without data copying
token_embeddings layer combining embedding lookup with positional encoding
tril layer for triangular mask generation
transpose and multm_prev layers for attention computation
dropout_rate layer with configurable per-layer dropout schedules

Advanced architectures:

mixture-of-experts (MoE) with dynamic expert routing and load balancing
hierarchical reasoning model (HRM) with dual recurrent modules
adaptive computation time (ACT) for dynamic computation allocation
SwiGLU gated activation for improved feed-forward networks

Optimization infrastructure

AdamW optimizer (dlib/dnn/solvers.h):

decoupled weight decay regularization for improved generalization
based on "Decoupled Weight Decay Regularization" (Loshchilov & Hutter, ICLR 2019)
standard optimizer for large language models and transformer architectures
per-layer learning rate and weight decay multipliers
production-ready implementation with proper bias correction

Language modeling utilities

Dataset preparation (language_model_data.h):

build_single_token_prediction_dataset() for autoregressive training
build_multi_token_prediction_dataset() for sequence-to-sequence tasks
shuffle_training_dataset() for data randomization
augment_training_dataset() for noise injection and robustness improvement

Inference management:

inference_context class for autoregressive generation with sliding window
stochastic text generation with temperature, top-k, nucleus sampling, and repetition penalty

Evaluation metrics:

edit distance (Levenshtein) with normalization
token overlap metrics (precision, recall, F1-score)
n-gram overlap (BLEU-like) for structural similarity
compute_text_similarity() combining all metrics

Preprocessing:

detect_file_type() supporting 30+ formats via magic numbers and entropy analysis

Complete transformer implementations

Canonical transformer (canonical_transformer namespace):

explicit Q, K, V projections for modularity and research
transformer_block combining attention and feed-forward networks
transformer_stack for building deep architectures
support for dynamic sequence lengths through dimension inference

Fused transformer (fused_transformer namespace):

combined QKV projection for memory and compute efficiency
optimized for production deployment scenarios
compatible API with canonical variant

Loss functions

Cross-entropy per logit (loss_cross_entropy_per_logit):

specialized loss for sequence models working directly with linear layer output
computes loss only at last sequence position
avoids dimension flattening while preserving sequence structure
numerically stable via log-sum-exp trick

Example programs

Four progressive examples demonstrate the capabilities:

slm_basic_train_ex.cpp: character-level transformer training on Shakespeare text. Demonstrates fundamental attention mechanics and memorization capability.

slm_advanced_train_ex.cpp: BPE tokenization with compact architecture. Introduces specialized loss function and byte-for-byte verification.

slm_mixture_of_experts_ex.cpp: sparse conditional computation with production-grade utilities. Demonstrates shuffle and augmentation utilities for robust training.

slm_chatbot_ex.cpp: conversational AI training pipeline with two-stage approach. Demonstrates base language model pre-training followed by supervised fine-tuning on question-answer pairs. Includes stochastic text generation with configurable sampling strategies (temperature, repetition penalty, nucleus sampling) and layer-wise learning rate multipliers for efficient fine-tuning. Shows practical implementation of interactive chatbot with context management.

Technical design

Matrix plane processing

Traditional Dlib layers operate channel-wise on 4D tensors. The extensions introduce plane-wise processing where (rows, cols) dimensions form semantic units for sequence data. This enables:

natural representation: (batch, 1, sequence_length, embedding_dim)
efficient attention computation over spatial planes
seamless integration with existing Dlib computational graph

Implementation approach

All components follow Dlib's design patterns:

header-only implementations where appropriate
template-based abstractions for compile-time optimization
compatibility with existing training infrastructure (dnn_trainer, optimizers, serialization)
comprehensive inline documentation following Dlib's conventions

Testing and validation

The example programs demonstrate:

perfect memorization on training data (99.99% accuracy for basic example)
byte-for-byte reproduction capability (advanced example)
balanced expert utilization in MoE (coefficient of variation < 0.3)

Main files modified/added

New headers:

dlib/dnn/transformer.h - complete transformer implementations
dlib/dnn/layers_transformer.h - specialized layers for sequence processing
dlib/dnn/language_model_data.h - utilities for dataset preparation and evaluation
dlib/tokenizer/bpe_tokenizer.h - byte-pair encoding tokenization
dlib/dnn/solvers.h - AdamW optimizer addition

New examples:

examples/slm_basic_train_ex.cpp
examples/slm_advanced_train_ex.cpp
examples/slm_mixture_of_experts_ex.cpp
examples/slm_chatbot_ex.cpp
examples/slm_data.h - internal datasets for examples

Abstract documentation:

docs/layers_abstract.h - layer specifications and usage patterns
docs/transformer_abstract.h - transformer architecture documentation
docs/language_model_data_abstract.h - language modeling utility documentation
docs/solvers_abstract.h - AdamW optimizer specification

Extended documentation

For more details, see the dedicated repository: https://github.com/Cydral/Dlib-Transformer-extensions

This contribution establishes official transformer support in Dlib, extending the library into modern natural language processing while maintaining its core values of simplicity, performance, and production readiness. The groundwork laid here enables upcoming vision transformer implementations and multimodal architectures, positioning Dlib as a comprehensive framework for contemporary deep learning applications.

…des an optimized linear transformation for multi-dimensional inputs.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…tion-free tokenization

Cydral and others added 30 commits April 28, 2025 22:10

Implementation of linear_ layer for neural networks. This layer provi…

3e9b9f1

…des an optimized linear transformation for multi-dimensional inputs.

Minor change

93ead3d

Update dlib/dnn/layers.h

bf1b805

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'davisking:master' into master

49bfbc6

Add reshape_to and flatten layers to Dlib's DNN module

f234faa

Missing update to "visitors.h"

26a2960

format fixing for reshape_to

c9a1ee4

Update dlib/test/dnn.cpp

02e62d8

Merge branch 'davisking:master' into master

394dee8

Vocabulary size fixed for learning, and function added for transforma…

778bfc1

…tion-free tokenization

Added a new example for learning a “complex” Transformer model.

03aafc2

Added a new example for learning a “complex” Transformer model.

22c2561

Updated example for training a Transformer model.

01cd0b2

fix for gcc/ffmpeg compilation

6b63e55

Fix a warning message for Ubuntu compilation.

ad1f757

Update for Linux environment.

c91c45a

Fix batch building

6fcc0aa

Slight improvement in model definition.

5a1773e

linear_ layer implementation improvement

10d7c59

finalizing the example

d4bf94b

Fixing break condition in training method.

a4dac0b

Fixing declaration order of variables.

63454e3

bpe_tokenizer improvements.

87ed70a

Example updated.

061c673

bpe_tokenizer class refactoring.

f6c8526

Example updated.

2db56f5

bpe_tokenizer class updated.

d4eeb2d

Decoding part of the bpe_tokenizer updated.

dcb5963

Network definition update

b81b502

Merge branch 'davisking:master' into master

80a6e0e

Cydral added 30 commits November 28, 2025 17:01

Update

d48164a

Update

40dd868

Update

44830f7

Update

85eb2c9

Update

5a83f2c

Update

78aae5b

Update

7cce339

Update

e8c6950

Update

e496c7d

Update

6460e81

Update

c6f6979

Update

952513c

Update

5a00bd8

Update

b302fb8

Update

b9be752

Update

8dfbdef

Update

17f859b

Update

f66369b

Update

84bd433

Update

c82213d

Update

8d1f4ea

Update

def6359

Update

9e76ed5

Update

426857c

Update

3fe4ee1

New example

c4086bc

New example added

1fc065d

Update

7b2c4ef

Update

9c86229

Update

0c730f9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Official transformer architecture support for Dlib #3124

Official transformer architecture support for Dlib #3124

Uh oh!

Cydral commented Nov 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Official transformer architecture support for Dlib #3124

Are you sure you want to change the base?

Official transformer architecture support for Dlib #3124

Uh oh!

Conversation

Cydral commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Official transformer architecture support for Dlib

Overview

Major additions

Core architectural components

Optimization infrastructure

Language modeling utilities

Complete transformer implementations

Loss functions

Example programs

Technical design

Matrix plane processing

Implementation approach

Testing and validation

Main files modified/added

Extended documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Cydral commented Nov 28, 2025 •

edited

Loading