Skip to content

Conversation

@Cydral
Copy link
Contributor

@Cydral Cydral commented Nov 28, 2025

Official transformer architecture support for Dlib

This pull request represents the consolidation and stabilization of transformer-related layers and components developed throughout 2024~2025. This substantial commit introduces official support for modern language modeling in Dlib, positioning the library as a new reference implementation for neural network building in natural language processing.

The extensions have been iteratively refined over the past year, with each component tested and validated across multiple architectures and use cases. This work establishes the foundation for upcoming multimodal capabilities, with active development underway for vision transformers and combined text-image processing.

Future releases will introduce examples demonstrating transformer architectures for image processing, followed by multimodal fusion combining textual and visual information. This PR represents an important milestone that could justify a new version of Dlib to mark the official introduction of these features.


Overview

This pull request introduces complete transformer architecture support to Dlib, enabling modern language modeling capabilities while maintaining Dlib's philosophy of simple APIs and production-ready implementations. All components are written in standard C++14 for cross-platform compatibility.

Major additions

Core architectural components

Attention mechanisms:

  • multi-head self-attention with scaled dot-product computation
  • canonical and fused transformer variants for research and production use
  • causal masking for autoregressive generation
  • rotary positional embeddings (RoPE) with YaRN extension for dynamic sequence length support
  • absolute positional encodings for traditional transformer architectures

Specialized layers:

  • linear layer with plane-wise matrix multiplication for sequence processing
  • rms_norm layer implementing efficient RMS normalization
  • reshape_to layer for dimension manipulation without data copying
  • token_embeddings layer combining embedding lookup with positional encoding
  • tril layer for triangular mask generation
  • transpose and multm_prev layers for attention computation
  • dropout_rate layer with configurable per-layer dropout schedules

Advanced architectures:

  • mixture-of-experts (MoE) with dynamic expert routing and load balancing
  • hierarchical reasoning model (HRM) with dual recurrent modules
  • adaptive computation time (ACT) for dynamic computation allocation
  • SwiGLU gated activation for improved feed-forward networks

Optimization infrastructure

AdamW optimizer (dlib/dnn/solvers.h):

  • decoupled weight decay regularization for improved generalization
  • based on "Decoupled Weight Decay Regularization" (Loshchilov & Hutter, ICLR 2019)
  • standard optimizer for large language models and transformer architectures
  • per-layer learning rate and weight decay multipliers
  • production-ready implementation with proper bias correction

Language modeling utilities

Dataset preparation (language_model_data.h):

  • build_single_token_prediction_dataset() for autoregressive training
  • build_multi_token_prediction_dataset() for sequence-to-sequence tasks
  • shuffle_training_dataset() for data randomization
  • augment_training_dataset() for noise injection and robustness improvement

Inference management:

  • inference_context class for autoregressive generation with sliding window
  • stochastic text generation with temperature, top-k, nucleus sampling, and repetition penalty

Evaluation metrics:

  • edit distance (Levenshtein) with normalization
  • token overlap metrics (precision, recall, F1-score)
  • n-gram overlap (BLEU-like) for structural similarity
  • compute_text_similarity() combining all metrics

Preprocessing:

  • detect_file_type() supporting 30+ formats via magic numbers and entropy analysis

Complete transformer implementations

Canonical transformer (canonical_transformer namespace):

  • explicit Q, K, V projections for modularity and research
  • transformer_block combining attention and feed-forward networks
  • transformer_stack for building deep architectures
  • support for dynamic sequence lengths through dimension inference

Fused transformer (fused_transformer namespace):

  • combined QKV projection for memory and compute efficiency
  • optimized for production deployment scenarios
  • compatible API with canonical variant

Loss functions

Cross-entropy per logit (loss_cross_entropy_per_logit):

  • specialized loss for sequence models working directly with linear layer output
  • computes loss only at last sequence position
  • avoids dimension flattening while preserving sequence structure
  • numerically stable via log-sum-exp trick

Example programs

Four progressive examples demonstrate the capabilities:

slm_basic_train_ex.cpp: character-level transformer training on Shakespeare text. Demonstrates fundamental attention mechanics and memorization capability.

slm_advanced_train_ex.cpp: BPE tokenization with compact architecture. Introduces specialized loss function and byte-for-byte verification.

slm_mixture_of_experts_ex.cpp: sparse conditional computation with production-grade utilities. Demonstrates shuffle and augmentation utilities for robust training.

slm_chatbot_ex.cpp: conversational AI training pipeline with two-stage approach. Demonstrates base language model pre-training followed by supervised fine-tuning on question-answer pairs. Includes stochastic text generation with configurable sampling strategies (temperature, repetition penalty, nucleus sampling) and layer-wise learning rate multipliers for efficient fine-tuning. Shows practical implementation of interactive chatbot with context management.

Technical design

Matrix plane processing

Traditional Dlib layers operate channel-wise on 4D tensors. The extensions introduce plane-wise processing where (rows, cols) dimensions form semantic units for sequence data. This enables:

  • natural representation: (batch, 1, sequence_length, embedding_dim)
  • efficient attention computation over spatial planes
  • seamless integration with existing Dlib computational graph

Implementation approach

All components follow Dlib's design patterns:

  • header-only implementations where appropriate
  • template-based abstractions for compile-time optimization
  • compatibility with existing training infrastructure (dnn_trainer, optimizers, serialization)
  • comprehensive inline documentation following Dlib's conventions

Testing and validation

The example programs demonstrate:

  • perfect memorization on training data (99.99% accuracy for basic example)
  • byte-for-byte reproduction capability (advanced example)
  • balanced expert utilization in MoE (coefficient of variation < 0.3)

Main files modified/added

New headers:

  • dlib/dnn/transformer.h - complete transformer implementations
  • dlib/dnn/layers_transformer.h - specialized layers for sequence processing
  • dlib/dnn/language_model_data.h - utilities for dataset preparation and evaluation
  • dlib/tokenizer/bpe_tokenizer.h - byte-pair encoding tokenization
  • dlib/dnn/solvers.h - AdamW optimizer addition

New examples:

  • examples/slm_basic_train_ex.cpp
  • examples/slm_advanced_train_ex.cpp
  • examples/slm_mixture_of_experts_ex.cpp
  • examples/slm_chatbot_ex.cpp
  • examples/slm_data.h - internal datasets for examples

Abstract documentation:

  • docs/layers_abstract.h - layer specifications and usage patterns
  • docs/transformer_abstract.h - transformer architecture documentation
  • docs/language_model_data_abstract.h - language modeling utility documentation
  • docs/solvers_abstract.h - AdamW optimizer specification

Extended documentation

For more details, see the dedicated repository: https://github.com/Cydral/Dlib-Transformer-extensions


This contribution establishes official transformer support in Dlib, extending the library into modern natural language processing while maintaining its core values of simplicity, performance, and production readiness. The groundwork laid here enables upcoming vision transformer implementations and multimodal architectures, positioning Dlib as a comprehensive framework for contemporary deep learning applications.

Cydral and others added 30 commits April 28, 2025 22:10
…des an optimized linear transformation for multi-dimensional inputs.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants