Inference Engine

A custom high-performance C++ inference engine for transformer-based language models. This engine is designed to run a 30 million parameter model trained on the TinyStories dataset.

Quick Start (macOS)

This project is specifically developed for macOS. To build and run:

# Step 1: Build the project
make

# Step 2: Run inference
./a.out

Platform Compatibility

Platform	Status	Notes
macOS (Apple Silicon)	Supported	Uses NEON SIMD instructions
macOS (Intel)	Supported	Uses SSE/AVX SIMD instructions
Linux	Not Tested	May require Makefile adjustments for SIMD flags
Windows	Not Tested	May require Makefile adjustments for SIMD flags

The Makefile uses -march=native which enables CPU-specific SIMD instructions. On macOS Apple Silicon, this enables NEON. On Intel Macs, this enables SSE/AVX. For Linux or Windows, you may need to modify the Makefile to use appropriate SIMD flags for your CPU architecture.

Overview

This project implements a complete inference pipeline for decoder-only transformer models. The engine features optimized matrix multiplication with SIMD instructions, a GPT-2 compatible Byte Pair Encoding tokenizer, and a modular architecture that separates tensor operations, model components, and tokenization.

Features

Feature	Description
SIMD Optimization	Matrix multiplication utilizes AVX2/SSE instructions via compiler flags
Tiled MatMul	Cache-efficient 32x32 tiling for improved memory access patterns
Multi-Head Attention	Full implementation with causal masking
Layer Normalization	Pre-normalization architecture support
BPE Tokenizer	GPT-2 compatible byte-level BPE tokenization
Positional Encodings	Sinusoidal positional embeddings
Temperature Sampling	Configurable sampling with temperature scaling

Project Structure

Inference/
├── engine/
│   ├── tensor.hpp          # Tensor class with matmul, softmax, LayerNorm
│   ├── tokenizer.hpp       # GPT-2 BPE tokenizer implementation
│   ├── runner.hpp          # Model, Transformer, Runner classes
│   ├── parser.hpp          # Weight loading from .npy files
│   ├── npy.hpp             # NumPy file parser (external library)
│   └── json.hpp            # JSON parser (external library)
├── weights/
│   ├── embeddings.weight.npy
│   ├── out.weight.npy
│   ├── out.bias.npy
│   └── transforms.*.npy    # Transformer block weights
├── gpt2_vocab.json         # Tokenizer vocabulary
├── merges.txt              # BPE merge rules
├── tokenizer.py            # Python script to export tokenizer
├── export_tokenizer.py     # Alternative tokenizer export script
├── convert.py              # Model conversion utilities
├── run.cpp                 # Main entry point
├── Makefile                # Build configuration
└── README.md               # This file

Model Architecture

The engine supports a decoder-only transformer with the following configuration:

Parameter	Value
Parameters	~30M
d_model	256
Number of Heads	8
Number of Blocks	6
d_k (per head)	32
Vocabulary Size	50257
Max Sequence Length	128
Training Dataset	TinyStories

Building

Prerequisites

C++17 compatible compiler (g++ or clang++)
Python 3.x with transformers library (for tokenizer export)

Compile

make

The Makefile uses the following optimization flags:

Flag	Purpose
-std=c++17	C++17 standard
-O3	Maximum optimization level
-march=native	Enable CPU-specific SIMD instructions
-ffast-math	Aggressive floating-point optimizations
-funroll-loops	Loop unrolling for performance

Export Tokenizer

Before running inference, export the GPT-2 tokenizer vocabulary:

python tokenizer.py

This creates gpt2_vocab.json and merges.txt in the project root.

Usage

Running Inference

./a.out

The Runner class initializes the model with weights from the weights/ directory and performs autoregressive generation.

Customizing Generation

Edit the run() function in runner.hpp to modify:

Parameter	Default	Description
prompt	"Hello, how are you?"	Input text for generation
max_new_tokens	5	Maximum tokens to generate
temperature	0.8	Sampling temperature (0 = greedy)
seq_len	128	Maximum context length

Example Output

Model config: vocab=50257, d_model=256, heads=8, blocks=6
Loaded vocab: 50257 tokens
Loaded merges: 50000 BPE merge rules
Model initialized!
Tokenized: 6 tokens
Generating...
after embeddings: 6x256
after block 0: 6x256
after block 1: 6x256
after block 2: 6x256
after block 3: 6x256
after block 4: 6x256
after block 5: 6x256
logits: 6x50257

=== Generated Text ===
Hello, how are you? Once upon a time...
======================

Tensor Operations

The Tensor class implements the following operations:

Operation	Method	Description
Matrix Multiplication	operator*	Tiled matmul with 32x32 blocks
Addition	operator+	Element-wise addition
Division	operator/	Scalar division
Transpose	t()	Matrix transpose
Softmax	softmax()	Row-wise softmax with numerical stability
Layer Normalization	LayerNorm()	Normalization with learnable parameters
Causal Mask	mask()	Lower triangular mask for attention
Concatenation	concat_horizontal()	Horizontal tensor concatenation
Bias Addition	add_bias()	Add bias vector to each row

Weight Format

Weights are stored in NumPy .npy format and loaded at runtime. The naming convention follows:

transforms.{block}.{component}.{head}.{weight|bias}.npy

Component	Description
q	Query projection
k	Key projection
v	Value projection
join	Output projection (after attention)
ffn.0	FFN intermediate layer
ffn.3	FFN output layer
norm1	Pre-attention layer norm
norm2	Pre-FFN layer norm

Performance Considerations

Optimization	Impact
Tiled matrix multiplication	Reduces cache misses
SIMD via -march=native	2-4x speedup on modern CPUs
-ffast-math	Enables vectorization of floating-point ops
-funroll-loops	Reduces loop overhead
Const references	Avoids unnecessary copies
Reserve for vectors	Prevents reallocations

Error Handling

The engine includes comprehensive error handling:

Shape mismatch detection for all tensor operations
Try-catch blocks around attention and forward passes
Validation of weight dimensions during initialization
Graceful handling of missing tokenizer files
Detailed error messages with tensor shapes

Limitations

Limitation	Notes
Single GPU	CPU-only implementation
Batch Size	Supports batch size of 1
Precision	FP32 only (no FP16/INT8 quantization)
Context	Fixed maximum context length
Threading	Single-threaded execution

Dependencies

Library	Purpose	Source
npy.hpp	NumPy file parsing	GitHub (external)
json.hpp	JSON parsing	GitHub (external)
transformers	Tokenizer export	HuggingFace (Python)

License

This project is for educational and research purposes.

Acknowledgments

TinyStories dataset for model training
HuggingFace transformers for tokenizer reference
GPT-2 architecture as the model foundation

Note: AI assistance was used during development to locate bugs, fix bugs, add error handling, and write boilerplate code. The npy.hpp and json.hpp files are not written by the author; they are external libraries obtained from GitHub. The tokenizer.hpp implementation was written by Claude Sonnet 4.6.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inference Engine

Quick Start (macOS)

Platform Compatibility

Overview

Features

Project Structure

Model Architecture

Building

Prerequisites

Compile

Export Tokenizer

Usage

Running Inference

Customizing Generation

Example Output

Tensor Operations

Weight Format

Performance Considerations

Error Handling

Limitations

Dependencies

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
engine		engine
weights		weights
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
convert.py		convert.py
gpt2_vocab.json		gpt2_vocab.json
merges.txt		merges.txt
parser		parser
run.py		run.py
test.py		test.py
tokenizer.py		tokenizer.py

Folders and files

Latest commit

History

Repository files navigation

Inference Engine

Quick Start (macOS)

Platform Compatibility

Overview

Features

Project Structure

Model Architecture

Building

Prerequisites

Compile

Export Tokenizer

Usage

Running Inference

Customizing Generation

Example Output

Tensor Operations

Weight Format

Performance Considerations

Error Handling

Limitations

Dependencies

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages