Skip to content

DzmitryPihulski/Encoder-transformer-from-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transformer Encoder from Scratch

This project is a self-educational implementation of a Transformer-based encoder-only architecture, built entirely from scratch using PyTorch. It demonstrates a practical understanding of core Transformer mechanics including tokenization, attention mechanisms, and model training workflows — all implemented without high-level libraries like HuggingFace Transformers. The project also includes data preprocessing, a custom WordPiece tokenizer, and evaluation on real-world datasets.


📁 Project Structure

.
├── data/
│   ├── datasets/
│   │   ├── data_science/
│   │   └── tiny_shakespeare/
│   ├── plots/
│   │   └── training_metrics_plot.png
│   ├── pre-trained_models/
│   │   └── model.safetensors
│   └── tokenizer/
│       └── wordpiece_tokenizer.json
│
├── src/
│   ├── model/
│   │   ├── attention.py
│   │   ├── model.py
│   │   └── utils.py
│   │
│   ├── tokenizer/
│   │   ├── from_lib.py
│   │   ├── tokenizer.py
│   │   ├── vocabluary_creation.py
│   │   └── tokenization_algorithms.py
│   │
│   └── training/
│       ├── dataset.py
│       ├── training.py
│       ├── testing.py
│       └── utils.py

📚 Datasets

Two publicly available text datasets were used for binary classification:

  1. Tiny Shakespeare Source: Kaggle

  2. Towards Data Science Articles Source: Kaggle

Each dataset was:

  • Converted to plain text
  • Split by comma into sentences
  • Balanced (50/50 class split)
  • Divided into train, val, and test sets (60/20/20)

The final training set contains ~10,000 sentences (equally split between datasets).


🔤 Tokenizer

Tokenization is based on the WordPiece algorithm (similar to BERT), with a vocabulary of 1000 tokens plus the following special tokens:

  • [PAD]
  • [CLS]
  • [MASK]
  • [UNK]

Implementation Highlights (src/tokenizer/tokenizer.py):

  • Custom Tokenizer class that supports:

    • Tokenization into WordPiece subwords
    • Optional padding (left or right)
    • Tensor or raw list output
    • [CLS] token prepending
    • Generation of attention masks (excluding padding from self-attention)
  • Vocabulary generation logic is provided in from_lib.py


🧠 Transformer Encoder Architecture

The model mimics the original Transformer encoder design (as proposed in Attention is All You Need), with modifications for text classification.

Architecture Overview:

  1. Embedding Layer

    • Token + positional encoding
    • Sinusoidal (non-trainable) positional encodings
  2. Encoder Stack: 6 layers of:

    • Multi-Head Self-Attention (8 heads)
    • Residual connection + LayerNorm
    • Feedforward block (2-layer MLP with GeLU)
    • Residual connection + LayerNorm
  3. Classification Head

    • Only the [CLS] token representation is used
    • MLP outputs logits for two classes (Shakespeare vs TDS)

Technical Focus:

  • Careful attention to tensor dimensions in:

    • Scaled dot-product attention
    • Head splitting/merging
    • Padding mask broadcasting
  • Attention masks ensure [PAD] tokens are ignored in all attention operations


⚙️ Training Setup

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-4)
scheduler = ReduceLROnPlateau(optimizer, mode="max", patience=4, threshold=1e-6)
  • Metrics (train/val accuracy) plotted in data/plots/training_metrics_plot.png
  • Training reaches plateau after just 2–4 epochs
  • Training accuracy approaches 100% by epoch 27
  • Best model saved as: data/pre-trained_models/model.safetensors

alt text


🧪 Evaluation

Evaluation was done on the held-out test set (20% of the data). Below is the classification report:

                precision    recall  f1-score   support

Towards Data Scince       0.99      0.98      0.99      2357
   Tiny Shakespeare       0.98      0.99      0.98      1798

           accuracy                           0.98      4155
          macro avg       0.98      0.98      0.98      4155
       weighted avg       0.98      0.98      0.98      4155

📌 Test Accuracy: 98.39%

This indicates the model generalizes well despite its simplicity and relatively small vocabulary.


📦 Installation

This project uses PDM (Python Development Master) for dependency management.

  1. Install pdm:

    pip install pdm
  2. Install dependencies:

    pdm install
  3. Run training/inference scripts as needed.


🧠 Skills Demonstrated

  • In-depth understanding of Transformer internals
  • Manual implementation of tokenizer
  • Tensor manipulation and multi-head attention scaling
  • Handling padding and masking in attention
  • Custom training loop with learning rate scheduling and logging
  • Real-world evaluation using accuracy, precision, recall, and F1-score

📜 License

This project is licensed for educational and non-commercial use. Refer to LICENSE.


🚀 Future Work

  • Implement vocabulary learning using the WordPiece algorithm from scratch
  • Train the model using Masked Language Modeling (MLM) (à la BERT) on a different dataset

About

Fully functional encoder transformer from tokenizer to lm-head

Topics

Resources

License

Stars

Watchers

Forks

Languages