Transformer Encoder from Scratch

This project is a self-educational implementation of a Transformer-based encoder-only architecture, built entirely from scratch using PyTorch. It demonstrates a practical understanding of core Transformer mechanics including tokenization, attention mechanisms, and model training workflows — all implemented without high-level libraries like HuggingFace Transformers. The project also includes data preprocessing, a custom WordPiece tokenizer, and evaluation on real-world datasets.

📁 Project Structure

.
├── data/
│   ├── datasets/
│   │   ├── data_science/
│   │   └── tiny_shakespeare/
│   ├── plots/
│   │   └── training_metrics_plot.png
│   ├── pre-trained_models/
│   │   └── model.safetensors
│   └── tokenizer/
│       └── wordpiece_tokenizer.json
│
├── src/
│   ├── model/
│   │   ├── attention.py
│   │   ├── model.py
│   │   └── utils.py
│   │
│   ├── tokenizer/
│   │   ├── from_lib.py
│   │   ├── tokenizer.py
│   │   ├── vocabluary_creation.py
│   │   └── tokenization_algorithms.py
│   │
│   └── training/
│       ├── dataset.py
│       ├── training.py
│       ├── testing.py
│       └── utils.py

📚 Datasets

Two publicly available text datasets were used for binary classification:

Tiny Shakespeare Source: Kaggle
Towards Data Science Articles Source: Kaggle

Each dataset was:

Converted to plain text
Split by comma into sentences
Balanced (50/50 class split)
Divided into train, val, and test sets (60/20/20)

The final training set contains ~10,000 sentences (equally split between datasets).

🔤 Tokenizer

Tokenization is based on the WordPiece algorithm (similar to BERT), with a vocabulary of 1000 tokens plus the following special tokens:

[PAD]
[CLS]
[MASK]
[UNK]

Implementation Highlights (`src/tokenizer/tokenizer.py`):

Custom Tokenizer class that supports:
- Tokenization into WordPiece subwords
- Optional padding (left or right)
- Tensor or raw list output
- [CLS] token prepending
- Generation of attention masks (excluding padding from self-attention)
Vocabulary generation logic is provided in from_lib.py

🧠 Transformer Encoder Architecture

The model mimics the original Transformer encoder design (as proposed in Attention is All You Need), with modifications for text classification.

Architecture Overview:

Embedding Layer
- Token + positional encoding
- Sinusoidal (non-trainable) positional encodings
Encoder Stack: 6 layers of:
- Multi-Head Self-Attention (8 heads)
- Residual connection + LayerNorm
- Feedforward block (2-layer MLP with GeLU)
- Residual connection + LayerNorm
Classification Head
- Only the [CLS] token representation is used
- MLP outputs logits for two classes (Shakespeare vs TDS)

Technical Focus:

Careful attention to tensor dimensions in:
- Scaled dot-product attention
- Head splitting/merging
- Padding mask broadcasting
Attention masks ensure [PAD] tokens are ignored in all attention operations

⚙️ Training Setup

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-4)
scheduler = ReduceLROnPlateau(optimizer, mode="max", patience=4, threshold=1e-6)

Metrics (train/val accuracy) plotted in data/plots/training_metrics_plot.png
Training reaches plateau after just 2–4 epochs
Training accuracy approaches 100% by epoch 27
Best model saved as: data/pre-trained_models/model.safetensors

🧪 Evaluation

Evaluation was done on the held-out test set (20% of the data). Below is the classification report:

                precision    recall  f1-score   support

Towards Data Scince       0.99      0.98      0.99      2357
   Tiny Shakespeare       0.98      0.99      0.98      1798

           accuracy                           0.98      4155
          macro avg       0.98      0.98      0.98      4155
       weighted avg       0.98      0.98      0.98      4155

📌 Test Accuracy: 98.39%

This indicates the model generalizes well despite its simplicity and relatively small vocabulary.

📦 Installation

This project uses PDM (Python Development Master) for dependency management.

Install pdm:
```
pip install pdm
```
Install dependencies:
```
pdm install
```
Run training/inference scripts as needed.

🧠 Skills Demonstrated

In-depth understanding of Transformer internals
Manual implementation of tokenizer
Tensor manipulation and multi-head attention scaling
Handling padding and masking in attention
Custom training loop with learning rate scheduling and logging
Real-world evaluation using accuracy, precision, recall, and F1-score

📜 License

This project is licensed for educational and non-commercial use. Refer to LICENSE.

🚀 Future Work

Implement vocabulary learning using the WordPiece algorithm from scratch
Train the model using Masked Language Modeling (MLM) (à la BERT) on a different dataset

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
src		src
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Transformer Encoder from Scratch

📁 Project Structure

📚 Datasets

🔤 Tokenizer

Implementation Highlights (`src/tokenizer/tokenizer.py`):

🧠 Transformer Encoder Architecture

Architecture Overview:

Technical Focus:

⚙️ Training Setup

🧪 Evaluation

📦 Installation

🧠 Skills Demonstrated

📜 License

🚀 Future Work

About

Uh oh!

Languages

License

DzmitryPihulski/Encoder-transformer-from-scratch

Folders and files

Latest commit

History

Repository files navigation

Transformer Encoder from Scratch

📁 Project Structure

📚 Datasets

🔤 Tokenizer

Implementation Highlights (src/tokenizer/tokenizer.py):

🧠 Transformer Encoder Architecture

Architecture Overview:

Technical Focus:

⚙️ Training Setup

🧪 Evaluation

📦 Installation

🧠 Skills Demonstrated

📜 License

🚀 Future Work

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages

Implementation Highlights (`src/tokenizer/tokenizer.py`):