This project is a self-educational implementation of a Transformer-based encoder-only architecture, built entirely from scratch using PyTorch. It demonstrates a practical understanding of core Transformer mechanics including tokenization, attention mechanisms, and model training workflows — all implemented without high-level libraries like HuggingFace Transformers. The project also includes data preprocessing, a custom WordPiece tokenizer, and evaluation on real-world datasets.
.
├── data/
│ ├── datasets/
│ │ ├── data_science/
│ │ └── tiny_shakespeare/
│ ├── plots/
│ │ └── training_metrics_plot.png
│ ├── pre-trained_models/
│ │ └── model.safetensors
│ └── tokenizer/
│ └── wordpiece_tokenizer.json
│
├── src/
│ ├── model/
│ │ ├── attention.py
│ │ ├── model.py
│ │ └── utils.py
│ │
│ ├── tokenizer/
│ │ ├── from_lib.py
│ │ ├── tokenizer.py
│ │ ├── vocabluary_creation.py
│ │ └── tokenization_algorithms.py
│ │
│ └── training/
│ ├── dataset.py
│ ├── training.py
│ ├── testing.py
│ └── utils.py
Two publicly available text datasets were used for binary classification:
Each dataset was:
- Converted to plain text
- Split by comma into sentences
- Balanced (50/50 class split)
- Divided into
train,val, andtestsets (60/20/20)
The final training set contains ~10,000 sentences (equally split between datasets).
Tokenization is based on the WordPiece algorithm (similar to BERT), with a vocabulary of 1000 tokens plus the following special tokens:
[PAD][CLS][MASK][UNK]
-
Custom
Tokenizerclass that supports:- Tokenization into WordPiece subwords
- Optional padding (
leftorright) - Tensor or raw list output
[CLS]token prepending- Generation of attention masks (excluding padding from self-attention)
-
Vocabulary generation logic is provided in
from_lib.py
The model mimics the original Transformer encoder design (as proposed in Attention is All You Need), with modifications for text classification.
-
Embedding Layer
- Token + positional encoding
- Sinusoidal (non-trainable) positional encodings
-
Encoder Stack: 6 layers of:
- Multi-Head Self-Attention (8 heads)
- Residual connection + LayerNorm
- Feedforward block (2-layer MLP with GeLU)
- Residual connection + LayerNorm
-
Classification Head
- Only the
[CLS]token representation is used - MLP outputs logits for two classes (Shakespeare vs TDS)
- Only the
-
Careful attention to tensor dimensions in:
- Scaled dot-product attention
- Head splitting/merging
- Padding mask broadcasting
-
Attention masks ensure
[PAD]tokens are ignored in all attention operations
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-4)
scheduler = ReduceLROnPlateau(optimizer, mode="max", patience=4, threshold=1e-6)- Metrics (train/val accuracy) plotted in
data/plots/training_metrics_plot.png - Training reaches plateau after just 2–4 epochs
- Training accuracy approaches 100% by epoch 27
- Best model saved as:
data/pre-trained_models/model.safetensors
Evaluation was done on the held-out test set (20% of the data). Below is the classification report:
precision recall f1-score support
Towards Data Scince 0.99 0.98 0.99 2357
Tiny Shakespeare 0.98 0.99 0.98 1798
accuracy 0.98 4155
macro avg 0.98 0.98 0.98 4155
weighted avg 0.98 0.98 0.98 4155
📌 Test Accuracy: 98.39%
This indicates the model generalizes well despite its simplicity and relatively small vocabulary.
This project uses PDM (Python Development Master) for dependency management.
-
Install
pdm:pip install pdm
-
Install dependencies:
pdm install
-
Run training/inference scripts as needed.
- In-depth understanding of Transformer internals
- Manual implementation of tokenizer
- Tensor manipulation and multi-head attention scaling
- Handling padding and masking in attention
- Custom training loop with learning rate scheduling and logging
- Real-world evaluation using accuracy, precision, recall, and F1-score
This project is licensed for educational and non-commercial use. Refer to LICENSE.
- Implement vocabulary learning using the WordPiece algorithm from scratch
- Train the model using Masked Language Modeling (MLM) (à la BERT) on a different dataset