A PyTorch implementation of the Transformer architecture for English to Nepali translation, based on "Attention Is All You Need" (Vaswani et al., 2017).
.
├── dataset/
│ ├── dataset.py # Dataset loading and preprocessing
│ ├── data_cleaning.py # Data cleaning utilities
│ └── __init__.py
│
├── english_nepali_translation_dataset/
│ ├── data-00000-of-00001.arrow # Preprocessed dataset (Arrow format)
│ ├── dataset_info.json # Dataset metadata
│ └── state.json # Dataset state information
│
├── training/
│ ├── config.py # Training configuration and hyperparameters
│ ├── train.py # Main training script
│ ├── utils.py # Training utilities (metrics, logging, etc.)
│ └── __init__.py
│
└── transformer_model/
├── blocks.py # Encoder/Decoder blocks
├── components.py # Core components (Embeddings, LayerNorm, etc.)
├── transformer.py # Complete Transformer architecture
└── __init__.py
- Full Transformer Implementation: Complete encoder-decoder architecture
- Multi-Head Attention: Parallel attention mechanisms for richer representations
- Positional Encoding: Sinusoidal positional embeddings
- Layer Normalization: Pre-LN architecture for stable training
- Modular Design: Easy to modify and extend individual components
- English-Nepali Translation: Specialized for English to Nepali language pairs
GPU (Strongly Recommended): NVIDIA CUDA-enabled GPU with ≥ 16 GB VRAM for effective training of the Transformer model. GPUs with 8 GB VRAM are generally insufficient for full training and may result in out-of-memory errors.
CPU: CPU-only execution is possible but impractically slow for training.
System Memory: Minimum 16 GB RAM, recommended 32 GB or more.
Due to hardware constraints, the model could not be fully trained. Consequently, the reported results and accuracy do not reflect the model’s full potential. Training on higher-end GPU hardware is expected to significantly improve performance.
- Install all dependencies:
pip install -r requirements.txt- Clone the repository:
git clone https://github.com/SangamSilwal/EnNe-NMT-Transformer.git
cd EnNe-NMT-Transformerpython -m training.traindef get_config():
return {
"batch_size":_,
"num_epochs":_,
"lr":_,
"seq_len":_,
"d_model":_,
"lang_src":"en",
"lang_tgt":"ne",
"model_folder":"weights",
"model_basename":"tmodel_",
"preload":None,
"tokenizer_file":"tokenizer_{0}.json",
"experiment_name":"runs/tmodel",
}
# You can change the configuration as per your need.
# config.py is inside the training directory1. Input Layer
- Token embeddings scaled by √d_model
- Sinusoidal positional encoding
- Dropout for regularization
2. Encoder (6 layers)
- Multi-head self-attention (8 heads)
- Position-wise feed-forward network
- Residual connections + layer normalization
3. Decoder (6 layers)
- Masked multi-head self-attention
- Multi-head cross-attention to encoder
- Position-wise feed-forward network
- Residual connections + layer normalization
4. Output Layer
- Linear projection to vocabulary
- Log-softmax for probability distribution
| Parameter | Value | Description |
|---|---|---|
| d_model | 512 | Model dimension |
| N | 6 | Number of encoder/decoder layers |
| h | 8 | Number of attention heads |
| d_ff | 2048 | Feed-forward dimension (4 × d_model) |
| dropout | 0.1 | Dropout rate |
| max_seq_len | 512 | Maximum sequence length |
| batch_size | 32 | Training batch size |
| learning_rate | 0.0001 | Initial learning rate |
The project uses an English-Nepali parallel corpus stored in Apache Arrow format.
- Text cleaning and normalization
- Tokenization (BPE/WordPiece)
- Vocabulary building
- Sequence padding and masking
- Optimizer: Adam with β1=0.9, β2=0.98, ε=10⁻⁹
- Loss Function: Cross-entropy with label smoothing
- Gradient Clipping: Max norm = 1.0
- Checkpointing: Save best model based on validation loss
Models are automatically saved:
weights/
├── model_epoch_1.pth
├── model_epoch_5.pth
├── model_epoch_10.pth
└── best_model.pth # Best validation loss
This model has been trained for a limited number of epochs due to hardware constraints. For production-quality translations, the model would benefit from:
- Training for 20-30+ epochs
- Larger batch sizes (64-128)
- More powerful GPU hardware (A100, V100, or multi-GPU setup)
- Extended training time (several days)
The current model demonstrates the architecture and training pipeline but may not achieve optimal translation quality. Consider this as a proof-of-concept implementation that can be scaled up with appropriate computational resources.
# Small model (faster training, less memory)
config = {
'd_model': 256,
'num_layers': 4,
'num_heads': 4,
'd_ff': 1024,
'batch_size': 64
}
# Large model (better performance, more memory)
config = {
'd_model': 1024,
'num_layers': 12,
'num_heads': 16,
'd_ff': 4096,
'batch_size': 16
}- Paper: Attention Is All You Need - Vaswani et al., 2017
- Tutorial: The Annotated Transformer
For questions or feedback:
- Email: sangamsilwal2062@gmail.com
- GitHub Issues: Create an issue