Skip to content

SangamSilwal/EnNe-NMT-Transformer

Repository files navigation

English-Nepali Neural Machine Translation

Python PyTorch NumPy Pandas Hugging Face

A PyTorch implementation of the Transformer architecture for English to Nepali translation, based on "Attention Is All You Need" (Vaswani et al., 2017).

Transformer Architecture

Project Structure

.
├── dataset/
│   ├── dataset.py           # Dataset loading and preprocessing
│   ├── data_cleaning.py     # Data cleaning utilities
│   └── __init__.py
│
├── english_nepali_translation_dataset/
│   ├── data-00000-of-00001.arrow    # Preprocessed dataset (Arrow format)
│   ├── dataset_info.json            # Dataset metadata
│   └── state.json                   # Dataset state information
│
├── training/
│   ├── config.py            # Training configuration and hyperparameters
│   ├── train.py             # Main training script
│   ├── utils.py             # Training utilities (metrics, logging, etc.)
│   └── __init__.py
│
└── transformer_model/
    ├── blocks.py            # Encoder/Decoder blocks
    ├── components.py        # Core components (Embeddings, LayerNorm, etc.)
    ├── transformer.py       # Complete Transformer architecture
    └── __init__.py

Features

  • Full Transformer Implementation: Complete encoder-decoder architecture
  • Multi-Head Attention: Parallel attention mechanisms for richer representations
  • Positional Encoding: Sinusoidal positional embeddings
  • Layer Normalization: Pre-LN architecture for stable training
  • Modular Design: Easy to modify and extend individual components
  • English-Nepali Translation: Specialized for English to Nepali language pairs

Requirements

Hardware

GPU (Strongly Recommended): NVIDIA CUDA-enabled GPU with ≥ 16 GB VRAM for effective training of the Transformer model. GPUs with 8 GB VRAM are generally insufficient for full training and may result in out-of-memory errors.

CPU: CPU-only execution is possible but impractically slow for training.

System Memory: Minimum 16 GB RAM, recommended 32 GB or more.

Limitation

Due to hardware constraints, the model could not be fully trained. Consequently, the reported results and accuracy do not reflect the model’s full potential. Training on higher-end GPU hardware is expected to significantly improve performance.

Installation

  1. Install all dependencies:
pip install -r requirements.txt
  1. Clone the repository:
git clone https://github.com/SangamSilwal/EnNe-NMT-Transformer.git
cd EnNe-NMT-Transformer

Quick Start

Training

python -m training.train

Custom Training Configuration

def get_config():
    return {
        "batch_size":_,
        "num_epochs":_,
        "lr":_,
        "seq_len":_,
        "d_model":_,
        "lang_src":"en",
        "lang_tgt":"ne",
        "model_folder":"weights",
        "model_basename":"tmodel_",
        "preload":None,
        "tokenizer_file":"tokenizer_{0}.json",
        "experiment_name":"runs/tmodel",
    }
# You can change the configuration as per your need.
# config.py is inside the training directory

Model Architecture

Transformer Components

1. Input Layer

  • Token embeddings scaled by √d_model
  • Sinusoidal positional encoding
  • Dropout for regularization

2. Encoder (6 layers)

  • Multi-head self-attention (8 heads)
  • Position-wise feed-forward network
  • Residual connections + layer normalization

3. Decoder (6 layers)

  • Masked multi-head self-attention
  • Multi-head cross-attention to encoder
  • Position-wise feed-forward network
  • Residual connections + layer normalization

4. Output Layer

  • Linear projection to vocabulary
  • Log-softmax for probability distribution

Default Hyperparameters

Parameter Value Description
d_model 512 Model dimension
N 6 Number of encoder/decoder layers
h 8 Number of attention heads
d_ff 2048 Feed-forward dimension (4 × d_model)
dropout 0.1 Dropout rate
max_seq_len 512 Maximum sequence length
batch_size 32 Training batch size
learning_rate 0.0001 Initial learning rate

Dataset

The project uses an English-Nepali parallel corpus stored in Apache Arrow format.

Preprocessing Pipeline

  1. Text cleaning and normalization
  2. Tokenization (BPE/WordPiece)
  3. Vocabulary building
  4. Sequence padding and masking

Training Features

  • Optimizer: Adam with β1=0.9, β2=0.98, ε=10⁻⁹
  • Loss Function: Cross-entropy with label smoothing
  • Gradient Clipping: Max norm = 1.0
  • Checkpointing: Save best model based on validation loss

Checkpoints

Models are automatically saved:

weights/
├── model_epoch_1.pth
├── model_epoch_5.pth
├── model_epoch_10.pth
└── best_model.pth          # Best validation loss

Important Note

This model has been trained for a limited number of epochs due to hardware constraints. For production-quality translations, the model would benefit from:

  • Training for 20-30+ epochs
  • Larger batch sizes (64-128)
  • More powerful GPU hardware (A100, V100, or multi-GPU setup)
  • Extended training time (several days)

The current model demonstrates the architecture and training pipeline but may not achieve optimal translation quality. Consider this as a proof-of-concept implementation that can be scaled up with appropriate computational resources.

Customization

Change Model Size

# Small model (faster training, less memory)
config = {
    'd_model': 256,
    'num_layers': 4,
    'num_heads': 4,
    'd_ff': 1024,
    'batch_size': 64
}

# Large model (better performance, more memory)
config = {
    'd_model': 1024,
    'num_layers': 12,
    'num_heads': 16,
    'd_ff': 4096,
    'batch_size': 16
}

References

Contact

For questions or feedback:


About

Transformer-powered English–Nepali Neural Machine Translation (NMT)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages