Skip to content

anminhhung/TrainLP

Repository files navigation

MultiFrame-LPR: Multi-Frame License Plate Recognition

Python 3.11 PyTorch 2.10 License

State-of-the-art multi-frame OCR system for low-resolution license plate recognition

Enhanced with Phase 1 Improvements: Beam Search, Label Smoothing, Larger Transformer, Cosine Annealing


🎯 Overview

MultiFrame-LPR is a deep learning system designed for the ICPR 2026 Challenge on Low-Resolution License Plate Recognition. It processes 5 consecutive frames per track to achieve robust character recognition even in challenging conditions.

Current Performance

Version Validation Accuracy Key Features
Baseline 77.90% ResNet34 + Transformer + STN
Phase 1 Enhanced 80-82% (target) + Beam Search + Label Smoothing + Larger Model

Key Achievements

  • βœ… 77.90% baseline accuracy with ResTranOCR
  • βœ… Multi-frame fusion with learned attention
  • βœ… Spatial alignment via STN (Spatial Transformer Network)
  • βœ… Transformer-based sequence modeling
  • βœ… End-to-end trainable with CTC loss
  • πŸ†• Phase 1 Improvements: Beam search decoding, label smoothing, larger transformer (6 layers, 12 heads), cosine annealing scheduler

πŸ“‹ Table of Contents


πŸ†• What's New in Phase 1

Recent Improvements (Expected +2-4% Accuracy)

1. Beam Search CTC Decoding ⭐⭐⭐

  • What: Maintains top-K hypotheses instead of greedy decoding
  • Benefit: Better global sequence selection (+1-2% accuracy)
  • Config: USE_BEAM_SEARCH = True, BEAM_WIDTH = 5

2. Label Smoothing ⭐⭐⭐

  • What: Prevents overconfidence on training data
  • Benefit: Better generalization and calibrated confidence (+0.5-1% accuracy)
  • Config: USE_LABEL_SMOOTHING = True, LABEL_SMOOTHING = 0.1

3. Larger Transformer ⭐⭐

  • What: 6 layers, 12 heads (from 3 layers, 8 heads)
  • Benefit: Better capacity for complex patterns (+1-2% accuracy)
  • Config: TRANSFORMER_LAYERS = 6, TRANSFORMER_HEADS = 12

4. Cosine Annealing with Warm Restarts ⭐⭐⭐

  • What: Periodic learning rate restarts
  • Benefit: Escape local minima, better convergence (+0.5-1% accuracy)
  • Config: USE_COSINE_ANNEALING = True, T_0 = 10, T_MULT = 2

5. Test-Time Augmentation (TTA)

  • What: Average predictions over augmented versions
  • Benefit: More robust predictions (+0.5-1.5% accuracy)
  • Usage: predict_with_tta() function available in src/utils/postprocess.py

✨ Features

Core Features

  • Multi-Frame Processing: Leverages temporal information from 5 frames
  • Spatial Alignment: STN automatically corrects rotation, scale, and perspective
  • Attention Fusion: Learns to weight frames by quality
  • Transformer Encoder: Captures character dependencies
  • CTC Decoding: No character-level alignment needed
  • πŸ†• Beam Search: Better sequence decoding
  • πŸ†• Label Smoothing: Improved generalization

Technical Features

  • Mixed precision training (AMP)
  • Data augmentation pipeline
  • Synthetic LR generation from HR images
  • 70/30 train/validation split (improved from 90/10)
  • Comprehensive logging and checkpointing
  • πŸ†• Cosine annealing scheduler with warm restarts
  • πŸ†• Larger transformer (45M parameters)

πŸ—οΈ Model Architecture

ResTranOCR Pipeline (Enhanced)

5 Frames β†’ STN β†’ ResNet-34 β†’ Attention Fusion β†’ Transformer (6 layers, 12 heads) β†’ CTC Head β†’ Beam Search β†’ Text

Components

  1. STN (Spatial Transformer Network): Geometric alignment
  2. ResNet-34: Visual feature extraction (modified strides for OCR)
  3. AttentionFusion: Multi-frame fusion with learned weights
  4. Transformer: 6-layer encoder with 12 attention heads (enhanced)
  5. CTC Head: Character classification with blank token
  6. πŸ†• Beam Search Decoder: Improved sequence decoding

Total Parameters: ~45M (increased from 31M for better capacity)

πŸ“– Detailed Architecture: See explain_model.md

πŸ“– Improvement Guide: See suggest_improve_model.md


πŸ“¦ Requirements

System Requirements

  • OS: Windows, Linux, or macOS
  • GPU: NVIDIA GPU with CUDA support (recommended)
    • Minimum: 6GB VRAM (increased for larger model)
    • Recommended: 8GB+ VRAM (RTX 3060 or better)
  • RAM: 16GB+ recommended
  • Storage: 10GB+ free space

Software Requirements

  • Python: 3.11.x (strictly required)
  • CUDA: 11.8 or 12.x (for GPU acceleration)
  • Package Manager: uv (recommended) or pip

πŸš€ Installation

Option 1: Using UV Package Manager (Recommended)

Step 1: Install UV

Windows (PowerShell):

powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

Linux/macOS:

curl -LsSf https://astral.sh/uv/install.sh | sh

Step 2: Clone Repository

git clone https://github.com/yourusername/MultiFrame-LPR.git
cd MultiFrame-LPR

Step 3: Install Dependencies

# Create virtual environment and install dependencies
uv sync

# Activate virtual environment
# Windows
.venv\Scripts\activate

# Linux/macOS
source .venv/bin/activate

Option 2: Using Pip

Step 1: Clone Repository

git clone https://github.com/yourusername/MultiFrame-LPR.git
cd MultiFrame-LPR

Step 2: Create Virtual Environment

# Create virtual environment
python -m venv venv

# Activate virtual environment
# Windows
venv\Scripts\activate

# Linux/macOS
source venv/bin/activate

Step 3: Install PyTorch

CUDA 12.8 (NVIDIA GPU):

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

CUDA 11.8 (older NVIDIA GPU):

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

CPU Only (no GPU):

pip install torch torchvision

Step 4: Install Other Dependencies

pip install albumentations opencv-python tqdm numpy matplotlib pandas seaborn pillow

βœ… Verify Installation

python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA Available: {torch.cuda.is_available()}')"

Expected output:

PyTorch: 2.10.0+cu128
CUDA Available: True

Run quick tests:

# Run model sanity tests
python -X utf8 test/quick_test.py

# Run dataset tests
python -X utf8 test/test_dataset.py

πŸ“ Dataset Preparation

Dataset Structure

Organize your data following this structure:

data/
β”œβ”€β”€ train/
β”‚   β”œβ”€β”€ Scenario-A/
β”‚   β”‚   β”œβ”€β”€ Brazilian/
β”‚   β”‚   β”‚   └── track_00001/
β”‚   β”‚   β”‚       β”œβ”€β”€ lr-001.png (or .jpg)
β”‚   β”‚   β”‚       β”œβ”€β”€ lr-002.png
β”‚   β”‚   β”‚       β”œβ”€β”€ lr-003.png
β”‚   β”‚   β”‚       β”œβ”€β”€ lr-004.png
β”‚   β”‚   β”‚       β”œβ”€β”€ lr-005.png
β”‚   β”‚   β”‚       β”œβ”€β”€ hr-001.png (optional, for synthetic LR)
β”‚   β”‚   β”‚       β”œβ”€β”€ hr-002.png
β”‚   β”‚   β”‚       β”œβ”€β”€ hr-003.png
β”‚   β”‚   β”‚       β”œβ”€β”€ hr-004.png
β”‚   β”‚   β”‚       β”œβ”€β”€ hr-005.png
β”‚   β”‚   β”‚       └── annotations.json
β”‚   β”‚   └── Mercosur/
β”‚   β”‚       └── track_00002/
β”‚   β”‚           └── ...
β”‚   └── Scenario-B/
β”‚       β”œβ”€β”€ Brazilian/
β”‚       └── Mercosur/
└── public_test/  (optional, for testing)
    └── track_xxxxx/
        β”œβ”€β”€ lr-001.png (or .jpg)
        β”œβ”€β”€ lr-002.png
        β”œβ”€β”€ lr-003.png
        β”œβ”€β”€ lr-004.png
        └── lr-005.png

Annotations Format

Each annotations.json file should contain:

{
  "plate_text": "ABC1234",
  "plate_layout": "Brazilian",
  "corners": {}
}

Character Set

Supported characters: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ (36 characters)

Sampling Data

For quick experiments, use the sampling_data (200 tracks, 1000 images):

# Generate sampling data (if not already done)
python -X utf8 create_sample_data.py

# Training will be ~10x faster on sampling_data

πŸƒ Quick Start

1. Quick Test on Sampling Data (Recommended First)

# Test on small dataset (fast, ~10 minutes)
python train.py --data-root sampling_data/train --epochs 10 -n quick_test

2. Train with Default Settings (Full Data)

# Delete old validation split to use new 70/30 split
rm -f data/val_tracks.json

# Train with Phase 1 improvements
python train.py

This will:

  • Use 70/30 train/val split (improved from 90/10)
  • Train ResTranOCR with enhanced Phase 1 features
  • Use larger transformer (6 layers, 12 heads)
  • Apply label smoothing and beam search
  • Save best model to results/restran_best.pth
  • Generate submission file

3. Train with Custom Settings

python train.py \
  --epochs 50 \
  --batch-size 32 \
  --lr 0.001 \
  --transformer-layers 6 \
  --transformer-heads 12 \
  -n my_experiment

4. Submission Mode (Full Training)

python train.py --submission-mode -n final_submission

This will:

  • Train on entire dataset (no validation split)
  • Generate predictions for test data
  • Save to results/submission_final_submission_final.txt

πŸŽ“ Training

Basic Training

python train.py

Output:

  • Checkpoints: results/restran_best.pth
  • Submission: results/submission_restran.txt
  • Logs: Console output

Advanced Training Options

Custom Hyperparameters

python train.py \
  --epochs 50 \
  --batch-size 64 \
  --lr 5e-4 \
  --transformer-heads 12 \
  --transformer-layers 6 \
  --num-workers 10 \
  --seed 42

Augmentation Levels

Full augmentation (default, recommended):

python train.py --aug-level full

Light augmentation (faster, less aggressive):

python train.py --aug-level light

Custom Output Directory

python train.py --output-dir experiments/exp_001

Training Configuration

Default hyperparameters in configs/config.py:

# Model Architecture (Phase 1 Enhanced)
TRANSFORMER_HEADS = 12        # Increased from 8
TRANSFORMER_LAYERS = 6        # Increased from 3
TRANSFORMER_FF_DIM = 2048
TRANSFORMER_DROPOUT = 0.1

# Training Hyperparameters
BATCH_SIZE = 64
LEARNING_RATE = 5e-4
EPOCHS = 30
GRAD_CLIP = 5.0
SPLIT_RATIO = 0.7            # 70/30 split (improved from 90/10)

# Phase 1 Improvements
USE_LABEL_SMOOTHING = True    # Enable label smoothing
LABEL_SMOOTHING = 0.1         # Smoothing factor
USE_BEAM_SEARCH = True        # Enable beam search
BEAM_WIDTH = 5                # Beam width
USE_COSINE_ANNEALING = True   # Cosine annealing scheduler
T_0 = 10                      # Restart every 10 epochs

CLI arguments override config values.

Monitoring Training

Training progress shows:

  • Epoch number and progress bar
  • Training loss
  • Validation loss and accuracy
  • Learning rate
  • Best model checkpoints

Example output:

Epoch 1/30: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 313/313 [02:15<00:00, 2.31it/s]
Train Loss: 2.3456 | Val Loss: 1.9876 | Val Acc: 48.23% | LR: 5.00e-04
  ⭐ Saved Best Model: results/restran_best.pth (48.23%)

Epoch 2/30: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 313/313 [02:14<00:00, 2.33it/s]
Train Loss: 1.8765 | Val Loss: 1.6543 | Val Acc: 55.10% | LR: 4.50e-04
  ⭐ Saved Best Model: results/restran_best.pth (55.10%)

πŸ§ͺ Evaluation & Testing

Test Trained Model

python test/test_model.py \
  --checkpoint results/restran_best.pth \
  --data-root data/public_test \
  --output-file predictions.txt \
  --batch-size 32

With visualizations:

python test/test_model.py \
  --checkpoint results/restran_best.pth \
  --data-root data/public_test \
  --output-file predictions.txt \
  --visualize

This generates:

  • Confidence distribution histogram
  • Confidence statistics
  • Prediction length distribution

Evaluate Predictions

Compare predictions against ground truth:

python test/evaluate.py \
  --predictions predictions.txt \
  --ground-truth data/train \
  --output-errors errors.csv \
  --verbose

Output metrics:

  • Exact match accuracy
  • Character-level accuracy
  • Average edit distance
  • Confidence scores (correct vs wrong)
  • Error analysis (top mistakes)

Quick Sanity Tests

# Test model initialization and forward pass
python -X utf8 test/quick_test.py

# Test dataset loading
python -X utf8 test/test_dataset.py

# Test train/val split
python -X utf8 test_split.py

πŸ”¬ Phase 1 Improvements Explained

1. Beam Search Decoding

Before (Greedy):

Position: 0    1    2    3    4    5
Char:     A -> A -> B -> 1 -> 2 -> 3
          ↓    ↓    ↓    ↓    ↓    ↓
Output:      A    B    1    2    3

Makes locally optimal choice at each position.

After (Beam Search, width=5):

Maintains top-5 sequences:
1. "AB123" (score: -2.1) ← Best
2. "AB1Z3" (score: -2.4)
3. "A8123" (score: -2.7)
4. "AB1Z8" (score: -3.1)
5. "AB12S" (score: -3.3)

Considers global sequence probability.

Implementation: Automatically enabled in validation/inference. Set USE_BEAM_SEARCH = True in config.


2. Label Smoothing

Before (Hard Labels):

Target: "A"
Probability: [0.0, 1.0, 0.0, 0.0, ...]  # One-hot
             [blank, A,   B,   C,  ...]

Model becomes overconfident.

After (Smoothed Labels, Ξ±=0.1):

Target: "A"
Probability: [0.003, 0.967, 0.003, 0.003, ...]
             [blank,   A,     B,     C,   ...]

Encourages less extreme predictions, better generalization.

Implementation: Automatically used during training. Set USE_LABEL_SMOOTHING = True in config.


3. Larger Transformer

Before:

  • 3 layers, 8 heads
  • 31M parameters
  • Limited capacity

After:

  • 6 layers, 12 heads
  • 45M parameters
  • Better pattern learning

Benefits:

  • Deeper context understanding
  • Better long-range dependencies
  • More expressive power

Trade-off: Slightly slower training (~1.3x), but significant accuracy gain.


4. Cosine Annealing with Warm Restarts

Before (OneCycleLR):

LR
β”‚     β•±β•²
β”‚    β•±  β•²
β”‚   β•±    β•²___
└──────────── Epochs

Single cycle, may get stuck.

After (Cosine Annealing):

LR
β”‚  β•±β•²    β•±β•²      β•±β•²
β”‚ β•±  β•²  β•±  β•²    β•±  β•²
β”‚β•±    β•²β•±    β•²  β•±    β•²
└────────────────────── Epochs
  T_0   T_0*2  T_0*4

Periodic restarts help escape local minima.

Implementation: Set USE_COSINE_ANNEALING = True in config.


5. Test-Time Augmentation (TTA)

Usage:

from src.utils.postprocess import predict_with_tta

# Load model
model = load_model('results/restran_best.pth')

# Predict with TTA (5 augmented versions)
results = predict_with_tta(
    model,
    images,
    idx2char,
    num_augments=5,
    use_beam_search=True,
    beam_width=5
)

Benefits:

  • More robust predictions
  • +0.5-1.5% accuracy
  • Slight increase in inference time

βš™οΈ Configuration

Config File: configs/config.py

Key parameters:

# Model Architecture (Phase 1 Enhanced)
MODEL_TYPE = "restran"
USE_STN = True
TRANSFORMER_HEADS = 12          # Phase 1: Increased
TRANSFORMER_LAYERS = 6          # Phase 1: Increased
TRANSFORMER_FF_DIM = 2048
TRANSFORMER_DROPOUT = 0.1

# Data
DATA_ROOT = "data/train"
IMG_HEIGHT = 32
IMG_WIDTH = 128
CHARS = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"

# Training
BATCH_SIZE = 64
LEARNING_RATE = 5e-4
EPOCHS = 30
GRAD_CLIP = 5.0
SPLIT_RATIO = 0.7               # Phase 1: 70/30 split

# Phase 1 Improvements
USE_LABEL_SMOOTHING = True      # Label smoothing
LABEL_SMOOTHING = 0.1
USE_BEAM_SEARCH = True          # Beam search decoding
BEAM_WIDTH = 5
USE_COSINE_ANNEALING = True     # Cosine annealing scheduler
T_0 = 10
T_MULT = 2
ETA_MIN = 1e-6

Modifying Configuration

Option 1: Edit configs/config.py (permanent changes)

Option 2: Use CLI arguments (temporary overrides)

python train.py --batch-size 32 --epochs 50 --lr 0.001 --transformer-layers 6 --transformer-heads 12

Option 3: Disable Phase 1 features (if needed)

# In configs/config.py
USE_LABEL_SMOOTHING = False  # Disable label smoothing
USE_BEAM_SEARCH = False      # Use greedy decoding
USE_COSINE_ANNEALING = False # Use OneCycleLR
TRANSFORMER_LAYERS = 3       # Use smaller model
TRANSFORMER_HEADS = 8

πŸ“‚ Project Structure

MultiFrame-LPR/
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ config.py              # Configuration with Phase 1 settings
β”‚   └── __init__.py
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ dataset.py         # MultiFrameDataset (PNG+JPG support, 70/30 split)
β”‚   β”‚   β”œβ”€β”€ transforms.py      # Augmentation pipelines
β”‚   β”‚   └── __init__.py
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ components.py      # STN, Fusion, ResNet, PositionalEncoding
β”‚   β”‚   β”œβ”€β”€ restran.py         # ResTranOCR model (6 layers, 12 heads)
β”‚   β”‚   └── __init__.py
β”‚   β”œβ”€β”€ training/
β”‚   β”‚   β”œβ”€β”€ trainer.py         # Trainer (Phase 1: Label smoothing, beam search)
β”‚   β”‚   └── __init__.py
β”‚   β”œβ”€β”€ utils/
β”‚   β”‚   β”œβ”€β”€ common.py          # seed_everything
β”‚   β”‚   β”œβ”€β”€ postprocess.py     # Phase 1: Beam search, TTA
β”‚   β”‚   └── __init__.py
β”‚   └── __init__.py
β”œβ”€β”€ test/
β”‚   β”œβ”€β”€ test_model.py          # Model testing script
β”‚   β”œβ”€β”€ evaluate.py            # Prediction evaluation
β”‚   β”œβ”€β”€ quick_test.py          # Sanity tests
β”‚   β”œβ”€β”€ test_dataset.py        # Dataset tests
β”‚   └── test_split.py          # Test train/val split
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ train/                 # Training data
β”‚   β”œβ”€β”€ public_test/           # Test data
β”‚   └── val_tracks.json        # Validation split (70/30)
β”œβ”€β”€ sampling_data/             # Sample data (200 tracks, 1000 images)
β”‚   β”œβ”€β”€ train/
β”‚   └── README.md
β”œβ”€β”€ results/                   # Output directory
β”‚   β”œβ”€β”€ *_best.pth             # Model checkpoints
β”‚   └── submission_*.txt       # Submission files
β”œβ”€β”€ experiments/               # Experiment logs
β”œβ”€β”€ train.py                   # Main training script
β”œβ”€β”€ run_ablation.py            # Best config training
β”œβ”€β”€ create_sample_data.py      # Create sampling data
β”œβ”€β”€ test_split.py              # Test data split
β”œβ”€β”€ test_training_quick.py     # Quick training test
β”œβ”€β”€ verify_sampling.py         # Verify sampling data
β”œβ”€β”€ pyproject.toml             # Dependencies
β”œβ”€β”€ README.md                  # This file
β”œβ”€β”€ CLAUDE.md                  # Claude Code instructions
β”œβ”€β”€ CHANGES_SUMMARY.md         # Recent changes summary
β”œβ”€β”€ explain_model.md           # Architecture documentation
β”œβ”€β”€ suggest_improve_model.md   # Improvement suggestions
β”œβ”€β”€ tutorial.md                # Function reference
└── TEST_RESULTS.md            # Test results

πŸ”§ Troubleshooting

Common Issues

1. CUDA Out of Memory

Error: RuntimeError: CUDA out of memory

Solution:

# Reduce batch size
python train.py --batch-size 32  # or 16

# Reduce model size (if needed)
# Edit configs/config.py:
TRANSFORMER_LAYERS = 3  # instead of 6
TRANSFORMER_HEADS = 8   # instead of 12

2. Unicode Encoding Error (Windows)

Error: UnicodeEncodeError: 'charmap' codec can't encode character

Solution:

# Use UTF-8 mode
python -X utf8 train.py

# Or set environment variable
set PYTHONIOENCODING=utf-8
python train.py

3. Validation Dataset Empty

Error: Validation shows 0 samples

Solution:

# Delete old split file
rm -f data/val_tracks.json

# Retrain (will create new 70/30 split)
python train.py

4. Slower Training with Phase 1

Issue: Training is slower with larger model

Solution:

# Option 1: Use sampling data for faster iteration
python train.py --data-root sampling_data/train --epochs 10

# Option 2: Reduce model size
# Edit configs/config.py:
TRANSFORMER_LAYERS = 4  # Compromise between 3 and 6
TRANSFORMER_HEADS = 10  # Compromise between 8 and 12

# Option 3: Increase batch size (if GPU allows)
python train.py --batch-size 128

πŸ“Š Performance

Validation Results

Configuration Val Accuracy Parameters Training Time* Inference**
CRNN (no STN) 74.45% ~25M ~1.5h ~50 FPS
CRNN + STN 75.65% ~26M ~1.7h ~48 FPS
ResTran (no STN) 75.80% ~30M ~2.0h ~45 FPS
ResTran + STN (Baseline) 77.90% 31M ~2.2h ~50 FPS
Phase 1 Enhanced 80-82% (target) 45M ~3.0h ~40 FPS

*On NVIDIA GTX 1650, 30 epochs, batch size 64 **Single sample inference on GTX 1650

Phase 1 Improvements Breakdown

Improvement Expected Gain Effort Status
Beam Search +1-2% 2-3 hours βœ… Implemented
Label Smoothing +0.5-1% 1 hour βœ… Implemented
Larger Transformer +1-2% 5 minutes βœ… Implemented
Cosine Annealing +0.5-1% 30 minutes βœ… Implemented
TTA (optional) +0.5-1.5% 2 hours βœ… Available
Total Expected +2-4% ~1-2 weeks βœ… Complete

Memory Usage

  • Training: ~6-8GB VRAM (batch size 64, Phase 1 model)
  • Inference: ~3-4GB VRAM (batch size 32)
  • Model Size: ~180MB (FP32), ~90MB (FP16)

πŸ“š Documentation


🎯 Tips & Best Practices

Training Tips

  1. Start with sampling data for quick iteration
  2. Monitor validation accuracy to detect overfitting
  3. Use Phase 1 improvements for best results (enabled by default)
  4. Save checkpoints frequently in case of interruption
  5. Delete val_tracks.json when changing split ratio

Data Preparation Tips

  1. Ensure consistent naming: lr-001.png/jpg to lr-005.png/jpg
  2. Validate annotations: Check plate_text format
  3. Balance scenarios: Include both Scenario-A and Scenario-B
  4. Check image quality: Avoid corrupted or empty images
  5. Support both formats: PNG and JPG files work automatically

Inference Tips

  1. Use beam search for best accuracy (enabled by default)
  2. Enable TTA for critical predictions
  3. Batch inference for speed (batch size 32)
  4. Monitor confidence scores to filter low-quality predictions
  5. Ensemble predictions from multiple checkpoints for best results

Performance Optimization

  1. GPU Memory: Reduce batch size if OOM errors occur
  2. Training Speed: Use --num-workers 10 for faster data loading
  3. Model Size: Adjust TRANSFORMER_LAYERS (3-6) based on GPU capacity
  4. Inference Speed: Use smaller beam width (3-5) for faster decoding

πŸš€ Next Steps

Phase 2 Improvements (Coming Soon)

Target: +1-3% additional improvement

  1. ⏳ SE blocks in ResNet
  2. ⏳ Multi-head attention fusion
  3. ⏳ Multi-scale feature fusion
  4. ⏳ Language model integration

See suggest_improve_model.md for full roadmap.


🀝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

  • ICPR 2026 Challenge for providing the dataset and benchmark
  • PyTorch Team for the deep learning framework
  • Albumentations for the augmentation library
  • Research papers: STN, ResNet, Transformer, CTC
  • Phase 1 Improvements inspired by recent OCR research

πŸ“§ Contact

For questions or issues:


πŸ“ˆ Changelog

Version 2.0.0 (Phase 1 Enhanced) - 2026-02-09

πŸ†• New Features

  • βœ… Beam search CTC decoding (5-beam width)
  • βœ… Label smoothing loss (Ξ±=0.1)
  • βœ… Larger transformer (6 layers, 12 heads)
  • βœ… Cosine annealing with warm restarts
  • βœ… Test-time augmentation support
  • βœ… PNG + JPG image support
  • βœ… 70/30 train/val split (improved from 90/10)

πŸ”§ Improvements

  • Better generalization with label smoothing
  • More stable training with cosine annealing
  • Better capacity with larger transformer
  • Improved validation dataset (now non-empty)

πŸ› Bug Fixes

  • Fixed validation dataset empty issue
  • Fixed image loading (now supports both PNG and JPG)
  • Fixed train/val split to use all tracks

πŸ“Š Performance

  • Target accuracy: 80-82% (from 77.90%)
  • Expected gain: +2-4%
  • Model size: 45M parameters (from 31M)

Version 1.0.0 (Baseline) - 2026-02-08

  • βœ… Initial release
  • βœ… ResTranOCR model (ResNet34 + Transformer + STN)
  • βœ… Multi-frame processing (5 frames)
  • βœ… Attention-based fusion
  • βœ… CTC loss training
  • βœ… Mixed precision support
  • βœ… 77.90% validation accuracy

Built with ❀️ for the ICPR 2026 Challenge

Enhanced with Phase 1 Improvements πŸš€

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages