State-of-the-art multi-frame OCR system for low-resolution license plate recognition
Enhanced with Phase 1 Improvements: Beam Search, Label Smoothing, Larger Transformer, Cosine Annealing
MultiFrame-LPR is a deep learning system designed for the ICPR 2026 Challenge on Low-Resolution License Plate Recognition. It processes 5 consecutive frames per track to achieve robust character recognition even in challenging conditions.
| Version | Validation Accuracy | Key Features |
|---|---|---|
| Baseline | 77.90% | ResNet34 + Transformer + STN |
| Phase 1 Enhanced | 80-82% (target) | + Beam Search + Label Smoothing + Larger Model |
- β 77.90% baseline accuracy with ResTranOCR
- β Multi-frame fusion with learned attention
- β Spatial alignment via STN (Spatial Transformer Network)
- β Transformer-based sequence modeling
- β End-to-end trainable with CTC loss
- π Phase 1 Improvements: Beam search decoding, label smoothing, larger transformer (6 layers, 12 heads), cosine annealing scheduler
- What's New in Phase 1
- Features
- Model Architecture
- Requirements
- Installation
- Dataset Preparation
- Quick Start
- Training
- Evaluation & Testing
- Phase 1 Improvements Explained
- Configuration
- Project Structure
- Troubleshooting
- Performance
- Documentation
- What: Maintains top-K hypotheses instead of greedy decoding
- Benefit: Better global sequence selection (+1-2% accuracy)
- Config:
USE_BEAM_SEARCH = True,BEAM_WIDTH = 5
- What: Prevents overconfidence on training data
- Benefit: Better generalization and calibrated confidence (+0.5-1% accuracy)
- Config:
USE_LABEL_SMOOTHING = True,LABEL_SMOOTHING = 0.1
- What: 6 layers, 12 heads (from 3 layers, 8 heads)
- Benefit: Better capacity for complex patterns (+1-2% accuracy)
- Config:
TRANSFORMER_LAYERS = 6,TRANSFORMER_HEADS = 12
- What: Periodic learning rate restarts
- Benefit: Escape local minima, better convergence (+0.5-1% accuracy)
- Config:
USE_COSINE_ANNEALING = True,T_0 = 10,T_MULT = 2
- What: Average predictions over augmented versions
- Benefit: More robust predictions (+0.5-1.5% accuracy)
- Usage:
predict_with_tta()function available insrc/utils/postprocess.py
- Multi-Frame Processing: Leverages temporal information from 5 frames
- Spatial Alignment: STN automatically corrects rotation, scale, and perspective
- Attention Fusion: Learns to weight frames by quality
- Transformer Encoder: Captures character dependencies
- CTC Decoding: No character-level alignment needed
- π Beam Search: Better sequence decoding
- π Label Smoothing: Improved generalization
- Mixed precision training (AMP)
- Data augmentation pipeline
- Synthetic LR generation from HR images
- 70/30 train/validation split (improved from 90/10)
- Comprehensive logging and checkpointing
- π Cosine annealing scheduler with warm restarts
- π Larger transformer (45M parameters)
5 Frames β STN β ResNet-34 β Attention Fusion β Transformer (6 layers, 12 heads) β CTC Head β Beam Search β Text
- STN (Spatial Transformer Network): Geometric alignment
- ResNet-34: Visual feature extraction (modified strides for OCR)
- AttentionFusion: Multi-frame fusion with learned weights
- Transformer: 6-layer encoder with 12 attention heads (enhanced)
- CTC Head: Character classification with blank token
- π Beam Search Decoder: Improved sequence decoding
Total Parameters: ~45M (increased from 31M for better capacity)
π Detailed Architecture: See explain_model.md
π Improvement Guide: See suggest_improve_model.md
- OS: Windows, Linux, or macOS
- GPU: NVIDIA GPU with CUDA support (recommended)
- Minimum: 6GB VRAM (increased for larger model)
- Recommended: 8GB+ VRAM (RTX 3060 or better)
- RAM: 16GB+ recommended
- Storage: 10GB+ free space
- Python: 3.11.x (strictly required)
- CUDA: 11.8 or 12.x (for GPU acceleration)
- Package Manager:
uv(recommended) orpip
Windows (PowerShell):
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"Linux/macOS:
curl -LsSf https://astral.sh/uv/install.sh | shgit clone https://github.com/yourusername/MultiFrame-LPR.git
cd MultiFrame-LPR# Create virtual environment and install dependencies
uv sync
# Activate virtual environment
# Windows
.venv\Scripts\activate
# Linux/macOS
source .venv/bin/activategit clone https://github.com/yourusername/MultiFrame-LPR.git
cd MultiFrame-LPR# Create virtual environment
python -m venv venv
# Activate virtual environment
# Windows
venv\Scripts\activate
# Linux/macOS
source venv/bin/activateCUDA 12.8 (NVIDIA GPU):
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128CUDA 11.8 (older NVIDIA GPU):
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118CPU Only (no GPU):
pip install torch torchvisionpip install albumentations opencv-python tqdm numpy matplotlib pandas seaborn pillowpython -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA Available: {torch.cuda.is_available()}')"Expected output:
PyTorch: 2.10.0+cu128
CUDA Available: True
Run quick tests:
# Run model sanity tests
python -X utf8 test/quick_test.py
# Run dataset tests
python -X utf8 test/test_dataset.pyOrganize your data following this structure:
data/
βββ train/
β βββ Scenario-A/
β β βββ Brazilian/
β β β βββ track_00001/
β β β βββ lr-001.png (or .jpg)
β β β βββ lr-002.png
β β β βββ lr-003.png
β β β βββ lr-004.png
β β β βββ lr-005.png
β β β βββ hr-001.png (optional, for synthetic LR)
β β β βββ hr-002.png
β β β βββ hr-003.png
β β β βββ hr-004.png
β β β βββ hr-005.png
β β β βββ annotations.json
β β βββ Mercosur/
β β βββ track_00002/
β β βββ ...
β βββ Scenario-B/
β βββ Brazilian/
β βββ Mercosur/
βββ public_test/ (optional, for testing)
βββ track_xxxxx/
βββ lr-001.png (or .jpg)
βββ lr-002.png
βββ lr-003.png
βββ lr-004.png
βββ lr-005.png
Each annotations.json file should contain:
{
"plate_text": "ABC1234",
"plate_layout": "Brazilian",
"corners": {}
}Supported characters: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ (36 characters)
For quick experiments, use the sampling_data (200 tracks, 1000 images):
# Generate sampling data (if not already done)
python -X utf8 create_sample_data.py
# Training will be ~10x faster on sampling_data# Test on small dataset (fast, ~10 minutes)
python train.py --data-root sampling_data/train --epochs 10 -n quick_test# Delete old validation split to use new 70/30 split
rm -f data/val_tracks.json
# Train with Phase 1 improvements
python train.pyThis will:
- Use 70/30 train/val split (improved from 90/10)
- Train ResTranOCR with enhanced Phase 1 features
- Use larger transformer (6 layers, 12 heads)
- Apply label smoothing and beam search
- Save best model to
results/restran_best.pth - Generate submission file
python train.py \
--epochs 50 \
--batch-size 32 \
--lr 0.001 \
--transformer-layers 6 \
--transformer-heads 12 \
-n my_experimentpython train.py --submission-mode -n final_submissionThis will:
- Train on entire dataset (no validation split)
- Generate predictions for test data
- Save to
results/submission_final_submission_final.txt
python train.pyOutput:
- Checkpoints:
results/restran_best.pth - Submission:
results/submission_restran.txt - Logs: Console output
python train.py \
--epochs 50 \
--batch-size 64 \
--lr 5e-4 \
--transformer-heads 12 \
--transformer-layers 6 \
--num-workers 10 \
--seed 42Full augmentation (default, recommended):
python train.py --aug-level fullLight augmentation (faster, less aggressive):
python train.py --aug-level lightpython train.py --output-dir experiments/exp_001Default hyperparameters in configs/config.py:
# Model Architecture (Phase 1 Enhanced)
TRANSFORMER_HEADS = 12 # Increased from 8
TRANSFORMER_LAYERS = 6 # Increased from 3
TRANSFORMER_FF_DIM = 2048
TRANSFORMER_DROPOUT = 0.1
# Training Hyperparameters
BATCH_SIZE = 64
LEARNING_RATE = 5e-4
EPOCHS = 30
GRAD_CLIP = 5.0
SPLIT_RATIO = 0.7 # 70/30 split (improved from 90/10)
# Phase 1 Improvements
USE_LABEL_SMOOTHING = True # Enable label smoothing
LABEL_SMOOTHING = 0.1 # Smoothing factor
USE_BEAM_SEARCH = True # Enable beam search
BEAM_WIDTH = 5 # Beam width
USE_COSINE_ANNEALING = True # Cosine annealing scheduler
T_0 = 10 # Restart every 10 epochsCLI arguments override config values.
Training progress shows:
- Epoch number and progress bar
- Training loss
- Validation loss and accuracy
- Learning rate
- Best model checkpoints
Example output:
Epoch 1/30: 100%|ββββββββββ| 313/313 [02:15<00:00, 2.31it/s]
Train Loss: 2.3456 | Val Loss: 1.9876 | Val Acc: 48.23% | LR: 5.00e-04
β Saved Best Model: results/restran_best.pth (48.23%)
Epoch 2/30: 100%|ββββββββββ| 313/313 [02:14<00:00, 2.33it/s]
Train Loss: 1.8765 | Val Loss: 1.6543 | Val Acc: 55.10% | LR: 4.50e-04
β Saved Best Model: results/restran_best.pth (55.10%)
python test/test_model.py \
--checkpoint results/restran_best.pth \
--data-root data/public_test \
--output-file predictions.txt \
--batch-size 32With visualizations:
python test/test_model.py \
--checkpoint results/restran_best.pth \
--data-root data/public_test \
--output-file predictions.txt \
--visualizeThis generates:
- Confidence distribution histogram
- Confidence statistics
- Prediction length distribution
Compare predictions against ground truth:
python test/evaluate.py \
--predictions predictions.txt \
--ground-truth data/train \
--output-errors errors.csv \
--verboseOutput metrics:
- Exact match accuracy
- Character-level accuracy
- Average edit distance
- Confidence scores (correct vs wrong)
- Error analysis (top mistakes)
# Test model initialization and forward pass
python -X utf8 test/quick_test.py
# Test dataset loading
python -X utf8 test/test_dataset.py
# Test train/val split
python -X utf8 test_split.pyBefore (Greedy):
Position: 0 1 2 3 4 5
Char: A -> A -> B -> 1 -> 2 -> 3
β β β β β β
Output: A B 1 2 3
Makes locally optimal choice at each position.
After (Beam Search, width=5):
Maintains top-5 sequences:
1. "AB123" (score: -2.1) β Best
2. "AB1Z3" (score: -2.4)
3. "A8123" (score: -2.7)
4. "AB1Z8" (score: -3.1)
5. "AB12S" (score: -3.3)
Considers global sequence probability.
Implementation: Automatically enabled in validation/inference. Set USE_BEAM_SEARCH = True in config.
Before (Hard Labels):
Target: "A"
Probability: [0.0, 1.0, 0.0, 0.0, ...] # One-hot
[blank, A, B, C, ...]
Model becomes overconfident.
After (Smoothed Labels, Ξ±=0.1):
Target: "A"
Probability: [0.003, 0.967, 0.003, 0.003, ...]
[blank, A, B, C, ...]
Encourages less extreme predictions, better generalization.
Implementation: Automatically used during training. Set USE_LABEL_SMOOTHING = True in config.
Before:
- 3 layers, 8 heads
- 31M parameters
- Limited capacity
After:
- 6 layers, 12 heads
- 45M parameters
- Better pattern learning
Benefits:
- Deeper context understanding
- Better long-range dependencies
- More expressive power
Trade-off: Slightly slower training (~1.3x), but significant accuracy gain.
Before (OneCycleLR):
LR
β β±β²
β β± β²
β β± β²___
βββββββββββββ Epochs
Single cycle, may get stuck.
After (Cosine Annealing):
LR
β β±β² β±β² β±β²
β β± β² β± β² β± β²
ββ± β²β± β² β± β²
βββββββββββββββββββββββ Epochs
T_0 T_0*2 T_0*4
Periodic restarts help escape local minima.
Implementation: Set USE_COSINE_ANNEALING = True in config.
Usage:
from src.utils.postprocess import predict_with_tta
# Load model
model = load_model('results/restran_best.pth')
# Predict with TTA (5 augmented versions)
results = predict_with_tta(
model,
images,
idx2char,
num_augments=5,
use_beam_search=True,
beam_width=5
)Benefits:
- More robust predictions
- +0.5-1.5% accuracy
- Slight increase in inference time
Key parameters:
# Model Architecture (Phase 1 Enhanced)
MODEL_TYPE = "restran"
USE_STN = True
TRANSFORMER_HEADS = 12 # Phase 1: Increased
TRANSFORMER_LAYERS = 6 # Phase 1: Increased
TRANSFORMER_FF_DIM = 2048
TRANSFORMER_DROPOUT = 0.1
# Data
DATA_ROOT = "data/train"
IMG_HEIGHT = 32
IMG_WIDTH = 128
CHARS = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
# Training
BATCH_SIZE = 64
LEARNING_RATE = 5e-4
EPOCHS = 30
GRAD_CLIP = 5.0
SPLIT_RATIO = 0.7 # Phase 1: 70/30 split
# Phase 1 Improvements
USE_LABEL_SMOOTHING = True # Label smoothing
LABEL_SMOOTHING = 0.1
USE_BEAM_SEARCH = True # Beam search decoding
BEAM_WIDTH = 5
USE_COSINE_ANNEALING = True # Cosine annealing scheduler
T_0 = 10
T_MULT = 2
ETA_MIN = 1e-6Option 1: Edit configs/config.py (permanent changes)
Option 2: Use CLI arguments (temporary overrides)
python train.py --batch-size 32 --epochs 50 --lr 0.001 --transformer-layers 6 --transformer-heads 12Option 3: Disable Phase 1 features (if needed)
# In configs/config.py
USE_LABEL_SMOOTHING = False # Disable label smoothing
USE_BEAM_SEARCH = False # Use greedy decoding
USE_COSINE_ANNEALING = False # Use OneCycleLR
TRANSFORMER_LAYERS = 3 # Use smaller model
TRANSFORMER_HEADS = 8MultiFrame-LPR/
βββ configs/
β βββ config.py # Configuration with Phase 1 settings
β βββ __init__.py
βββ src/
β βββ data/
β β βββ dataset.py # MultiFrameDataset (PNG+JPG support, 70/30 split)
β β βββ transforms.py # Augmentation pipelines
β β βββ __init__.py
β βββ models/
β β βββ components.py # STN, Fusion, ResNet, PositionalEncoding
β β βββ restran.py # ResTranOCR model (6 layers, 12 heads)
β β βββ __init__.py
β βββ training/
β β βββ trainer.py # Trainer (Phase 1: Label smoothing, beam search)
β β βββ __init__.py
β βββ utils/
β β βββ common.py # seed_everything
β β βββ postprocess.py # Phase 1: Beam search, TTA
β β βββ __init__.py
β βββ __init__.py
βββ test/
β βββ test_model.py # Model testing script
β βββ evaluate.py # Prediction evaluation
β βββ quick_test.py # Sanity tests
β βββ test_dataset.py # Dataset tests
β βββ test_split.py # Test train/val split
βββ data/
β βββ train/ # Training data
β βββ public_test/ # Test data
β βββ val_tracks.json # Validation split (70/30)
βββ sampling_data/ # Sample data (200 tracks, 1000 images)
β βββ train/
β βββ README.md
βββ results/ # Output directory
β βββ *_best.pth # Model checkpoints
β βββ submission_*.txt # Submission files
βββ experiments/ # Experiment logs
βββ train.py # Main training script
βββ run_ablation.py # Best config training
βββ create_sample_data.py # Create sampling data
βββ test_split.py # Test data split
βββ test_training_quick.py # Quick training test
βββ verify_sampling.py # Verify sampling data
βββ pyproject.toml # Dependencies
βββ README.md # This file
βββ CLAUDE.md # Claude Code instructions
βββ CHANGES_SUMMARY.md # Recent changes summary
βββ explain_model.md # Architecture documentation
βββ suggest_improve_model.md # Improvement suggestions
βββ tutorial.md # Function reference
βββ TEST_RESULTS.md # Test results
Error: RuntimeError: CUDA out of memory
Solution:
# Reduce batch size
python train.py --batch-size 32 # or 16
# Reduce model size (if needed)
# Edit configs/config.py:
TRANSFORMER_LAYERS = 3 # instead of 6
TRANSFORMER_HEADS = 8 # instead of 12Error: UnicodeEncodeError: 'charmap' codec can't encode character
Solution:
# Use UTF-8 mode
python -X utf8 train.py
# Or set environment variable
set PYTHONIOENCODING=utf-8
python train.pyError: Validation shows 0 samples
Solution:
# Delete old split file
rm -f data/val_tracks.json
# Retrain (will create new 70/30 split)
python train.pyIssue: Training is slower with larger model
Solution:
# Option 1: Use sampling data for faster iteration
python train.py --data-root sampling_data/train --epochs 10
# Option 2: Reduce model size
# Edit configs/config.py:
TRANSFORMER_LAYERS = 4 # Compromise between 3 and 6
TRANSFORMER_HEADS = 10 # Compromise between 8 and 12
# Option 3: Increase batch size (if GPU allows)
python train.py --batch-size 128| Configuration | Val Accuracy | Parameters | Training Time* | Inference** |
|---|---|---|---|---|
| CRNN (no STN) | 74.45% | ~25M | ~1.5h | ~50 FPS |
| CRNN + STN | 75.65% | ~26M | ~1.7h | ~48 FPS |
| ResTran (no STN) | 75.80% | ~30M | ~2.0h | ~45 FPS |
| ResTran + STN (Baseline) | 77.90% | 31M | ~2.2h | ~50 FPS |
| Phase 1 Enhanced | 80-82% (target) | 45M | ~3.0h | ~40 FPS |
*On NVIDIA GTX 1650, 30 epochs, batch size 64 **Single sample inference on GTX 1650
| Improvement | Expected Gain | Effort | Status |
|---|---|---|---|
| Beam Search | +1-2% | 2-3 hours | β Implemented |
| Label Smoothing | +0.5-1% | 1 hour | β Implemented |
| Larger Transformer | +1-2% | 5 minutes | β Implemented |
| Cosine Annealing | +0.5-1% | 30 minutes | β Implemented |
| TTA (optional) | +0.5-1.5% | 2 hours | β Available |
| Total Expected | +2-4% | ~1-2 weeks | β Complete |
- Training: ~6-8GB VRAM (batch size 64, Phase 1 model)
- Inference: ~3-4GB VRAM (batch size 32)
- Model Size: ~180MB (FP32), ~90MB (FP16)
- Architecture Guide: explain_model.md - Comprehensive model explanation
- Improvement Guide: suggest_improve_model.md - Phase 1-4 improvement roadmap
- Function Reference: tutorial.md - Complete API documentation
- Test Results: TEST_RESULTS.md - Testing output and results
- Recent Changes: CHANGES_SUMMARY.md - Validation fix and improvements
- Claude Instructions: CLAUDE.md - Instructions for Claude Code
- Start with sampling data for quick iteration
- Monitor validation accuracy to detect overfitting
- Use Phase 1 improvements for best results (enabled by default)
- Save checkpoints frequently in case of interruption
- Delete val_tracks.json when changing split ratio
- Ensure consistent naming: lr-001.png/jpg to lr-005.png/jpg
- Validate annotations: Check plate_text format
- Balance scenarios: Include both Scenario-A and Scenario-B
- Check image quality: Avoid corrupted or empty images
- Support both formats: PNG and JPG files work automatically
- Use beam search for best accuracy (enabled by default)
- Enable TTA for critical predictions
- Batch inference for speed (batch size 32)
- Monitor confidence scores to filter low-quality predictions
- Ensemble predictions from multiple checkpoints for best results
- GPU Memory: Reduce batch size if OOM errors occur
- Training Speed: Use
--num-workers 10for faster data loading - Model Size: Adjust
TRANSFORMER_LAYERS(3-6) based on GPU capacity - Inference Speed: Use smaller beam width (3-5) for faster decoding
Target: +1-3% additional improvement
- β³ SE blocks in ResNet
- β³ Multi-head attention fusion
- β³ Multi-scale feature fusion
- β³ Language model integration
See suggest_improve_model.md for full roadmap.
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- ICPR 2026 Challenge for providing the dataset and benchmark
- PyTorch Team for the deep learning framework
- Albumentations for the augmentation library
- Research papers: STN, ResNet, Transformer, CTC
- Phase 1 Improvements inspired by recent OCR research
For questions or issues:
- GitHub Issues: Create an issue
- Email: your.email@example.com
- β Beam search CTC decoding (5-beam width)
- β Label smoothing loss (Ξ±=0.1)
- β Larger transformer (6 layers, 12 heads)
- β Cosine annealing with warm restarts
- β Test-time augmentation support
- β PNG + JPG image support
- β 70/30 train/val split (improved from 90/10)
- Better generalization with label smoothing
- More stable training with cosine annealing
- Better capacity with larger transformer
- Improved validation dataset (now non-empty)
- Fixed validation dataset empty issue
- Fixed image loading (now supports both PNG and JPG)
- Fixed train/val split to use all tracks
- Target accuracy: 80-82% (from 77.90%)
- Expected gain: +2-4%
- Model size: 45M parameters (from 31M)
- β Initial release
- β ResTranOCR model (ResNet34 + Transformer + STN)
- β Multi-frame processing (5 frames)
- β Attention-based fusion
- β CTC loss training
- β Mixed precision support
- β 77.90% validation accuracy
Built with β€οΈ for the ICPR 2026 Challenge
Enhanced with Phase 1 Improvements π