Official PyTorch implementation of Spatial Temporal Reasoning Models (STRMs) for vision-based localization.
STRMs transform first-person perspective (FPP) observations into global map perspective (GMP) and precise geographical coordinates, achieving GPS-level precision for autonomous navigation.
If you use this code in your research, please cite our paper:
@article{lui2025strms,
title={STRMs: Spatial Temporal Reasoning Models for Vision-Based Localization Rivaling GPS Precision},
author={Lui, Hin Wai and Krichmar, Jeffrey L.},
journal={arXiv preprint arXiv:2503.07939},
year={2025}
}Paper: arXiv:2503.07939
Download the following from Google Drive:
- Datasets (Jackal and Tesla)
- Pretrained model weights
- Satellite images (
jackal_satellite.pngandtesla_satellite.png- required for generating Global Map Perspective (GMP) images)
pip install gdown
gdown --folder FOLDER_URLNote: Download each file individually to avoid rate limits when downloading the entire dataset at once.
After downloading, your directory should look like:
strm/
├── data/
│ ├── jackal/
│ └── tesla/
├── trained_weights/
│ └── models/
├── jackal_satellite.png # Required for GMP image generation
└── tesla_satellite.png # Required for GMP image generation
git clone https://github.com/UCI-CARL/strm.git
cd strmpython -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activatepip install -r requirements.txtpip install git+https://gitlab.com/paloha/gpuMultiprocessing- Python 3.8+
- PyTorch 2.0+
- CUDA-compatible GPU (recommended)
python jackal_pre_process.py --source data/jackal
python tesla_pre_process.py --source data/teslaDataset caches are automatically built during training when needed. You can also build them manually:
python dataloader.py --source data/jackal --seq_len 24 --seq_delta 10
python dataloader.py --source data/tesla --seq_len 24 --seq_delta 10Note: A new cache is created whenever you use different seq_len or seq_delta values.
python vae_train.py --model VAERNN --source data/jackal --seq_len 24 --seq_delta 10 --tag jackal_rnnpython vae_train.py --model VAETransformer --source data/tesla --seq_len 24 --seq_delta 10 --tag tesla_transformer --lr=8e-6Note: Transformer models require a lower learning rate (8e-6) for optimal performance.
Test your setup with a small subset of data:
python vae_train.py --debug --model VAETransformer --source data/jackalEvaluate models without retraining:
python vae_train.py --model VAERNN --source data/jackal \
--load_path trained_weights/models/jackal/VAERNN --skip_trainGenerate predictions on test data:
python vae_inference.py --model VAETransformer --source data/tesla \
--load_path trained_weights/models/tesla/VAETransformerpython vae_hp_search.py --hp_search latent_sizeFor faster experimentation:
python vae_hp_search.py --hp_search model --dataset_ratio 0.1Generate all figures from the paper:
python plot_paths.py --source data/jackal
python plot_paths.py --source data/teslapython lpc.pypython inference_time_plot.pypython recon_ablation.pypython vae_visualize.py --load_path trained_weights/models/jackal/VAERNNTrain and compare VIGOR models against STRM with statistical robustness:
# Full comparison with 3 seeds (train + evaluate)
python vigor_comparison.py --dataset jackal --seeds 1 2 3 --epochs 30
# For Tesla dataset
python vigor_comparison.py --dataset tesla --seeds 1 2 3 --epochs 30For detailed usage, data format adaptations, and troubleshooting, see VIGOR_COMPARISON.md.
Compare computational efficiency (GPU memory, parameters, inference time) against VIGOR and TransGeo:
python benchmark_models.py --dataset jackal --batch_size 32Results are saved to:
- CSV:
trained_weights/vigor_vs_strm/{dataset}/benchmark_results.csv - LaTeX table:
trained_weights/vigor_vs_strm/{dataset}/benchmark_table.tex
| Option | Description | Default |
|---|---|---|
--dataset |
Dataset to use (jackal or tesla) |
Required |
--batch_size |
Batch size for benchmarking | 32 |
--num_warmup |
Number of warmup iterations | 10 |
--num_iterations |
Number of benchmark iterations | 100 |
--seed |
Model seed index to load | 0 |
The main training script (vae_train.py) supports many configuration options:
| Option | Description | Default |
|---|---|---|
--model |
Model architecture (VAERNN, VAETransformer, VAEMultiscaleTransformer) | VAERNN |
--seq_len |
Sequence length for temporal processing | 24 |
--seq_delta |
Time between frames in seconds | 10 |
--latent_size |
Size of latent space dimension | 256 |
--img_size |
Input image size | 224 |
--batch_size |
Batch size for training | 32 |
--epochs |
Number of training epochs | 100 |
--lr |
Learning rate | 1e-4 |
For a complete list of options:
python vae_train.py --helpThis project is licensed under the MIT License - see the LICENSE file for details.
This work builds upon research in vision-based localization and spatial-temporal reasoning. We thank the authors of VIGOR and TransGeo for their open-source implementations.