A box-referring VQA dataset and benchmark for evaluating vision-language models on spatial and temporal reasoning in autonomous driving scenarios.
Box-QAymo addresses a critical gap in autonomous driving AI: the ability to understand and respond to user queries about specific objects in complex driving scenes. Rather than relying on full-scene descriptions, our dataset enables users to express intent by drawing bounding boxes around objects of interest, providing a fast and intuitive interface for focused queries.
π Project Page
Current vision-language models (VLMs) struggle with localized, user-driven queries in real-world autonomous driving scenarios. Existing datasets focus on:
- β Full-scene descriptions without spatial specificity
- β Waypoint prediction rather than interpretable communication
- β Idealized assumptions that don't reflect real user needs
Box-QAymo enables:
- β Spatial reasoning about user-specified objects via bounding boxes
- β Temporal understanding of object motion and inter-object dynamics
- β Hierarchical evaluation from basic perception to complex spatiotemporal reasoning
- β Real-world complexity with crowd-sourced fine-grained annotations
- 202 driving scenes from Waymo Open Dataset validation split
- 50% of objects enhanced with crowd-sourced fine-grained semantic labels
- Hierarchical question taxonomy spanning 3 complexity levels:
- Binary sanity checks (movement status, orientation)
- Instance-grounded questions (fine-grained classification, color recognition)
- Motion reasoning (trajectory analysis, relative motion, path conflicts)
- Robust quality control through negative sampling, temporal consistency, and difficulty stratification
| Category | Description | Example |
|---|---|---|
| π VLM Sanity Check | Binary questions testing basic scene understanding | "Are there any stationary vehicles?" |
| π¦ Instance-Grounded | Questions about specific box-referred objects | "What type of object is in the red box?" |
| π Motion Reasoning | Spatiotemporal understanding across frames | "Are the ego vehicle and truck on a collision course?" |
- Python 3.10
- Access to Waymo Open Dataset v1.4.3
-
Clone and install requirements:
git clone https://github.com/your-username/box-qaymo cd box-qaymo pip install -r requirements.txt -
Download required data:
- Waymo Open Dataset v1.4.3 validation tfrecords
- Metadata files from Google Drive
-
Process Waymo data:
python waymo_extract.py --validation-path /path/to/waymo --output-dir /path/to/output
-
Generate VQA dataset:
python vqa_generator.py --dataset_path /path/to/output
Our three-stage construction methodology:
- Base dataset: Waymo Open Dataset (superior scene diversity and LiDAR density)
- Crowd-sourced labeling: Fine-grained semantic categories following Argoverse 2.0 taxonomy
- Multi-view annotation: 3Γ3 object galleries from best visibility crops
- Color attributes: Vehicle color labels for color-based reasoning
- Visual markers: Red bounding boxes instead of numerical coordinates
- Hierarchical complexity: From binary questions to complex spatiotemporal reasoning
- Motion analysis: Both implicit (single-frame) and explicit (multi-frame) approaches
- Negative sampling: Balanced positive/negative examples
- Temporal consistency: Logical consistency across frame sequences
- Difficulty stratification: Granular complexity levels
- Answer formats: 2-4 option multiple choice to prevent binary guessing
Our hierarchical evaluation protocol systematically tests VLM capabilities:
Binary Sanity Checks β Instance-Grounded Questions β Motion Reasoning
(Basic VLM) (Spatial Focus) (Temporal Understanding)
Evaluate your models using our provided scripts:
# LLaVA evaluation
python llava_predict.py --dataset_path /path/to/data
# Qwen-VL evaluation
python qwenvl_predict.py --dataset_path /path/to/data
# Evaluate all models
python eval_all_csv.py --dataset_path /path/to/dataOur comprehensive evaluation reveals significant limitations in current VLMs:
- Spatial referencing: Models struggle to correctly identify box-referred objects
- Fine-grained classification: Poor performance on detailed object categorization
- Motion understanding: Difficulty with temporal reasoning and trajectory analysis
- Real-world gap: Performance drops significantly under realistic conditions
Box-QAymo enables research in:
- Interpretable autonomous driving systems
- Spatial-aware vision-language models
- Human-AI interaction in safety-critical domains
- Temporal reasoning in dynamic environments
- User intent understanding through visual references
If you use Box-QAymo in your research, please cite:
@article{box_qaymo_2025,
title={Box-QAymo: Box-Referring VQA Dataset for Autonomous Driving},
author={Etchegaray, Djamahl, Fu, Yuxia, Huang, Zi and Luo, Yadan},
journal={arXiv preprint},
year={2025}
}Core Components
WaymoDatasetLoader: Extracts and processes Waymo scenesSceneInfo: Complete scene representation with temporal dataObjectInfo: Enhanced object annotations with fine-grained labels
BasePromptGenerator: Abstract base for question generatorsObjectBinaryPromptGenerator: Binary sanity check questionsObjectDrawnBoxPromptGenerator: Instance-grounded questionsEgoRelativeObjectTrajectoryPromptGenerator: Motion reasoning questions
MultipleChoiceMetric: Accuracy, Recall, Precision, F1 evaluation for MCQ
We welcome contributions! Areas of particular interest:
- New question types and complexity levels
- Additional evaluation metrics
- Model integration scripts
- Analysis tools and visualizations
MIT License - see LICENSE file for details.
Note: This project processes data from the Waymo Open Dataset, which requires separate licensing from Waymo.
- Waymo Open Dataset team for providing high-quality autonomous driving data
- Crowd-sourcing annotators for fine-grained semantic labels
- Argoverse team for the semantic taxonomy
π Ready to evaluate your VLM on real-world driving scenarios? Get started with Box-QAymo today!