Vision-Language-Action (VLA) models have recently attracted growing attention in end-to-end autonomous driving for their strong reasoning capabilities and rich world knowledge. However, existing VLAs often suffer from limited numerical reasoning ability and overly simplified input–output mappings, which hinder their performance in complex driving scenarios requiring step-by-step causal reasoning. To address these challenges, we propose CoT4AD, a novel VLA framework that introduces Chain-of-Thought (CoT) reasoning for autonomous driving to enhance both numerical and causal reasoning in Vision-Language Models (VLMs). CoT4AD integrates visual observations and language instructions to perform semantic reasoning, scene understanding, and trajectory planning. During training, it explicitly models a perception–question–prediction–action CoT to align the reasoning space with the action space across multiple driving tasks. During inference, it performs implicit CoT reasoning to enable consistent numerical reasoning and robust decision-making in dynamic environments. Extensive experiments on both real-world and simulated benchmarks, including nuScenes and Bench2Drive, demonstrate that CoT4AD achieves state-of-the-art performance in both open-loop and closed-loop evaluations. Code will be released upon paper acceptance.
[2025/11/13] CoT4AD paper is under review!
- [-] CoT4AD Inference Framework
- [-] Open-loop Evaluation
- [-] Close-loop Evalution
- [-] CoT4AD Checkpoint
- [-] CoT4AD Training Framework
git clone https://github.com/wzh506/CoT4AD.git
cd ./cot
conda create -n cot python=3.8 -y
conda activate cot
pip install torch==2.4.1+cu118 torchvision==0.19.1+cu118 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118
pip install -v -e .
pip install -r requirements.txt
You can refer to here to prepare the Bench2drive dataset.
CoT4AD uses the pretrained 2D llm weights and vision encoder + projector weights provided by Omnidrive
cd /path/to/CoT4AD
mkdir ckpts
The vision encoder + projector weights are extracted from ckpts/pretrain_qformer/, which is pretrained by using llava data.
To help reproduce the results of CoT4AD, our Chat-B2D dataset are provided at here.
CoT4AD follows a three-stage training process. In stage1, you can download the Chat-B2D dataset, then put it under the /data directory.
unzip Chat-B2D.zip -d data/
We use Chat-B2D data for pre-training:
./adzoo/cot/cot_dist_train.sh adzoo/cot/configs/cot_stage1_train.py $GPUS
After the stage1 training is completed, you can start the stage2 training using the following commands (Remember to change the load_from in the cfg):
./adzoo/cot/cot_dist_train.sh adzoo/cot/configs/cot_stage2_train.py $GPUS
Futher More, to
You can perform an open-loop evaluation of CoT4AD with the following command
./adzoo/cot/cot_dist_eval.sh adzoo/cot/configs/cot_stage3_infer.py [--PATH_CHECKPOINTS] 1
You also can perform a CoT inference of CoT4AD with (this might be quite slow)
./adzoo/cot/cot_dist_eval.sh adzoo/cot/configs/cot_stage3_cot.py [--PATH_CHECKPOINTS] 1
We recommend inference for CoT4AD on an NVIDIA A800 or other GPUs with more than 32GB of memory (inference in FP32, as default).
Meanwhile, CoT4AD can also perform FP16 inference and achieve almost the same performance. We recommend fp16 inference on a GPU with more than 17GB of memory.
./adzoo/cot/cot_dist_eval.sh adzoo/cot/configs/cot_stage3_fp16.py [--PATH_CHECKPOINTS] 1
You can refer to here to clone Bench2Drive evaluation tools and prepare CARLA for it.
Follow here to use evaluation tools of Bench2Drive.
Note that you may first verify the correctness of the team agent, you need to set GPU_RANK, TEAM_AGENT, TEAM_CONFIG in the eval scripts.
You can set as following for close-loop evaluation
TEAM_CONFIG=adzoo/cot/configs/cot_stage3_agent.py+[CHECKPOINT_PATH]
The results of UniAD & VAD are refer to the official results of Bench2DriveZoo
| Method | L2 (m) 2s | Driving Score | Success Rate(%) | Config | Download | Eval Json |
|---|---|---|---|---|---|---|
| UniAD-Tiny | 0.80 | 40.73 | 13.18 | config | Hugging Face/Baidu Cloud | Json |
| UniAD-Base | 0.73 | 45.81 | 16.36 | config | Hugging Face/Baidu Cloud | Json |
| VAD | 0.91 | 42.35 | 15.00 | config | Hugging Face/Baidu Cloud | Json |
| CoT4AD | 0.68 | 77.74 | 54.62 | config | Hugging Face | Json |
If this work is helpful for your research, please consider citing:
@article{wang2025cot4ad,
title={CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving},
<!-- author={Haoyu Fu and Diankun Zhang and Zongchuang Zhao and Jianfeng Cui and Dingkang Liang and Chong Zhang and Dingyuan Zhang and Hongwei Xie and Bing Wang and Xiang Bai}, -->
<!-- journal={arXiv:2503.19755}, -->
year={2025}
}