GitHub - wzh506/CoT4AD: Official implementation of 'CoT4AD: A Vision-Language-Action Model with Chain-of-Thought Reasoning for Autonomous Driving'

CoT4AD: A Vision-Language-Action Model with Chain-of-Thought Reasoning for Autonomous Driving

Abstract

Vision-Language-Action (VLA) models have recently attracted growing attention in end-to-end autonomous driving for their strong reasoning capabilities and rich world knowledge. However, existing VLAs often suffer from limited numerical reasoning ability and overly simplified input–output mappings, which hinder their performance in complex driving scenarios requiring step-by-step causal reasoning. To address these challenges, we propose CoT4AD, a novel VLA framework that introduces Chain-of-Thought (CoT) reasoning for autonomous driving to enhance both numerical and causal reasoning in Vision-Language Models (VLMs). CoT4AD integrates visual observations and language instructions to perform semantic reasoning, scene understanding, and trajectory planning. During training, it explicitly models a perception–question–prediction–action CoT to align the reasoning space with the action space across multiple driving tasks. During inference, it performs implicit CoT reasoning to enable consistent numerical reasoning and robust decision-making in dynamic environments. Extensive experiments on both real-world and simulated benchmarks, including nuScenes and Bench2Drive, demonstrate that CoT4AD achieves state-of-the-art performance in both open-loop and closed-loop evaluations. Code will be released upon paper acceptance.

Overview

News

[2025/11/13] CoT4AD paper is under review!

Currently Supported Features

[-] CoT4AD Inference Framework
[-] Open-loop Evaluation
[-] Close-loop Evalution
[-] CoT4AD Checkpoint
[-] CoT4AD Training Framework

Getting Started

git clone https://github.com/wzh506/CoT4AD.git
cd ./cot
conda create -n cot python=3.8 -y
conda activate cot
pip install torch==2.4.1+cu118 torchvision==0.19.1+cu118 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118
pip install -v -e .
pip install -r requirements.txt

Preperation

You can refer to here to prepare the Bench2drive dataset.

CoT4AD uses the pretrained 2D llm weights and vision encoder + projector weights provided by Omnidrive

cd /path/to/CoT4AD
mkdir ckpts

The vision encoder + projector weights are extracted from ckpts/pretrain_qformer/, which is pretrained by using llava data.

To help reproduce the results of CoT4AD, our Chat-B2D dataset are provided at here.

Train

CoT4AD follows a three-stage training process. In stage1, you can download the Chat-B2D dataset, then put it under the /data directory.

unzip Chat-B2D.zip -d data/

We use Chat-B2D data for pre-training:

./adzoo/cot/cot_dist_train.sh adzoo/cot/configs/cot_stage1_train.py $GPUS

After the stage1 training is completed, you can start the stage2 training using the following commands (Remember to change the load_from in the cfg):

./adzoo/cot/cot_dist_train.sh adzoo/cot/configs/cot_stage2_train.py $GPUS

Futher More, to

Open-loop evaluation

You can perform an open-loop evaluation of CoT4AD with the following command

./adzoo/cot/cot_dist_eval.sh adzoo/cot/configs/cot_stage3_infer.py [--PATH_CHECKPOINTS] 1

You also can perform a CoT inference of CoT4AD with (this might be quite slow)

./adzoo/cot/cot_dist_eval.sh adzoo/cot/configs/cot_stage3_cot.py [--PATH_CHECKPOINTS] 1

We recommend inference for CoT4AD on an NVIDIA A800 or other GPUs with more than 32GB of memory (inference in FP32, as default).

Meanwhile, CoT4AD can also perform FP16 inference and achieve almost the same performance. We recommend fp16 inference on a GPU with more than 17GB of memory.

./adzoo/cot/cot_dist_eval.sh adzoo/cot/configs/cot_stage3_fp16.py [--PATH_CHECKPOINTS] 1

Close-loop evaluation

You can refer to here to clone Bench2Drive evaluation tools and prepare CARLA for it.

Follow here to use evaluation tools of Bench2Drive.

Note that you may first verify the correctness of the team agent， you need to set GPU_RANK, TEAM_AGENT, TEAM_CONFIG in the eval scripts.

You can set as following for close-loop evaluation

TEAM_CONFIG=adzoo/cot/configs/cot_stage3_agent.py+[CHECKPOINT_PATH]

Results and Checkpoints

CoT4AD and other baselines

The results of UniAD & VAD are refer to the official results of Bench2DriveZoo

Method	L2 (m) 2s	Driving Score	Success Rate(%)	Config	Download	Eval Json
UniAD-Tiny	0.80	40.73	13.18	config	Hugging Face/Baidu Cloud	Json
UniAD-Base	0.73	45.81	16.36	config	Hugging Face/Baidu Cloud	Json
VAD	0.91	42.35	15.00	config	Hugging Face/Baidu Cloud	Json
CoT4AD	0.68	77.74	54.62	config	Hugging Face	Json

Qalitative visualization & Analysis

Citation

If this work is helpful for your research, please consider citing:

@article{wang2025cot4ad,
  title={CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving},
  <!-- author={Haoyu Fu and Diankun Zhang and Zongchuang Zhao and Jianfeng Cui and Dingkang Liang and Chong Zhang and Dingyuan Zhang and Hongwei Xie and Bing Wang and Xiang Bai}, -->
  <!-- journal={arXiv:2503.19755}, -->
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.vscode		.vscode
adzoo		adzoo
assets		assets
data		data
docs		docs
mmcv		mmcv
team_code		team_code
.gitignore		.gitignore
INFO		INFO
LICENSE		LICENSE
README.md		README.md
learn.txt		learn.txt
llama3.py		llama3.py
llm.py		llm.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CoT4AD: A Vision-Language-Action Model with Chain-of-Thought Reasoning for Autonomous Driving

Abstract

Overview

News

Currently Supported Features

Getting Started

Preperation

Train

Open-loop evaluation

Close-loop evaluation

Results and Checkpoints

CoT4AD and other baselines

Qalitative visualization & Analysis

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

wzh506/CoT4AD

Folders and files

Latest commit

History

Repository files navigation

CoT4AD: A Vision-Language-Action Model with Chain-of-Thought Reasoning for Autonomous Driving

Abstract

Overview

News

Currently Supported Features

Getting Started

Preperation

Train

Open-loop evaluation

Close-loop evaluation

Results and Checkpoints

CoT4AD and other baselines

Qalitative visualization & Analysis

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages