Skip to content

wm-research/recogdrive

 
 

Repository files navigation

ReCogDrive

A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Yongkang Li1,2*, Kaixin Xiong2*, Xiangyu Guo1,2, Fang Li2, Sixu Yan1, Gangwei Xu1,2,
Lijun Zhou2, Long Chen2, Haiyang Sun2†, Bing Wang2, Guang Chen2,
Hangjun Ye2, Wenyu Liu1, Xinggang Wang1✉

1Huazhong University of Science and Technology
2Xiaomi EV

(*) Equal contribution. (†) Project leader. (✉) Corresponding author.

Arxiv 2025

Paper PDF Project Page huggingface weights  huggingface datasets 

News

  • Aug. 24th, 2025: We have released all driving pretraining QA, including 12 driving datasets and our own annotated NavSim data. We have rewritten the scoring, filtering, and evaluation for open-source data. If it’s helpful to you, feel free to star and cite our work! 🚗💨
  • Aug. 21th, 2025: We release the initial version of code and weight on NAVSIM, along with documentation and training/evaluation scripts. We will also update our new revision of the paper and the pretraining datasets later this month or next month. Please stay tuned! ☕️
  • Jun. 11th, 2025: We released our paper on Arxiv. Code/Models are coming soon. Please stay tuned! ☕️

Updates

  • Release Paper
  • Release Full Models and Training/Evaluation Framework
  • Release Full Driving QA Datasets
  • Release updated paper

Table of Contents

Abstract

Although end-to-end autonomous driving has made remarkable progress, its performance degrades significantly in rare and long-tail scenarios. Recent approaches attempt to address this challenge by leveraging the rich world knowledge of Vision-Language Models (VLMs), but these methods suffer from several limitations: (1) a significant domain gap between the pre-training data of VLMs and real-world driving data, (2) a dimensionality mismatch between the discrete language space and the continuous action space, and (3) imitation learning tends to capture the average behavior present in the dataset, which may be suboptimal even dangerous. In this paper, we propose ReCogDrive, an autonomous driving system that integrates VLMs with diffusion planner, which adopts a three-stage paradigm for training. In the first stage, we use a large-scale driving question-answering datasets to train the VLMs, mitigating the domain discrepancy between generic content and real-world driving scenarios. In the second stage, we employ a diffusion-based planner to perform imitation learning, mapping representations from the latent language space to continuous driving actions. Finally, we fine-tune the diffusion planner using reinforcement learning with NAVSIM non-reactive simulator, enabling model to generate safer, more human-like driving trajectories. We evaluate our approach on the planning-oriented NAVSIM benchmark, achieving a PDMS of 90.5 and setting a new state-of-the-art that surpasses the previous vision-only SOTA by 6.5 PDMS.

Qualitative Results on NAVSIM Navtest

This visualization highlights ReCogDrive’s ability to generate smooth trajectories, accurate scene summaries, and clear driving instructions. By identifying key objects such as vehicles and traffic signals, it achieves robust end-to-end autonomous driving with enhanced cognition.

Getting Started

Checkpoint

Results on NAVSIM

Method Model Size Training Stage PDMS Weight Download
ReCogDrive(VLM)-Base 2B Stage 1 84.8 Model
ReCogDrive-Base 2B + 35M Stage 1&2&3 90.3 Model
ReCogDrive(VLM)-Large 8B Stage 1 86.8 Model
ReCogDrive-Large 8B + 35M Stage 1&2&3 90.5 Model

Driving Pretraining Datasets

Datasets Source Rewritten and filtered Annotations Jsonl
NAVSIM-Traj - JSONL
NAVSIM-ReCogDrive - JSONL
DriveLM link -
Nuinstruct link JSONL
NuscenesQA link JSONL
Omnidrive link JSONL
Senna link JSONL
LingoQA link JSONL
Drama link JSONL
MapLM link JSONL
Talk2Car link JSONL
Drivegpt4 link JSONL
CODA-LM link JSONL
SUTD link JSONL

Our ReCogDrive is pretrained on 12 open-source driving datasets. For most of these datasets, we leveraged Qwen2.5VL-72B to re-annotate the answers, applied standardized scoring, and filtered them to obtain 12 high-quality QA datasets. In addition, we built an automated annotation pipeline on Navsim, generating 752k QA pairs. These resources enable VLMs to better adapt to driving scenarios.

We open-sourced these high-quality driving QA datasets in the hope of supporting research on Vision-Language-Action (VLA) for driving. If the official maintainers of any dataset prefer that we do not release the JSON annotations, we will remove them immediately. Please note that if you use these datasets, you must comply with the original licenses of the respective datasets. We emphasize that our usage of these datasets is solely for academic research purposes, with no commercial applications involved.

Contact

If you have any questions, please contact Yongkang Li via email (liyk@hust.edu.cn) or wechat (liyk_0803).

Acknowledgement

ReCogDrive is greatly inspired by the following outstanding contributions to the open-source community: NAVSIM, DPPO, LightningDiT, DiffusionDrive, Senna, GR00T.

Citation

If you find ReCogDrive is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.

@article{li2025recogdrive,
  title={ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving},
  author={Li, Yongkang and Xiong, Kaixin and Guo, Xiangyu and Li, Fang and Yan, Sixu and Xu, Gangwei and Zhou, Lijun and Chen, Long and Sun, Haiyang and Wang, Bing and others},
  journal={arXiv preprint arXiv:2506.08052},
  year={2025}
}

About

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.7%
  • Shell 3.1%
  • Jupyter Notebook 1.2%