A survey on Multimodal Large Language Models
for Autonomous Driving
August 22, 2025
Abstract
Large Language Models (LLMs) such as GPT-4, PaLM-2, and LLaMA-
2 have demonstrated remarkable human-like reasoning capabilities. Their
integration into autonomous driving systems, forming Multimodal Large
Language Models (MLLMs) that process diverse data types (images, video,
LiDAR, maps, speech), holds the potential to revolutionize the core pillars
of autonomous driving: perception, planning, and control. This paper pro-
vides a comprehensive overview of the development of MLLMs and their
application in autonomous driving. We survey their use in perception,
planning, control, and industry applications, review relevant datasets and
benchmarks, summarize the inaugural LLVM-AD workshop, and discuss
open challenges and future directions. While significant barriers in la-
tency, safety, and data limitations remain, MLLMs represent a promising
pathway towards achieving higher levels of automation and more intuitive
human-vehicle interaction.
1 Introduction
The advent of Large Language Models (LLMs) like GPT-4, PaLM-2, and LLaMA-
2 has marked a significant milestone in artificial intelligence, showcasing emer-
gent abilities in reasoning, in-context learning, and complex problem-solving.
These capabilities are now being explored for safety-critical applications such as
autonomous driving (AD). By fusing LLMs with multimodal data—including
images, video, LiDAR, HD maps, and speech—Multimodal Large Language
Models (MLLMs) are poised to transform the three foundational pillars of au-
tonomous driving:
• Perception: Advanced understanding of dynamic scenes using sensor
fusion.
• Planning: Sophisticated reasoning about routes, interactions, and long-
term goals.
• Control: Customizable motion generation and explainable action execu-
tion.
1
2 Development of Autonomous Driving
The pursuit of autonomous driving has evolved over decades. Early milestones
include Carnegie Mellon’s Autonomous Land Vehicle (ALV) in the 1980s and the
seminal DARPA Grand Challenge in 2005, won by Stanford’s Stanley. The SAE
Levels of Driving Automation (0-5), standardized in 2014, provide a framework
for classifying technological progress.
The deep learning era catalyzed advancement, with Convolutional Neu-
ral Networks (CNNs) revolutionizing object detection, Deep Neural Networks
(DNNs) enhancing scene understanding, and Deep Reinforcement Learning (DRL)
enabling adaptive decision-making.
The current landscape features widespread Level 1-2 Advanced Driver-Assistance
Systems (ADAS) like Tesla Autopilot, NVIDIA DRIVE, and Baidu Apollo.
Level 3+ robotaxi services are being deployed by companies like Waymo, Cruise,
Zoox, and Baidu. However, significant limitations persist, including perfor-
mance degradation in rare or adverse weather conditions, safety concerns, and
the ”black-box” nature of DNNs.
Current research trends focus on Trustworthy AI (explainability, adversarial
robustness), Vehicle-to-Everything (V2X) communication, and increasingly, the
integration of LLMs for their superior reasoning and potential to enhance safety.
3 Development of Multimodal Language Mod-
els
3.1 Language Models
The evolution of Language Models began with rule-based NLP in the 1960s,
progressed to statistical models (N-gram, HMM) in the 1990s, and shifted to
neural models (RNN, LSTM) in the 2000s. The introduction of Word2Vec in
2013 provided dense semantic embeddings, but the true paradigm shift arrived
with the Transformer architecture in 2017, paving the way for modern LLMs.
3.2 Advancements in LLMs
Models like GPT-3, PaLM, LLaMA, and GPT-4, with billions of parameters, ex-
hibit emergent abilities such as in-context learning and chain-of-thought reason-
ing. A new frontier is embodied AI, where LLMs are integrated with perception
and action loops in environments (e.g., Voyager, Voxposer).
3.3 Multimodal Models Evolution
Early multimodal research focused on tasks like image captioning, Visual Ques-
tion Answering (VQA), and scene understanding. Architectures combining
CNNs and RNNs were common. The field advanced with large-scale pretraining
on multimodal data, leading to models like CLIP, BLIP, and Flamingo. The rise
2
of generative models, including DALL·E and Stable Diffusion, further expanded
capabilities.
3.4 Emergence of MLLMs
The fusion of LLMs with vision encoders created powerful MLLMs such as
LLaVA, PaLM-E, Video-LLaMA, and GPT-4V. Key enabling techniques in-
clude:
• Multimodal instruction tuning
• Multimodal in-context learning
• Multimodal chain-of-thought reasoning
• LLM-aided visual reasoning
4 MLLMs for Autonomous Driving
4.1 Perception
Traditional perception systems are limited to a predefined set of object cate-
gories. MLLMs overcome this by learning from raw, unstructured text descrip-
tions fused with multimodal sensor data. Examples include:
• Talk2BEV: Fuses Bird’s-Eye View (BEV) maps with language for com-
plex reasoning.
• DriveGPT4: Processes driving videos to generate textual descriptions
and driving responses.
• HiLMD: Detects hazardous scenarios using high-resolution input.
• Generative World Models: Models like GAIA-1 (Wayve) and UniSim
(Google) simulate realistic driving scenarios for training and testing.
4.2 Planning Control
MLLMs provide a natural language interface for human-in-the-loop planning
and control, enhancing transparency and adaptability.
• Drive as You Speak (DaYS): Uses GPT-4 to translate natural language
commands into executable driving plans.
• SurrealDriver: Leverages LLM agents to simulate realistic human driv-
ing behaviors.
• GPT-Driver: Reformulates motion planning as a language modeling
task.
3
• LanguageMPC: Adapts Model Predictive Control (MPC) parameters
based on LLM reasoning.
A critical advantage is explainability; MLLMs can generate text explanations
for their driving decisions, building crucial trust.
4.3 Industry Applications
Industry leaders are actively developing and deploying MLLMs:
• Wayve: LINGO-1 for explainable driving, GAIA-1 as a generative simu-
lator.
• Tencent: MAPLM, a large-scale dataset and model for map and traffic
scene understanding.
• Waymo: MotionLM, framing trajectory prediction as language modeling.
• Bosch HKUST: Applying MLLMs for traffic risk prediction.
5 Datasets and Benchmarks
Foundation vision datasets like KITTI, nuScenes, and the Waymo Open Dataset
have been instrumental. The rise of MLLMs has spurred the creation of multimodal-
language driving datasets:
• Talk2Car: Natural language commands for object referral.
• nuScenes-QA: Question-Answer pairs based on the nuScenes dataset.
• DriveLM: Combines driving scenes with a language structure for reason-
ing.
• MAPLM (Tencent): A large-scale dataset (2M frames) with aligned HD
maps, LiDAR, camera images, and text descriptions.
A significant gap remains; existing data is limited in scale, diversity, and lin-
guistic complexity to fully match human-level scene comprehension.
6 LLVM-AD Workshop (WACV 2024)
The first workshop on Large Language and Vision Models for Autonomous Driv-
ing (LLVM-AD) was held at WACV 2024. Key contributions included:
• Research on human-like reasoning (Drive as You Speak, Drive Like a Hu-
man).
• Focus on human-centric autonomous systems and user interaction.
• Applications in UAV planning and object detection.
The workshop also launched two open-source datasets: MAPLM-QA (for
question-answering) and UCU (a driver command dataset).
4
7 Discussion Future Directions
Several critical avenues for future work exist:
• New Datasets: Large-scale datasets encompassing multi-modal data
(panoramic images, LiDAR, HD maps) with rich, aligned textual annota-
tions are urgently needed.
• Hardware Challenges: The computational latency and power consump-
tion of LLMs are prohibitive for real-time driving. Research into model
compression, quantization, and efficient inference is crucial.
• HD Maps Encoding: Developing effective methods to encode com-
plex HD map structures into a language that LLMs can understand (e.g.,
Tesla’s ”language of lanes”, Baidu’s ERNIE-GeoL, Tencent’s THMA).
• User-Vehicle Interaction: Leveraging MLLMs to interpret multi-modal
human inputs (speech, gestures, gaze) to detect driver state (e.g., distrac-
tion) and adapt vehicle behavior accordingly.
• Personalized Driving: Adapting driving policies to individual user pref-
erences (e.g., aggressive vs. cautious styles) through natural language
interaction.
• Trust Safety: Ensuring MLLMs can provide verifiable explanations for
decisions (e.g., ”why was overtaking safe?”), estimate uncertainty, and
reliably handle edge cases.
8 Conclusion
Multimodal Large Language Models represent a paradigm shift in autonomous
driving by merging the powerful reasoning capabilities of LLMs with rich, multi-
sensor data. They offer the potential to create systems that not only understand
complex traffic scenes and plan effectively but also interact with humans in a
natural and intuitive manner. While significant challenges in latency, safety
verification, and data scarcity remain, MLLMs are a rapidly evolving technology
with transformative potential for achieving robust SAE Level 4-5 autonomy.