Personal Paper Summarization

Uploaded by

hirain184

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views5 pages

Personal Paper Summarization

Uploaded by

hirain184

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

A survey on Multimodal Large Language Models

for Autonomous Driving

August 22, 2025

Abstract
Large Language Models (LLMs) such as GPT-4, PaLM-2, and LLaMA-
2 have demonstrated remarkable human-like reasoning capabilities. Their
integration into autonomous driving systems, forming Multimodal Large
Language Models (MLLMs) that process diverse data types (images, video,
LiDAR, maps, speech), holds the potential to revolutionize the core pillars
of autonomous driving: perception, planning, and control. This paper pro-
vides a comprehensive overview of the development of MLLMs and their
application in autonomous driving. We survey their use in perception,
planning, control, and industry applications, review relevant datasets and
benchmarks, summarize the inaugural LLVM-AD workshop, and discuss
open challenges and future directions. While significant barriers in la-
tency, safety, and data limitations remain, MLLMs represent a promising
pathway towards achieving higher levels of automation and more intuitive
human-vehicle interaction.

1 Introduction
The advent of Large Language Models (LLMs) like GPT-4, PaLM-2, and LLaMA-
2 has marked a significant milestone in artificial intelligence, showcasing emer-
gent abilities in reasoning, in-context learning, and complex problem-solving.
These capabilities are now being explored for safety-critical applications such as
autonomous driving (AD). By fusing LLMs with multimodal data—including
images, video, LiDAR, HD maps, and speech—Multimodal Large Language
Models (MLLMs) are poised to transform the three foundational pillars of au-
tonomous driving:
• Perception: Advanced understanding of dynamic scenes using sensor
fusion.
• Planning: Sophisticated reasoning about routes, interactions, and long-
term goals.
• Control: Customizable motion generation and explainable action execu-
tion.

1
2 Development of Autonomous Driving
The pursuit of autonomous driving has evolved over decades. Early milestones
include Carnegie Mellon’s Autonomous Land Vehicle (ALV) in the 1980s and the
seminal DARPA Grand Challenge in 2005, won by Stanford’s Stanley. The SAE
Levels of Driving Automation (0-5), standardized in 2014, provide a framework
for classifying technological progress.
The deep learning era catalyzed advancement, with Convolutional Neu-
ral Networks (CNNs) revolutionizing object detection, Deep Neural Networks
(DNNs) enhancing scene understanding, and Deep Reinforcement Learning (DRL)
enabling adaptive decision-making.
The current landscape features widespread Level 1-2 Advanced Driver-Assistance
Systems (ADAS) like Tesla Autopilot, NVIDIA DRIVE, and Baidu Apollo.
Level 3+ robotaxi services are being deployed by companies like Waymo, Cruise,
Zoox, and Baidu. However, significant limitations persist, including perfor-
mance degradation in rare or adverse weather conditions, safety concerns, and
the ”black-box” nature of DNNs.
Current research trends focus on Trustworthy AI (explainability, adversarial
robustness), Vehicle-to-Everything (V2X) communication, and increasingly, the
integration of LLMs for their superior reasoning and potential to enhance safety.

3 Development of Multimodal Language Mod-

els
3.1 Language Models
The evolution of Language Models began with rule-based NLP in the 1960s,
progressed to statistical models (N-gram, HMM) in the 1990s, and shifted to
neural models (RNN, LSTM) in the 2000s. The introduction of Word2Vec in
2013 provided dense semantic embeddings, but the true paradigm shift arrived
with the Transformer architecture in 2017, paving the way for modern LLMs.

3.2 Advancements in LLMs

Models like GPT-3, PaLM, LLaMA, and GPT-4, with billions of parameters, ex-
hibit emergent abilities such as in-context learning and chain-of-thought reason-
ing. A new frontier is embodied AI, where LLMs are integrated with perception
and action loops in environments (e.g., Voyager, Voxposer).

3.3 Multimodal Models Evolution

Early multimodal research focused on tasks like image captioning, Visual Ques-
tion Answering (VQA), and scene understanding. Architectures combining
CNNs and RNNs were common. The field advanced with large-scale pretraining
on multimodal data, leading to models like CLIP, BLIP, and Flamingo. The rise

2
of generative models, including DALL·E and Stable Diffusion, further expanded
capabilities.

3.4 Emergence of MLLMs

The fusion of LLMs with vision encoders created powerful MLLMs such as
LLaVA, PaLM-E, Video-LLaMA, and GPT-4V. Key enabling techniques in-
clude:
• Multimodal instruction tuning

• Multimodal in-context learning

• Multimodal chain-of-thought reasoning
• LLM-aided visual reasoning

4 MLLMs for Autonomous Driving

4.1 Perception
Traditional perception systems are limited to a predefined set of object cate-
gories. MLLMs overcome this by learning from raw, unstructured text descrip-
tions fused with multimodal sensor data. Examples include:
• Talk2BEV: Fuses Bird’s-Eye View (BEV) maps with language for com-
plex reasoning.
• DriveGPT4: Processes driving videos to generate textual descriptions
and driving responses.
• HiLMD: Detects hazardous scenarios using high-resolution input.

• Generative World Models: Models like GAIA-1 (Wayve) and UniSim

(Google) simulate realistic driving scenarios for training and testing.

4.2 Planning Control

MLLMs provide a natural language interface for human-in-the-loop planning
and control, enhancing transparency and adaptability.
• Drive as You Speak (DaYS): Uses GPT-4 to translate natural language
commands into executable driving plans.
• SurrealDriver: Leverages LLM agents to simulate realistic human driv-
ing behaviors.
• GPT-Driver: Reformulates motion planning as a language modeling
task.

3
• LanguageMPC: Adapts Model Predictive Control (MPC) parameters
based on LLM reasoning.
A critical advantage is explainability; MLLMs can generate text explanations
for their driving decisions, building crucial trust.

4.3 Industry Applications

Industry leaders are actively developing and deploying MLLMs:
• Wayve: LINGO-1 for explainable driving, GAIA-1 as a generative simu-
lator.
• Tencent: MAPLM, a large-scale dataset and model for map and traffic
scene understanding.
• Waymo: MotionLM, framing trajectory prediction as language modeling.
• Bosch HKUST: Applying MLLMs for traffic risk prediction.

5 Datasets and Benchmarks

Foundation vision datasets like KITTI, nuScenes, and the Waymo Open Dataset
have been instrumental. The rise of MLLMs has spurred the creation of multimodal-
language driving datasets:
• Talk2Car: Natural language commands for object referral.
• nuScenes-QA: Question-Answer pairs based on the nuScenes dataset.
• DriveLM: Combines driving scenes with a language structure for reason-
ing.
• MAPLM (Tencent): A large-scale dataset (2M frames) with aligned HD
maps, LiDAR, camera images, and text descriptions.
A significant gap remains; existing data is limited in scale, diversity, and lin-
guistic complexity to fully match human-level scene comprehension.

6 LLVM-AD Workshop (WACV 2024)

The first workshop on Large Language and Vision Models for Autonomous Driv-
ing (LLVM-AD) was held at WACV 2024. Key contributions included:
• Research on human-like reasoning (Drive as You Speak, Drive Like a Hu-
man).
• Focus on human-centric autonomous systems and user interaction.
• Applications in UAV planning and object detection.
The workshop also launched two open-source datasets: MAPLM-QA (for
question-answering) and UCU (a driver command dataset).

4
7 Discussion Future Directions
Several critical avenues for future work exist:
• New Datasets: Large-scale datasets encompassing multi-modal data
(panoramic images, LiDAR, HD maps) with rich, aligned textual annota-
tions are urgently needed.
• Hardware Challenges: The computational latency and power consump-
tion of LLMs are prohibitive for real-time driving. Research into model
compression, quantization, and efficient inference is crucial.
• HD Maps Encoding: Developing effective methods to encode com-
plex HD map structures into a language that LLMs can understand (e.g.,
Tesla’s ”language of lanes”, Baidu’s ERNIE-GeoL, Tencent’s THMA).

• User-Vehicle Interaction: Leveraging MLLMs to interpret multi-modal

human inputs (speech, gestures, gaze) to detect driver state (e.g., distrac-
tion) and adapt vehicle behavior accordingly.
• Personalized Driving: Adapting driving policies to individual user pref-
erences (e.g., aggressive vs. cautious styles) through natural language
interaction.
• Trust Safety: Ensuring MLLMs can provide verifiable explanations for
decisions (e.g., ”why was overtaking safe?”), estimate uncertainty, and
reliably handle edge cases.

8 Conclusion
Multimodal Large Language Models represent a paradigm shift in autonomous
driving by merging the powerful reasoning capabilities of LLMs with rich, multi-
sensor data. They offer the potential to create systems that not only understand
complex traffic scenes and plan effectively but also interact with humans in a
natural and intuitive manner. While significant challenges in latency, safety
verification, and data scarcity remain, MLLMs are a rapidly evolving technology
with transformative potential for achieving robust SAE Level 4-5 autonomy.

Cui A Survey On Multimodal Large Language Models For Autonomous Driving WACVW 2024 Paper
No ratings yet
Cui A Survey On Multimodal Large Language Models For Autonomous Driving WACVW 2024 Paper
22 pages
A Survey On Multimodal Large Language Models For Autonomous Driving
No ratings yet
A Survey On Multimodal Large Language Models For Autonomous Driving
22 pages
Yang 等 - 2024 - LLM4Drive a Survey of Large Language Models for Autonomous Driving
No ratings yet
Yang 等 - 2024 - LLM4Drive a Survey of Large Language Models for Autonomous Driving
19 pages
Park VLAAD Vision and Language Assistant For Autonomous Driving WACVW 2024 Paper
No ratings yet
Park VLAAD Vision and Language Assistant For Autonomous Driving WACVW 2024 Paper
8 pages
Receive, Reason, and React: Drive As You Say With Large Language Models in Autonomous Vehicles
No ratings yet
Receive, Reason, and React: Drive As You Say With Large Language Models in Autonomous Vehicles
11 pages
Vision-Language Models for AD
No ratings yet
Vision-Language Models for AD
12 pages
2501.02189v3 - 2025
No ratings yet
2501.02189v3 - 2025
35 pages
Review 2 Report........
No ratings yet
Review 2 Report........
40 pages
Exploring
No ratings yet
Exploring
16 pages
OccLLaMA An Occupancy-Language-Action Generative World Model For Autonomous Driving
No ratings yet
OccLLaMA An Occupancy-Language-Action Generative World Model For Autonomous Driving
9 pages
LLMS&EMBEDDINGS
No ratings yet
LLMS&EMBEDDINGS
10 pages
LVLM Survey
No ratings yet
LVLM Survey
22 pages
Efficient Multimodal Large Language Models - A Survey
No ratings yet
Efficient Multimodal Large Language Models - A Survey
36 pages
Multi-Frame, Lightweight & Efficient Vision-Language Models For Question Answering in Autonomous Driving
No ratings yet
Multi-Frame, Lightweight & Efficient Vision-Language Models For Question Answering in Autonomous Driving
9 pages
A Survey of State of The Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges
No ratings yet
A Survey of State of The Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges
22 pages
Tian 等 - 2024 - DriveVLM the Convergence of Autonomous Driving and Large Vision-language Models
No ratings yet
Tian 等 - 2024 - DriveVLM the Convergence of Autonomous Driving and Large Vision-language Models
30 pages
Drive LM
No ratings yet
Drive LM
52 pages
Evaluation and Comparison of Visual Language Models For Transportation Engineering Problems
No ratings yet
Evaluation and Comparison of Visual Language Models For Transportation Engineering Problems
19 pages
Survey of Different Large Language Model Architectures Trends Benchmarks and Challenges
No ratings yet
Survey of Different Large Language Model Architectures Trends Benchmarks and Challenges
43 pages
LLM-Driven Testing For Autonomous Driving Scenarios
No ratings yet
LLM-Driven Testing For Autonomous Driving Scenarios
6 pages
Opendrivevla: Towards End-To-End Autonomous Driving With Large Vision Language Action Model
No ratings yet
Opendrivevla: Towards End-To-End Autonomous Driving With Large Vision Language Action Model
11 pages
Driveagent: Multi-Agent Structured Reasoning With LLM and Multimodal Sensor Fusion For Autonomous Driving
No ratings yet
Driveagent: Multi-Agent Structured Reasoning With LLM and Multimodal Sensor Fusion For Autonomous Driving
8 pages
Drivinggpt: Unifying Driving World Modeling and Planning With Multi-Modal Autoregressive Transformers
No ratings yet
Drivinggpt: Unifying Driving World Modeling and Planning With Multi-Modal Autoregressive Transformers
15 pages
Sop Lec Notes
No ratings yet
Sop Lec Notes
6 pages
Report - PDF 20240827 210738 0000
No ratings yet
Report - PDF 20240827 210738 0000
23 pages
Visual Large Language Models For Generalized and Specialized Applications
No ratings yet
Visual Large Language Models For Generalized and Specialized Applications
29 pages
Pranay Report
No ratings yet
Pranay Report
26 pages
The Dawn of LMMs Preliminary Explorations With GPT 4V Ision Arxiv 2309 17421v2 Cs CV 11 Oct 2023 1st Edition Zhengyuan Yang Full Access
No ratings yet
The Dawn of LMMs Preliminary Explorations With GPT 4V Ision Arxiv 2309 17421v2 Cs CV 11 Oct 2023 1st Edition Zhengyuan Yang Full Access
77 pages
Omnidrive: A Holistic Llm-Agent Framework For Autonomous Driving With 3D Perception, Reasoning and Planning
No ratings yet
Omnidrive: A Holistic Llm-Agent Framework For Autonomous Driving With 3D Perception, Reasoning and Planning
18 pages
LLMs: Applications & Challenges
No ratings yet
LLMs: Applications & Challenges
30 pages
A Survey For Foundation Models in Autonomous Driving
No ratings yet
A Survey For Foundation Models in Autonomous Driving
22 pages
Survey
No ratings yet
Survey
19 pages
Lingo Qa
No ratings yet
Lingo Qa
31 pages
The Dawn of LMMs Preliminary Explorations With GPT 4V Ision Arxiv 2309 17421v2 Cs CV 11 Oct 2023 1st Edition Zhengyuan Yang Instant Download
100% (1)
The Dawn of LMMs Preliminary Explorations With GPT 4V Ision Arxiv 2309 17421v2 Cs CV 11 Oct 2023 1st Edition Zhengyuan Yang Instant Download
159 pages
Pixel To Phrases
No ratings yet
Pixel To Phrases
6 pages
A Survey On Multimodal Large Language Models
No ratings yet
A Survey On Multimodal Large Language Models
18 pages
Mobile-Videogpt: Fast and Accurate Video Understanding Language Model
No ratings yet
Mobile-Videogpt: Fast and Accurate Video Understanding Language Model
13 pages
1-2024-arxiv-MobileVLM V2：更快更强的视觉语言模型基线
No ratings yet
1-2024-arxiv-MobileVLM V2：更快更强的视觉语言模型基线
15 pages
Driver GPT 4
No ratings yet
Driver GPT 4
16 pages
Towards Large Language Models That Perceive and
No ratings yet
Towards Large Language Models That Perceive and
12 pages
DeepSeek-VL: Open-Source Vision-Language Model
No ratings yet
DeepSeek-VL: Open-Source Vision-Language Model
33 pages
CoVLA Comprehensive Vision-Language-Action Dataset For Autonomous Driving
No ratings yet
CoVLA Comprehensive Vision-Language-Action Dataset For Autonomous Driving
14 pages
Sensors 23 09225
No ratings yet
Sensors 23 09225
27 pages
Lec8 - Large Multimodal Models
No ratings yet
Lec8 - Large Multimodal Models
45 pages
677 A Survey On Bridging VLMs
No ratings yet
677 A Survey On Bridging VLMs
20 pages
Presentation 11
No ratings yet
Presentation 11
20 pages
MULTIMODAL LLMs
No ratings yet
MULTIMODAL LLMs
82 pages
Navgpt: Explicit Reasoning in Vision-And-Language Navigation With Large Language Models
No ratings yet
Navgpt: Explicit Reasoning in Vision-And-Language Navigation With Large Language Models
26 pages
Innovations in LLMs Presentation Expanded MSOffice
No ratings yet
Innovations in LLMs Presentation Expanded MSOffice
24 pages
Lijuan Slides Cvpr2024 Fundationmodels
No ratings yet
Lijuan Slides Cvpr2024 Fundationmodels
25 pages
MM-LLMs Recent Advances in MultiModal Large Language Models
No ratings yet
MM-LLMs Recent Advances in MultiModal Large Language Models
22 pages
Lijuan Slides Cvpr2024 Fundationmodels
No ratings yet
Lijuan Slides Cvpr2024 Fundationmodels
25 pages
Applsci 14 05068
No ratings yet
Applsci 14 05068
30 pages
LLMs and Future Directions in AI
No ratings yet
LLMs and Future Directions in AI
8 pages
LLM Papers
No ratings yet
LLM Papers
2 pages
LLM Review
No ratings yet
LLM Review
16 pages
Hardware Acceleration of LLMS: A Comprehensive Survey and Comparison
No ratings yet
Hardware Acceleration of LLMS: A Comprehensive Survey and Comparison
15 pages
Exploring The Evolution of Large Language Models: Architectures, Applications, and Future Directions
No ratings yet
Exploring The Evolution of Large Language Models: Architectures, Applications, and Future Directions
11 pages
Prompting - Unleashing The Potential of Prompt Engineering in Large Language Models
No ratings yet
Prompting - Unleashing The Potential of Prompt Engineering in Large Language Models
58 pages
AIDL
No ratings yet
AIDL
2 pages
FLUX Workflow Guide for AI Artists
100% (1)
FLUX Workflow Guide for AI Artists
15 pages
Large Language Models For Software Engineering
No ratings yet
Large Language Models For Software Engineering
79 pages
Build A Large Language Model (From Scratch)
0% (2)
Build A Large Language Model (From Scratch)
7 pages
使用自反思大型语言模型学习生成可解释的股票预测
No ratings yet
使用自反思大型语言模型学习生成可解释的股票预测
20 pages
GUIDE - A CMOs Guide To Responsible and Rsults-Driven AI in 2025 - Jasper
No ratings yet
GUIDE - A CMOs Guide To Responsible and Rsults-Driven AI in 2025 - Jasper
34 pages
Visual Informatics
No ratings yet
Visual Informatics
9 pages
Primer On Large Language Models-An Educational Overview For Intensivists
No ratings yet
Primer On Large Language Models-An Educational Overview For Intensivists
13 pages
How LLMs Collaborate With Multi Agent Setup
No ratings yet
How LLMs Collaborate With Multi Agent Setup
6 pages
IIT Roorkee - AIOps Brochure (Compressed)
No ratings yet
IIT Roorkee - AIOps Brochure (Compressed)
33 pages
Enhancing Scientific Reasoning
No ratings yet
Enhancing Scientific Reasoning
16 pages
AI Agent Implementation Approach PDF
No ratings yet
AI Agent Implementation Approach PDF
12 pages
Fundamentals of RAG (Retrieval Augmented Generation)
No ratings yet
Fundamentals of RAG (Retrieval Augmented Generation)
2 pages
UCD PhD Scholarship Guide 2025
No ratings yet
UCD PhD Scholarship Guide 2025
41 pages
Threshold AI Oracles Supra
No ratings yet
Threshold AI Oracles Supra
29 pages
AXNav - Replaying Accessibility Tests From Natural Language
No ratings yet
AXNav - Replaying Accessibility Tests From Natural Language
16 pages
HALVA - To Tackle Hallucinations in Multimodal LLMs
No ratings yet
HALVA - To Tackle Hallucinations in Multimodal LLMs
7 pages
NVIDIA TELCO 10 Key Generative AI Applications For Telecom Leaders
No ratings yet
NVIDIA TELCO 10 Key Generative AI Applications For Telecom Leaders
15 pages
Mastering LLM As A Judge
No ratings yet
Mastering LLM As A Judge
69 pages
3 Coding Attention Mechanisms - Build A Large Language Model (From Scratch)
No ratings yet
3 Coding Attention Mechanisms - Build A Large Language Model (From Scratch)
31 pages
Short Paper: AI-Driven Disaster Warning System: Integrating Predictive Data With LLM For Contextualized Guideline Generation
No ratings yet
Short Paper: AI-Driven Disaster Warning System: Integrating Predictive Data With LLM For Contextualized Guideline Generation
7 pages
Hiiiijhi
No ratings yet
Hiiiijhi
14 pages
Building AI Agents With LLMS, RAG, and Knowledge Graphs
100% (5)
Building AI Agents With LLMS, RAG, and Knowledge Graphs
560 pages
Generative AI and Large Language Models For Cyber
No ratings yet
Generative AI and Large Language Models For Cyber
50 pages
AI Scaling and Limitation
No ratings yet
AI Scaling and Limitation
3 pages
Prompt Engineering F
No ratings yet
Prompt Engineering F
36 pages
Teams of LLM Agents Can Exploit Zero-Day Vulnerabilities
No ratings yet
Teams of LLM Agents Can Exploit Zero-Day Vulnerabilities
10 pages
Invitedpaper Aspdac 24
No ratings yet
Invitedpaper Aspdac 24
7 pages
Interpretable Face Anti-Spoofing: Enhancing Generalization With Multimodal Large Language Models
No ratings yet
Interpretable Face Anti-Spoofing: Enhancing Generalization With Multimodal Large Language Models
15 pages
Quiz Watsonx - Ai Gen AI Models L2
100% (1)
Quiz Watsonx - Ai Gen AI Models L2
10 pages
LLMs For Finance
No ratings yet
LLMs For Finance
9 pages