Review 2 Report........
Review 2 Report........
Project
21BCE2460 TUSHAR
21BCE2456 AMRIT RAJ PARAMHANS
KALYANARAMAN P
Professor Grade 1
School of Computer Science and Engineering (SCOPE)
B.Tech.
in
February 2025
1|Page
ABSTRACT
Language models are better equipped to understand contextual and sequential information,
enabling self-sustaining systems to simulate human-like thinking in intricate and dynamic
situations. They also play a significant part in human-vehicle interaction by giving natural
language descriptions of driving actions, building passenger confidence, and enabling V2X
communication.
We review the advancements of language models in prominent fields like traffic behavior
prediction, scene understanding, and intent recognition considering challenges posed by
computational requirements, latency, and robustness for safety-critical use cases. Ethical
factors such as avoiding bias and promoting transparency are addressed. By bridging the
perception, reasoning, and communication gaps, language models can potentially play a huge
role in making autonomous driving systems more reliable, comprehensible, and user-friendly,
thereby paving the way for more secure and intelligent mobility solutions.
2|Page
TABLE OF CONTENTS
3|Page
1. INTRODUCTION
1.1 Background
Autonomous driving has developed very quickly in the last decade, taking advantage of
advances in artificial intelligence (AI), sensor technology, and computational hardware.
Conventional autonomous driving systems are based on convolutional neural networks
(CNNs) and reinforcement learning (RL) to sense their surroundings and make decisions in
real-time. These systems use modular architectures in which perception, prediction, and
planning are managed by independent components, tending to cause information bottlenecks
and restrict cross-modal comprehension.
Computer vision methods analyze visual data from cameras to identify objects, lane lines,
and traffic lights. LiDAR sensors use 3D space data collection, whereas radar systems
measure velocity and distance of the objects nearby. Even though they are effective, these
traditional methods have several vital shortcomings:
Lack of Contextual Reasoning: Classical AI systems process inputs in context isolation, not
understanding wider situational contexts — like understanding how an unexpected lane
change might impact downstream traffic flow.
Recent advancements in large language models (LLMs) — like GPT, BERT, and their
multimodal cousins — have opened up new frontiers for autonomous driving. LLMs have
shown outstanding reasoning capabilities across many domains, making them candidates for
filling major loopholes in present autonomous systems.
Specifically, Vision-Language Models (VLMs) combine visual and textual inputs, allowing
for deeper scene understanding and more natural-looking decision-making processes. By
processing multimodal information — like camera views, LiDAR point clouds, and natural
language traffic reports — VLMs can:
4|Page
Interpret intricate driving scenarios using natural language descriptions.
1.2 Motivation
The inspiration for this project lies in the urgency of breaking free from the confines of
existing autonomous driving systems and harnessing the revolutionary power of Vision-
Language Models. With autonomous cars inching toward complete rollout, a few crucial
issues need to be solved:
Cross-Modal Fusion for Robust Perception: VLMs integrate heterogeneous sensor inputs,
eliminating silos between vision (camera), depth (LiDAR), and language information.
Holistic perception reduces the likelihood of information loss and improves scene
understanding.
Scalable and Adaptive Systems: Scaling autonomous vehicles across geographies needs
flexible models. VLMs, with diverse training data, show improved generalization to unseen
driving environments.
5|Page
Through the creation of an end-to-end VLM-driven autonomous driving system, this project
is set to change the way self-driving cars sense, plan, and engage with the world. The
integrated architecture will not only improve driving efficiency and safety but also open up
possibilities for more natural human-AI collaboration on the road.
The project revolves around applying language models to autonomous vehicles for:
Recent work indicates the potential of large language models (LLMs) to overcome the
aforementioned limitations. LLMs such as GPT and BERT have been shown to have superior
reasoning ability across diverse areas of application, including robotics and automation to
autonomous driving. In autonomous driving, LLMs can understand multimodal sensor data,
predict traffic conditions, and offer natural language responses for improved human-vehicle
interaction
The literature has explored using LLMs in self-driving vehicles to conduct tasks like
6|Page
Literature Review: Vision-Language Models for Autonomous Driving
Key Research:
1. Zhang et al. (2023). "DriveGPT4: Interpretable End-to-end Autonomous Driving via Large
Language Model."
- Integrates LLMs with vision encoders for interpretable driving decisions
- Demonstrates improved performance in complex urban scenarios
5. Chen et al. (2023). "GAIA-1: A Generative World Model for Autonomous Driving."
- Presents a world model that combines generative and predictive capabilities
- Demonstrates superior performance in long-horizon planning scenarios
Key Research:
7|Page
1. Mao et al. (2023). "BEVFormer: Learning Bird's-Eye-View Representation from Multi-
Camera Images via Spatiotemporal Transformers."
- Transforms multi-view images to bird's-eye-view (BEV) representations
- Shows significant improvements in 3D object detection and map segmentation
2. Liu et al. (2022). "PETR: Position Embedding Transformation for Multi-View 3D Object
Detection."
- Introduces position-aware vision transformers for 3D perception
- Achieves competitive results with reduced computational complexity
3. Zhu et al. (2023). "DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object
Detection."
- Proposes a transformer-based fusion architecture for LiDAR and camera data
- Demonstrates superior performance in adverse weather conditions
5. Yang et al. (2023). "SigLIP: Simple Gated Linear Projections Improve Vision-Language
Models."
- Demonstrates how gated linear projections improve vision-language models
- Shows particular benefits for fine-grained perception tasks relevant to driving
Key Research:
1. Kim et al. (2023). "Language Models Enable Simple Systems for Generating Interactive
3D Scenes."
- Demonstrates how LLMs can understand and generate 3D scene descriptions
- Provides foundations for language-guided scene understanding in driving
8|Page
2. Xu et al. (2023). "LLM-Driver: Imitation Learning of Rule-Guided Driving with Language
Models."
- Uses LLMs to interpret traffic rules and generate driving policies
- Shows improved compliance with traffic regulations in simulated environments
5. Malla et al. (2023). "DRAMA: Joint Risk Localization and Captioning in Driving."
- Combines risk detection with natural language explanations
- Validates on real-world driving datasets with annotated risk scenarios
Key Research:
1. Chen et al. (2023). "Decision Transformer: Reinforcement Learning via Sequence
Modeling."
- Frames RL as a sequence modeling problem
- Shows improved performance in long-horizon driving tasks
2. Janner et al. (2023). "Planning with Diffusion for Flexible Behavior Synthesis."
- Uses diffusion models for planning in autonomous driving
- Demonstrates robust performance in complex multi-agent scenarios
9|Page
3. Zhang et al. (2023). "UniAD++: Unified Action Space for Autonomous Driving."
- Proposes a universal action representation for RL in driving
- Shows improved transfer learning across different driving environments
Key Research:
1. Ros et al. (2023). "SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain
Adaptation."
- Provides a large-scale synthetic dataset with diverse weather and lighting conditions
- Demonstrates improved model robustness when trained on this data
2. Amini et al. (2023). "VISTA 2.0: An Open, Data-driven Simulator for Multimodal Sensing
and Policy Learning for Autonomous Vehicles."
- Presents an open-source simulator for generating multimodal synthetic driving data
- Shows the importance of diverse synthetic data for robust model training
10 | P a g e
- Introduces a unified simulator for multiple sensor modalities
- Shows the value of synthetic sensor data in improving perception robustness
5. Prakash et al. (2023). "FUSION: Generating Realistic Traffic Scenarios for Autonomous
Driving Evaluation."
- Proposes a framework for generating diverse and realistic traffic scenarios
- Demonstrates effectiveness in identifying edge cases for autonomous systems
Key Research:
1. Gupta et al. (2023). "ViP-LLaVA: Making Large Multimodal Models Understand
Vehicles."
- Specializes a vision-language model for vehicle understanding
- Demonstrates improved reasoning about vehicle behavior and attributes
2. Liu et al. (2023). "LAVILA: LAnguage VIsion LAnguage Model for Vision and Language
Understanding."
- Proposes a cyclic architecture for vision-language understanding
- Shows strong performance on driving scene understanding tasks
4. Chen et al. (2023). "MultiPath++: Efficient Information Fusion and Trajectory Prediction
for Autonomous Driving."
- Presents an efficient architecture for fusing multiple sensor modalities
- Achieves state-of-the-art prediction accuracy on the Waymo Open Dataset
11 | P a g e
- Fuses visual, LiDAR, and textual inputs for end-to-end planning
- Shows improved navigation in complex urban environments
Key Research:
1. Caesar et al. (2023). "nuScenes: A multimodal dataset for autonomous driving."
- Introduces a large-scale dataset with rich annotations for context understanding
- Provides benchmarks for holistic scene understanding tasks
2. Seff et al. (2023). "MT-DETR: Detecting Multiple Object Types with One Detector."
- Proposes a unified architecture for detecting traffic elements, road users, and
infrastructure
- Demonstrates improved scene understanding with reduced computational resources
5. Wang et al. (2023). "BeMapNet: Building Extraction from LiDAR and Dialog-guided
Scene Understanding."
- Combines LiDAR data with language guidance for improved building extraction
- Shows applications in improving context awareness for autonomous driving
12 | P a g e
Key Research:
1. Li et al. (2023). "TinyVLM: A Smaller and Faster Vision-Language Model for
Autonomous Driving."
- Introduces a compact vision-language model optimized for edge devices
- Achieves 10x speedup with minimal performance degradation
3. Guo et al. (2024). "EdgeVLM: Efficient Vision-Language Models for Edge Devices."
- Presents a hardware-aware architecture for edge-deployed VLMs
- Shows real-time performance on NVIDIA Jetson platforms
Key Research:
1. Corso et al. (2023). "Neural Simplex Architecture for Safe Reinforcement Learning-Based
Autonomous Driving."
- Introduces a safety architecture that monitors and corrects unsafe actions
- Demonstrates improved safety without sacrificing performance
13 | P a g e
2. Sun et al. (2023). "Minding the Gap: Safety Validation of Vision-Language Models for
Autonomous Driving."
- Proposes a framework for identifying and mitigating hallucinations in VLMs
- Shows improved reliability in safety-critical scenarios
Key Research:
1. Mani et al. (2023). "Natural Language Interfaces for Autonomous Driving."
- Explores how natural language can facilitate human-vehicle interaction
- Demonstrates improved user trust and acceptance
2. Liu et al. (2023). "ExplainDrive: Natural Language Explanations for Autonomous Driving
Decisions."
- Presents a system that generates natural language explanations for vehicle behaviors
- Shows improved user understanding and trust in autonomous systems
14 | P a g e
3. Hayes et al. (2023). "Human-AI Shared Control for Autonomous Driving with Vision-
Language Models."
- Explores how VLMs can enable more intuitive shared control interfaces
- Demonstrates improved takeover performance in critical scenarios
Executive Summary
This comprehensive project proposal outlines the development of a Vision-Language Model
(VLM) based autonomous driving system. We identify critical research gaps in current
approaches, establish clear objectives, provide a detailed problem statement, and present a
comprehensive project plan. The document also includes functional and non-functional
requirements, a multi-faceted feasibility analysis, and complete system specifications.
1 Theoretical Gaps
1. Limited Cross-Modal Understanding: Current systems struggle to establish deep semantic
connections between visual road scenarios and natural language instructions.
2. Fragmented Architecture Design: Most existing systems use separate modules for
perception, prediction, and planning, creating information bottlenecks at module boundaries.
15 | P a g e
3. Insufficient Context Retention: Current models lack mechanisms to maintain long-term
context across complex driving scenarios.
16 | P a g e
4. Regulatory Compliance Hurdles: Lack of standardized evaluation metrics for VLM-based
driving systems that align with emerging regulations.
2. Create a unified architecture that seamlessly integrates perception, prediction, and planning
through multimodal fusion.
3. Implement TensorRT optimization that reduces model size by 70% with less than 2%
performance degradation.
17 | P a g e
4. Create an explainable AI framework that provides natural language justification for driving
decisions.
5. Develop a robust object detection and tracking system with 98%+ accuracy for safety-
critical objects.
3. Create new benchmarks and evaluation metrics specifically designed for VLM-based
driving systems.
4. Explore effective knowledge distillation methods for transferring capabilities from large
VLMs to edge-deployable models.
Core Challenges
Current autonomous driving systems face fundamental limitations in their ability to
understand and interpret complex driving environments. Traditional modular architectures
create information bottlenecks, while recent end-to-end approaches struggle with
explainability and reliability. Vision-Language Models offer promising capabilities for
holistic scene understanding, but existing implementations face significant challenges in real-
time performance, edge deployment, and safety verification.
18 | P a g e
1. Information Integration Problem: How to effectively combine visual perception, semantic
understanding, and spatial reasoning for robust driving decisions.
4. Safety Verification Problem: How to provide formal guarantees about system behavior in
safety-critical scenarios.
19 | P a g e
Phase 3: Decision Module and Training Pipeline (Months 8-11)
1. Reinforcement learning environment setup
2. Policy network architecture development
3. Reward function design and validation
4. Training pipeline implementation
5. Initial end-to-end system integration
20 | P a g e
Work Breakdown Structure (WBS)
REQUIREMENT ANALYSIS
3.1Requirements
3.1. 1 Functional Requirements
3.1.1.1 Perception Requirements
1. The system shall detect and classify all vehicles within a 100-meter radius with 99%
accuracy.
21 | P a g e
2. The system shall identify and categorize all traffic signs and signals with 98%
accuracy under standard lighting conditions.
3. The system shall detect pedestrians and cyclists with 99.5% accuracy within a 50-
meter radius.
4. The system shall accurately map lane markings and road boundaries with 95%
precision, even in cases of partial occlusion.
5. The system shall classify road surface conditions (dry, wet, icy) with 90% accuracy.
6. The system shall detect and track dynamic objects through occlusions for up to 3
seconds.
7. The system shall identify construction zones and temporary traffic patterns with 90%
accuracy.
8. The system shall recognize and respond to emergency vehicles with 99% accuracy.
9. The system shall maintain perception capabilities in adverse weather conditions (rain,
fog, snow) with a minimum detection accuracy of 85%.
10. The system shall recognize hand signals from traffic officers and construction workers
with 90% accuracy.
3.1.1.2 Planning and Decision-Making Requirements
1. The system shall generate safe trajectories that comply with all traffic regulations.
2. The system shall maintain safe following distances, adjusted for speed and road
conditions.
3. The system shall execute lane changes safely with proper signaling when appropriate.
4. The system shall navigate intersections according to traffic signals and right-of-way
rules.
5. The system shall identify and respond appropriately to emergency vehicles,
prioritizing their right-of-way.
6. The system shall plan paths that minimize unnecessary lane changes and ensure
passenger comfort.
7. The system shall navigate construction zones and road closures by following
temporary traffic patterns.
8. The system shall recognize and adapt to local driving customs and regulations.
9. The system shall generate alternate routes when encountering unexpected road
closures or traffic congestion.
10. The system shall make appropriate speed adjustments based on road conditions,
visibility, and traffic flow.
3.1.1.3 Control and Execution Requirements
22 | P a g e
1. The system shall execute smooth acceleration and deceleration within comfortable g-
force limits (0.3g max under normal conditions).
2. The system shall maintain lane centering with a deviation of less than 10 cm under
normal conditions.
3. The system shall execute precision maneuvers for parking operations with a
positioning error of less than 5 cm.
4. The system shall maintain velocity within 2 km/h of the target speed under normal
conditions.
5. The system shall execute emergency braking with maximum available deceleration
when required for collision avoidance.
6. The system shall provide smooth transitions between different driving modes without
jerky movements.
7. The system shall maintain stability control during adverse road conditions.
8. The system shall execute precise turning maneuvers with appropriate speed
adjustments.
9. The system shall manage regenerative braking and propulsion systems for optimal
energy efficiency.
10. The system shall execute vehicle-to-infrastructure (V2I) commands with high
reliability.
3.1.1.4 Safety and Fallback Requirements
1. The system shall detect internal failures and execute appropriate fallback strategies.
2. The system shall maintain a minimum risk condition when operational capabilities are
compromised.
3. The system shall execute a safe handover to the human driver with a minimum 10-
second warning time.
4. The system shall continuously monitor driver attention during Level 3 operation.
5. The system shall maintain redundant sensing capabilities for critical perception tasks.
6. The system shall log all critical events and decisions for post-event analysis.
7. The system shall validate command integrity before execution.
8. The system shall implement limp-home mode capabilities for hardware degradation
scenarios.
9. The system shall provide clear operational status indicators to the driver.
10. The system shall implement geofencing to prevent operation outside validated
operational domains.
3.1.1.5 User Interface Requirements
23 | P a g e
1. The system shall accept natural language commands for destination setting and
preference adjustment.
2. The system shall provide natural language explanations for driving decisions when
requested.
3. The system shall display operational status through a simple, intuitive interface.
4. The system shall provide clear indicators of perception confidence and limitations.
5. The system shall support multimodal user interactions (voice, touch, gesture).
6. The system shall provide customizable notification preferences for system events.
7. The system shall support user profiles with personalized driving preferences.
8. The system shall provide accessible interfaces for users with disabilities.
9. The system shall offer multilingual support for interface and voice commands.
10. The system shall provide real-time visualization of system perception for driver
monitoring.
24 | P a g e
6. The system shall maintain a mean time between failures (MTBF) of at least 10,000
hours.
7. The system shall implement data integrity checks for all sensor inputs and internal
communications.
Safety and Security Requirements
1. The system shall comply with ISO 26262 ASIL D requirements for functional safety.
2. The system shall implement secure boot and runtime attestation.
3. All external communications shall use TLS 1.3 or equivalent encryption.
4. The system shall implement access controls for maintenance and update functions.
5. The system shall detect and mitigate sensor spoofing attacks.
6. The system shall implement a defense-in-depth security architecture.
7. The system shall log all security events for audit purposes.
Maintainability Requirements
1. The system architecture shall be modular to allow component-level updates.
2. The system shall support over-the-air (OTA) updates for software components.
3. The system shall maintain comprehensive logging for diagnostic purposes.
4. The system shall implement versioning for all software components and models.
5. The system shall provide API documentation for all interfaces.
6. The system shall support remote diagnostics capabilities.
7. The system codebase shall maintain unit test coverage of at least 85%.
Scalability Requirements
1. The system architecture shall support additional sensor types without requiring a core
redesign.
2. The system shall scale to support various vehicle platforms with minimal adaptation.
3. The perception pipeline shall support configurable resolution scaling based on
available resources.
4. The system shall support both centralized and distributed computing architectures.
5. The system shall implement dynamic resource allocation based on operational
demands.
6. The training pipeline shall support distributed training across multiple nodes.
7. The system shall support geographic expansion with minimal reconfiguration.
Usability Requirements
25 | P a g e
1. The system shall provide an interface with a System Usability Scale (SUS) score of at
least 80.
2. The system shall respond to user commands within 500 ms.
3. The system shall require minimal training for basic operation (less than 30 minutes).
4. The system shall provide clear error messages understandable by non-technical users.
5. The system interface shall be usable while wearing gloves.
6. The system displays shall be readable in direct sunlight.
7. The system shall follow established automotive UX guidelines for consistency.
26 | P a g e
• USB: 2× USB 3.0, 1× USB-C
• HDMI/DisplayPort: 1× HDMI 2.0 or DisplayPort 1.4
• Audio: 3.5mm headphone/microphone jack
Power Supply
• Minimum: 300W
• Recommended: 500W (for systems with dedicated GPUs)
Additional Hardware
• Webcam: 720p or 1080p
• Microphone: Built-in or external
• Keyboard & Mouse: Standard or ergonomic
27 | P a g e
Virtualization & Containerization
• Virtual Machines: VMware, VirtualBox
• Containers: Docker, Kubernetes
Security Software
• Antivirus: Windows Defender, Norton, McAfee
• Firewall: Built-in OS firewall or third-party solutions
Collaboration & Project Management
• Communication: Slack, Microsoft Teams, Zoom
• Project Management: Trello, Asana, Jira
Cloud Services
• Storage: Google Drive, Dropbox, OneDrive
• Compute: AWS, Google Cloud Platform, Microsoft Azure
Other Utilities
• File Compression: WinRAR, 7-Zip
• Media Players: VLC Media Player
• PDF Reader: Adobe Acrobat Reader, Foxit Reader
28 | P a g e
3.2. Feasibility Study
3. Communications Infrastructure:
- CAN bus interface supports required message rates for vehicle control.
- Ethernet backbone provides sufficient bandwidth for sensor data transmission.
- V2X communication modules meet latency requirements for urban environments.
29 | P a g e
- RL algorithms (PPO, DQN) have shown effectiveness in similar control scenarios.
2. Integration Analysis:
- ROS2 middleware provides sufficient real-time capabilities for component integration.
- TensorRT optimization pathway is well-established for similar model architectures.
- Cross-modal fusion techniques have been validated in recent research literature.
3. Development Environment:
- CI/CD pipeline supports required development workflow.
- Simulation environments (CARLA, AirSim) provide suitable testing platforms.
- Automated testing infrastructure is available for continuous validation.
Development Costs
Cost-Benefit Analysis: Initial costs will involve the purchase of sensors, cameras, and
development resources for the machine learning model and integration. However, the long
term savings in reduced accident rates, insurance claims, and potential legal liabilities will
offset these initial expenses.
• Budget:
Hardware Costs: 30,000 INR (for sensors, cameras, and processing devices)
Software Development: 20,000 INR (for algorithm development, system integration)
Maintenance: 5,000 INR annually (for software updates and sensor recalibration) •
Return on Investment (ROI): The reduction in accident-related costs, along with
30 | P a g e
potential partnerships with automotive companies, can lead to profitability within 1-2
years. Monetization opportunities include licensing the technology to car
manufacturers and providing a subscription model for system updates. • Funding:
Potential sources of funding include partnerships with automotive companies,
government transportation safety programs, and grants for innovations aimed at
reducing traffic accidents.
31 | P a g e
3.2.5 Legal Feasibility
2. Compliance Requirements:
- Type approval processes differ by jurisdiction
- Data recording and sharing requirements
- Cybersecurity certification requirements
- Safety validation frameworks
- Insurance and liability considerations
2. Regulatory Evolution:
32 | P a g e
- Risk: Changing regulations may require system modifications
- Mitigation: Modular design allowing for regulatory compliance updates
- Mitigation: Active participation in regulatory development forums
33 | P a g e
4.SYSTEM DESIGN
34 | P a g e
4.2 System Architecture Diagram
35 | P a g e
4.3 Activity Diagram
36 | P a g e
37 | P a g e
Reference:
2."Drive Like a Human: Rethinking Autonomous Driving with Large Language Models"
Authors: Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, Yu Qiao
Link: https://arxiv.org/abs/2307.07162
4."DriveLLM: Charting The Path Toward Full Autonomous Driving with Large Language
Models
Link:
https://www.researchgate.net/publication/375014265_DriveLLM_Charting_The_Path_Towar
d_Full_Autonomous_Driving_with_Large_Language_Models
6."Engineering Safety Requirements for Autonomous Driving with Large Language Models"
Link: https://community.openai.com/t/research-on-llms-performing-system-
development/705503
7."Drive As You Speak: Enabling Human-Like Interaction With Large Language Models in
Autonomous Driving"
Authors: Can Cui, Yunsheng Ma, Xu Cao, et al.
38 | P a g e
Link: https://openaccess.thecvf.com/content/WACV2024W/LLVM-
AD/papers/Cui_Drive_As_You_Speak_Enabling_Human-
Like_Interaction_With_Large_Language_WACVW_2024_paper.pdf
39 | P a g e
14."Receive, Reason, and React: Drive as You Say with Large Language Models in
Autonomous Vehicles"
Authors: Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Ziran Wang
Link: https://arxiv.org/abs/2310.08034
16."Hybrid Reasoning Based on Large Language Models for Autonomous Car Driving"
Authors: Mehdi Azarafza, Mojtaba Nayyeri, Charles Steinmetz, Steffen Staab, Achim
Rettberg
Link: https://arxiv.org/abs/2402.13602
17."How Large Language Models (LLMs) Are Coming for Self-Driving Cars"
Author: Jamie Shotton
Link: https://www.autonomousvehicleinternational.com/features/feature-how-large-language-
models-llms-are-coming-for-self-driving-cars.html
40 | P a g e