0% found this document useful (0 votes)
8 views7 pages

Enwar

The document introduces E NWAR, a multi-modal large language model (LLM) framework designed for enhanced wireless environment perception by integrating various sensory data such as GPS, LiDAR, and camera inputs. E NWAR addresses the limitations of existing LLMs in handling domain-specific knowledge and multi-modal data, achieving significant performance metrics in situational awareness and decision-making for 6G networks. The framework's capabilities are evaluated using the DeepSense6G dataset, demonstrating its effectiveness in providing detailed environmental insights and improving network management.

Uploaded by

ihasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views7 pages

Enwar

The document introduces E NWAR, a multi-modal large language model (LLM) framework designed for enhanced wireless environment perception by integrating various sensory data such as GPS, LiDAR, and camera inputs. E NWAR addresses the limitations of existing LLMs in handling domain-specific knowledge and multi-modal data, achieving significant performance metrics in situational awareness and decision-making for 6G networks. The framework's capabilities are evaluated using the DeepSense6G dataset, demonstrating its effectiveness in providing detailed environmental insights and improving network management.

Uploaded by

ihasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

E NWAR: A RAG-empowered Multi-Modal LLM

Framework for Wireless Environment Perception


Ahmad M. Nazar, Abdulkadir Celik, Mohamed Y. Selim, Asmaa Abdallah, Daji Qiao, Ahmed M. Eltawil

Abstract—Large language models (LLMs) hold significant blockage mitigation through timely handover management,
promise in advancing network management and orchestration and seamless service migration. Sensing functionalities and
in 6G and beyond networks. However, existing LLMs are limited environmental awareness are essential for ZSM to effectively
in domain-specific knowledge and their ability to handle multi-
modal sensory data, which is critical for real-time situational navigate this new terrain.
awareness in dynamic wireless environments. This paper ad- In this context, multi-modal integrated sensing and com-
dresses this gap by introducing E NWAR1 , an ENvironment- munication (ISAC) represent coherent fusion of disparate
aWARe retrieval augmented generation-empowered multi-modal
arXiv:2410.18104v1 [cs.NI] 8 Oct 2024

data streams from various sensors (e.g., LiDARs, radars,


LLM framework. E NWAR seamlessly integrates multi-modal cameras, GPS, etc.), unlocking critical capabilities such as
sensory inputs to perceive, interpret, and cognitively process
complex wireless environments to provide human-interpretable environment mapping, object/human detection and classifica-
situational awareness. E NWAR is evaluated on the GPS, LiDAR, tion, urban planning, localization, and tracking. These sensing
and camera modality combinations of DeepSense6G dataset with functionalities collectively lay the foundations of digital twins
state-of-the-art LLMs such as Mistral-7b/8x7b and LLaMa3.1- (DTs); a dynamic and near real-time virtual replica of 6G
8/70/405b. Compared to general and often superficial environ- networks, providing contextual and site-specific insights into
mental descriptions of these vanilla LLMs, E NWAR delivers
richer spatial analysis, accurately identifies positions, analyzes the spatio-temporal characteristics of wireless environment [3].
obstacles, and assesses line-of-sight between vehicles. Results DTs are crucial in optimizing network performance, enabling
show that E NWAR achieves key performance indicators of up to real-time decision-making, and enhancing overall situational
70% relevancy, 55% context recall, 80% correctness, and 86% awareness, making it an integral component of the future
faithfulness, demonstrating its efficacy in multi-modal perception telecom ecosystem.
and interpretation.
Nonetheless, LLMs predominantly operate in a text-based
modality, which restricts their ability to interact with multi-
I NTRODUCTION modal sensory data—a critical need in situation-aware net-
works where real-world comprehension goes beyond textual
ENERATIVE artificial intelligence (AI), with its un-
G paralleled capability to generate, synthesize, and adapt
data, is poised to play a pivotal role in the evolution of 6G
inputs. Moreover, LLMs’ possession of general knowledge
over a massive training data corpus often falls short on
domain-specific tasks and contexts. This limitation arises
and beyond networks [1]. Among various generative models, from their fundamental reliance on probabilistic pattern recog-
LLMs have proven to be the most revolutionary, fundamen- nition rather than true understanding or reasoning [4]. To
tally changing how machines understand and produce human overcome these deficiencies, retrieval-augmented generation
language. Built on sophisticated generative transformers and (RAG) frameworks have emerged as a potential solution to en-
driven by attention mechanisms, LLMs leverage vast pre- hance the generative process by integrating external knowledge
training on diverse datasets to excel in tasks such as natural retrieval, allowing LLMs to tap into domain-specific databases
language processing, decision support, and beyond [2]. LLMs’ or real-time knowledge sources. This augmentation provides
adaptability and scalability are particularly well-suited for more accurate, contextually relevant responses, bridging the
complex systems operating in dynamic environments, making gap between generic LLMs and specialized 6G network needs.
them valuable assets for advancing AI-native wireless systems While RAG addresses the challenge of domain expertise, it
and enhancing the cognitive capabilities of next-generation does not entirely resolve the critical need for multi-modal
networks. Hence, they have the potential to revolutionize ISAC to realize situational-aware 6G networks.
decision-making, resource management, and intelligent opti- The integration of LLMs into wireless networks has been
mization of 6G networks, eventually paving the way for zero- initially explored in [5], [6], which primarily focus on textual
touch network and service management (ZSM). data and overlook RAG capabilities, limiting their application
However, the technical demands of next-generation net- to telecom chatbots. WirelessLLM demonstrates how domain-
works differ greatly from legacy generations. Future net- specific knowledge can be incorporated into LLMs to en-
works are expected to operate with massive antenna arrays hance performance in tasks such as spectrum sensing and
at significantly higher frequencies, wherein wireless channels protocol understanding [7]. Similarly, Xu et al. propose a
become less probabilistic and more deterministic and exhibits framework for edge LLMs that are divided into perception,
geometric propagation characteristics. This shift introduces grounding, and alignment modules to optimize 6G-related
daunting mobility challenges such as tracking narrow beams, tasks [8]. While the potential of multi-modal large language
1 Enwar is a common name in Turkic and Arabic cultures, meaning more
models (MLLMs) has been discussed in [9], [10], these vision
enlightened, insightful, and intellectual; herein referring to a multi-modal papers lack comprehensive case studies and proof-of-concept
LLM providing deep situational and contextual insights into the environment. implementations to substantiate MLLMs’ effectiveness in real-
Fig. 1: E NWAR workflows: multi-modal RAG formation (Steps A-C); and prompt interpretation, knowledge retrieval, and response generation (Steps 1-5).

world multi-modal environments. We address this gap in the Multi-Modal RAG Formation
wireless literature by introducing E NWAR, an E Nvironment-
⃝A Dataset Preprocessing and Modality Transformation:
aWARe RAG-empowered MLLM framework that leverages
E NWAR is designed to seamlessly accommodate diverse sensor
multi-modal sensory data to perceive, interpret, and cognitively
modalities by preprocessing and transforming them into a
process complex wireless environments. E NWAR’s human-
unified textual format that can be effectively processed by
interpretable situational awareness is crucial for both sensing
LLMs. For instance, GPS data undergoes transformation from
and communication applications, where real-time environmen-
raw spatial coordinates into textual descriptions that provide
tal perception can significantly enhance system performance
insights such as relative distances, directional bearings, and
and reliability.
movement patterns, offering a richer contextual understanding
In the following sections, we first outline the workflow
of spatial relationships.
of E NWAR and introduce key performance indicators (KPIs),
namely answer relevancy, context recall, correctness score, Visual data are processed through an image-to-text conver-
and faithfulness. E NWAR’s performance is evaluated across sion model that extracts key visual elements and translates
Mistral-7b/8x7b and Llama3.1-8/70/405b models on GPS, them into LLM interpretable natural language descriptions.
LiDAR, and camera modalities of vehicle-to-vehicle scenarios The use of instructional prompts ensure that the generated tex-
in the DeepSense6G dataset [11]. While Vanilla LLMs pro- tual outputs are contextually relevant and sufficiently detailed
vide a general and often superficial environment descriptions, to accurately represent the visual information.
E NWAR delivers contextually rich analysis of spatial dynamics Point cloud data from LiDARs, another complex modality,
by accurately identifying the positions and distances of enti- is processed by feature extraction models (e.g., ResNet) to
ties (vehicles, cyclists, pedestrians, etc.), analyzing potential identify salient environmental elements. Object detection and
obstacles, and assessing line-of-sight between communicating classification systems are then employed to recognize key
vehicles. Numerical results compare various modality combi- entities (e.g., pedestrians, vehicles), which are subsequently
nations across different LLM versions and demonstrates that converted into textual descriptions.
E NWAR achieves up to 70% relevancy, 55% context recall, The final step in the preprocessing pipeline involves synthe-
80% correctness, and 86% faithfullness. The paper concludes sizing the transformed data from all modalities into a unified
with an exploration of future research directions and the textual representation. By consolidating various sensory data
potential of E NWAR to advance multi-modal perception and into a textual format (e.g, JSON format), E NWAR ensures
cognition in wireless systems. that LLMs can cohesively process and interpret multi-modal
inputs, enhancing the model’s ability to generate contextually
A N OVERVIEW OF E NWAR F RAMEWORK aware and reliable outputs. This synthesis is pivotal for en-
As illustrated in Fig. 1, E NWAR is comprised of two primary abling the framework to make informed decisions in complex
workflow pipelines: 1) multi-modal RAG formation (Steps A- environments.
C) and 2) prompt interpretation, response generation, knowl- ⃝
B Text Chunking and Embedding: E NWAR ’s next critical
edge retrieval, and response generation (Steps 1-5); which are step is to segment the sensory data into manageable chunks
described in the following sections along with KPIs. and convert these chunks into numerical embeddings. In this
way, LLMs can efficiently process and interpret textual infor- data in the knowledge base. As detailed next, the top-ranked
mation, especially when handling large datasets from diverse results, which are contextually aligned with the prompt, are
sensor modalities. then selected for further processing.
Chunking involves breaking down the preprocessed text into ⃝4 Result Ranking: E NWAR ensures relevance by ranking
smaller, contextually coherent units. This is essential as LLMs results according to their section headers, prioritizing the most
have token limits, meaning that excessively large text inputs contextually appropriate portions of the knowledge base. This
cannot be processed effectively. Segmentation ensures that the refined search mechanism optimizes retrieval by focusing on
model can focus on relevant parts of the data without losing the most pertinent content. Since some contexts may have
contextual integrity. For instance, GPS data may be chunked similar vectorized embeddings, E NWAR concentrates on the
based on time intervals or location changes, while visual and top-p percentile to effectively filter out less relevant data, with
point cloud descriptions could be divided based on objects p = 95 used throughout the system to anchor the retrieval
detected or spatial regions. process in the highest-ranking results. This approach enhances
Once the data is chunked, it is passed to a General Text both the precision and relevance of the retrieved information,
Embeddings (GTE) model to convert each chunk into a dense resulting in more accurate and contextually appropriate re-
vectorized format—a numerical representation of the text that sponses.
captures its semantic content. These embeddings serve as ⃝5 Response Generation: Once the top-ranked results from
a structured and machine-readable format that encodes the the semantic search are identified, they provide the essential
underlying meaning of the text. In other words, vectorization context for the LLM to generate a coherent and contextually
enables LLMs to tokenize and process the data, establishing appropriate response. These results serve as the foundation
relationships between different chunks based on their semantic upon which the LLM builds its output, ensuring that the
similarity. generated response is both accurate and relevant to the user’s
⃝C Domain-Specific Knowledge Base Generation: E N - prompt.
WAR ’s ability to deliver precise and context-aware responses The LLM processes the vectorized embedding of the user
is largely dependent on its robust domain-specific knowledge prompt along with the retrieved context from the knowledge
base. By constructing a knowledge base comprising of embed- base. It integrates information from multiple sources, such
dings generated from a variety of sensor modalities, E NWAR as GPS coordinates, LiDAR data, and visual descriptions, to
ensures that the system is equipped with contextually rich construct a detailed representation of the environment. This
and diverse information about the environment. To ensure may involve detecting vehicles, their locations, and describing
optimal performance, the knowledge base is indexed in a physical aspects of the surroundings in relation to the prompt.
way that allows the RAG framework to retrieve relevant data To further enhance the generation process, E NWAR employs
efficiently, which is explained in the following sections. This top-p sampling to strike a balance between accuracy and di-
structured knowledge base enables real-time decision-making versity in responses, effectively filtering out irrelevant outputs
and ensures that E NWAR remains adaptable and responsive while maintaining contextual richness.
to a wide range of scenarios, enhancing its performance in Beyond simple description, the LLM can infer interactions
dynamic and complex wireless environments. between various elements in the environment. For instance,
using GPS data and environmental context, the model might
predict how vehicles interact or how the physical surroundings
Prompt Interpretation and Response Generation
could influence those interactions. Furthermore, the LLM

1 Prompt Preprocessing and Modality Transformation: ensures that the response aligns with the specific instructions
This step closely mirrors the procedures in Step-⃝:A the user or tasks provided by the user. In the context of E NWAR, this
prompt is preprocessed by transforming its components and means delivering comprehensive environmental awareness by
any real-time multi-modal sensory data into a unified, stan- describing key entities, their locations, and how they may
dardized format suitable for LLMs. This ensures the prompt interact within the given environment. By synthesizing all
is properly aligned with the knowledge base, allowing for relevant data and ensuring it is grounded in the retrieved
seamless interaction with the model’s retrieval mechanisms. context, the LLM generates a detailed and actionable response,

2 Prompt Text Embedding: Similarly, this step follows the providing insights necessary to make informed decisions,
procedures in Step-⃝: B the preprocessed prompt is converted particularly in dynamic and complex scenarios.
into numerical embeddings, ensuring that it can be efficiently
processed and compared to the vectorized data in the knowl-
edge base. This transformation facilitates accurate retrieval Key Performance Indicators
of relevant information, streamlining the prompt’s interaction Evaluating the performance of E NWAR requires assessing
with the model’s generative components. its output based on both general benchmarks and domain-

3 Semantic Search and Knowledge Retrieval: Once the specific metrics. Standard benchmarks such as the General
user prompt has been transformed into embeddings, E NWAR Language Understanding Evaluation (GLUE) and Massive
performs semantic search to retrieve the most relevant infor- Multitask Language Understanding (MMLU) offer a broad
mation from its domain-specific knowledge base. This process assessment of an LLM’s capabilities across various metrics
identifies entries that closely match the prompt by calculating such as answer relevancy, factual correctness, and hallucina-
the semantic similarity between the prompt and the embedded tions avoidance. However, RAG-based systems require a more
targeted evaluation due to their reliance on domain–specific A Breakdon of E NWAR Setup
contexts, and in our case, specifically tailored multi-modal
data. Following RAGAS framework [https://docs.ragas.io], we
6 DeepSense6G Dataset: To evaluate E NWAR’s performance,
we utilize a large-scale, real-world multi-modal sensing and
ensure a comprehensive and accurate evaluation of E NWAR’s
communication dataset [11]. DeepSense6G dataset offers a ro-
performance through following KPIs:
bust platform for testing E NWAR’s ability to interpret complex
0 Answer Relevancy (AR) measures how well E NWAR’s spatial and environmental data. For our evaluation, we focus
responses align with the user’s prompt and context. Denoting
on Scenario 36 that includes GPS coordinates of a vehicle
Epi and Eti as the embedding vectors of ith generated prompt
equipped with four signal receivers, its captured 360-degree
and the relevant ground truth of the data  sample,respectively; LiDAR point clouds and front-back camera frames, and the
cosine/semantic similarity, −1 ≤ cos E ⃗p , E
⃗ t ≤ 1, mea-
i i GPS coordinates of another vehicle equipped with a transmit-
sures how semantically similar two texts are based on their ter. We meticulously identified a total of 180 scenes/samples,
vector representations, i.e., 1: perfectly similar (aligned), 0: no 30 of which is used for testing, that captures urban envi-
similarity, -1: completely dissimilar (opposite). Accordingly, ronments with varying numbers of pedestrians, cyclists, and
AR is evaluated by calculating the average cosine similarity vehicles; an exemplary scene shown in Fig. 2
as follows 6 Modality Transformation and Information Extraction:
N N ⃗p · E
⃗t For all the selected scenes, E NWAR extracts latitude and lon-
1 X 
⃗t = 1
⃗p , E
 X E i i
AR = cos E i i
. (1) gitude coordinates from GPS inputs to determine the positions
N i=1 N i=1 E
⃗p ⃗
Eti and relative bearings of two vehicles, which are then converted
i
2 2
into textual format for seamless integration into E NWAR’s
0 Context Recall assesses the degree to which E NWAR’s prompt. For front-rear images, E NWAR perform image-to-
retrieved context aligns with the ground truth. It evaluates text transformation to generate a textual description of the
whether the system correctly recalls information from the visual content by using InstructBLIP trained on Vicuna-7b and
knowledge base that is relevant to the prompt and verifies how optimized for visual-tuned instructions [12].
much of the response can be attributed to the correct context. For LiDAR point clouds, E NWAR leverages the
It is calculated by normalizing the alignment extent of the super fast accurate 3D (SFA3D) model for object
retrieved contexts within the ground truth with the number of detection and analysis [https://github.com/maudzung/
sentences in the ground truth. Super-Fast-Accurate-3D-Object-Detection]. SFA3D was
0 Correctness Score measures the factual correctness and modified to extract object information, including locations
semantic similarity of the generated responses. Denoting the and bearings relative to the sensor, and converts this data
embedding vector of ith generated answer by E ⃗ a and F1 score into text to describe the environment. SFA3D utilizes a
i
as a metric of factual correctness, the overall correctness score ResNet-based keypoint feature pyramid network (KFPN)
is given by for reliable LiDAR object detection, transforming 3D point
  clouds into Birds-Eye-View images, which are then processed
Correctness = ω cos E ⃗a , E⃗ t + (1 − ω)F1 , (2)
i i to identify objects with high confidence, providing detailed
information such as positions, dimensions, and orientations.
where weighting parameter 0 ≤ ω ≤ 1—the RAGAS sets The extracted information from each modality is hard-
ω = 0.25 as default—ensures that response assessments are coded into a template [c.f., white text highlighted with black
both factually accurate and contextually appropriate. background in Fig. 2] to be utilized during the prompting and
0 Faithfulness evaluates the consistency of the generated grounding process. Since object types and potential block-
answers with the retrieved context. A response is considered ages are not readily available labels within the DeepSense6G
faithful if all claims align with the retrieved data, ensuring the dataset, we manually create ground truth text for all 180 scenes
output does not contain unsupported or fabricated information. by correcting extracted information if necessary and/or adding
Faithfulness checks whether the claims made in the output can missing details.
be logically deduced from the given context and is given by
6 Instructional Text Prompt: The prompt includes extracted
|NGc | information of each scene and specifies predefined tasks, guid-
Faithfulness = , (3) ing E NWAR to accurately analyze the wireless environment,
|NC |
detect potential blockages, and generate relevant insights based
where NGc is the number of claims in the generated answer on the processed multi-modal inputs[c.f., yellow box in Fig.
that can be inferred from the given context and NC is the total 2].
number of claims in the generated answer. 6 Chunking and Embedding: E NWAR utilizes the gte-large-
en-v1.5 embedding model from AliBaba-NLP to vectorize
E NWAR S ETUP B REAKDOWN AND A C ASE S TUDY ground truth samples for knowledge base creation [13]. This
model supports a tokenized context length of up to 8,192
This section provides a detailed breakdown of the E NWAR tokens. To maintain continuity between data segments, the
setup and offers a qualitative performance comparison with transformed textual data is divided into chunks of 1,024
vanilla LLMs, highlighting the advantages of integrating multi- characters, with a 100-character overlap to ensure context is
modal data and knowledge retrieval within E NWAR. preserved across boundaries.
Fig. 2: Illustration of the case study scene with raw data, extracted information, generated prompt, and responses from Vanilla Llama and E NWAR.

TABLE I: KPI comparison for the scene in Fig. 2.


While Vanilla Llama provides a general description of the
KPIs Relevancy Correctness Faithfulness scene and acknowledges the presence of various entities, its
Vanilla Llama 70.3 % 54.3 % 42.2 % response remains superficial. It offers only basic information
E NWAR 81.2 % 76.9 % 68.6 % about the distances and directions of objects, largely reiterating
the extracted data without analyzing the relative positions of
vehicles or obstacles and failing to infer how these elements
6 Knowledge Base Creation and Performance Evaluation: might impact movement or communication the units.
E NWAR utilizes the Facebook AI Similarity Search (FAISS)
library [https://ai.meta.com/tools/faiss/] to create 7 distinct By contrast, as highlighted and underlined in Fig. 2, E NWAR
knowledge bases—one for each modality combination—and delivers a detailed, contextually rich breakdown of the spatial
its efficient search/retrieval through the top-95% ranking ex- dynamics from Unit 1’s perspective. It accurately identifies the
plained above. E NWAR’s performance was thoroughly eval- positions and distances of nearby vehicles, pedestrians, and
uated by running RAGAS framework across LLMs such as cyclists, providing a clear analysis of potential obstacles and
Mistral-7b/8x7b [https://mistral.ai] and Llama3.1-8/70/405b suggesting maneuvering strategies for Unit 1 in the congested
[https://www.llama.com], with model sizes ranging from 7 environment. Crucially, it assesses line-of-sight communica-
billion to 405 billion parameters. For comparison, baseline ver- tion between the units and identifies any obstructions, essential
sions of these LLMs were used to benchmark the performance for effective coordination. E NWAR’s ability to infer network
of vanilla LLMs against E NWAR, particularly in generating interactions, evaluate line-of-sight, and propose navigation
detailed and accurate responses. strategies highlights its advanced environment sensing capa-
bilities, setting it apart from Vanilla LLaMa’s more generic
output.
A Comparative Analysis of an Inference Case Study At this stage, it is crucial to compare the corresponding KPIs
In this section, we evaluate the perception capabilities of for the scene depicted in Fig. 2. As shown in Table I, wherein
Vanilla Llama and E NWAR in processing, analyzing, and context recall is omitted as it is a RAG context-specific metric,
interpreting spatial relationships between objects, as well as E NWAR consistently outperforms vanilla Llama in delivering
inferring potential obstacles between two units. As shown in more contextually aligned, accurate, and faithful responses.
Fig. 2, both models were tasked with generating a detailed This underscores its superior ability to interpret complex
description of a busy city street scene. To provide more environments and provide reliable insights, largely due to
descriptive insights, we selected a specific scenario featuring E NWAR’s seamless integration of multi-modal data and its
congested traffic, including cars, motorcycles, pedestrians, and robust RAG framework. E NWAR’s single-modality inference
other stationary objects along the road. times are 100 ms, 100 ms, and 2.5 s for GPS, LiDAR, and
GPS LiDAR CAM GPS+LiDAR GPS+CAM CAM+LiDAR GPS+LiDAR+CAM
100 100 100 100

90 90 90 90

80 80 80 80

70 70 70 70

Context Recall (%)

Faithfulness (%)
Correctness (%)
Relevancy (%)

60 60 60 60

50 50 50 50

40 40 40 40

30 30 30 30

20 20 20 20

10 10 10 10

M7b M46b L8b L70b L405b M7b M46b L8b L70b L405b M7b M46b L8b L70b L405b M7b M46b L8b L70b L405b

101 101 101 101


Normalized Answer Relevancy [%pb]

Normalized Context Recall [%pb]

Normalized Faithfulness [%pb]


Normalized Correctness [%pb]
100 100 100 100

10-1 10-1 10-1 10-1

10-2 10-2 10-2 10-2


M7b M46b L8b L70b L405b M7b M46b L8b L70b L405b M7b M46b L8b L70b L405b M7b M46b L8b L70b L405b

Fig. 3: KPI comparison of LLMs across modality combinations: the first and second rows present absolute KPI [%] and KPI normalized per billion parameters
of each LLM [%pb], respectively.

camera inputs, respectively. Although image-to-text translation Camera or GPS is paired with LiDAR, CAM+LiDAR con-
dominates the processing time, it’s important to recognize that sistently outperforms GPS+LiDAR across all KPIs. This re-
scene elements such as traffic conditions, weather, landscape, flects the stronger impact of visual data on the models’
and area classification (urban, suburban, rural) — as illustrated ability to generate contextually rich and accurate responses.
in Fig. 2 — typically exhibit limited variability within the As expected, the integration of all three modalities yields
order of seconds. That is visual data updates may not re- the highest performance across every KPI. The fusion of
quire constant reprocessing at the same frequency as other spatial, depth, and visual information allows the models to
modalities. Unlike the narrative-driven information extracted deliver the most comprehensive and accurate responses, further
from camera inputs, LiDAR and GPS provides quantitative emphasizing the value of multi-modal data integration for
measures within the order of milliseconds, allowing E NWAR advanced situational awareness.
to utilize windowing for efficient tracking and prediction of
environmental changes. These inference times can be further LLM Type and Size Comparison
reduced through well-crafted hierarchical LLM architectures
as outlined in the final section. Across all modality permutations, an increase in the pa-
rameter size correlates with improved absolute KPI values.
KPI E VALUATION OF S TATE - OF - THE -A RT LLM S Larger models consistently outperform their smaller counter-
ON M ODALITY C OMBINATIONS parts across all metrics, reflecting the advantage of parameter
scaling in LLMs performance. However, it is also notable that
This section evaluates the performance of various state-of-
the rate of KPI improvement slows and begins to saturate
the-art LLMs across different modality combinations, which is
as the parameter space grows. This indicates diminishing
presented in Fig. 3 and discussed in the following subsections.
returns at higher parameter counts, with the largest models
showing only marginal gains compared to their slightly smaller
Modality Combination Comparison counterparts. In terms of model comparisons, the performance
For single modality evaluations, the general trend shows differences between Mistral 7b and LLaMa 8b are minimal, in-
GPS < LiDAR < CAM in terms of performance across dicating that these two models are comparable in terms of their
all KPIs. GPS alone provides limited contextual information, effectiveness across KPIs and modality combinations. The
resulting in the lowest scores, while CAM proves to be the second row of Fig. 2 reveals a noticeable observation: Despite
most effective single modality, offering richer visual context the larger models providing better overall absolute KPIs, the
that significantly enhances answer relevancy, correctness, and efficiency (i.e., performance per billion parameters) of adding
faithfulness. more parameters decreases significantly, potentially indicating
When dual modalities are combined, the trends observed in overfitting and interesting research directions covered in the
the single-modality evaluations continue. Specifically, when next section. Another promising way of inference latency
reduction might be training baby LMs to operate directly on knowledge retrieval and system adaptability, allowing local
the sensory data at the edge to form local RAG to eliminate models to respond to real-time data while benefiting from
the need for intermediary steps of modality transformation and the shared global context. It also mitigates the diminishing
information extraction. returns of scaling large models, optimizes resource usage via
serverless computing, and ensures continuous learning for
C ONCLUSION AND F UTURE D IRECTIONS improved performance in complex environments.

As a RAG-empowered multi-modal LLM framework, E N - R EFERENCES


WAR can address some of the key challenges in next-
[1] A. Celik and A. M. Eltawil, “At the dawn of generative AI era: A
generation networks by enabling situational aware network tutorial-cum-survey on new frontiers in 6G wireless intelligence,” IEEE
management through multi-modal perception. By preprocess- Open J. Commun. Soc., vol. 5, pp. 2433–2489, 2024.
ing and integrating various sensory data, E NWAR enhances its [2] A. Vaswani et al., “Attention is all you need,” Adv. Neural Inf. Process.
Sys., 2017.
ability to interpret complex wireless environments and deliver [3] A. Alkhateeb, S. Jiang, and G. Charan, “Real-time digital twins: Vision
contextually rich, human-interpretable insights. In spite of and research directions for 6G and beyond,” IEEE Commun. Mag.,
promising preliminary results, there is still room for improve- vol. 61, no. 11, pp. 128–134, 2023.
[4] E. M. Bender et al., “On the dangers of stochastic parrots: Can language
ment through several architectural enhancements depending on models be too big?” in Proc. ACM FAccT, 2021, p. 610–623.
the target applications, which are discussed below. [5] A. Maatouk et al., “TeleQnA: A benchmark dataset to assess
} Hierarchical and Federated LLM Architectures: For large language models telecommunications knowledge,” arXiv preprint
arXiv:2310.15051, 2023.
mission-critical and time-sensitive tasks, inference time and [6] S. Tarkoma, R. Morabito, and J. Sauvola, “AI-native interconnect
model efficiency can be significantly improved by adopting a framework for integration of large language model technologies in 6G
federated LLM architecture that integrates smaller, edge-based systems,” arXiv preprint arXiv:2311.05842, 2023.
[7] J. Shao et al., “WirelessLLM: Empowering large language models
”baby LMs” with full-scale LLMs in the cloud. Baby LMs are towards wireless intelligence,” arXiv preprint arXiv:2405.17053, 2024.
designed for near-real-time operation at the edge, reducing [8] M. Xu et al., “When large language model agents meet 6G networks:
reliance on cloud infrastructure. By employing model pruning Perception, grounding, and alignment,” 2024.
[9] L. Bariah et al., “Large generative AI models for telecom: The next big
and quantization techniques, these models remain lightweight thing?” IEEE Commun. Mag., pp. 1–7, 2024.
and efficient, focusing on immediate, critical tasks. More [10] S. Xu et al., “Large multi-modal models (LMMs) as universal foundation
complex computations are offloaded to cloud-based LLMs, models for AI-native wireless systems,” IEEE Network, vol. 38, no. 5,
pp. 10–20, 2024.
providing both speed at the edge and scalability in the cloud. [11] J. Morais et al., “Deepsense-V2V: A vehicle-to-vehicle multi-modal
Further latency reduction can be achieved by training baby sensing, localization, and communications dataset,” 2023.
LMs to operate directly on the sensory data at the edge to form [12] W. Dai et al., “InstructBLIP: Towards general-purpose vision-language
models with instruction tuning,” 2023.
local RAG, which bypasses intermediary steps of modality [13] Z. Li et al., “Towards general text embeddings with multi-stage con-
transformation and information extraction. trastive learning,” arXiv preprint arXiv:2308.03281, 2023.
} Serverless LLM Architectures: Serverless architectures
allow cloud-based LLMs to dynamically scale resources based
Ahmed M. Nazar currently pursues a Ph.D. degree in computer engineering
on demand, making them ideal for non-time-sensitive tasks from Iowa State University (ISU), Ames, IA, USA.
such as data aggregation, post-event analysis, or batch pro-
cessing. These event-driven systems automatically allocate
Abdulkadir Celik received a Ph.D. in co-majors of electrical engineering and
resources only when required, eliminating idle costs and computer engineering from Iowa State University, Ames, IA, USA, in 2016.
improving cost-efficiency. Although serverless architectures He is currently a senior research scientist at KAUST.
may introduce minor latency due to cold starts, they are
well-suited for applications where real-time processing is not Mohamed Y. Selim received a Ph.D. in computer engineering from Iowa
essential. Tasks requiring periodic, large-scale computations State University, Ames, IA, USA, in 2016; where he is currently an associate
can be efficiently managed in the cloud without continuous teaching professor.
resource allocation.
} Cooperative and Adaptive RAG Formation: Given the Asmaa Abdallah received a Ph.D. in electrical engineering from the Ameri-
importance of RAG in the E NWAR framework, a distributed can University of Beirut, Beirut, Lebanon, in 2020. She is currently a research
and adaptive RAG approach could maintain a global knowl- scientist at KAUST.
edge base, aggregating local knowledge bases across the
hierarchical LLM structure. This collaborative knowledge sys- Daji Qiao received a Ph.D. degree in Electrical Engineering from The
tem enables baby LMs to efficiently retrieve relevant, up- University of Michigan, Ann Arbor, MI, USA. He is currently a full professor
at Iowa State University.
to-date information without needing to store large amounts
of data locally. Adaptive learning techniques further enhance
E NWAR by continuously refining its understanding of dynamic Ahmed M. Eltawil received a Ph.D. degree in electrical engineering from
the University of California, Los Angeles, CA, USA, in 2003. He is currently
environments, ensuring effective processing of multi-modal a full professor at KAUST.
inputs. By dynamically updating the global knowledge base
and leveraging adaptive learning, E NWAR can balance perfor-
mance improvements with model efficiency, mitigating the risk
of overfitting. This adaptive RAG approach optimizes both

You might also like