Enwar
Enwar
Abstract—Large language models (LLMs) hold significant blockage mitigation through timely handover management,
promise in advancing network management and orchestration and seamless service migration. Sensing functionalities and
in 6G and beyond networks. However, existing LLMs are limited environmental awareness are essential for ZSM to effectively
in domain-specific knowledge and their ability to handle multi-
modal sensory data, which is critical for real-time situational navigate this new terrain.
awareness in dynamic wireless environments. This paper ad- In this context, multi-modal integrated sensing and com-
dresses this gap by introducing E NWAR1 , an ENvironment- munication (ISAC) represent coherent fusion of disparate
aWARe retrieval augmented generation-empowered multi-modal
arXiv:2410.18104v1 [cs.NI] 8 Oct 2024
world multi-modal environments. We address this gap in the Multi-Modal RAG Formation
wireless literature by introducing E NWAR, an E Nvironment-
⃝A Dataset Preprocessing and Modality Transformation:
aWARe RAG-empowered MLLM framework that leverages
E NWAR is designed to seamlessly accommodate diverse sensor
multi-modal sensory data to perceive, interpret, and cognitively
modalities by preprocessing and transforming them into a
process complex wireless environments. E NWAR’s human-
unified textual format that can be effectively processed by
interpretable situational awareness is crucial for both sensing
LLMs. For instance, GPS data undergoes transformation from
and communication applications, where real-time environmen-
raw spatial coordinates into textual descriptions that provide
tal perception can significantly enhance system performance
insights such as relative distances, directional bearings, and
and reliability.
movement patterns, offering a richer contextual understanding
In the following sections, we first outline the workflow
of spatial relationships.
of E NWAR and introduce key performance indicators (KPIs),
namely answer relevancy, context recall, correctness score, Visual data are processed through an image-to-text conver-
and faithfulness. E NWAR’s performance is evaluated across sion model that extracts key visual elements and translates
Mistral-7b/8x7b and Llama3.1-8/70/405b models on GPS, them into LLM interpretable natural language descriptions.
LiDAR, and camera modalities of vehicle-to-vehicle scenarios The use of instructional prompts ensure that the generated tex-
in the DeepSense6G dataset [11]. While Vanilla LLMs pro- tual outputs are contextually relevant and sufficiently detailed
vide a general and often superficial environment descriptions, to accurately represent the visual information.
E NWAR delivers contextually rich analysis of spatial dynamics Point cloud data from LiDARs, another complex modality,
by accurately identifying the positions and distances of enti- is processed by feature extraction models (e.g., ResNet) to
ties (vehicles, cyclists, pedestrians, etc.), analyzing potential identify salient environmental elements. Object detection and
obstacles, and assessing line-of-sight between communicating classification systems are then employed to recognize key
vehicles. Numerical results compare various modality combi- entities (e.g., pedestrians, vehicles), which are subsequently
nations across different LLM versions and demonstrates that converted into textual descriptions.
E NWAR achieves up to 70% relevancy, 55% context recall, The final step in the preprocessing pipeline involves synthe-
80% correctness, and 86% faithfullness. The paper concludes sizing the transformed data from all modalities into a unified
with an exploration of future research directions and the textual representation. By consolidating various sensory data
potential of E NWAR to advance multi-modal perception and into a textual format (e.g, JSON format), E NWAR ensures
cognition in wireless systems. that LLMs can cohesively process and interpret multi-modal
inputs, enhancing the model’s ability to generate contextually
A N OVERVIEW OF E NWAR F RAMEWORK aware and reliable outputs. This synthesis is pivotal for en-
As illustrated in Fig. 1, E NWAR is comprised of two primary abling the framework to make informed decisions in complex
workflow pipelines: 1) multi-modal RAG formation (Steps A- environments.
C) and 2) prompt interpretation, response generation, knowl- ⃝
B Text Chunking and Embedding: E NWAR ’s next critical
edge retrieval, and response generation (Steps 1-5); which are step is to segment the sensory data into manageable chunks
described in the following sections along with KPIs. and convert these chunks into numerical embeddings. In this
way, LLMs can efficiently process and interpret textual infor- data in the knowledge base. As detailed next, the top-ranked
mation, especially when handling large datasets from diverse results, which are contextually aligned with the prompt, are
sensor modalities. then selected for further processing.
Chunking involves breaking down the preprocessed text into ⃝4 Result Ranking: E NWAR ensures relevance by ranking
smaller, contextually coherent units. This is essential as LLMs results according to their section headers, prioritizing the most
have token limits, meaning that excessively large text inputs contextually appropriate portions of the knowledge base. This
cannot be processed effectively. Segmentation ensures that the refined search mechanism optimizes retrieval by focusing on
model can focus on relevant parts of the data without losing the most pertinent content. Since some contexts may have
contextual integrity. For instance, GPS data may be chunked similar vectorized embeddings, E NWAR concentrates on the
based on time intervals or location changes, while visual and top-p percentile to effectively filter out less relevant data, with
point cloud descriptions could be divided based on objects p = 95 used throughout the system to anchor the retrieval
detected or spatial regions. process in the highest-ranking results. This approach enhances
Once the data is chunked, it is passed to a General Text both the precision and relevance of the retrieved information,
Embeddings (GTE) model to convert each chunk into a dense resulting in more accurate and contextually appropriate re-
vectorized format—a numerical representation of the text that sponses.
captures its semantic content. These embeddings serve as ⃝5 Response Generation: Once the top-ranked results from
a structured and machine-readable format that encodes the the semantic search are identified, they provide the essential
underlying meaning of the text. In other words, vectorization context for the LLM to generate a coherent and contextually
enables LLMs to tokenize and process the data, establishing appropriate response. These results serve as the foundation
relationships between different chunks based on their semantic upon which the LLM builds its output, ensuring that the
similarity. generated response is both accurate and relevant to the user’s
⃝C Domain-Specific Knowledge Base Generation: E N - prompt.
WAR ’s ability to deliver precise and context-aware responses The LLM processes the vectorized embedding of the user
is largely dependent on its robust domain-specific knowledge prompt along with the retrieved context from the knowledge
base. By constructing a knowledge base comprising of embed- base. It integrates information from multiple sources, such
dings generated from a variety of sensor modalities, E NWAR as GPS coordinates, LiDAR data, and visual descriptions, to
ensures that the system is equipped with contextually rich construct a detailed representation of the environment. This
and diverse information about the environment. To ensure may involve detecting vehicles, their locations, and describing
optimal performance, the knowledge base is indexed in a physical aspects of the surroundings in relation to the prompt.
way that allows the RAG framework to retrieve relevant data To further enhance the generation process, E NWAR employs
efficiently, which is explained in the following sections. This top-p sampling to strike a balance between accuracy and di-
structured knowledge base enables real-time decision-making versity in responses, effectively filtering out irrelevant outputs
and ensures that E NWAR remains adaptable and responsive while maintaining contextual richness.
to a wide range of scenarios, enhancing its performance in Beyond simple description, the LLM can infer interactions
dynamic and complex wireless environments. between various elements in the environment. For instance,
using GPS data and environmental context, the model might
predict how vehicles interact or how the physical surroundings
Prompt Interpretation and Response Generation
could influence those interactions. Furthermore, the LLM
⃝
1 Prompt Preprocessing and Modality Transformation: ensures that the response aligns with the specific instructions
This step closely mirrors the procedures in Step-⃝:A the user or tasks provided by the user. In the context of E NWAR, this
prompt is preprocessed by transforming its components and means delivering comprehensive environmental awareness by
any real-time multi-modal sensory data into a unified, stan- describing key entities, their locations, and how they may
dardized format suitable for LLMs. This ensures the prompt interact within the given environment. By synthesizing all
is properly aligned with the knowledge base, allowing for relevant data and ensuring it is grounded in the retrieved
seamless interaction with the model’s retrieval mechanisms. context, the LLM generates a detailed and actionable response,
⃝
2 Prompt Text Embedding: Similarly, this step follows the providing insights necessary to make informed decisions,
procedures in Step-⃝: B the preprocessed prompt is converted particularly in dynamic and complex scenarios.
into numerical embeddings, ensuring that it can be efficiently
processed and compared to the vectorized data in the knowl-
edge base. This transformation facilitates accurate retrieval Key Performance Indicators
of relevant information, streamlining the prompt’s interaction Evaluating the performance of E NWAR requires assessing
with the model’s generative components. its output based on both general benchmarks and domain-
⃝
3 Semantic Search and Knowledge Retrieval: Once the specific metrics. Standard benchmarks such as the General
user prompt has been transformed into embeddings, E NWAR Language Understanding Evaluation (GLUE) and Massive
performs semantic search to retrieve the most relevant infor- Multitask Language Understanding (MMLU) offer a broad
mation from its domain-specific knowledge base. This process assessment of an LLM’s capabilities across various metrics
identifies entries that closely match the prompt by calculating such as answer relevancy, factual correctness, and hallucina-
the semantic similarity between the prompt and the embedded tions avoidance. However, RAG-based systems require a more
targeted evaluation due to their reliance on domain–specific A Breakdon of E NWAR Setup
contexts, and in our case, specifically tailored multi-modal
data. Following RAGAS framework [https://docs.ragas.io], we
6 DeepSense6G Dataset: To evaluate E NWAR’s performance,
we utilize a large-scale, real-world multi-modal sensing and
ensure a comprehensive and accurate evaluation of E NWAR’s
communication dataset [11]. DeepSense6G dataset offers a ro-
performance through following KPIs:
bust platform for testing E NWAR’s ability to interpret complex
0 Answer Relevancy (AR) measures how well E NWAR’s spatial and environmental data. For our evaluation, we focus
responses align with the user’s prompt and context. Denoting
on Scenario 36 that includes GPS coordinates of a vehicle
Epi and Eti as the embedding vectors of ith generated prompt
equipped with four signal receivers, its captured 360-degree
and the relevant ground truth of the data sample,respectively; LiDAR point clouds and front-back camera frames, and the
cosine/semantic similarity, −1 ≤ cos E ⃗p , E
⃗ t ≤ 1, mea-
i i GPS coordinates of another vehicle equipped with a transmit-
sures how semantically similar two texts are based on their ter. We meticulously identified a total of 180 scenes/samples,
vector representations, i.e., 1: perfectly similar (aligned), 0: no 30 of which is used for testing, that captures urban envi-
similarity, -1: completely dissimilar (opposite). Accordingly, ronments with varying numbers of pedestrians, cyclists, and
AR is evaluated by calculating the average cosine similarity vehicles; an exemplary scene shown in Fig. 2
as follows 6 Modality Transformation and Information Extraction:
N N ⃗p · E
⃗t For all the selected scenes, E NWAR extracts latitude and lon-
1 X
⃗t = 1
⃗p , E
X E i i
AR = cos E i i
. (1) gitude coordinates from GPS inputs to determine the positions
N i=1 N i=1 E
⃗p ⃗
Eti and relative bearings of two vehicles, which are then converted
i
2 2
into textual format for seamless integration into E NWAR’s
0 Context Recall assesses the degree to which E NWAR’s prompt. For front-rear images, E NWAR perform image-to-
retrieved context aligns with the ground truth. It evaluates text transformation to generate a textual description of the
whether the system correctly recalls information from the visual content by using InstructBLIP trained on Vicuna-7b and
knowledge base that is relevant to the prompt and verifies how optimized for visual-tuned instructions [12].
much of the response can be attributed to the correct context. For LiDAR point clouds, E NWAR leverages the
It is calculated by normalizing the alignment extent of the super fast accurate 3D (SFA3D) model for object
retrieved contexts within the ground truth with the number of detection and analysis [https://github.com/maudzung/
sentences in the ground truth. Super-Fast-Accurate-3D-Object-Detection]. SFA3D was
0 Correctness Score measures the factual correctness and modified to extract object information, including locations
semantic similarity of the generated responses. Denoting the and bearings relative to the sensor, and converts this data
embedding vector of ith generated answer by E ⃗ a and F1 score into text to describe the environment. SFA3D utilizes a
i
as a metric of factual correctness, the overall correctness score ResNet-based keypoint feature pyramid network (KFPN)
is given by for reliable LiDAR object detection, transforming 3D point
clouds into Birds-Eye-View images, which are then processed
Correctness = ω cos E ⃗a , E⃗ t + (1 − ω)F1 , (2)
i i to identify objects with high confidence, providing detailed
information such as positions, dimensions, and orientations.
where weighting parameter 0 ≤ ω ≤ 1—the RAGAS sets The extracted information from each modality is hard-
ω = 0.25 as default—ensures that response assessments are coded into a template [c.f., white text highlighted with black
both factually accurate and contextually appropriate. background in Fig. 2] to be utilized during the prompting and
0 Faithfulness evaluates the consistency of the generated grounding process. Since object types and potential block-
answers with the retrieved context. A response is considered ages are not readily available labels within the DeepSense6G
faithful if all claims align with the retrieved data, ensuring the dataset, we manually create ground truth text for all 180 scenes
output does not contain unsupported or fabricated information. by correcting extracted information if necessary and/or adding
Faithfulness checks whether the claims made in the output can missing details.
be logically deduced from the given context and is given by
6 Instructional Text Prompt: The prompt includes extracted
|NGc | information of each scene and specifies predefined tasks, guid-
Faithfulness = , (3) ing E NWAR to accurately analyze the wireless environment,
|NC |
detect potential blockages, and generate relevant insights based
where NGc is the number of claims in the generated answer on the processed multi-modal inputs[c.f., yellow box in Fig.
that can be inferred from the given context and NC is the total 2].
number of claims in the generated answer. 6 Chunking and Embedding: E NWAR utilizes the gte-large-
en-v1.5 embedding model from AliBaba-NLP to vectorize
E NWAR S ETUP B REAKDOWN AND A C ASE S TUDY ground truth samples for knowledge base creation [13]. This
model supports a tokenized context length of up to 8,192
This section provides a detailed breakdown of the E NWAR tokens. To maintain continuity between data segments, the
setup and offers a qualitative performance comparison with transformed textual data is divided into chunks of 1,024
vanilla LLMs, highlighting the advantages of integrating multi- characters, with a 100-character overlap to ensure context is
modal data and knowledge retrieval within E NWAR. preserved across boundaries.
Fig. 2: Illustration of the case study scene with raw data, extracted information, generated prompt, and responses from Vanilla Llama and E NWAR.
90 90 90 90
80 80 80 80
70 70 70 70
Faithfulness (%)
Correctness (%)
Relevancy (%)
60 60 60 60
50 50 50 50
40 40 40 40
30 30 30 30
20 20 20 20
10 10 10 10
M7b M46b L8b L70b L405b M7b M46b L8b L70b L405b M7b M46b L8b L70b L405b M7b M46b L8b L70b L405b
Fig. 3: KPI comparison of LLMs across modality combinations: the first and second rows present absolute KPI [%] and KPI normalized per billion parameters
of each LLM [%pb], respectively.
camera inputs, respectively. Although image-to-text translation Camera or GPS is paired with LiDAR, CAM+LiDAR con-
dominates the processing time, it’s important to recognize that sistently outperforms GPS+LiDAR across all KPIs. This re-
scene elements such as traffic conditions, weather, landscape, flects the stronger impact of visual data on the models’
and area classification (urban, suburban, rural) — as illustrated ability to generate contextually rich and accurate responses.
in Fig. 2 — typically exhibit limited variability within the As expected, the integration of all three modalities yields
order of seconds. That is visual data updates may not re- the highest performance across every KPI. The fusion of
quire constant reprocessing at the same frequency as other spatial, depth, and visual information allows the models to
modalities. Unlike the narrative-driven information extracted deliver the most comprehensive and accurate responses, further
from camera inputs, LiDAR and GPS provides quantitative emphasizing the value of multi-modal data integration for
measures within the order of milliseconds, allowing E NWAR advanced situational awareness.
to utilize windowing for efficient tracking and prediction of
environmental changes. These inference times can be further LLM Type and Size Comparison
reduced through well-crafted hierarchical LLM architectures
as outlined in the final section. Across all modality permutations, an increase in the pa-
rameter size correlates with improved absolute KPI values.
KPI E VALUATION OF S TATE - OF - THE -A RT LLM S Larger models consistently outperform their smaller counter-
ON M ODALITY C OMBINATIONS parts across all metrics, reflecting the advantage of parameter
scaling in LLMs performance. However, it is also notable that
This section evaluates the performance of various state-of-
the rate of KPI improvement slows and begins to saturate
the-art LLMs across different modality combinations, which is
as the parameter space grows. This indicates diminishing
presented in Fig. 3 and discussed in the following subsections.
returns at higher parameter counts, with the largest models
showing only marginal gains compared to their slightly smaller
Modality Combination Comparison counterparts. In terms of model comparisons, the performance
For single modality evaluations, the general trend shows differences between Mistral 7b and LLaMa 8b are minimal, in-
GPS < LiDAR < CAM in terms of performance across dicating that these two models are comparable in terms of their
all KPIs. GPS alone provides limited contextual information, effectiveness across KPIs and modality combinations. The
resulting in the lowest scores, while CAM proves to be the second row of Fig. 2 reveals a noticeable observation: Despite
most effective single modality, offering richer visual context the larger models providing better overall absolute KPIs, the
that significantly enhances answer relevancy, correctness, and efficiency (i.e., performance per billion parameters) of adding
faithfulness. more parameters decreases significantly, potentially indicating
When dual modalities are combined, the trends observed in overfitting and interesting research directions covered in the
the single-modality evaluations continue. Specifically, when next section. Another promising way of inference latency
reduction might be training baby LMs to operate directly on knowledge retrieval and system adaptability, allowing local
the sensory data at the edge to form local RAG to eliminate models to respond to real-time data while benefiting from
the need for intermediary steps of modality transformation and the shared global context. It also mitigates the diminishing
information extraction. returns of scaling large models, optimizes resource usage via
serverless computing, and ensures continuous learning for
C ONCLUSION AND F UTURE D IRECTIONS improved performance in complex environments.