o What are the concepts discussed in the papers?
General principles
● Modality correlation: It is important to leverage the correlation between noises in
different modalities. This is the motivation for sensor fusion, as it allows for better
general performance than in the unimodal case. However, correlation also leads to
information redundancy, and thus increased computation.
● Primary vs. secondary modalities: In certain tasks one modality may be informative,
whereas others may provide useful cues to the primary (e.g. depth cues on top of RGB)
● Computational constraints: Processing different modalities may hold different
computational costs, e.g. image is 2D while audio and text are 1D.
● Fusion operations: concatenation, addition, weighted addition, etc… How are features
of different modalities fused together in the network?
● Fusion stage: at which stage is fusion of features done? How much processing of each
individual modality before information is combined together to look for correlation
patterns, semantic meaning, etc..?
Multimodal Fusion on Low-quality Data: A Comprehensive Survey (paper 1)
● Properties and challenges of multi-modal datasets:
1. Noisy: need to mitigate the underlying influence of arbitrary noise). Types of
noise: feature-dependent or cross-modal (misaligned labels, ‘semantic’). The first
category is handled using variation-based noise reduction, and the second using
filter/rectification/noise-adaptive regularization techniques
2. Incomplete: Lacking measurements regarding certain modalities, as a result of
sensor errors, economical costs, etc… Imputation-based as well as
imputation-free techniques exist.
3. Imbalanced: Need to mitigate the influence of bias (i.e. one modality more
helpful than others-> model only utilizes this one) as well as discrepancies
between modalities. These may regard convergence speed (different modalities
yield different convergence speeds in learning), quality (some sensors may be
better/worse and noisier than others). Techniques involve modifying learning
objectives, scheduling learning rates per modality, high-quality modality masking,
etc…
4. Quality dynamically varying multimodal data: varying quality is necessary in
real-world applications (e.g. RGB is worse than thermal in low-light conditions,
vice versa in high-light conditions). This demands dynamic multimodal fusion,
which can adapt to the changing quality of multimodal data by fusing features
from different modalities adaptively.
Dynamic Multimodal Fusion (paper 2)
● Dynamic sensor fusion: generate data-dependent forward paths on-the-fly
● modality-level vs. fusion-level decisions, Two different implementations, each utilizing a
gating-function architecture for reduced computation, where the gating networkality
depends on the task at hand.
● Modality-level approach: the gating network chooses a branch, which corresponds to
some model which performs inference using a subset of available modalities. This works
better for simpler tasks involving some ‘main’ modality, such as emotional recognition
on a dataset involving text, audio and video, with text being the most informative
modality of the three.
● Fusion-level approach: Fusion blocks are interleaved between feature extraction blocks,
where the gating network decides the relevant fusion operation for each such block.
Fusion operations are either identity/concatenation/addition/weighted sum. This
approach is useful for more difficult tasks such as image segmentation, given RGB-D
modalities.
● Control over computational cost: because all forward path branches are known ahead
of time, their relative computational cost can be estimated. This allows for a loss
function with a regularization term λ, which can prioritize limited-computation over
precision, or vice versa.
InMu-Net: Advancing Multi-modal Intent Detection via Information Bottleneck and
Multi-sensory Processing (paper 3)
● Denoising: filter out intent-irrelevant information (‘semantic’ noise)
● Bottleneck: Distill relevant part by projecting features into a smaller-dimensional space
and then re-projecting back to original feature space. Denoises features extracted by
respective encoder modules per modality.
● Saliency Preservation: retain as much intent-relevant information as possible.
● Mutual information: Used to establish a loss function for saliency preservation, whereby
minimizing the difference in conditional entropy H(y|f) - H(y|f’, where f is extracted
feature vector and f’ is the denoised f, is equivalent to minimizing the KL-divergence
between the two vectors.
● Kurtosis: Measure of long-tail distribution: higher kurtosis indicates higher density in
distribution ‘tail’. In this case it is used as a measure for the distribution of model
predictions over samples, where minimizing kurtosis results in reduced ‘edge-case’
behavior in the model.
O What kind of modalities are considered?
Tasks and related modalities
Emotional recognition/sentiment analysis (text, visual, audio)
movie genre classification (image, text)
Semantic segmentation (RGB, depth images)
Multi-modal Intent Detection (text, visual, audio)
Medical-imaging (MRI, PET, CSF)
Urban area classification (hyperspectral imaging (HSI), LiDAR)
o How are the multimodal learning systems built? What are the key
instructional design principles derived?
Paper 2
● gating network to provide modality-level or fusion-level decisions, where modality-level
DynMM and fusion-level DynMM target different granularity levels, depending on the
task at hand (modality-level is coarse, fusion-level is ‘fine’).
Modality-level
● classical Mixture-of-Experts (MoE) framework, where each expert specializes in a subset
of modalities.
● gating network, denoted by G(x) decides which expert network should be activated,
where each such expert specializes in a subset of available modalities
● G produces a one-hot encoding, i.e., only one branch is selected for each instance (for
reduced computational costs).
● Examples of possible gating networks: multi-layer perceptron, transformer, CNN
depending on the task at hand
● G can take intermediate features per modality as inputs
Fusion-level
● Instead of completely skipping computations of some modality, it is better to harness
these at certain stages.
● interlace static feature extraction blocks (MLP/attention) with dynamic fusion cells
● A fusion cell can be implemented as any function to fuse multimodal features, such as
simple identity mapping (i.e., Oi = x1), addition (i.e., Oi = x1 + x2 + · · · + xM ),
concatenation (i.e., Oi = [x1, x2, · · · , xM ]) and self-attention.
● Global gating network: same network is used for all fusion-level decisions, and thus
works on all intermediate features. This is more efficient training-wise.
● Helpful in tasks where the final prediction is mainly based on a dominant modality, i.e.
keep extracting features from main modality, while using others as ‘cues’ (RGB-Depth for
example), and controlling how and when the auxiliary modality comes in to assist the
main prediction process.
● Used for semantic segmentation (i.e., a dense prediction problem).
Paper 3
● This network does static multimodal fusion (computation is fixed)
● Uses separate modality-specific Transformers to capture the features of distinct
modalities, fuses (concatenates) features.
● Denoising bottleneck comprises projection layers to lower-dim space, which are then
re-projected back to the original feature space. This distills intent-relevant information
while filtering out irrelevant information
o How do these principles enhance learning in multimodal environments?
Paper 2
● Does not directly enhance learning, but rather enhances control over computational
cost of inference via a regularization parameter λ during training,
where C(Ei) denote the computation cost (e.g., MAdds) of executing an expert network Ei, and
C(Oi,j) represents the computation cost of the i-th fusion operation in the j-th cell.
● For example, in an emotion recognition task, the dynamic model can only activate its
text branch and skip paths corresponding to the other two modalities, whereas in a
‘harder’ case it could rely on all modalities, thus leading to heavier computation.
● Fusion-level is useful in tasks where the final prediction is mainly based on a dominant
modality while using the others as ‘cues’ (e.g. is RGB-Depth). In this mode it is possible
to control how and when the auxiliary modality comes in to assist the main prediction
process.
Paper 3
● Denoising bottleneck module allows model to retain intent-relevant information while
discarding irrelevant information.
● Loss function combines foundational supervision component (comparison of extracted
features with target), with regularization terms (saliency preservation, reduced kurtosis).
The first three loss terms are ‘foundational’ in the sense that they supervise the model
output with regard to a target. More specifically, L_f, L_f~ supervise features and
L_{modality} gives the text modality dominance.
● The denoising bottleneck module serves as data augmentation during training, thus
making the model more robust to noisy input.
o Summarize the empirical evidence supporting the effectiveness of the design
principles mentioned in the article.
Paper 2
● DynMM is used in 3 tasks (movie genre classification, sentiment analysis, semantic
segmentation) involving different datasets with different modalities.
● modality-level DynMM is used for the first two tasks and fusion-level for the third.
● Different variants of the model are obtained per task by using different value of the
regularization hyperparameter, thus yielding different computation-precision tradeoffs
within each variant
● Light-computation variants all do well compared to the SOTA baselines with regard to
classification metrics (F1, MAE, MIoU, etc..). This allows for almost 50% reduction in
computation, while remaining close to the baseline requirements. On the other hand,
the heavy-computation variants exceed SOTA baselines.
● The model is also shown to be more robust than other SOTA baselines approaches via
Gaussian noise injection in the more difficult semantic segmentation task.
Paper 3
● The model does as well as or better than SOTA baselines with regard to the task of MID
(multimodal intention detection) over two multimodal datasets. The model also has a
lower perplexity measure.
● The model does much better than SOTA baselines over uncommon intent categories
thanks to kurtosis regularization.
● Ablation studies show that the total loss function (comprising foundational and
regularization terms) does significantly better than subsets of components over both
datasets.
● Modality corruption and low-resource studies also show that the model achieves
high-performance compared to the baselines.
● Generalizability of the model is examined via cross-architecture and cross-task scenarios.
This reveals that the underlying sensor-fusion architecture can be improved, and that
the model achieves competitive performance with multi-modal sentiment analysis
(MSA) baselines.
o Discuss any limitations or gaps in the empirical studies reviewed.
Paper 2
● Different gating network modules (MLP/Transformer/convolutional) were chosen for
different tasks. What were the motivations for this? Is this modality or task dependent?
● Given certain computational or precision requirements, how could one go about
choosing a regularization parameter λ which would yield a suitable model?
● What is the difference in computational cost between modality and fusion-level
implementations regarding the same task?
● Could one go about integrating the two by using a fusion-level variant within the
framework of a modality-level variant? Would this be useful at all?
Paper 3
● The proposed architecture is used only within the framework of multimodal datasets
including text, also within the cross-task study. How does the model compare to the
baseline in tasks which involve no text at all (meaning the loss function needs to be
modified)?
● What is the effect of the denoising bottleneck in the case of highly ‘misaligned’ samples
(i.e. text with one intent, visuals with another, audio with a third)? Will it automatically
prioritise the ‘main’ modality in this case?
Using Heterogeneous Multilevel Swarms of UAVs and High-Level Data
Fusion to Support Situation Management in Surveillance Scenarios (paper 4)
o What future research directions does the article suggest for multimodal
sensor/information fusion ?
● multi-sensor multi-source data fusion: fuse information delivered by heterogeneous
swarms of UAVs
● Decentralized, distributed frameworks for high-level information fusion allows for
decreased workload per unit.
● Information sharing between UAVs- different sensors can be ‘aware’ of each other- save
higher-level computation
● High level data fusion: situation assessment, relationships between objects, impact
assessment. ‘Object-Oriented World Model’ allows for a framework within to complete
such tasks, demanding perhaps different information fusion approaches
o How can these directions be applied to the field of multimodal sensor fusion
in dynamic environments?
● A decentralized framework demands a dynamic multimodal fusion approach, as some
drones may be unavailable (as a result of loss, refueling, etc…)
● Similarly, a higher-level operator (intelligence control, etc..) can choose a subset of
drones in the swarm to collect data, depending on position, availability, and sensors.
● Fusion flexibility: low-quality data can also be collected and utilized. Missing data can be
imputed if necessary.
● UAVs are autonomous, and can thus perform computationally-light sensor fusion
themselves if necessary (in case higher-level operator is unavailable).
o What are the challenges presented by dynamic environments and varying
topologies?
● Real-time (online) implementation of sensor-fusion algorithms, needs to be
computationally efficient and robust.
● A dynamic environment means that new events may occur all the time. This demands
quick reactions to changes (for example- a drone in the swarm falls).
● Dynamic movement requires a mobility model. This is additional computation for each
drone (swarm needs to navigate, keep distance).
● Varying topology means sensor fusion model can’t be ‘optimized’ with regard to a
certain modality. For example- in certain topologies less light may be available, or certain
sensors may be preferable to others.
(Questions of mine)
● How much computing power do the drones in question have? Are the algorithms
mentioned in the paper (optical flow, projective image transformation, motion detection
etc..) implemented by the drones themselves?
● What kinds of sensors? Does a heterogeneous fleet necessarily mean that these could
change from drone to drone?
● Can OCR be integrated into the fusion process? Would this be accomplished by
individual drones or by a higher-operator?