Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding

📚Introduction

Despite the remarkable capabilities of Multimodal Large Language Models (MLLMs), they still suffer from visual fading in long-context scenarios. Specifically, the attention to visual tokens diminishes as the text sequence lengthens, leading to text generation detached from visual constraints. We attribute this degradation to the inherent inductive bias of Multimodal RoPE, which penalizes inter-modal attention as the distance between visual and text tokens increases. To address this, we propose inter-modal Distance Invariant Position Encoding (DIPE), a simple but effective mechanism that disentangles position encoding based on modality interactions. DIPE retains the natural relative positioning for intra-modal interactions to preserve local structure, while enforcing an anchored perceptual proximity for inter-modal interactions. This strategy effectively mitigates the inter-modal distance-based penalty, ensuring that visual signals remain perceptually consistent regardless of the context length. Experimental results demonstrate that by integrating DIPE with Multimodal RoPE, the model maintains stable visual grounding in long-context scenarios, significantly alleviating visual fading while preserving performance on standard short-context benchmarks.

🔥 News

2026/03/10: Training and Evaluation Code is available Now! Currently, we only release the implementation with eagar attention for stability.

🛠️ Install

Create the environment

conda create -n dipe python=3.12
conda activate dipe

Install PyTorch

pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1

Install more requirement
```
pip install -r requirements.txt
```
Please following LLaVA-Next to origin training datasets, and configure it in DIPE/qwenvl/data/__init__.py
Evaluations are implemented using VLMEvalKit. Please refer to DIPE/long_context_vqa for distractor text generation.

▶️ Results

♥️ Acknowledgement

We thank Qwen3-VL, Transformers, LLaVA-Next and VLMEvalKit, which we used to build our training and evalution code.

✒️ Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@article{chen2026dipe,
    title = {Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding},
    author={Chen, Lin and Ni, Bolin and Yang, Qi and Wang, Zili and Ding, Kun and Wang, Ying and Peng, Houwen and Xiang, Shiming},
    journal={arXiv preprint arXiv:2603.10863},
    year = {2026},
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.assets		.assets
interleave_test		interleave_test
long_context_vqa		long_context_vqa
qwenvl		qwenvl
scripts		scripts
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding

📚Introduction

🔥 News

🛠️ Install

▶️ Results

♥️ Acknowledgement

✒️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding

📚Introduction

🔥 News

🛠️ Install

▶️ Results

♥️ Acknowledgement

✒️ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages