A curated reading list & model zoo for the era of Born-Native multimodal foundation models.
๐ Companion paper: "Toward Native Multimodal Modeling: A Roadmap"
This repository systematically tracks the structural transition from Modular Assembly โ late-fusion / grafted compositions that suffer from a fundamental blindness to raw sensory signals โ to Native Multimodal Modeling (NMM), where multiple modalities are intrinsically integrated into a unified transformer space or joint backbone.
โญ Star this repo to track the latest landmark works. PRs are warmly welcomed for any model we may have missed.
We formalize the NMM ecosystem through a dual-dimensional lens based on Integration Depth (mid-fusion vs. early-fusion) and Functional InputโOutput Duality:
| # | Paradigm | Input โ Output | Core Idea |
|---|---|---|---|
| ๐ฆ | M2T โ Multi-to-Text | multimodal โ text | Ground cross-modal inputs into purely linguistic responses for reasoning. |
| ๐ฉ | M2G โ Multi-to-Target | multimodal โ modality-specific | Direct synthesis of modality-specific outputs through native representations to achieve temporal & acoustic coherence. |
| ๐ช | M2M โ Multi-to-Multi | multimodal โ multimodal | A unified paradigm where understanding and generation naturally coexist as reciprocal projections within a single network. |
Native scaling frameworks that ground cross-modal inputs into linguistic streams for logical reasoning.
Modularly assembled via shallow projectors; blind to raw sensory signals.
- LLaVA [Liu et al., 2023] โ
๐ป GitHubยท๐ Paper - DeepSeek-VL [Lu et al., 2024] โ
๐ป GitHubยท๐ Paper - Qwen-Image [Wu et al., 2025] โ
๐ป GitHubยท๐ Blog
Foundational pioneers maintaining explicit, modality-aware boundaries.
- CogVLM [Wang et al., 2023] โ
๐ป GitHub - Qwen-Audio [Chu et al., 2023] โ
๐ป GitHubยท๐ Project Page
Massive state-of-the-art evolved mid-fusion architectures:
- Qwen2.5-VL [Qwen Team, 2025] โ
๐ป GitHubยท๐ Blog - Qwen3-VL [Qwen Team, 2025] โ
๐ป GitHubยท๐ Paper - InternVL-3.5 [Chen et al., 2025] โ
๐ป GitHubยท๐ค HF Collection
Scale-driven industrial mid-fusion implementations:
- GLM-4.5V / GLM-V [ZhipuAI, 2025โ2026] โ
๐ป GitHubยท๐ค HF Model - Kimi K2 / K2.5 [Moonshot AI, 2025โ2026] โ
๐ Project Pageยท๐ป GitHub Org
- MiniCPM-V 4.x [Yu et al., 2025] โ
๐ป GitHub - Nemotron 3 Nano Omni [NVIDIA, 2026] โ
๐ป GitHubยท๐ Paper - MiMo-V2.5 [Xiaomi MiMo Team, 2026] โ
๐ป GitHubยท๐ Project Page - Gemma-4 / Qwen3.6 โ Timeline benchmarks driving advanced contextual reasoning (forthcoming).
Bypassing traditional post-hoc decoders to synthesize photorealistic spatiotemporal physics or continuous speech directly.
- Wan 2.2-T2V-A14B [Wan Team, 2025] โ
๐ค HF Modelโ Unifies video patches into native generation spaces with continuous physics. - HunyuanVideo & HunyuanVideo-1.5 [Tencent, 2024โ2025] โ
๐ป GitHubยท๐ค HF Model (1.5) - Kling-Omni [Kuaishou, 2025] โ
๐ Project Page
- OmniVoice [Zhu et al., 2026] โ
๐ป GitHubยท๐ Project Page - MiniCPM-o 2.6 / 4.5 [OpenBMB, 2025โ2026] โ
๐ค HF Modelยท๐ป GitHub - Seedream 3.0 [Gao et al., 2025] โ
๐ Tech Reportยท๐ Project Page - HiDream-I1 โ
๐ป GitHub
- LTX-2 / LTX-Video [Lightricks, 2024โ2026] โ
๐ป GitHub - Ming-Flash-Omni [Ant Group / inclusionAI, 2025] โ
๐ป GitHubยท๐ Paper
Omni-directional unified spaces establishing a symmetric paradigm where comprehension and generation natively coexist.
Born-native designs treating all modalities equivalently via one unified backbone & embedding space.
- Transfusion [Zhou et al., 2024] โ
๐ Paper - Chameleon โ
[Meta AI, 2024] โ
๐ป GitHubยท๐ Paper - AnyGPT โ
[Zhan et al., 2024] โ
๐ป GitHubยท๐ Paper
- Moshi โ
[Dรฉfossez et al., 2024] โ
๐ป GitHubยท๐ Paperโ Real-time conversational audio-text dual-stream processing. - Emu3 / Emu3.5 โ
[BAAI, 2024โ2025] โ
๐ Project Pageยท๐ Paperโ Next-token sequence prediction unifying understanding and synthesis.
- BAGEL-7B [ByteDance Seed Team, 2025] โ
๐ค HF Modelยท๐ Project Pageยท๐ Paper - OneCAT-3B [Meituan & SJTU, 2025] โ
๐ป GitHubยท๐ค HF Model - Show-o2-7B [Xie et al., 2025] โ
๐ป GitHubยท๐ Paper
Collapsing representation boundaries.
- Janus-Pro โ
[DeepSeek-AI, 2025] โ
๐ค HF Modelยท๐ Paper - Llama-4 Scout / Maverick [Meta AI, 2025] โ
๐ Llama Siteโ Advanced interleaved-scale exploration. - LLaDA-V [Ml-GSAI, 2025] โ
๐ป GitHubยท๐ Project Pageยท๐ Paper - Lance [ByteDance, 2026] โ
๐ Paperโ Leading edge of complete native convergence. - TUNA-2 [Liu et al., 2026] / Mamoda 2.5 [Shi et al., 2026] / LongCat-Next โ Forthcoming.
- Dynin-Omni [Kim et al., 2026] โ
๐ป GitHubยท๐ค HF Modelยท๐ Paperยท๐ Project Page
โ Denotes early exploratory or foundational dual-regime architectures.
Following the systemic structure detailed across Sections ยง3โยง7 of the roadmap paper, the core components of the NMM lifecycle are curated below.
|
|
|
|
|
|
Contributions are very welcome! If a notable native multimodal model is missing or you find an outdated link, please open an Issue or send a Pull Request.
The preferred entry format is:
- **<Model Name>** [<Authors / Team>, <Year>] โ [`๐ป GitHub`](https://...) ยท [`๐ Paper`](https://...)If our formalization, taxonomy, or roadmap framework assists your research, please cite our definitive paper:
@article{TencentYoutuLab2026toward,
title = {Toward Native Multimodal Modeling: A Roadmap},
author = {Siyu An and Junru Lu and Junnan Dong and others},
journal = {arXiv preprint},
year = {2026}
}