Awesome Native Multimodal Modeling (NMM)

A curated reading list & model zoo for the era of Born-Native multimodal foundation models.

📄 Companion paper: "Toward Native Multimodal Modeling: A Roadmap"

Taxonomy · M2T · M2G · M2M · Technical Roadmap · Cite

This repository systematically tracks the structural transition from Modular Assembly — late-fusion / grafted compositions that suffer from a fundamental blindness to raw sensory signals — to Native Multimodal Modeling (NMM), where multiple modalities are intrinsically integrated into a unified transformer space or joint backbone.

⭐ Star this repo to track the latest landmark works. PRs are warmly welcomed for any model we may have missed.

🗺️ The NMM Architectural Taxonomy

We formalize the NMM ecosystem through a dual-dimensional lens based on Integration Depth (mid-fusion vs. early-fusion) and Functional Input–Output Duality:

#	Paradigm	Input → Output	Core Idea
🟦	M2T — Multi-to-Text	multimodal → text	Ground cross-modal inputs into purely linguistic responses for reasoning.
🟩	M2G — Multi-to-Target	multimodal → modality-specific	Direct synthesis of modality-specific outputs through native representations to achieve temporal & acoustic coherence.
🟪	M2M — Multi-to-Multi	multimodal → multimodal	A unified paradigm where understanding and generation naturally coexist as reciprocal projections within a single network.

🟦 1. Multi-to-Text (M2T) Unimodal Generation

Native scaling frameworks that ground cross-modal inputs into linguistic streams for logical reasoning.

🧱 Late-Fusion Baseline References

Modularly assembled via shallow projectors; blind to raw sensory signals.

LLaVA [Liu et al., 2023] — 💻 GitHub · 📄 Paper
DeepSeek-VL [Lu et al., 2024] — 💻 GitHub · 📄 Paper
Qwen-Image [Wu et al., 2025] — 💻 GitHub · 🌐 Blog

🔗 Mid-Fusion (Naturally Interacted Regime)

Foundational pioneers maintaining explicit, modality-aware boundaries.

CogVLM [Wang et al., 2023] — 💻 GitHub
Qwen-Audio [Chu et al., 2023] — 💻 GitHub · 🌐 Project Page

Massive state-of-the-art evolved mid-fusion architectures:

Qwen2.5-VL [Qwen Team, 2025] — 💻 GitHub · 🌐 Blog
Qwen3-VL [Qwen Team, 2025] — 💻 GitHub · 📄 Paper
InternVL-3.5 [Chen et al., 2025] — 💻 GitHub · 🤗 HF Collection

Scale-driven industrial mid-fusion implementations:

GLM-4.5V / GLM-V [ZhipuAI, 2025–2026] — 💻 GitHub · 🤗 HF Model
Kimi K2 / K2.5 [Moonshot AI, 2025–2026] — 🌐 Project Page · 💻 GitHub Org

🧬 Dense / Native M2T Scaling

MiniCPM-V 4.x [Yu et al., 2025] — 💻 GitHub
Nemotron 3 Nano Omni [NVIDIA, 2026] — 💻 GitHub · 📄 Paper
MiMo-V2.5 [Xiaomi MiMo Team, 2026] — 💻 GitHub · 🌐 Project Page
Gemma-4 / Qwen3.6 — Timeline benchmarks driving advanced contextual reasoning (forthcoming).

🟩 2. Multi-to-Target (M2G) Scenario-based Generation

Bypassing traditional post-hoc decoders to synthesize photorealistic spatiotemporal physics or continuous speech directly.

🎬 Advanced Video / World Simulators

Wan 2.2-T2V-A14B [Wan Team, 2025] — 🤗 HF Model — Unifies video patches into native generation spaces with continuous physics.
HunyuanVideo & HunyuanVideo-1.5 [Tencent, 2024–2025] — 💻 GitHub · 🤗 HF Model (1.5)
Kling-Omni [Kuaishou, 2025] — 🌐 Project Page

🎙️ Speech-Centric Native Frameworks

OmniVoice [Zhu et al., 2026] — 💻 GitHub · 🌐 Project Page
MiniCPM-o 2.6 / 4.5 [OpenBMB, 2025–2026] — 🤗 HF Model · 💻 GitHub
Seedream 3.0 [Gao et al., 2025] — 📄 Tech Report · 🌐 Project Page
HiDream-I1 — 💻 GitHub

📅 Timeline Milestone Generators

LTX-2 / LTX-Video [Lightricks, 2024–2026] — 💻 GitHub
Ming-Flash-Omni [Ant Group / inclusionAI, 2025] — 💻 GitHub · 📄 Paper

🟪 3. Multi-to-Multi (M2M) Symmetric Modeling

Omni-directional unified spaces establishing a symmetric paradigm where comprehension and generation natively coexist.

🔥 Early-Fusion (Native Convergent Regime)

Born-native designs treating all modalities equivalently via one unified backbone & embedding space.

Transfusion [Zhou et al., 2024] — 📄 Paper
Chameleon ★ [Meta AI, 2024] — 💻 GitHub · 📄 Paper
AnyGPT ★ [Zhan et al., 2024] — 💻 GitHub · 📄 Paper

🔮 Early Unified Predictors

Moshi ★ [Défossez et al., 2024] — 💻 GitHub · 📄 Paper — Real-time conversational audio-text dual-stream processing.
Emu3 / Emu3.5 ★ [BAAI, 2024–2025] — 🌐 Project Page · 📄 Paper — Next-token sequence prediction unifying understanding and synthesis.

🧩 Interleaved Sequence Modeling

BAGEL-7B [ByteDance Seed Team, 2025] — 🤗 HF Model · 🌐 Project Page · 📄 Paper
OneCAT-3B [Meituan & SJTU, 2025] — 💻 GitHub · 🤗 HF Model
Show-o2-7B [Xie et al., 2025] — 💻 GitHub · 📄 Paper

🌌 Bidirectional Unification Frontiers

Collapsing representation boundaries.

Janus-Pro ★ [DeepSeek-AI, 2025] — 🤗 HF Model · 📄 Paper
Llama-4 Scout / Maverick [Meta AI, 2025] — 🌐 Llama Site — Advanced interleaved-scale exploration.
LLaDA-V [Ml-GSAI, 2025] — 💻 GitHub · 🌐 Project Page · 📄 Paper
Lance [ByteDance, 2026] — 📄 Paper — Leading edge of complete native convergence.
TUNA-2 [Liu et al., 2026] / Mamoda 2.5 [Shi et al., 2026] / LongCat-Next — Forthcoming.
Dynin-Omni [Kim et al., 2026] — 💻 GitHub · 🤗 HF Model · 📄 Paper · 🌐 Project Page

★ Denotes early exploratory or foundational dual-regime architectures.

🛠️ The Technical Roadmap Dimensions

Following the systemic structure detailed across Sections §3–§7 of the roadmap paper, the core components of the NMM lifecycle are curated below.

🧩 1. Architecture · §3 Integration Depth Mapping — Structural mechanics of joint multimodal backbones vs. single unified transformer spaces. Input–Output Decoupling — Eliminating modality-aware boundaries & shallow projectors.	📊 2. Data Curriculum · §4 Interleaved Data Curricula — Pre-training token mixtures combining web-scale text, audio waves, & video streams. Post-Training Engineering — Multi-modal instruction tuning & alignment token datasets.
🎯 3. Training Strategies · §5 Multi-Objective Loss Recipes — Unifying continuous-discrete objectives (autoregressive next-token prediction + diffusion steps). Scaling Dynamics — Computed token-allocation strategies to maintain gradient stability at the 1T+ MoE frontier.	⚡ 4. Inference & Deployment · §6 Full-Duplex Orchestration — Dynamic KV-cache eviction & multi-scale attention patterns for real-time interaction (<100 ms). Hardware-Native Compilation — Distributed CUDA compute kernels for unified cross-modal token routing.
🧪 5. Evaluation Benchmarks · §7 Symmetric Evaluation Matrices — Benchmarking systems capable of examining interleaved multi-modal sequences without suffering from target-modality collapse.

🤝 Contributing

Contributions are very welcome! If a notable native multimodal model is missing or you find an outdated link, please open an Issue or send a Pull Request.

The preferred entry format is:

- **<Model Name>** [<Authors / Team>, <Year>] — [`💻 GitHub`](https://...) · [`📄 Paper`](https://...)

✍️ Citation

If our formalization, taxonomy, or roadmap framework assists your research, please cite our definitive paper:

@article{TencentYoutuLab2026toward,
  title   = {Toward Native Multimodal Modeling: A Roadmap},
  author  = {Siyu An and Junru Lu and Junnan Dong and others},
  journal = {arXiv preprint},
  year    = {2026}
}

_{Maintained by the NMM-Roadmap community · Made with ❤️ for open multimodal research.}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Native Multimodal Modeling (NMM)

🗺️ The NMM Architectural Taxonomy

🟦 1. Multi-to-Text (M2T) Unimodal Generation

🧱 Late-Fusion Baseline References

🔗 Mid-Fusion (Naturally Interacted Regime)

🧬 Dense / Native M2T Scaling

🟩 2. Multi-to-Target (M2G) Scenario-based Generation

🎬 Advanced Video / World Simulators

🎙️ Speech-Centric Native Frameworks

📅 Timeline Milestone Generators

🟪 3. Multi-to-Multi (M2M) Symmetric Modeling

🔥 Early-Fusion (Native Convergent Regime)

🔮 Early Unified Predictors

🧩 Interleaved Sequence Modeling

🌌 Bidirectional Unification Frontiers

🛠️ The Technical Roadmap Dimensions

🧩 1. Architecture · §3

📊 2. Data Curriculum · §4

🎯 3. Training Strategies · §5

⚡ 4. Inference & Deployment · §6

🧪 5. Evaluation Benchmarks · §7

🤝 Contributing

✍️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Native Multimodal Modeling (NMM)

🗺️ The NMM Architectural Taxonomy

🟦 1. Multi-to-Text (M2T) Unimodal Generation

🧱 Late-Fusion Baseline References

🔗 Mid-Fusion (Naturally Interacted Regime)

🧬 Dense / Native M2T Scaling

🟩 2. Multi-to-Target (M2G) Scenario-based Generation

🎬 Advanced Video / World Simulators

🎙️ Speech-Centric Native Frameworks

📅 Timeline Milestone Generators

🟪 3. Multi-to-Multi (M2M) Symmetric Modeling

🔥 Early-Fusion (Native Convergent Regime)

🔮 Early Unified Predictors

🧩 Interleaved Sequence Modeling

🌌 Bidirectional Unification Frontiers

🛠️ The Technical Roadmap Dimensions

🧩 1. Architecture · §3

📊 2. Data Curriculum · §4

🎯 3. Training Strategies · §5

⚡ 4. Inference & Deployment · §6

🧪 5. Evaluation Benchmarks · §7

🤝 Contributing

✍️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages