Skip to content

NMM-Roadmap/Awesome-NMM-List

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

15 Commits
ย 
ย 
ย 
ย 

Repository files navigation

Awesome Native Multimodal Modeling (NMM)

A curated reading list & model zoo for the era of Born-Native multimodal foundation models.

Awesome Paradigm Timeline PRs welcome License GitHub stars

๐Ÿ“„ Companion paper: "Toward Native Multimodal Modeling: A Roadmap"

Taxonomy ยท M2T ยท M2G ยท M2M ยท Technical Roadmap ยท Cite


This repository systematically tracks the structural transition from Modular Assembly โ€” late-fusion / grafted compositions that suffer from a fundamental blindness to raw sensory signals โ€” to Native Multimodal Modeling (NMM), where multiple modalities are intrinsically integrated into a unified transformer space or joint backbone.

โญ Star this repo to track the latest landmark works. PRs are warmly welcomed for any model we may have missed.


๐Ÿ—บ๏ธ The NMM Architectural Taxonomy

We formalize the NMM ecosystem through a dual-dimensional lens based on Integration Depth (mid-fusion vs. early-fusion) and Functional Inputโ€“Output Duality:

# Paradigm Input โ†’ Output Core Idea
๐ŸŸฆ M2T โ€” Multi-to-Text multimodal โ†’ text Ground cross-modal inputs into purely linguistic responses for reasoning.
๐ŸŸฉ M2G โ€” Multi-to-Target multimodal โ†’ modality-specific Direct synthesis of modality-specific outputs through native representations to achieve temporal & acoustic coherence.
๐ŸŸช M2M โ€” Multi-to-Multi multimodal โ†’ multimodal A unified paradigm where understanding and generation naturally coexist as reciprocal projections within a single network.

๐ŸŸฆ 1. Multi-to-Text (M2T) Unimodal Generation

Native scaling frameworks that ground cross-modal inputs into linguistic streams for logical reasoning.

๐Ÿงฑ Late-Fusion Baseline References

Modularly assembled via shallow projectors; blind to raw sensory signals.

๐Ÿ”— Mid-Fusion (Naturally Interacted Regime)

Foundational pioneers maintaining explicit, modality-aware boundaries.

Massive state-of-the-art evolved mid-fusion architectures:

Scale-driven industrial mid-fusion implementations:

๐Ÿงฌ Dense / Native M2T Scaling


๐ŸŸฉ 2. Multi-to-Target (M2G) Scenario-based Generation

Bypassing traditional post-hoc decoders to synthesize photorealistic spatiotemporal physics or continuous speech directly.

๐ŸŽฌ Advanced Video / World Simulators

๐ŸŽ™๏ธ Speech-Centric Native Frameworks

๐Ÿ“… Timeline Milestone Generators


๐ŸŸช 3. Multi-to-Multi (M2M) Symmetric Modeling

Omni-directional unified spaces establishing a symmetric paradigm where comprehension and generation natively coexist.

๐Ÿ”ฅ Early-Fusion (Native Convergent Regime)

Born-native designs treating all modalities equivalently via one unified backbone & embedding space.

๐Ÿ”ฎ Early Unified Predictors

๐Ÿงฉ Interleaved Sequence Modeling

๐ŸŒŒ Bidirectional Unification Frontiers

Collapsing representation boundaries.

โ˜… Denotes early exploratory or foundational dual-regime architectures.


๐Ÿ› ๏ธ The Technical Roadmap Dimensions

Following the systemic structure detailed across Sections ยง3โ€“ยง7 of the roadmap paper, the core components of the NMM lifecycle are curated below.

๐Ÿงฉ 1. Architecture ยท ยง3

  • Integration Depth Mapping โ€” Structural mechanics of joint multimodal backbones vs. single unified transformer spaces.

  • Inputโ€“Output Decoupling โ€” Eliminating modality-aware boundaries & shallow projectors.

๐Ÿ“Š 2. Data Curriculum ยท ยง4

  • Interleaved Data Curricula โ€” Pre-training token mixtures combining web-scale text, audio waves, & video streams.

  • Post-Training Engineering โ€” Multi-modal instruction tuning & alignment token datasets.

๐ŸŽฏ 3. Training Strategies ยท ยง5

  • Multi-Objective Loss Recipes โ€” Unifying continuous-discrete objectives (autoregressive next-token prediction + diffusion steps).

  • Scaling Dynamics โ€” Computed token-allocation strategies to maintain gradient stability at the 1T+ MoE frontier.

โšก 4. Inference & Deployment ยท ยง6

  • Full-Duplex Orchestration โ€” Dynamic KV-cache eviction & multi-scale attention patterns for real-time interaction (<100 ms).

  • Hardware-Native Compilation โ€” Distributed CUDA compute kernels for unified cross-modal token routing.

๐Ÿงช 5. Evaluation Benchmarks ยท ยง7

  • Symmetric Evaluation Matrices โ€” Benchmarking systems capable of examining interleaved multi-modal sequences without suffering from target-modality collapse.


๐Ÿค Contributing

Contributions are very welcome! If a notable native multimodal model is missing or you find an outdated link, please open an Issue or send a Pull Request.

The preferred entry format is:

- **<Model Name>** [<Authors / Team>, <Year>] โ€” [`๐Ÿ’ป GitHub`](https://...) ยท [`๐Ÿ“„ Paper`](https://...)

โœ๏ธ Citation

If our formalization, taxonomy, or roadmap framework assists your research, please cite our definitive paper:

@article{TencentYoutuLab2026toward,
  title   = {Toward Native Multimodal Modeling: A Roadmap},
  author  = {Siyu An and Junru Lu and Junnan Dong and others},
  journal = {arXiv preprint},
  year    = {2026}
}

Maintained by the NMM-Roadmap community ยท Made with โค๏ธ for open multimodal research.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors