minicpm_sala

MiniCPM-SALA

What is MiniCPM-SALA?

MiniCPM-SALA (Sparse Attention and Linear Attention) introduces the first large-scale hybrid architecture that systematically integrates 25% sparse attention (InfLLM-v2) with 75% linear attention (Lightning Attention) for efficient ultra-long context modeling.

By combining high-fidelity long context modeling with globally efficient recurrent computation—and further empowered by HyPE, a hybrid positional embedding scheme—the model scales to million-token context windows while preserving strong length generalization.

Performance: Compared to dense Transformer baselines (e.g., Qwen3-8B), MiniCPM-SALA achieves up to 3.5× inference speed under long-context settings, significantly reducing both compute and KV-cache overhead.
Methodology: To ensure performance retention, we propose a novel Transformer-to-hybrid distillation recipe, initializing from MiniCPM-4 and applying structured decay and post-training adaptation to effectively transfer dense attention capabilities into the hybrid architecture.

MiniCPM-SALA Cookbook

Cook up amazing long-context applications efficiently with MiniCPM-SALA, bringing unparalleled context understanding and speed right to your fingertips!

✨ What Makes Our Recipes Special?

Easy Usage Documentation

Our comprehensive documentation website presents every recipe in a clear, well-organized manner. All features are displayed at a glance, making it easy for you to quickly find exactly what you need.

Broad User Spectrum

We support a wide range of users, from individuals to enterprises and researchers.

Individuals: Enjoy effortless inference using HuggingFace with minimal setup.
Enterprises: Achieve high-throughput, scalable performance with vLLM or SGLang.
Researchers: Leverage advanced frameworks, including Transformers Trainer and LLaMA-Factory, to enable flexible model development and cutting-edge experimentation.

Fine-tuning recipes

Customize your model with your own ingredients. For more detailed instructions for fine-tuning, check out the finetune subdirectory and its corresponding README.md.

Training

We provide training methods serving different needs as follows:

Framework	Description
Transformers Trainer	Most flexible for low-level customization.
LLaMA-Factory	Modular fine-tuning toolkit.

👥 Community

Contributing

We love new recipes! Please share your creative dishes:

Fork the repository
Create your recipe
Submit a pull request

Issues & Support

Found a bug? Open an issue

Institutions

This cookbook is developed by OpenBMB.

📜 License

This cookbook is served under the Apache-2.0 License - cook freely, share generously! 🍳

Citation

If you find our model, code, or paper helpful, please consider citing our papers 📝 and starring us ⭐️!

@article{minicpm4,
  title={{MiniCPM-SALA}: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling},
  author={MiniCPM Team},
  year={2026}
}

Name		Name	Last commit message	Last commit date
parent directory ..
finetune		finetune
README-cn.md		README-cn.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

MiniCPM-SALA

What is MiniCPM-SALA?

MiniCPM-SALA Cookbook

✨ What Makes Our Recipes Special?

Easy Usage Documentation

Broad User Spectrum

Fine-tuning recipes

Training

👥 Community

Contributing

Issues & Support

Institutions

📜 License

Citation

FilesExpand file tree

minicpm_sala

Directory actions

More options

Directory actions

More options

Latest commit

History

minicpm_sala

Folders and files

parent directory

README.md

MiniCPM-SALA

What is MiniCPM-SALA?

MiniCPM-SALA Cookbook

✨ What Makes Our Recipes Special?

Easy Usage Documentation

Broad User Spectrum

Fine-tuning recipes

Training

👥 Community

Contributing

Issues & Support

Institutions

📜 License

Citation