Boost Decoding Efficiency via High-Locality Offloading
NOSA is a trainable sparse attention mechanism designed for KV-cache offloading with an explicit locality constraint, paired with an inference system (NOSI) to realize its efficiency. It improves long-context/long-generation quality over prior offloading baselines while boosting decoding throughput by up to 5.04× vs FullAttn, 1.92× vs InfLLMv2, and 1.83× vs ShadowKV on 1B/3B/8B LLMs.
We train 1B, 3B, and 8B models FullAttn, InfLLMv2, DMA, and NOSA, resulting in a total of 12 models. The following models have been released on Hugging Face.
| Model | Link |
|---|---|
| NOSA-1B | NOSA-1B |
| NOSA-3B | NOSA-3B |
| NOSA-8B | NOSA-8B |
Please reach out to us if additional baseline models (FullAttn, InfLLMv2, or DMA) are needed. You may open an issue or contact us directly via email (our email addresses are provided in the paper).
We set up our experimental environment using uv inside Docker. If you need to set up the docker from scratch, please refer to dependencies/README.md
First, download our Docker image from ModelScope: huangyx21/nosa-env-docker. Please start from this image, where most dependencies have been pre-installed to greatly simplify environment setup.
pip install modelscope
modelscope download --model huangyx21/nosa-env-docker nosa_comp.tar --local_dir ./dependencies
docker import ./dependencies/nosa_comp.tar nosa:newest
# Then, please set up the directory mapping in ./dependencies/launch_docker.sh manually.
bash ./dependencies/launch_docker.sh nosa:newest # This takes a while
We have pre-installed four virtual environments. You can activate each one by executing the corresponding command below. For vLLM and SGLang evaluations, please refer to their official Docker images.
source /venv/nosa/bin/activate # for NOSA, FullAttn, InfLLMv2, DMA
source /venv/shadowkv/bin/activate # for ShadowKV
source /venv/infllm/bin/activate # for InfLLM
source /venv/arkvale/bin/activate # for arkvale
For flexibility, we keep the evaluation framework LM-Harness-Eval for general tasks, and ShadowKV not pre-installed. Please install these two packages as follows. Assume we are in the repo's directory now.
- LM-Harness-Eval:
source /venv/nosa/bin/activate
uv pip install -e benchmarks/lm-evaluation-harness
- ShadowKV:
source /venv/shadowkv/bin/activate
export TORCH_CUDA_ARCH_LIST=8.0 # Change to your GPU architecture
cd dependencies/ShadowKV
uv pip install -e . --no-build-isolation
Also, please install NOSI as follows.
uv pip install ./nosi
We run all methods on LongBench and HELMET.
- LongBench
cd benchmarks/LongBench
# download test data
bash download_data.sh
# activate the corresponding venv
source /venv/nosa/bin/activate
# run LongBench
python pred.py --model 8b_nosa_sft
python eval.py --model 8b_nosa_sft
cd -
- HELMET
cd benchmarks/HELMET
# download test data
bash scripts/download_data.sh
# activate the corresponding venv
source /venv/nosa/bin/activate
# run HELMET
python eval.py --output_dir output
bash collect_result.sh
cd -
cd benchmarks/lm-evaluation-harness
# activate the corresponding venv
source /venv/nosa/bin/activate
bash run_nosa.sh && bash run_infllmv2.sh && bash run_full.sh && bash run_dma.sh
cd -
Each setting has a test_xxx_pg19.sh in benchmarks/Efficiency. Directly running them can obtain the decoding throughput.
cd benchmarks/Efficiency
# activate the corresponding venv
source /venv/nosa/bin/activate
# for example: NOSA+NOSI
bash test_nosa_pg19.sh
cd -
Some content of this repository are adapted from LongBench, HELMET, RULER, lm-evaluation-harness, ShadowKV, ArkVale, InfLLM, and InfLLMv2.
@article{huang2025nosa,
title={NOSA: Native and Offloadable Sparse Attention},
author={Huang, Yuxiang and Wang, Pengjie and Han, Jicheng and Zhao, Weilin and Su, Zhou and Sun, Ao and Lyu, Hongya and Zhao, Hengyu and Wang, Yudong and Xiao, Chaojun and Han, Xu and Liu, Zhiyuan},
journal={arXiv preprint arXiv:2510.13602},
year={2025}
}