Skip to content

thunlp/NOSA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NOSA: Native and Offloadable Sparse Attention

Boost Decoding Efficiency via High-Locality Offloading

Overview

NOSA is a trainable sparse attention mechanism designed for KV-cache offloading with an explicit locality constraint, paired with an inference system (NOSI) to realize its efficiency. It improves long-context/long-generation quality over prior offloading baselines while boosting decoding throughput by up to 5.04× vs FullAttn, 1.92× vs InfLLMv2, and 1.83× vs ShadowKV on 1B/3B/8B LLMs.

framework_github

Models

We train 1B, 3B, and 8B models FullAttn, InfLLMv2, DMA, and NOSA, resulting in a total of 12 models. The following models have been released on Hugging Face.

Model Link
NOSA-1B NOSA-1B
NOSA-3B NOSA-3B
NOSA-8B NOSA-8B

Please reach out to us if additional baseline models (FullAttn, InfLLMv2, or DMA) are needed. You may open an issue or contact us directly via email (our email addresses are provided in the paper).

Setup

We set up our experimental environment using uv inside Docker. If you need to set up the docker from scratch, please refer to dependencies/README.md

First, download our Docker image from ModelScope: huangyx21/nosa-env-docker. Please start from this image, where most dependencies have been pre-installed to greatly simplify environment setup.

pip install modelscope
modelscope download --model huangyx21/nosa-env-docker nosa_comp.tar --local_dir ./dependencies
docker import ./dependencies/nosa_comp.tar nosa:newest
# Then, please set up the directory mapping in ./dependencies/launch_docker.sh manually.
bash ./dependencies/launch_docker.sh nosa:newest # This takes a while

We have pre-installed four virtual environments. You can activate each one by executing the corresponding command below. For vLLM and SGLang evaluations, please refer to their official Docker images.

source /venv/nosa/bin/activate # for NOSA, FullAttn, InfLLMv2, DMA
source /venv/shadowkv/bin/activate # for ShadowKV
source /venv/infllm/bin/activate # for InfLLM
source /venv/arkvale/bin/activate # for arkvale

For flexibility, we keep the evaluation framework LM-Harness-Eval for general tasks, and ShadowKV not pre-installed. Please install these two packages as follows. Assume we are in the repo's directory now.

  • LM-Harness-Eval:
source /venv/nosa/bin/activate
uv pip install -e benchmarks/lm-evaluation-harness
  • ShadowKV:
source /venv/shadowkv/bin/activate
export TORCH_CUDA_ARCH_LIST=8.0 # Change to your GPU architecture
cd dependencies/ShadowKV
uv pip install -e . --no-build-isolation

Also, please install NOSI as follows.

uv pip install ./nosi

Run Experiments

Long-Input Evaluation

We run all methods on LongBench and HELMET.

  • LongBench
cd benchmarks/LongBench

# download test data
bash download_data.sh
# activate the corresponding venv
source /venv/nosa/bin/activate
# run LongBench
python pred.py --model 8b_nosa_sft
python eval.py --model 8b_nosa_sft

cd -
  • HELMET
cd benchmarks/HELMET

# download test data
bash scripts/download_data.sh
# activate the corresponding venv
source /venv/nosa/bin/activate
# run HELMET
python eval.py --output_dir output
bash collect_result.sh

cd -

General Tasks

cd benchmarks/lm-evaluation-harness

# activate the corresponding venv
source /venv/nosa/bin/activate
bash run_nosa.sh && bash run_infllmv2.sh && bash run_full.sh && bash run_dma.sh

cd -

Decoding Efficiency Tests

Each setting has a test_xxx_pg19.sh in benchmarks/Efficiency. Directly running them can obtain the decoding throughput.

cd benchmarks/Efficiency

# activate the corresponding venv
source /venv/nosa/bin/activate
# for example: NOSA+NOSI
bash test_nosa_pg19.sh

cd -

Acknowledgment

Some content of this repository are adapted from LongBench, HELMET, RULER, lm-evaluation-harness, ShadowKV, ArkVale, InfLLM, and InfLLMv2.

Citation

@article{huang2025nosa,
  title={NOSA: Native and Offloadable Sparse Attention},
  author={Huang, Yuxiang and Wang, Pengjie and Han, Jicheng and Zhao, Weilin and Su, Zhou and Sun, Ao and Lyu, Hongya and Zhao, Hengyu and Wang, Yudong and Xiao, Chaojun and Han, Xu and Liu, Zhiyuan},
  journal={arXiv preprint arXiv:2510.13602},
  year={2025}
}

About

The official implementation of NOSA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published