Code for AAAI'25 "Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference"
See requirements.txt.
Alternatively, you can use docker file zongyueq/llmss:0.0.2, then with command source ~/miniconda3/bin/activate myenv; conda activate myenv;
Your GPU needs to support nvidia-smi to measure GPU energy consumption.
The SQUAD dataset will be downloaded automatically. To use Spider dataset, download the data from https://yale-lily.github.io/spider and uncompress it under the DSBD directory. Make sure the file path in execution\_accuracy of sampling/utils.py is correct.
python evaluation.py --approx_model_name meta-llama/Llama-3.2-1B --target_model_name meta-llama/Llama-3.1-8B --max_tokens 200 --max_seconds 10000 --log_file /llmss/DSBD/logs/tmp.log --dataset squad --top_k=10 --top_p=0.9 --num_inputs=10
- appox_model_name: path of the draft model
- target_model_name: path of the target model
- max_tokens: the number of tokens to generate (values we used: 100, 200)
- max_seconds: the time limit for each method
- log_file: path to the log file
- dataset: squad or spider
- top_k: k for top k sampling (values we used: 10, 20)
- top_p: p for top p sampling (values we used: 0.8, 0.9)
- num_inputs: the number of inputs to test (values we used: 100, 200)
To run experiments with MT-Bench, please download its repo and replace the decoding function in "gen_model_answer.py" with our "beam_speculative_sampling".
The code is forked from "https://github.com/feifeibear/LLMSpeculativeSampling"