-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : add pipeline parallelism support #6017
Conversation
…ltiple CUDA GPUs ggml-ci
Do you mind to also add |
Posting some results on 8x A100: LLAMA_SCHED_MAX_COPIES=8 LLAMA_CUBLAS=1 make -j
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./llama-bench \
-m models/codellama-7b/ggml-model-f16.gguf \
-m models/codellama-7b/ggml-model-q8_0.gguf \
-m models/codellama-7b/ggml-model-q4_k.gguf \
-m models/codellama-7b/ggml-model-q4_0.gguf \
-ngl 99 -p 512,1024,2048,4096,8192 -b 8192 master (1x GPU):ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
build: d8fd0cc (2412) master (2x GPUs):ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
build: d8fd0cc (2412) master (4x GPUs):ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
build: d8fd0cc (2412) master (8x GPUs):ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
build: d8fd0cc (2412) old (sl/micro-batching, 8x GPUs) with ub=256:ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
build: af789e7 (1861) new (x1 GPU,
|
model | size | params | backend | ngl | n_batch | n_ubatch | test | t/s |
---|---|---|---|---|---|---|---|---|
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 512 | 8237.93 ± 21.75 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 1024 | 7823.11 ± 17.30 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 2048 | 6974.75 ± 9.76 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 4096 | 5594.67 ± 4.54 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 8192 | 4527.03 ± 8.31 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | tg 128 | 73.87 ± 0.16 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 512 | 6728.34 ± 16.10 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 1024 | 7055.80 ± 8.37 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 2048 | 6674.59 ± 9.64 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 4096 | 5496.80 ± 7.27 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 8192 | 4449.37 ± 4.71 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | tg 128 | 114.80 ± 0.33 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 512 | 6125.22 ± 13.83 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 1024 | 6690.14 ± 19.46 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 2048 | 6501.66 ± 6.32 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 4096 | 5427.55 ± 8.96 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 8192 | 4404.22 ± 3.19 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | tg 128 | 129.97 ± 0.30 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 512 | 6109.62 ± 5.88 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 1024 | 6678.93 ± 21.22 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 2048 | 6496.22 ± 14.22 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 4096 | 5418.54 ± 4.24 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | pp 8192 | 4408.39 ± 3.31 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 4096 | tg 128 | 145.15 ± 0.50 |
build: 54cdd47 (2424)
new (x2 GPUs):
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
model | size | params | backend | ngl | n_batch | n_ubatch | test | t/s |
---|---|---|---|---|---|---|---|---|
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 512 | 10090.00 ± 25.47 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 1024 | 11267.24 ± 20.31 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 2048 | 11404.60 ± 14.40 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 4096 | 10493.05 ± 10.40 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 8192 | 8587.92 ± 4.60 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | tg 128 | 72.08 ± 0.21 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 512 | 7276.31 ± 13.11 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 1024 | 8212.10 ± 7.13 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 2048 | 8474.57 ± 12.77 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 4096 | 8044.92 ± 16.42 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 8192 | 6922.49 ± 3.52 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | tg 128 | 111.61 ± 0.45 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 512 | 6241.75 ± 10.47 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 1024 | 7085.78 ± 15.74 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 2048 | 7372.92 ± 5.08 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 4096 | 7087.04 ± 3.72 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 8192 | 6186.62 ± 2.76 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | tg 128 | 126.62 ± 0.41 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 512 | 6198.75 ± 6.54 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 1024 | 7045.06 ± 7.38 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 2048 | 7345.98 ± 3.19 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 4096 | 7083.79 ± 3.68 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 8192 | 6192.83 ± 2.38 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | tg 128 | 140.12 ± 0.65 |
build: 54cdd47 (2424)
new (x4 GPUs):
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 4 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
model | size | params | backend | ngl | n_batch | n_ubatch | test | t/s |
---|---|---|---|---|---|---|---|---|
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 512 | 11813.02 ± 22.83 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 1024 | 15387.27 ± 33.61 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 2048 | 17609.80 ± 41.84 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 4096 | 17618.77 ± 18.29 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 8192 | 15100.46 ± 21.89 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | tg 128 | 71.87 ± 0.22 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 512 | 8604.28 ± 4.99 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 1024 | 11357.60 ± 4.19 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 2048 | 13201.46 ± 18.92 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 4096 | 13411.17 ± 21.39 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 8192 | 11882.45 ± 10.51 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | tg 128 | 110.12 ± 0.59 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 512 | 7401.93 ± 6.81 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 1024 | 9828.07 ± 11.61 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 2048 | 11516.67 ± 20.88 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 4096 | 11881.12 ± 20.03 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 8192 | 10722.42 ± 24.25 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | tg 128 | 124.27 ± 0.74 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 512 | 7293.93 ± 11.64 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 1024 | 9681.32 ± 20.72 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 2048 | 11390.60 ± 12.97 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 4096 | 11770.08 ± 14.25 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 8192 | 10639.98 ± 10.34 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | tg 128 | 137.21 ± 0.81 |
build: 54cdd47 (2424)
new (x8 GPUs):
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 8 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
model | size | params | backend | ngl | n_batch | n_ubatch | test | t/s |
---|---|---|---|---|---|---|---|---|
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 512 | 12100.57 ± 193.33 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 1024 | 17591.46 ± 132.00 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 2048 | 22931.45 ± 144.20 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 4096 | 25504.36 ± 434.87 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 8192 | 23820.97 ± 188.92 |
llama 7B F16 | 12.55 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | tg 128 | 71.87 ± 0.02 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 512 | 9266.34 ± 59.90 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 1024 | 13796.83 ± 29.05 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 2048 | 17820.71 ± 327.33 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 4096 | 20105.47 ± 38.39 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 8192 | 18915.61 ± 58.18 |
llama 7B Q8_0 | 6.67 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | tg 128 | 108.56 ± 0.90 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 512 | 7983.52 ± 84.85 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 1024 | 12007.61 ± 19.79 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 2048 | 15755.47 ± 74.45 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 4096 | 17909.72 ± 28.69 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 8192 | 17152.45 ± 22.33 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | tg 128 | 122.04 ± 0.95 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 512 | 7908.16 ± 5.14 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 1024 | 11764.20 ± 142.62 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 2048 | 15567.45 ± 33.99 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 4096 | 17668.75 ± 59.72 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | pp 8192 | 16992.44 ± 27.34 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 8192 | 256 | tg 128 | 135.03 ± 1.31 |
build: 54cdd47 (2424)
ppl, -c 512 -b 2048, -ub 256
- runtime: 30.7s
LLAMA_SCHED_MAX_COPIES=8 LLAMA_CUBLAS=1 make -j perplexity && time CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./perplexity -m models/codellama-7b/ggml-model-f16.gguf -ngl 99 -c 512 -b 2048 -ub 256 -f wikitext-2-raw/wiki.test.raw
llm_load_tensors: ggml ctx size = 1.00 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 250.12 MiB
llm_load_tensors: CUDA0 buffer size = 1930.16 MiB
llm_load_tensors: CUDA1 buffer size = 1544.12 MiB
llm_load_tensors: CUDA2 buffer size = 1544.12 MiB
llm_load_tensors: CUDA3 buffer size = 1544.12 MiB
llm_load_tensors: CUDA4 buffer size = 1544.12 MiB
llm_load_tensors: CUDA5 buffer size = 1544.12 MiB
llm_load_tensors: CUDA6 buffer size = 1544.12 MiB
llm_load_tensors: CUDA7 buffer size = 1408.23 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 256
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 160.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 128.00 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 128.00 MiB
llama_kv_cache_init: CUDA3 KV buffer size = 128.00 MiB
llama_kv_cache_init: CUDA4 KV buffer size = 128.00 MiB
llama_kv_cache_init: CUDA5 KV buffer size = 128.00 MiB
llama_kv_cache_init: CUDA6 KV buffer size = 128.00 MiB
llama_kv_cache_init: CUDA7 KV buffer size = 96.00 MiB
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 250.12 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=8)
llama_new_context_with_model: CUDA0 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA2 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA3 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA4 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA5 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA6 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA7 compute buffer size = 128.01 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 20.02 MiB
llama_new_context_with_model: graph splits: 9
system_info: n_threads = 126 / 252 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 922.755 ms
perplexity: calculating perplexity over 655 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 0.15 seconds per pass - ETA 0.38 minutes
[1]5.5511,[2]6.0573,[3]6.7593,[4]7.7007,[5]7.8618,[6]7.6842,[7]7.8791,[8]7.8360,[9]8.2430,[10]8.5690,[11]8.8467,[12]8.9639,[13]8.8943,[14]9.0010,[15]9.3134,[16]8.8007,[17]8.6196,[18]8.5497,[19]8.0678,[20]8.0193,[21]7.9075,[22]7.7105,[23]7.6356,[24]7.5056,[25]7.4911,[26]7.2615,[27]6.9952,[28]6.8504,[29]6.7198,[30]6.5173,[31]6.4668,[32]6.5103,[33]6.4616,[34]6.4974,[35]6.5202,[36]6.5535,[37]6.5503,[38]6.5514,[39]6.5706,[40]6.6289,[41]6.6420,[42]6.6959,[43]6.6409,[44]6.6959,[45]6.7062,[46]6.6758,[47]6.7094,[48]6.6758,[49]6.6698,[50]6.6089,[51]6.6233,[52]6.6110,[53]6.6717,[54]6.6556,[55]6.6299,[56]6.6832,[57]6.7115,[58]6.7425,[59]6.7600,[60]6.8152,[61]6.8079,[62]6.8741,[63]6.9046,[64]6.8991,[65]6.9468,[66]6.9558,[67]6.9631,[68]6.9864,[69]7.0091,[70]7.0463,[71]7.0751,[72]7.1178,[73]7.1782,[74]7.1854,[75]7.1957,[76]7.2122,[77]7.2331,[78]7.2206,[79]7.2511,[80]7.2403,[81]7.2675,[82]7.2900,[83]7.2264,[84]7.2237,[85]7.2272,[86]7.2067,[87]7.1662,[88]7.1429,[89]7.1244,[90]7.1093,[91]7.1429,[92]7.1385,[93]7.1324,[94]7.1418,[95]7.1824,[96]7.1791,[97]7.1767,[98]7.1758,[99]7.1554,[100]7.1581,[101]7.1865,[102]7.1820,[103]7.2073,[104]7.2144,[105]7.2079,[106]7.2312,[107]7.2356,[108]7.2507,[109]7.2459,[110]7.2401,[111]7.2658,[112]7.2937,[113]7.2966,[114]7.2962,[115]7.3049,[116]7.2989,[117]7.3076,[118]7.3354,[119]7.3648,[120]7.4016,[121]7.4241,[122]7.4536,[123]7.4954,[124]7.5155,[125]7.4993,[126]7.5349,[127]7.5734,[128]7.6024,[129]7.5849,[130]7.5883,[131]7.5831,[132]7.5721,[133]7.5550,[134]7.5628,[135]7.5569,[136]7.5508,[137]7.5394,[138]7.5140,[139]7.5074,[140]7.5004,[141]7.4701,[142]7.4698,[143]7.4386,[144]7.4145,[145]7.4026,[146]7.3901,[147]7.3926,[148]7.3947,[149]7.3940,[150]7.3950,[151]7.3984,[152]7.3873,[153]7.3681,[154]7.3583,[155]7.3625,[156]7.3602,[157]7.3794,[158]7.3808,[159]7.3843,[160]7.3906,[161]7.4065,[162]7.3679,[163]7.3529,[164]7.3240,[165]7.2897,[166]7.2563,[167]7.2055,[168]7.1706,[169]7.1610,[170]7.1486,[171]7.1179,[172]7.0995,[173]7.0834,[174]7.0533,[175]7.0260,[176]7.0116,[177]6.9880,[178]6.9643,[179]6.9449,[180]6.9396,[181]6.9200,[182]6.8981,[183]6.8821,[184]6.8821,[185]6.8762,[186]6.8799,[187]6.8900,[188]6.8914,[189]6.9165,[190]6.9204,[191]6.9462,[192]6.9646,[193]6.9863,[194]7.0012,[195]7.0284,[196]7.0463,[197]7.0716,[198]7.0863,[199]7.0885,[200]7.0948,[201]7.0876,[202]7.1108,[203]7.1237,[204]7.1431,[205]7.1612,[206]7.1705,[207]7.1669,[208]7.1796,[209]7.1845,[210]7.1894,[211]7.2037,[212]7.2112,[213]7.2223,[214]7.2253,[215]7.2263,[216]7.2405,[217]7.2590,[218]7.2737,[219]7.2695,[220]7.2644,[221]7.2528,[222]7.2511,[223]7.2399,[224]7.2308,[225]7.2227,[226]7.2459,[227]7.2548,[228]7.2605,[229]7.2638,[230]7.2600,[231]7.2770,[232]7.2661,[233]7.2424,[234]7.2223,[235]7.1999,[236]7.1899,[237]7.1746,[238]7.1772,[239]7.1613,[240]7.1496,[241]7.1493,[242]7.1494,[243]7.1434,[244]7.1286,[245]7.1240,[246]7.1083,[247]7.0942,[248]7.0844,[249]7.0797,[250]7.0844,[251]7.0735,[252]7.0683,[253]7.0559,[254]7.0456,[255]7.0301,[256]7.0075,[257]6.9929,[258]6.9799,[259]6.9761,[260]6.9634,[261]6.9565,[262]6.9477,[263]6.9374,[264]6.9211,[265]6.9218,[266]6.9153,[267]6.9052,[268]6.9166,[269]6.9204,[270]6.9187,[271]6.9293,[272]6.9360,[273]6.9367,[274]6.9365,[275]6.9453,[276]6.9501,[277]6.9676,[278]6.9776,[279]6.9893,[280]6.9940,[281]7.0085,[282]7.0149,[283]7.0307,[284]7.0399,[285]7.0482,[286]7.0631,[287]7.0621,[288]7.0664,[289]7.0547,[290]7.0401,[291]7.0271,[292]7.0130,[293]6.9996,[294]7.0019,[295]7.0022,[296]7.0082,[297]7.0083,[298]7.0143,[299]7.0128,[300]7.0034,[301]7.0033,[302]6.9977,[303]6.9883,[304]6.9792,[305]6.9776,[306]6.9637,[307]6.9657,[308]6.9651,[309]6.9480,[310]6.9422,[311]6.9368,[312]6.9407,[313]6.9375,[314]6.9383,[315]6.9206,[316]6.9212,[317]6.8998,[318]6.8745,[319]6.8893,[320]6.9036,[321]6.9058,[322]6.8981,[323]6.8981,[324]6.9013,[325]6.9164,[326]6.9179,[327]6.9223,[328]6.9251,[329]6.9330,[330]6.9426,[331]6.9580,[332]6.9545,[333]6.9661,[334]6.9597,[335]6.9514,[336]6.9532,[337]6.9498,[338]6.9510,[339]6.9467,[340]6.9415,[341]6.9477,[342]6.9479,[343]6.9532,[344]6.9538,[345]6.9535,[346]6.9495,[347]6.9521,[348]6.9556,[349]6.9591,[350]6.9571,[351]6.9569,[352]6.9573,[353]6.9502,[354]6.9500,[355]6.9559,[356]6.9605,[357]6.9549,[358]6.9657,[359]6.9688,[360]6.9646,[361]6.9628,[362]6.9721,[363]6.9853,[364]6.9924,[365]6.9977,[366]6.9991,[367]7.0063,[368]7.0008,[369]7.0019,[370]7.0035,[371]6.9969,[372]7.0010,[373]7.0051,[374]7.0013,[375]6.9996,[376]7.0100,[377]7.0028,[378]7.0065,[379]7.0141,[380]7.0042,[381]7.0010,[382]6.9944,[383]6.9944,[384]6.9932,[385]6.9938,[386]6.9941,[387]6.9961,[388]6.9923,[389]6.9873,[390]6.9804,[391]6.9732,[392]6.9752,[393]6.9797,[394]6.9863,[395]6.9843,[396]6.9754,[397]6.9850,[398]6.9921,[399]7.0012,[400]7.0003,[401]7.0017,[402]7.0040,[403]7.0057,[404]7.0127,[405]7.0102,[406]7.0073,[407]7.0125,[408]7.0133,[409]7.0256,[410]7.0396,[411]7.0541,[412]7.0736,[413]7.0866,[414]7.0969,[415]7.1036,[416]7.1153,[417]7.1269,[418]7.1320,[419]7.1385,[420]7.1509,[421]7.1644,[422]7.1703,[423]7.1794,[424]7.1918,[425]7.2047,[426]7.2135,[427]7.2187,[428]7.2303,[429]7.2368,[430]7.2494,[431]7.2648,[432]7.2681,[433]7.2650,[434]7.2590,[435]7.2599,[436]7.2629,[437]7.2752,[438]7.2845,[439]7.2789,[440]7.2778,[441]7.2726,[442]7.2706,[443]7.2708,[444]7.2715,[445]7.2686,[446]7.2708,[447]7.2739,[448]7.2799,[449]7.2768,[450]7.2767,[451]7.2715,[452]7.2706,[453]7.2622,[454]7.2568,[455]7.2581,[456]7.2614,[457]7.2639,[458]7.2631,[459]7.2628,[460]7.2732,[461]7.2714,[462]7.2717,[463]7.2781,[464]7.2772,[465]7.2734,[466]7.2652,[467]7.2690,[468]7.2719,[469]7.2758,[470]7.2767,[471]7.2726,[472]7.2797,[473]7.2732,[474]7.2786,[475]7.2801,[476]7.2828,[477]7.2762,[478]7.2753,[479]7.2892,[480]7.2962,[481]7.2997,[482]7.2952,[483]7.2913,[484]7.2962,[485]7.2961,[486]7.2903,[487]7.2945,[488]7.2946,[489]7.2898,[490]7.2903,[491]7.2909,[492]7.2878,[493]7.2842,[494]7.2825,[495]7.2852,[496]7.2831,[497]7.2814,[498]7.2821,[499]7.2758,[500]7.2668,[501]7.2613,[502]7.2659,[503]7.2653,[504]7.2569,[505]7.2604,[506]7.2614,[507]7.2561,[508]7.2501,[509]7.2494,[510]7.2528,[511]7.2609,[512]7.2636,[513]7.2651,[514]7.2720,[515]7.2664,[516]7.2648,[517]7.2661,[518]7.2668,[519]7.2710,[520]7.2733,[521]7.2759,[522]7.2794,[523]7.2805,[524]7.2872,[525]7.2914,[526]7.2918,[527]7.2945,[528]7.2889,[529]7.2914,[530]7.2835,[531]7.2816,[532]7.2885,[533]7.2916,[534]7.2892,[535]7.2931,[536]7.2878,[537]7.2852,[538]7.2916,[539]7.2940,[540]7.2978,[541]7.3027,[542]7.3019,[543]7.3033,[544]7.3033,[545]7.3012,[546]7.3024,[547]7.2970,[548]7.2887,[549]7.2891,[550]7.2859,[551]7.2822,[552]7.2799,[553]7.2758,[554]7.2727,[555]7.2676,[556]7.2671,[557]7.2735,[558]7.2704,[559]7.2692,[560]7.2676,[561]7.2680,[562]7.2629,[563]7.2649,[564]7.2701,[565]7.2723,[566]7.2716,[567]7.2699,[568]7.2691,[569]7.2664,[570]7.2692,[571]7.2697,[572]7.2696,[573]7.2699,[574]7.2657,[575]7.2640,[576]7.2626,[577]7.2598,[578]7.2574,[579]7.2565,[580]7.2485,[581]7.2447,[582]7.2445,[583]7.2444,[584]7.2439,[585]7.2345,[586]7.2259,[587]7.2256,[588]7.2309,[589]7.2378,[590]7.2396,[591]7.2412,[592]7.2396,[593]7.2352,[594]7.2354,[595]7.2323,[596]7.2361,[597]7.2316,[598]7.2292,[599]7.2312,[600]7.2305,[601]7.2294,[602]7.2348,[603]7.2367,[604]7.2393,[605]7.2428,[606]7.2447,[607]7.2455,[608]7.2413,[609]7.2415,[610]7.2460,[611]7.2428,[612]7.2441,[613]7.2384,[614]7.2325,[615]7.2219,[616]7.2245,[617]7.2152,[618]7.2070,[619]7.1987,[620]7.1805,[621]7.1720,[622]7.1704,[623]7.1718,[624]7.1710,[625]7.1699,[626]7.1686,[627]7.1717,[628]7.1713,[629]7.1710,[630]7.1741,[631]7.1794,[632]7.1853,[633]7.1833,[634]7.1873,[635]7.1882,[636]7.1848,[637]7.1813,[638]7.1843,[639]7.1801,[640]7.1824,[641]7.1821,[642]7.1901,[643]7.1924,[644]7.1930,[645]7.1911,[646]7.1956,[647]7.1954,[648]7.1974,[649]7.1978,[650]7.2026,[651]7.2079,[652]7.2096,[653]7.2143,[654]7.2065,[655]7.2061,
Final estimate: PPL = 7.2061 +/- 0.04271
llama_print_timings: load time = 1602.72 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 15310.80 ms / 335360 tokens ( 0.05 ms per token, 21903.50 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 25860.90 ms / 335361 tokens
real 0m30.712s
user 1m12.300s
sys 0m25.268s
master (1x GPU, 13B):
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
model | size | params | backend | ngl | n_batch | test | t/s |
---|---|---|---|---|---|---|---|
llama 13B F16 | 24.25 GiB | 13.02 B | CUDA | 99 | 4096 | pp 512 | 5111.70 ± 47.07 |
llama 13B F16 | 24.25 GiB | 13.02 B | CUDA | 99 | 4096 | pp 1024 | 4922.35 ± 35.87 |
llama 13B F16 | 24.25 GiB | 13.02 B | CUDA | 99 | 4096 | pp 2048 | 4413.17 ± 10.56 |
llama 13B F16 | 24.25 GiB | 13.02 B | CUDA | 99 | 4096 | tg 128 | 45.28 ± 0.05 |
llama 13B Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 4096 | pp 512 | 4036.98 ± 25.68 |
llama 13B Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 4096 | pp 1024 | 4350.98 ± 14.85 |
llama 13B Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 4096 | pp 2048 | 4162.22 ± 9.86 |
llama 13B Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 4096 | tg 128 | 68.79 ± 0.04 |
llama 13B Q4_K - Medium | 7.33 GiB | 13.02 B | CUDA | 99 | 4096 | pp 512 | 3615.92 ± 20.41 |
llama 13B Q4_K - Medium | 7.33 GiB | 13.02 B | CUDA | 99 | 4096 | pp 1024 | 4062.81 ± 27.69 |
llama 13B Q4_K - Medium | 7.33 GiB | 13.02 B | CUDA | 99 | 4096 | pp 2048 | 4006.28 ± 11.71 |
llama 13B Q4_K - Medium | 7.33 GiB | 13.02 B | CUDA | 99 | 4096 | tg 128 | 83.52 ± 0.05 |
llama 13B Q4_0 | 6.86 GiB | 13.02 B | CUDA | 99 | 4096 | pp 512 | 3608.53 ± 20.18 |
llama 13B Q4_0 | 6.86 GiB | 13.02 B | CUDA | 99 | 4096 | pp 1024 | 4059.13 ± 20.43 |
llama 13B Q4_0 | 6.86 GiB | 13.02 B | CUDA | 99 | 4096 | pp 2048 | 4004.22 ± 12.23 |
llama 13B Q4_0 | 6.86 GiB | 13.02 B | CUDA | 99 | 4096 | tg 128 | 95.35 ± 0.08 |
build: 99b71c0 (2410)
new (x8 GPUs, 13B):
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 8 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
model | size | params | backend | ngl | n_batch | n_ubatch | test | t/s |
---|---|---|---|---|---|---|---|---|
llama 13B F16 | 24.25 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 512 | 8142.20 ± 3.99 |
llama 13B F16 | 24.25 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 1024 | 11977.25 ± 18.75 |
llama 13B F16 | 24.25 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 2048 | 15377.04 ± 19.31 |
llama 13B F16 | 24.25 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 4096 | 16842.23 ± 240.43 |
llama 13B F16 | 24.25 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 8192 | 15316.75 ± 11.69 |
llama 13B F16 | 24.25 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | tg 128 | 43.96 ± 0.12 |
llama 13B Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 512 | 5605.35 ± 3.46 |
llama 13B Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 1024 | 8428.62 ± 0.80 |
llama 13B Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 2048 | 10686.25 ± 14.22 |
llama 13B Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 4096 | 11824.00 ± 20.43 |
llama 13B Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 8192 | 11228.18 ± 12.87 |
llama 13B Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | tg 128 | 66.32 ± 0.38 |
llama 13B Q4_K - Medium | 7.33 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 512 | 4775.66 ± 3.41 |
llama 13B Q4_K - Medium | 7.33 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 1024 | 7225.87 ± 4.09 |
llama 13B Q4_K - Medium | 7.33 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 2048 | 9251.36 ± 4.16 |
llama 13B Q4_K - Medium | 7.33 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 4096 | 10346.24 ± 19.27 |
llama 13B Q4_K - Medium | 7.33 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 8192 | 10018.40 ± 6.89 |
llama 13B Q4_K - Medium | 7.33 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | tg 128 | 80.95 ± 0.48 |
llama 13B Q4_0 | 6.86 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 512 | 4731.63 ± 2.33 |
llama 13B Q4_0 | 6.86 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 1024 | 7174.17 ± 4.75 |
llama 13B Q4_0 | 6.86 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 2048 | 9167.04 ± 2.94 |
llama 13B Q4_0 | 6.86 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 4096 | 10255.81 ± 20.46 |
llama 13B Q4_0 | 6.86 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | pp 8192 | 9943.68 ± 11.66 |
llama 13B Q4_0 | 6.86 GiB | 13.02 B | CUDA | 99 | 8192 | 256 | tg 128 | 90.05 ± 0.70 |
build: 54cdd47 (2424)
ppl (13B), -c 512 -b 2048, -ub 256
- runtime: 39.3s
LLAMA_SCHED_MAX_COPIES=8 LLAMA_CUBLAS=1 make -j perplexity && time CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./perplexity -m models/codellama-13b/ggml-model-f16.gguf -ngl 99 -c 512 -b 2048 -ub 256 -f wikitext-2-raw/wiki.test.raw
llm_load_tensors: ggml ctx size = 1.25 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 312.66 MiB
llm_load_tensors: CUDA0 buffer size = 3630.23 MiB
llm_load_tensors: CUDA1 buffer size = 3025.20 MiB
llm_load_tensors: CUDA2 buffer size = 3025.20 MiB
llm_load_tensors: CUDA3 buffer size = 3025.20 MiB
llm_load_tensors: CUDA4 buffer size = 3025.20 MiB
llm_load_tensors: CUDA5 buffer size = 3025.20 MiB
llm_load_tensors: CUDA6 buffer size = 3025.20 MiB
llm_load_tensors: CUDA7 buffer size = 2732.83 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 256
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 240.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 200.00 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 200.00 MiB
llama_kv_cache_init: CUDA3 KV buffer size = 200.00 MiB
llama_kv_cache_init: CUDA4 KV buffer size = 200.00 MiB
llama_kv_cache_init: CUDA5 KV buffer size = 200.00 MiB
llama_kv_cache_init: CUDA6 KV buffer size = 200.00 MiB
llama_kv_cache_init: CUDA7 KV buffer size = 160.00 MiB
llama_new_context_with_model: KV self size = 1600.00 MiB, K (f16): 800.00 MiB, V (f16): 800.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 250.12 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=8)
llama_new_context_with_model: CUDA0 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA2 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA3 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA4 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA5 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA6 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA7 compute buffer size = 156.01 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 21.02 MiB
llama_new_context_with_model: graph splits: 9
system_info: n_threads = 126 / 252 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 969.675 ms
perplexity: calculating perplexity over 655 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 0.21 seconds per pass - ETA 0.57 minutes
[1]4.7515,[2]5.5743,[3]6.2923,[4]7.1020,[5]7.3440,[6]7.1685,[7]7.3431,[8]7.3339,[9]7.6557,[10]7.9098,[11]8.1824,[12]8.2266,[13]8.1402,[14]8.3017,[15]8.6143,[16]8.1901,[17]8.0203,[18]7.9660,[19]7.5368,[20]7.5153,[21]7.3838,[22]7.1791,[23]7.1297,[24]7.0125,[25]7.0057,[26]6.7937,[27]6.5418,[28]6.3808,[29]6.2788,[30]6.0670,[31]6.0166,[32]6.0573,[33]6.0083,[34]6.0517,[35]6.0759,[36]6.1186,[37]6.0951,[38]6.0937,[39]6.1068,[40]6.1666,[41]6.1816,[42]6.2251,[43]6.1683,[44]6.2104,[45]6.2241,[46]6.1909,[47]6.2133,[48]6.1786,[49]6.1735,[50]6.1243,[51]6.1308,[52]6.1158,[53]6.1663,[54]6.1518,[55]6.1288,[56]6.1687,[57]6.1866,[58]6.2100,[59]6.2220,[60]6.2663,[61]6.2574,[62]6.3202,[63]6.3439,[64]6.3468,[65]6.3853,[66]6.3889,[67]6.3957,[68]6.4132,[69]6.4349,[70]6.4674,[71]6.4919,[72]6.5292,[73]6.5872,[74]6.5989,[75]6.6017,[76]6.6165,[77]6.6342,[78]6.6272,[79]6.6555,[80]6.6523,[81]6.6707,[82]6.6828,[83]6.6300,[84]6.6309,[85]6.6321,[86]6.6132,[87]6.5768,[88]6.5584,[89]6.5367,[90]6.5242,[91]6.5494,[92]6.5431,[93]6.5359,[94]6.5396,[95]6.5709,[96]6.5721,[97]6.5615,[98]6.5512,[99]6.5314,[100]6.5349,[101]6.5650,[102]6.5594,[103]6.5811,[104]6.5868,[105]6.5803,[106]6.6011,[107]6.6042,[108]6.6121,[109]6.6110,[110]6.6045,[111]6.6265,[112]6.6494,[113]6.6511,[114]6.6513,[115]6.6577,[116]6.6475,[117]6.6507,[118]6.6772,[119]6.7020,[120]6.7391,[121]6.7564,[122]6.7807,[123]6.8201,[124]6.8343,[125]6.8196,[126]6.8549,[127]6.8890,[128]6.9170,[129]6.8986,[130]6.9033,[131]6.8992,[132]6.8917,[133]6.8754,[134]6.8875,[135]6.8839,[136]6.8761,[137]6.8654,[138]6.8416,[139]6.8363,[140]6.8337,[141]6.8074,[142]6.8039,[143]6.7750,[144]6.7536,[145]6.7413,[146]6.7318,[147]6.7340,[148]6.7389,[149]6.7376,[150]6.7390,[151]6.7435,[152]6.7361,[153]6.7216,[154]6.7140,[155]6.7182,[156]6.7190,[157]6.7372,[158]6.7420,[159]6.7459,[160]6.7538,[161]6.7710,[162]6.7381,[163]6.7260,[164]6.7003,[165]6.6716,[166]6.6415,[167]6.5974,[168]6.5694,[169]6.5591,[170]6.5475,[171]6.5199,[172]6.5045,[173]6.4907,[174]6.4609,[175]6.4382,[176]6.4240,[177]6.4023,[178]6.3802,[179]6.3635,[180]6.3575,[181]6.3416,[182]6.3212,[183]6.3065,[184]6.3064,[185]6.3032,[186]6.3078,[187]6.3188,[188]6.3199,[189]6.3446,[190]6.3471,[191]6.3713,[192]6.3893,[193]6.4072,[194]6.4224,[195]6.4466,[196]6.4630,[197]6.4856,[198]6.5021,[199]6.5066,[200]6.5136,[201]6.5077,[202]6.5295,[203]6.5412,[204]6.5503,[205]6.5645,[206]6.5725,[207]6.5681,[208]6.5817,[209]6.5853,[210]6.5894,[211]6.6043,[212]6.6106,[213]6.6198,[214]6.6245,[215]6.6243,[216]6.6345,[217]6.6509,[218]6.6648,[219]6.6618,[220]6.6608,[221]6.6477,[222]6.6465,[223]6.6350,[224]6.6276,[225]6.6201,[226]6.6419,[227]6.6499,[228]6.6569,[229]6.6612,[230]6.6582,[231]6.6721,[232]6.6608,[233]6.6399,[234]6.6229,[235]6.5969,[236]6.5903,[237]6.5777,[238]6.5783,[239]6.5635,[240]6.5497,[241]6.5486,[242]6.5490,[243]6.5424,[244]6.5292,[245]6.5252,[246]6.5119,[247]6.4984,[248]6.4879,[249]6.4835,[250]6.4865,[251]6.4765,[252]6.4718,[253]6.4596,[254]6.4509,[255]6.4359,[256]6.4159,[257]6.4019,[258]6.3910,[259]6.3876,[260]6.3756,[261]6.3697,[262]6.3622,[263]6.3531,[264]6.3359,[265]6.3355,[266]6.3300,[267]6.3235,[268]6.3349,[269]6.3385,[270]6.3401,[271]6.3501,[272]6.3558,[273]6.3565,[274]6.3549,[275]6.3623,[276]6.3679,[277]6.3819,[278]6.3923,[279]6.4024,[280]6.4065,[281]6.4176,[282]6.4230,[283]6.4373,[284]6.4463,[285]6.4552,[286]6.4717,[287]6.4694,[288]6.4729,[289]6.4627,[290]6.4507,[291]6.4391,[292]6.4267,[293]6.4133,[294]6.4148,[295]6.4170,[296]6.4237,[297]6.4246,[298]6.4292,[299]6.4272,[300]6.4185,[301]6.4203,[302]6.4145,[303]6.4065,[304]6.3975,[305]6.3954,[306]6.3836,[307]6.3860,[308]6.3862,[309]6.3712,[310]6.3669,[311]6.3623,[312]6.3648,[313]6.3619,[314]6.3628,[315]6.3458,[316]6.3434,[317]6.3241,[318]6.3019,[319]6.3154,[320]6.3273,[321]6.3286,[322]6.3214,[323]6.3207,[324]6.3248,[325]6.3398,[326]6.3410,[327]6.3440,[328]6.3467,[329]6.3532,[330]6.3590,[331]6.3731,[332]6.3688,[333]6.3790,[334]6.3732,[335]6.3674,[336]6.3687,[337]6.3651,[338]6.3659,[339]6.3609,[340]6.3549,[341]6.3606,[342]6.3624,[343]6.3670,[344]6.3670,[345]6.3674,[346]6.3647,[347]6.3676,[348]6.3702,[349]6.3736,[350]6.3720,[351]6.3717,[352]6.3708,[353]6.3638,[354]6.3618,[355]6.3680,[356]6.3736,[357]6.3680,[358]6.3786,[359]6.3817,[360]6.3768,[361]6.3760,[362]6.3854,[363]6.3971,[364]6.4042,[365]6.4087,[366]6.4100,[367]6.4187,[368]6.4150,[369]6.4159,[370]6.4180,[371]6.4115,[372]6.4165,[373]6.4203,[374]6.4182,[375]6.4181,[376]6.4260,[377]6.4201,[378]6.4225,[379]6.4266,[380]6.4187,[381]6.4174,[382]6.4118,[383]6.4120,[384]6.4119,[385]6.4122,[386]6.4108,[387]6.4122,[388]6.4087,[389]6.4035,[390]6.3968,[391]6.3893,[392]6.3888,[393]6.3913,[394]6.3975,[395]6.3965,[396]6.3880,[397]6.3968,[398]6.4013,[399]6.4093,[400]6.4078,[401]6.4078,[402]6.4109,[403]6.4134,[404]6.4204,[405]6.4161,[406]6.4146,[407]6.4190,[408]6.4209,[409]6.4331,[410]6.4453,[411]6.4581,[412]6.4769,[413]6.4899,[414]6.4994,[415]6.5063,[416]6.5153,[417]6.5260,[418]6.5301,[419]6.5360,[420]6.5462,[421]6.5586,[422]6.5638,[423]6.5708,[424]6.5822,[425]6.5930,[426]6.6008,[427]6.6052,[428]6.6157,[429]6.6220,[430]6.6324,[431]6.6473,[432]6.6511,[433]6.6487,[434]6.6432,[435]6.6449,[436]6.6479,[437]6.6584,[438]6.6675,[439]6.6628,[440]6.6633,[441]6.6583,[442]6.6560,[443]6.6565,[444]6.6572,[445]6.6552,[446]6.6574,[447]6.6596,[448]6.6651,[449]6.6622,[450]6.6619,[451]6.6574,[452]6.6568,[453]6.6497,[454]6.6455,[455]6.6463,[456]6.6501,[457]6.6530,[458]6.6518,[459]6.6514,[460]6.6611,[461]6.6594,[462]6.6604,[463]6.6651,[464]6.6640,[465]6.6612,[466]6.6541,[467]6.6583,[468]6.6617,[469]6.6656,[470]6.6659,[471]6.6624,[472]6.6693,[473]6.6634,[474]6.6679,[475]6.6683,[476]6.6705,[477]6.6653,[478]6.6659,[479]6.6783,[480]6.6844,[481]6.6876,[482]6.6837,[483]6.6808,[484]6.6849,[485]6.6842,[486]6.6791,[487]6.6824,[488]6.6813,[489]6.6775,[490]6.6781,[491]6.6778,[492]6.6753,[493]6.6720,[494]6.6706,[495]6.6722,[496]6.6699,[497]6.6684,[498]6.6700,[499]6.6648,[500]6.6557,[501]6.6508,[502]6.6547,[503]6.6548,[504]6.6469,[505]6.6490,[506]6.6502,[507]6.6438,[508]6.6375,[509]6.6372,[510]6.6389,[511]6.6451,[512]6.6482,[513]6.6498,[514]6.6552,[515]6.6503,[516]6.6496,[517]6.6511,[518]6.6508,[519]6.6542,[520]6.6565,[521]6.6581,[522]6.6613,[523]6.6632,[524]6.6701,[525]6.6747,[526]6.6761,[527]6.6787,[528]6.6736,[529]6.6750,[530]6.6685,[531]6.6665,[532]6.6725,[533]6.6751,[534]6.6736,[535]6.6767,[536]6.6723,[537]6.6705,[538]6.6757,[539]6.6772,[540]6.6797,[541]6.6831,[542]6.6831,[543]6.6849,[544]6.6856,[545]6.6836,[546]6.6846,[547]6.6793,[548]6.6715,[549]6.6718,[550]6.6695,[551]6.6658,[552]6.6636,[553]6.6601,[554]6.6579,[555]6.6536,[556]6.6533,[557]6.6595,[558]6.6565,[559]6.6563,[560]6.6539,[561]6.6543,[562]6.6497,[563]6.6502,[564]6.6555,[565]6.6582,[566]6.6584,[567]6.6563,[568]6.6563,[569]6.6540,[570]6.6560,[571]6.6559,[572]6.6558,[573]6.6550,[574]6.6506,[575]6.6488,[576]6.6474,[577]6.6449,[578]6.6429,[579]6.6416,[580]6.6339,[581]6.6304,[582]6.6306,[583]6.6305,[584]6.6308,[585]6.6216,[586]6.6133,[587]6.6131,[588]6.6172,[589]6.6236,[590]6.6247,[591]6.6269,[592]6.6253,[593]6.6214,[594]6.6215,[595]6.6188,[596]6.6224,[597]6.6184,[598]6.6174,[599]6.6196,[600]6.6185,[601]6.6173,[602]6.6227,[603]6.6240,[604]6.6267,[605]6.6298,[606]6.6316,[607]6.6323,[608]6.6270,[609]6.6270,[610]6.6306,[611]6.6279,[612]6.6297,[613]6.6241,[614]6.6166,[615]6.6077,[616]6.6093,[617]6.6009,[618]6.5941,[619]6.5868,[620]6.5708,[621]6.5625,[622]6.5601,[623]6.5617,[624]6.5614,[625]6.5608,[626]6.5594,[627]6.5634,[628]6.5628,[629]6.5624,[630]6.5658,[631]6.5707,[632]6.5768,[633]6.5751,[634]6.5783,[635]6.5790,[636]6.5765,[637]6.5732,[638]6.5749,[639]6.5715,[640]6.5731,[641]6.5729,[642]6.5797,[643]6.5822,[644]6.5830,[645]6.5809,[646]6.5849,[647]6.5844,[648]6.5861,[649]6.5860,[650]6.5894,[651]6.5940,[652]6.5949,[653]6.5985,[654]6.5921,[655]6.5914,
Final estimate: PPL = 6.5914 +/- 0.03810
llama_print_timings: load time = 2846.46 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 22401.71 ms / 335360 tokens ( 0.07 ms per token, 14970.29 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 33071.41 ms / 335361 tokens
real 0m39.310s
user 1m16.005s
sys 0m31.222s
You can usually get better performance with F16 models with It is also possible to test the performance in real scenario with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks Mamba, but it's fixable.
To help with fixing, I managed to at least make main
work with compilade@3e06fca, but parallel
still triggers an assert with Mamba. I'll investigate further.
@compilade thanks for testing, feel free to push your fixes here directly. |
Tested to work correctly with both `main` and `parallel` examples.
…r pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage
65bbb1a
to
89bfa1f
Compare
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that the input tensors are allocated in the graph, it's much cleaner than before.
But I think the new default size of the logits buffer might be too big.
llama.cpp
Outdated
@@ -12537,7 +12582,8 @@ struct llama_context_params llama_context_default_params() { | |||
struct llama_context_params result = { | |||
/*.seed =*/ LLAMA_DEFAULT_SEED, | |||
/*.n_ctx =*/ 512, | |||
/*.n_batch =*/ 512, | |||
/*.n_batch =*/ 4096, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4096 seems a bit big for a default logical batch size.
For a model with a vocab size of 50280, the logits buffer takes 50280*4096*4/1024/1024 = 785.63 MiB
, while with the previous default batch size of 512, the logits buffer took 50280*512*4/1024/1024 = 98.20 MiB
.
This only depends on the vocab and logical batch sizes, so the logits buffer for a small model like Mamba-130m (a 256.96 MiB
model in f16
) would take 3 times as much memory as the model weights with a default n_batch
of 4096.
And it doesn't really fit with the default n_ctx
of 512; a bigger n_batch
than n_ctx
won't ever be used completely (unless there's a way I didn't think of), and is thus wasted memory.
I suggest to either clamp n_batch
to n_ctx
, or (preferably) make the default n_batch
equal to the default n_ctx
again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bigger batch size is what allows pipeline parallelism to work. For example if the application submits a batch of 4096 tokens to llama_decode
, these will be split in mini-batches of 512 tokens each (n_ubatch
) and evaluated as a pipeline in parallel between the available GPUs. It could be reduced to 2048 and still get most of the benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipeline parallelism by default seems desirable. But n_batch
shouldn't exceed n_ctx
. Even when passing --ctx-size 128
to main
, n_batch
is still 4096 (from the default), and the 785.63 MiB
of logits are still allocated, even if they can't be used since a batch bigger than n_ctx
will simply not be able to find a big enough KV slot (for Transformer-based models, at least).
Clamping n_batch
to n_ctx
(with something like n_batch = std::min(n_batch, n_ctx)
) should fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can change llama_new_context_with_model
to limit n_batch
and n_ubatch
to n_ctx
, since there is no advantage to increasing it beyond that.
Ideally, we would set the defaults intelligently according to the hardware of the system, and only increase n_batch
if the system can actually support pipeline parallelism, which requires several CUDA GPUs and full offload. However that would require deeper changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But
n_batch
shouldn't exceedn_ctx
This is not always the case. When using a model that does not utilize the KV cache (for example non-causal embedding model like BERT), we might want to run with n_ctx = 0, n_batch = 8192
With 4400153 applied in such cases, we now have to allocate a KV cache due to n_ctx = 8192
and it won't be used
Given that, should we revert the clamp change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could pre-allocate buffers for n_max_seq
tokens only during initialization, and increase the size dynamically in llama_decode
automatically if there is a request for more logits than that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could pre-allocate buffers for
n_max_seq
tokens only during initialization, and increase the size dynamically inllama_decode
automatically if there is a request for more logits than that.
From what I understand, the logits aren't necessarily contiguous with each other in the output buffer, so yes pre-allocation and dynamic resizing could be done, but not until the way the layout of the logits is always made contiguous with no offset before the first used logits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could definitely improve the handling of logits. Even perplexity
and imatrix
only need logits for n_ctx/2
tokens. We could also skip the computation of the output layer for the tokens where logits are not needed. IIRC there was a PR about this a while ago, but it was never merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the std::vector<float> logits;
can become a std::vector<std::pair<int, std::vector<float>>
where the int is the token index in the batch (i.e. i_batch
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we should do something like that, and remove llama_get_logits
in favor of llama_get_logits_ith
. However logits
is no longer a std::vector
since it is allocated in a host buffer, which is necessary for pipeline parallelism, otherwise the copy from the GPU can cause a synchronization. It is also important that the logits are contiguous in memory when possible to reduce the number of copies for applications such as perplexity
, there is a significant performance improvement when doing just one cudaMemcpy
instead of one for each token (which ends being n_ctx/2
calls to cudaMemcpy
).
202adca
to
cda49d3
Compare
Here is a recap of the data from @ggerganov:
With master the performance would be roughly the same as with 1x GPU in all the cases. |
I noticed the thread sanitizer tests failing, but the errors don't make much sense to me. The errors are intermittent, running the jobs again eventually the test passes. I suspect that it is an issue with a particular runner. It produces errors such as this: |
@phymbert I've added multi-GPU results for @slaren Yes, the thread sanitizer build failures are not related to our code (#5943 (comment)) |
* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Does it also support multiple GPU , when we are using Vulken SDK for compiling the llama.cpp code? (For the AMD GPU) |
* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Pipeline parallelism improves batch processing performance when using multiple GPUs.
Changes:
ggml_backend_sched
to make multiple copies automatically, which is useful to reduce synchronization requirements with pipeline parallelismllama_decode
llama_decode
automatically splits the batches into multiple smaller batches if it is too big for the configured compute batch sizellama_decode
is still limited byn_batch
to reduce the size of the logits and embeddings buffersn_ubatch
(-ub
in the command line) tollama_context_params
parametern_batch
sets the size of the logits and embeddings buffer, which limits the maximum batch size passed tollama_decode
n_ubatch
sets the maximum batch size for computationn_batch
is 4096,n_ubatch
is 512n_batch
without having to update their logicllama_decode
asynchronousllama_get_logits
andllama_get_embeddings
llama_synchronize
to force a synchronization manually, which can be useful when measuring the time ofllama_decode
llama_decode
without synchronizingllama_timings
may not be accurate if the application does not synchronize immediately after callingllama_decode
ggml_get_rows
(still single-threaded when full offload)LLAMA_SCHED_MAX_COPIES
ggml_backend_sched
when using pipeline parallelism