forked from tabletuser-blogspot/ollama-benchmark
-
Notifications
You must be signed in to change notification settings - Fork 14
Closed
Description
I set up the AMD W7700 on a Pi 5 16GB with kernel 6.12.y following the guide in geerlingguy/raspberry-pi-pcie-devices#680
Then I downloaded and tried running Deepseek Qwen 14B distilled:
# Follow https://www.jeffgeerling.com/blog/2024/llms-accelerated-egpu-on-raspberry-pi-5
cd models
wget https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf
export GGML_VK_FORCE_MAX_ALLOCATION_SIZE=2147483647
cd ../
./build/bin/llama-cli -m "models/DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf" -p "Why is the sky blue?" -n 50 -e -ngl 33 -t 4
But I'm only seeing part of the model being offloaded to the GPU:
load_tensors: offloading 33 repeating layers to GPU
load_tensors: offloaded 33/49 layers to GPU
load_tensors: Vulkan0 model buffer size = 5155.22 MiB
load_tensors: CPU_Mapped model buffer size = 3410.82 MiB
And token performance is pretty bad, with GPU only utilizing about 10% of it's capacity, and CPU 350%+
Metadata
Metadata
Assignees
Labels
No labels