Interactive terminal workflow to clone llama.cpp, build with CMake, download GGUF weights from Hugging Face using hf download, and run llama-cli / llama-server. Model sources and filename patterns live in models.json next to the script—edit that file to add or remove entries.
| Option | What it does |
|---|---|
| Full Installation | Pick install root → clone → build → download a model from models.json |
| Build llama.cpp only | Compile only (default root ~/llama-cpp-hf) |
| Download model only | Download GGUF into an existing llama.cpp tree |
| Run model | Pick a local .gguf, then chat or start the API server |
| Exit | Quit |
The download step installs huggingface_hub and hf_transfer via pip when needed. hf download expects a working Hugging Face setup; for gated or private models, run huggingface-cli login or set HF_TOKEN.
- git — clone llama.cpp
- cmake — configure and compile
- pip / pip3 — install Python packages
- python3 — parse
models.jsonand helpers
If anything is missing, the script suggests example installs (e.g. brew install … on macOS). You also need free disk space for sources, build artifacts, and models.
cd /path/to/llama-cpp-hf-setup
chmod +x setup-llama-cpp-hf.sh
./setup-llama-cpp-hf.shRun the script from this directory so it finds models.json beside it.
| Item | Default |
|---|---|
| Install root | ~/llama-cpp-hf |
| llama.cpp checkout | $INSTALL_DIR/llama.cpp |
| GGUF download dir | $INSTALL_DIR/llama.cpp/models/ |
| Default context size (when offered) | 65536 (adjust in the run flow) |
| Default API port | 8080 |
Top level: a JSON object with version and a models array. Each entry describes how to resolve a Hugging Face repo and target filename.
| Field | Meaning |
|---|---|
id |
Internal id |
label |
Long label for display |
list_name |
Short menu name (optional; derived from label if omitted) |
hf_repo |
Fixed repo, e.g. org/repo-GGUF |
hf_repo_pattern |
Template with {size}, e.g. unsloth/Qwen3.5-{size}-GGUF |
artifact_pattern |
Filename template with {size} and {quant} |
sizes |
List of { "value", "label" } choices |
quants |
List of { "value", "label" } quantization names (e.g. Q4_K_M) |
If several rules could match the same basename, the first matching entry in the array wins. Append new objects to models for more sources (names must match what is on Hugging Face).
- NVIDIA GPU: If
nvidia-smiis present, you can enable CUDA (-DGGML_CUDA=ON) during install. - Apple Silicon: On macOS, llama-cli is invoked with
--n-gpu-layers 99so layers run on Metal instead of CPU-only. - Existing build: If
build/bin/llama-cliandllama-serverexist, you can skip a full rebuild or rebuild from scratch.
- No
hfcommand: The download step installshuggingface_hub; if it still fails, runpip install -U huggingface_hub hf_transferand ensurehfis onPATH. - 401/403 on download: Log in to Hugging Face or set
HF_TOKEN. - CMake / link errors: The script unsets some environment variables that break linking; if you build manually, avoid conflicting
LDFLAGSand similar.
This repository contains only the installer script and sample config. llama.cpp and model weights are governed by their respective upstream and Hugging Face terms—check those before use.