//examples/llm runs interactive or one-shot text generation from a model repository from HuggingFace.
We support the following models, automatically detected from the model_type in the config.json:
- Llama 3.1
- Qwen 3.5
- LFM 2.5
To load a model from HuggingFace directly:
# CPU
bazel run //examples/llm -- --model=hf://meta-llama/Llama-3.1-8B-Instruct
# CUDA
bazel run //examples/llm --@zml//platforms:cuda=true -- --model=hf://meta-llama/Llama-3.1-8B-Instruct
# ROCm
bazel run //examples/llm --@zml//platforms:rocm=true -- --model=hf://meta-llama/Llama-3.1-8B-InstructFrom a local directory:
bazel run //examples/llm --@zml//platforms:cuda=true -- --model=/var/models/meta-llama/Llama-3.1-8B-Instruct/For a single non-interactive prompt:
bazel run //examples/llm --@zml//platforms:cuda=true -- --model=hf://meta-llama/Llama-3.1-8B-Instruct --prompt="Write a haiku about Zig"--model=<path>: Required. Model repository to load. This can be a local path or a huggingface/S3 URI such ashf://...ors3://....--prompt=<string>: Optional. Runs a single prompt instead of opening the interactive chat loop.--seqlen=<number>: Optional. Maximum sequence length. Defaults to2048.--backend=<vanilla|cuda_fa2|cuda_fa3>: Optional. Attention backend. If omitted, the program auto-selects one for the current platform.