README.md

LLM

//examples/llm runs interactive or one-shot text generation from a model repository from HuggingFace.

We support the following models, automatically detected from the model_type in the config.json:

Llama 3.1
Qwen 3.5
LFM 2.5

Run

To load a model from HuggingFace directly:

# CPU
bazel run //examples/llm -- --model=hf://meta-llama/Llama-3.1-8B-Instruct
# CUDA
bazel run //examples/llm --@zml//platforms:cuda=true -- --model=hf://meta-llama/Llama-3.1-8B-Instruct
# ROCm
bazel run //examples/llm --@zml//platforms:rocm=true -- --model=hf://meta-llama/Llama-3.1-8B-Instruct

From a local directory:

bazel run //examples/llm --@zml//platforms:cuda=true -- --model=/var/models/meta-llama/Llama-3.1-8B-Instruct/

For a single non-interactive prompt:

bazel run //examples/llm --@zml//platforms:cuda=true -- --model=hf://meta-llama/Llama-3.1-8B-Instruct --prompt="Write a haiku about Zig"

Options

--model=<path>: Required. Model repository to load. This can be a local path or a huggingface/S3 URI such as hf://... or s3://....
--prompt=<string>: Optional. Runs a single prompt instead of opening the interactive chat loop.
--seqlen=<number>: Optional. Maximum sequence length. Defaults to 2048.
--backend=<vanilla|cuda_fa2|cuda_fa3>: Optional. Attention backend. If omitted, the program auto-selects one for the current platform.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM

Run

Options

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

LLM

Run

Options