Skip to content

Latest commit

 

History

History
41 lines (29 loc) · 1.38 KB

File metadata and controls

41 lines (29 loc) · 1.38 KB

LLM

//examples/llm runs interactive or one-shot text generation from a model repository from HuggingFace.

We support the following models, automatically detected from the model_type in the config.json:

  • Llama 3.1
  • Qwen 3.5
  • LFM 2.5

Run

To load a model from HuggingFace directly:

# CPU
bazel run //examples/llm -- --model=hf://meta-llama/Llama-3.1-8B-Instruct
# CUDA
bazel run //examples/llm --@zml//platforms:cuda=true -- --model=hf://meta-llama/Llama-3.1-8B-Instruct
# ROCm
bazel run //examples/llm --@zml//platforms:rocm=true -- --model=hf://meta-llama/Llama-3.1-8B-Instruct

From a local directory:

bazel run //examples/llm --@zml//platforms:cuda=true -- --model=/var/models/meta-llama/Llama-3.1-8B-Instruct/

For a single non-interactive prompt:

bazel run //examples/llm --@zml//platforms:cuda=true -- --model=hf://meta-llama/Llama-3.1-8B-Instruct --prompt="Write a haiku about Zig"

Options

  • --model=<path>: Required. Model repository to load. This can be a local path or a huggingface/S3 URI such as hf://... or s3://....
  • --prompt=<string>: Optional. Runs a single prompt instead of opening the interactive chat loop.
  • --seqlen=<number>: Optional. Maximum sequence length. Defaults to 2048.
  • --backend=<vanilla|cuda_fa2|cuda_fa3>: Optional. Attention backend. If omitted, the program auto-selects one for the current platform.