This project lets you use Go for hardware accelerated local inference with llama.cpp directly integrated into your applications via the yzma module. Kronk provides a high-level API that feels similar to using an OpenAI compatible API.
This project also provides a model server for chat completions and embeddings. The server is compatible with the OpebWebUI project.
Here is the current catalog of models that have been verified to work with Kronk.
Copyright 2025 Ardan Labs
hello@ardanlabs.com
Name: Bill Kennedy
Company: Ardan Labs
Title: Managing Partner
Email: bill@ardanlabs.com
Twitter: goinggodotnet
To install the Kronk tool run the following command:
$ go install github.com/ardanlabs/kronk/cmd/kronk@latest
Here is the help screen.
$ kronk help
Go for hardware accelerated local inference with llama.cpp directly integrated into your applications via the yzma. Kronk provides a high-level API that feels similar to using an OpenAI compatible API.
Usage:
kronk [flags]
kronk [command]
Available Commands:
catalog Manage model catalog
help Help about any command
libs Install or upgrade llama.cpp libraries
logs Stream server logs
model Manage models
server Start Kronk model server
stop Stop the running Kronk model server
Flags:
-h, --help help for kronk
-v, --version version for kronk
Use "kronk [command] --help" for more information about a command.
Here is the existing ROADMAP for the project and the things being worked on or things that would be nice to have. The roadmap is not in any particular order.
If you are interested in helping in any way, please send an email to Bill Kennedy.
The architecture of Kronk is designed to be simple and scalable. The Kronk API allows you to write applications that can diectly interact with local open source GGUF models (supported by llama.cpp) that provide inference for text and media (vision and audio).
Check out the examples section below.
If you want an OpenAI compatible model server, the Kronk model server leverages the power of the Kronk API to give you a concurrent and scalable web api.
Run make kronk-server to check it out.
The diagram below shows how the Kronk model server supports access to multiple models. The model server manages kronk API instances that each provide access to a model. The Kronk API allows concurrent access to the models in a safe and reliable way.
-> Kronk API -> Yzma -> Llama.cpp -> Model 1 (1 instance)
Client -> Kronk MS -> Kronk API -> Yzma -> Llama.cpp -> Model 2 (1 instance)
-> Kronk API -> Yzma -> Llama.cpp -> Model 3 (1 instance)
You can configure the number of Models that stay loaded in memory (default: 3) through configuration how long they will stay in memory (default: 5m) when not being used. This allows you to manage the resouces on the hardware.
For more details on the settings, run the following command after installing Kronk:
$ kronk server --help
Kronk uses models in the GGUF format supported by llama.cpp. You can find many models in GGUF format on Hugging Face (over 147k at last count):
https://huggingface.co/models?library=gguf&sort=trending
Kronk currently has support for over 94% of llama.cpp functionality thanks to yzma. See the yzma ROADMAP.md for the complete list.
You can use multimodal models (image/audio) and text language models with full hardware acceleration on Linux, on macOS, and on Windows.
| OS | CPU | GPU |
|---|---|---|
| Linux | amd64, arm64 | CUDA, Vulkan, HIP, ROCm, SYCL |
| macOS | arm64 | Metal |
| Windows | amd64 | CUDA, Vulkan, HIP, SYCL, OpenCL |
Whenever there is a new release of llama.cpp, the tests for yzma are run automatically. Kronk runs tests once a day and will check for updates to llama.cpp. This helps us stay up to date with the latest code and models.
There are examples in the examples direction:
The first time you run these programs the system will download and install the model and libraries.
AUDIO - This example shows you how to execute a simple prompt against an audio model.
$ make example-audio
CHAT - This example shows you how to create a simple chat application against an inference model using kronk. Thanks to Kronk and yzma, reasoning and tool calling is enabled.
$ make example-chat
EMBEDDING - This example shows you how to use an embedding model.
$ make example-embedding
QUESTION - This example shows you a basic program of using Kronk to ask a model a question.
$ make example-question
VISION - This example shows you how to execute a simple prompt against a vision model.
$ make example-vision
WEB - This example shows you a web service that provides a chat endpoint for asking questions to a model with a browser based chat UI.
$ make example-web
You can find more examples in the ArdanLabs AI training repo at Example13.
The model server is OpenAI compatible and you can use OpenWebUI to interact with it. To start the Kronk model server run:
$ kronk server
You will need to load a model if this is the first time you're using the system. To download a starter model run the catalog list command to see the current catalog of models:
$ kronk catalog list --local
CATALOG MODEL ID
Audio-Text-to-Text Qwen2-Audio-7B.Q8_0
Embedding embeddinggemma-300m-qat-Q8_0
Image-Text-to-Text Qwen2.5-VL-3B-Instruct-Q8_0
Text-Generation gpt-oss-20b-Q8_0
Text-Generation Qwen3-8B-Q8_0
Then download the Qwen3-8B-Q8_0 model using the catalog pull command:
$ kronk catalog pull Qwen3-8B-Q8_0 --local
If you want to play with OpenWebUI, run the following commands:
$ make install-owu
$ make owu-up
The open your browser to localhost:8080 or open another terminal window and run:
$ make owu-browse
package main
import (
"context"
"fmt"
"os"
"runtime"
"time"
"github.com/ardanlabs/kronk/sdk/kronk"
"github.com/ardanlabs/kronk/sdk/kronk/defaults"
"github.com/ardanlabs/kronk/sdk/kronk/model"
"github.com/ardanlabs/kronk/sdk/tools/libs"
"github.com/ardanlabs/kronk/sdk/tools/models"
"github.com/hybridgroup/yzma/pkg/download"
)
const (
modelURL = "https://huggingface.co/Qwen/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q8_0.gguf"
modelInstances = 1
)
var (
libPath = defaults.LibsDir("")
modelPath = defaults.ModelsDir("")
)
func main() {
if err := run(); err != nil {
fmt.Printf("\nERROR: %s\n", err)
os.Exit(1)
}
}
func run() error {
info, err := installSystem()
if err != nil {
return fmt.Errorf("unable to installation system: %w", err)
}
if err := kronk.Init(libPath, kronk.LogSilent); err != nil {
return fmt.Errorf("unable to init kronk: %w", err)
}
krn, err := kronk.New(modelInstances, model.Config{
ModelFile: info.ModelFile,
})
if err != nil {
return fmt.Errorf("unable to create inference model: %w", err)
}
defer func() {
fmt.Println("\nUnloading Kronk")
if err := krn.Unload(context.Background()); err != nil {
fmt.Printf("failed to unload model: %v", err)
}
}()
// -------------------------------------------------------------------------
ctx, cancel := context.WithTimeout(context.Background(), 120*time.Second)
defer cancel()
question := "Hello model"
fmt.Println()
fmt.Println("QUESTION:", question)
fmt.Println()
d := model.D{
"messages": model.DocumentArray(
model.TextMessage("user", question),
),
"temperature": 0.7,
"top_p": 0.9,
"top_k": 40,
"max_tokens": 2048,
}
ch, err := krn.ChatStreaming(ctx, d)
if err != nil {
return fmt.Errorf("chat streaming: %w", err)
}
// -------------------------------------------------------------------------
var reasoning bool
for resp := range ch {
switch resp.Choice[0].FinishReason {
case model.FinishReasonError:
return fmt.Errorf("error from model: %s", resp.Choice[0].Delta.Content)
case model.FinishReasonStop:
return nil
default:
if resp.Choice[0].Delta.Reasoning != "" {
reasoning = true
fmt.Printf("\u001b[91m%s\u001b[0m", resp.Choice[0].Delta.Reasoning)
continue
}
if reasoning {
reasoning = false
fmt.Println()
continue
}
fmt.Printf("%s", resp.Choice[0].Delta.Content)
}
}
return nil
}
func installSystem() (models.Path, error) {
libCfg, err := libs.NewConfig(
libPath,
runtime.GOARCH,
runtime.GOOS,
download.CPU.String(),
kronk.LogSilent.Int(),
true,
)
if err != nil {
return models.Path{}, err
}
_, err = libs.Download(context.Background(), kronk.FmtLogger, libCfg)
if err != nil {
return models.Path{}, fmt.Errorf("unable to install llama.cpp: %w", err)
}
mp, err := models.Download(context.Background(), kronk.FmtLogger, modelURL, "", modelPath)
if err != nil {
return models.Path{}, fmt.Errorf("unable to install model: %w", err)
}
return mp, nil
}This example can produce the following output:
$ make example-question
CGO_ENABLED=0 go run examples/question/main.go
download-libraries: status[check libraries version information] arch[arm64] os[darwin] processor[cpu]
download-libraries: status[check llama.cpp installation] arch[arm64] os[darwin] processor[cpu] latest[b7406] current[b7406]
download-libraries: status[already installed] latest[b7406] current[b7406]
download-model: model-url[https://huggingface.co/Qwen/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q8_0.gguf] proj-url[] model-id[Qwen3-8B-Q8_0]:
download-model: waiting to check model status...:
download-model: status[already exists]:
QUESTION: Hello model
Okay, the user said "Hello model." I need to respond appropriately. First, I should acknowledge their greeting. Since they mentioned "model," maybe they're referring to me as a language model. I should clarify that I'm Qwen, a large language model developed by Alibaba Cloud. I should keep the response friendly and open-ended, inviting them to ask questions or share topics they're interested in. Let me make sure the tone is welcoming and helpful. Also, check for any possible misunderstandings. They might be testing if I recognize the term "model," so confirming my identity as Qwen is important. Alright, time to put it all together in a natural, conversational way.
! I'm Qwen, a large language model developed by Alibaba Cloud. How can I assist you today? π Whether you have questions, need help with something, or just want to chat, feel free to let me know!
Unloading Kronk