Releases · ggml-org/llama.cpp

server: prevent data race from HTTP threads (#18263)

server: prevent data race from HTTP threads
fix params
fix default_generation_settings
nits: make handle_completions_impl looks less strange
stricter const
fix GGML_ASSERT(idx < states.size())
move index to be managed by server_response_reader
http: make sure req & res lifecycle are tied together
fix compile
fix index handling buggy
fix data race for lora endpoint
nits: fix shadow variable
nits: revert redundant changes
nits: correct naming for json_webui_settings

macOS/iOS:

Linux:

Windows:

openEuler:

server: fix data race in to_json_anthropic (#18283)

macOS/iOS:

Linux:

Windows:

openEuler:

release: update release workflow to store XCFramework as Zip file (#18284)

Update release workflow to store XCFramework as Zip file
Add comments to document Zip file requirement for XCFramework
Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

macOS/iOS:

Linux:

Windows:

openEuler:

tool/ex/tests: consistently free ctx, then model (#18168)

macOS/iOS:

Linux:

Windows:

openEuler:

vulkan: Implement set_tensor_async and the event interfaces (#18047)

The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from #7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.

macOS/iOS:

Linux:

Windows:

openEuler:

llama: fix RPC for -fit on (#18233)

macOS/iOS:

Linux:

Windows:

openEuler:

vulkan: fix im2col overflowing maxworkgroupcount (#18180)

macOS/iOS:

Linux:

Windows:

openEuler:

vulkan/cuda: fix topk_moe with exp_probs_b (#18071)

I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn
and added coverage for exp_probs_b and some other missing combinations. This
exposed a bug in both CUDA and Vulkan backends where they were assuming the
input to argsort and the input to get_rows are the same. I'd like to optimize
this graph in another change, but for now just get it functional.

CUDA also had a bug where it got n_experts from the wrong place, leading to
GGML_ASSERT failures in some of the new tests.

macOS/iOS:

Linux:

Windows:

openEuler:

vulkan: support GGML_UNARY_OP_XIELU (#18062)

macOS/iOS:

Linux:

Windows:

openEuler:

vulkan: in graph_optimize, try to group ADD operations (#18060)

I saw the adds not staying together in the new nemotron 3 nano model.

macOS/iOS:

Linux:

Windows:

openEuler:

Releases: ggml-org/llama.cpp

b7508

Uh oh!

b7507

Uh oh!

b7506

Uh oh!

b7503

Uh oh!

b7502

Uh oh!

b7501

Uh oh!

b7499

Uh oh!

b7498

Uh oh!

b7497

Uh oh!

b7496

Uh oh!