Skip to content

Releases: ggml-org/llama.cpp

b7508

22 Dec 14:05
6ce863c

Choose a tag to compare

server: prevent data race from HTTP threads (#18263)

  • server: prevent data race from HTTP threads

  • fix params

  • fix default_generation_settings

  • nits: make handle_completions_impl looks less strange

  • stricter const

  • fix GGML_ASSERT(idx < states.size())

  • move index to be managed by server_response_reader

  • http: make sure req & res lifecycle are tied together

  • fix compile

  • fix index handling buggy

  • fix data race for lora endpoint

  • nits: fix shadow variable

  • nits: revert redundant changes

  • nits: correct naming for json_webui_settings

macOS/iOS:

Linux:

Windows:

openEuler:

b7507

22 Dec 13:10
3997c78

Choose a tag to compare

b7506

22 Dec 13:00
ee74642

Choose a tag to compare

release: update release workflow to store XCFramework as Zip file (#18284)

  • Update release workflow to store XCFramework as Zip file

  • Add comments to document Zip file requirement for XCFramework

  • Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com


Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

macOS/iOS:

Linux:

Windows:

openEuler:

b7503

22 Dec 11:05
147a521

Choose a tag to compare

b7502

21 Dec 22:34
e1f15b4

Choose a tag to compare

vulkan: Implement set_tensor_async and the event interfaces (#18047)

The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from #7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.

macOS/iOS:

Linux:

Windows:

openEuler:

b7501

21 Dec 19:45
0e1ccf1

Choose a tag to compare

b7499

21 Dec 11:29
fd05c51

Choose a tag to compare

b7498

21 Dec 11:21
b365c3f

Choose a tag to compare

vulkan/cuda: fix topk_moe with exp_probs_b (#18071)

I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn
and added coverage for exp_probs_b and some other missing combinations. This
exposed a bug in both CUDA and Vulkan backends where they were assuming the
input to argsort and the input to get_rows are the same. I'd like to optimize
this graph in another change, but for now just get it functional.

CUDA also had a bug where it got n_experts from the wrong place, leading to
GGML_ASSERT failures in some of the new tests.

macOS/iOS:

Linux:

Windows:

openEuler:

b7497

21 Dec 11:26
cb64222

Choose a tag to compare

b7496

21 Dec 11:24
6eb7081

Choose a tag to compare