Tags · albiol2004/llama.cpp

b8746

common: mark --split-mode tensor as experimental (ggml-org#21684)

Apr 10, 2026
0893f50
zip
tar.gz

b8744

common : enable reasoning budget sampler for gemma4 (ggml-org#21697)

* fix: enable reasoning budget sampler for gemma4

Add thinking_start_tag and thinking_end_tag to
common_chat_params_init_gemma4(). Without these, the reasoning
budget sampler never activates for gemma4.

Make the newline after "thought" optional in the PEG parser to
handle budget=0 (sampler forces end tag before the newline).

Add test case for empty thinking block.

Fixes ggml-org#21487

* use p.space() instead of p.optional(p.literal("\n")) in gemma4 thought parser

Apr 10, 2026
d7ff074
zip
tar.gz

b8742

vulkan: Support Q1_0 (ggml-org#21539)

* vulkan: Support Q1_0

* use get_dm

Apr 10, 2026
7b69125
zip
tar.gz

b8741

common : add fluidity to the progress bar (ggml-org#21671)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

Apr 10, 2026
e095a48
zip
tar.gz

b8740

CUDA: fuse muls (ggml-org#21665)

Apr 10, 2026
e34f042
zip
tar.gz

b8739

HIP: add CDNA4 (gfx950) architecture support for MI350X/MI355X (ggml-…

…org#21570)

Add AMD Instinct MI350X/MI355X (gfx950, CDNA4) support:

- vendors/hip.h: Add CDNA4 preprocessor define for __gfx950__
- common.cuh: Add GGML_CUDA_CC_CDNA4 and GGML_CUDA_CC_IS_CDNA4 macros
- mma.cuh: Route CDNA4 to compatible MFMA instructions:
  * f32 matmul: mfma_f32_16x16x4f32 (xf32 variant unavailable on gfx950)
  * bf16 matmul: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8 matmul: mfma_i32_16x16x32_i8/32x32x16 (same as CDNA3)
- mmq.cuh: Include CDNA4 in stream-k kernel dispatch

CDNA4 is largely compatible with CDNA3 except:
- No xf32 MFMA (mfma_f32_16x16x8_xf32) — routes to f32 path
- Different FP8 format (e4m3fn vs e4m3_fnuz) — not changed here

Tested on AMD Instinct MI355X (gfx950), ROCm 7.0.1:
- Build: compiles cleanly with -DAMDGPU_TARGETS=gfx950
- llama-bench (Qwen2.5-1.5B Q4_K_M, single GPU):
  * f16+FA: 40,013 tok/s prefill, 254 tok/s decode
  * q8_0+FA: functional
- Flash attention: works correctly
- MMQ: works correctly with stream-k dispatch

Co-authored-by: Andy Luo <andyluo7@users.noreply.github.com>

Apr 9, 2026
d132f22
zip
tar.gz

b8738

ggml: backend-agnostic tensor parallelism (experimental) (ggml-org#19378

)

* ggml: backend-agnostic tensor parallelism

* support for GPT-OSS, Qwen 3 MoE

* partial Vulkan fix

* add support for 4/8 GPUs

* unconditional peer access

* re-use buffers + ggml contexts

* fix output pattern

* NCCL support

* GGML: HIP: add RCCL support

* Remove shfl and AllReduce from backend interface

* move allocation workaround out of ggml-alloc.c

* 2d tensor set/get support

* Fix the seg fault without NCCL

* Apply suggestion from JohannesGaessler

* support for tensor dims % n_devs != 0

* fix view_offs scaling

* arbitrary num. of GPUs/tensor split

* fix compilation

* better granularity estimate

* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.

Fix compilation errors.

* partial Qwen 3 Next support

* Fix qwen3 30b (ggml-org#8)

* Fix crash with Qwen-30B-A3B Q4_0

Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.

* Decide block size based on tensor quantization type

* Fix crashes due to KV cache serialization (ggml-org#9)

KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.

* metal : fix build (ggml-org#7)

* static memory allocations, fix usage count

* fix tensor granularity

* more even memory distribution

* use BF16 for allreduce

* rebase fixup

* better error message for unsupported architectures

* Fix device mismatch during scatter of allReduce. (ggml-org#11)

There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies

* Enable the previous allreduce implementation. It is better in both perf and stability (ggml-org#12)

* delay AllReduce for Moe for less I/O

* build : clean-up compile warnings

* backend : move most of the meta backend API to ggml-backend-impl.h

* cont : hide unused public API in the implementation

* llama : use llama_device + remove ggml_backend_dev_is_meta()

* ggml-backend : remove unused alloc include

* minor : remove regex include

* ggml : introduce ggml-ext.h for staging new APIs

* rebase fixup

* fix tests

* llama : more robust logic for determining Meta devices (ggml-org#16)

* llama : more robust logic for determining Meta devices

* cont : fix devs size check

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cont : fix log type

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* disable roundtrip for meta backend

* fix arch selection

* Qwen 3.5 support

* fix Gemma 4 MoE

* fix OpenVino, SYCL

* fix test-llama-archs for CPU-only builds

* Fix Qwen 3.5 MoE

* disable meta backend tests for WebGPU

* tests : filter CPU-based devices from the Meta backend tests (ggml-org#17)

* meta : formatting, naming, indentation (ggml-org#18)

* formatting : llama-model.cpp

* formatting : ggml-ext.h

* formatting : ggml-backend-meta.cpp

* meta : add TODO

* add documentation

* better error messages

* fix GPT-OSS

---------

Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Apr 9, 2026
d6f3030
zip
tar.gz

b8737

ggml : check return value of CUB calls used in argsort and top-k (the…

…y all return cudaError_t) (ggml-org#21676)

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

Apr 9, 2026
009a113
zip
tar.gz

b8734

common : fix ambiguous grammar rule in gemma4 (ggml-org#21661)

* common : fix ambiguous grammar rule in gemma4

* cont : fix missing comma...

Apr 9, 2026
ddf03c6
zip
tar.gz

b8733

common : simplify autoparser tagged parser rules (ggml-org#21216)

* common : simplify autoparser tagged parser rules

* cont : remove upper limit on optional args

* cont : revert changes to parsing at the end

* cont : undo arbitrary ordering of optional args

* cont : fix uninitialized required parameters

* revert to simplify merge

* re-apply patches

* restore flexible optional arg ordering tests

Apr 9, 2026
2622975
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b8746

b8744

b8742

b8741

b8740

b8739

b8738

b8737

b8734

b8733

Tags: albiol2004/llama.cpp