Skip to content

Tags: albiol2004/llama.cpp

Tags

b8746

Toggle b8746's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
common: mark --split-mode tensor as experimental (ggml-org#21684)

b8744

Toggle b8744's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
common : enable reasoning budget sampler for gemma4 (ggml-org#21697)

* fix: enable reasoning budget sampler for gemma4

Add thinking_start_tag and thinking_end_tag to
common_chat_params_init_gemma4(). Without these, the reasoning
budget sampler never activates for gemma4.

Make the newline after "thought" optional in the PEG parser to
handle budget=0 (sampler forces end tag before the newline).

Add test case for empty thinking block.

Fixes ggml-org#21487

* use p.space() instead of p.optional(p.literal("\n")) in gemma4 thought parser

b8742

Toggle b8742's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
vulkan: Support Q1_0 (ggml-org#21539)

* vulkan: Support Q1_0

* use get_dm

b8741

Toggle b8741's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
common : add fluidity to the progress bar (ggml-org#21671)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

b8740

Toggle b8740's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
CUDA: fuse muls (ggml-org#21665)

b8739

Toggle b8739's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
HIP: add CDNA4 (gfx950) architecture support for MI350X/MI355X (ggml-…

…org#21570)

Add AMD Instinct MI350X/MI355X (gfx950, CDNA4) support:

- vendors/hip.h: Add CDNA4 preprocessor define for __gfx950__
- common.cuh: Add GGML_CUDA_CC_CDNA4 and GGML_CUDA_CC_IS_CDNA4 macros
- mma.cuh: Route CDNA4 to compatible MFMA instructions:
  * f32 matmul: mfma_f32_16x16x4f32 (xf32 variant unavailable on gfx950)
  * bf16 matmul: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8 matmul: mfma_i32_16x16x32_i8/32x32x16 (same as CDNA3)
- mmq.cuh: Include CDNA4 in stream-k kernel dispatch

CDNA4 is largely compatible with CDNA3 except:
- No xf32 MFMA (mfma_f32_16x16x8_xf32) — routes to f32 path
- Different FP8 format (e4m3fn vs e4m3_fnuz) — not changed here

Tested on AMD Instinct MI355X (gfx950), ROCm 7.0.1:
- Build: compiles cleanly with -DAMDGPU_TARGETS=gfx950
- llama-bench (Qwen2.5-1.5B Q4_K_M, single GPU):
  * f16+FA: 40,013 tok/s prefill, 254 tok/s decode
  * q8_0+FA: functional
- Flash attention: works correctly
- MMQ: works correctly with stream-k dispatch

Co-authored-by: Andy Luo <andyluo7@users.noreply.github.com>

b8738

Toggle b8738's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
ggml: backend-agnostic tensor parallelism (experimental) (ggml-org#19378

)

* ggml: backend-agnostic tensor parallelism

* support for GPT-OSS, Qwen 3 MoE

* partial Vulkan fix

* add support for 4/8 GPUs

* unconditional peer access

* re-use buffers + ggml contexts

* fix output pattern

* NCCL support

* GGML: HIP: add RCCL support

* Remove shfl and AllReduce from backend interface

* move allocation workaround out of ggml-alloc.c

* 2d tensor set/get support

* Fix the seg fault without NCCL

* Apply suggestion from JohannesGaessler

* support for tensor dims % n_devs != 0

* fix view_offs scaling

* arbitrary num. of GPUs/tensor split

* fix compilation

* better granularity estimate

* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.

Fix compilation errors.

* partial Qwen 3 Next support

* Fix qwen3 30b (ggml-org#8)

* Fix crash with Qwen-30B-A3B Q4_0

Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.

* Decide block size based on tensor quantization type

* Fix crashes due to KV cache serialization (ggml-org#9)

KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.

* metal : fix build (ggml-org#7)

* static memory allocations, fix usage count

* fix tensor granularity

* more even memory distribution

* use BF16 for allreduce

* rebase fixup

* better error message for unsupported architectures

* Fix device mismatch during scatter of allReduce. (ggml-org#11)

There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies

* Enable the previous allreduce implementation. It is better in both perf and stability (ggml-org#12)

* delay AllReduce for Moe for less I/O

* build : clean-up compile warnings

* backend : move most of the meta backend API to ggml-backend-impl.h

* cont : hide unused public API in the implementation

* llama : use llama_device + remove ggml_backend_dev_is_meta()

* ggml-backend : remove unused alloc include

* minor : remove regex include

* ggml : introduce ggml-ext.h for staging new APIs

* rebase fixup

* fix tests

* llama : more robust logic for determining Meta devices (ggml-org#16)

* llama : more robust logic for determining Meta devices

* cont : fix devs size check

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cont : fix log type

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* disable roundtrip for meta backend

* fix arch selection

* Qwen 3.5 support

* fix Gemma 4 MoE

* fix OpenVino, SYCL

* fix test-llama-archs for CPU-only builds

* Fix Qwen 3.5 MoE

* disable meta backend tests for WebGPU

* tests : filter CPU-based devices from the Meta backend tests (ggml-org#17)

* meta : formatting, naming, indentation (ggml-org#18)

* formatting : llama-model.cpp

* formatting : ggml-ext.h

* formatting : ggml-backend-meta.cpp

* meta : add TODO

* add documentation

* better error messages

* fix GPT-OSS

---------

Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

b8737

Toggle b8737's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
ggml : check return value of CUB calls used in argsort and top-k (the…

…y all return cudaError_t) (ggml-org#21676)

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

b8734

Toggle b8734's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
common : fix ambiguous grammar rule in gemma4 (ggml-org#21661)

* common : fix ambiguous grammar rule in gemma4

* cont : fix missing comma...

b8733

Toggle b8733's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
common : simplify autoparser tagged parser rules (ggml-org#21216)

* common : simplify autoparser tagged parser rules

* cont : remove upper limit on optional args

* cont : revert changes to parsing at the end

* cont : undo arbitrary ordering of optional args

* cont : fix uninitialized required parameters

* revert to simplify merge

* re-apply patches

* restore flexible optional arg ordering tests