Skip to content

Tags: tetherto/qvac

Tags

ai-sdk-provider-v0.2.1

Toggle ai-sdk-provider-v0.2.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
chore[notask]: bump bare-fetch to ^3.0.1 in ocr-ggml and translation-…

…nmtcpp (#2584)

Aligns both addons with the latest bare-fetch major already used by rag
(0.6.3) and ocr-onnx, removing the duplicate older bare-fetch major from
the dependency tree.

- @qvac/ocr-ggml: 0.2.0 -> 0.2.1
- @qvac/translation-nmtcpp: 6.0.0 -> 6.0.1

vla-v0.4.0

Toggle vla-v0.4.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
feat[api]: DocTR depthwise convs via direct Metal CONV_2D_DW kernel (…

…+ all-consumer fabric overlay) (#2536)

* feat[api]: fold DocTR detection BatchNorm into conv weights

The DBNet detection graph applied BatchNorm as a runtime scale/shift after
every conv: `conv + add(bias_br) + mul(scale) + add(shift) + act`, where
bias_br is a zero tensor for the (bias=False) BN convs. That is three
full-tensor elementwise passes per conv on top of the conv itself.

Fold the per-channel BN scale into the F16 conv weights and combine the conv
bias and BN shift into a single bias at load time:

    out = scale*(W*x + bias) + shift = (scale*W)*x + (scale*bias + shift)

so the runtime graph collapses to `conv + add(combined) + act` (one pass).
This removes ~60 elementwise passes from the detection graph, which matters
most on bandwidth/dispatch-constrained mobile GPUs where detection is ~55% of
the DocTR pipeline.

foldScaleIntoConv() folds scale into the preceding conv weight (per output
channel, ne[3]) and absorbs bias_br into the shift tensor; it handles both
normal BN (running stats present) and the offline-folded identity path. The
sub-pixel transposed conv in the prob head is left as a runtime scale/shift
since its weight is reshaped at graph build.

Numerically exact: region counts unchanged (365/197/187/197) and all DocTR
integration quality tests pass on Apple Metal (M4); ct_scan 1173->1157ms,
clinical 858->841ms.

* feat[api]: fold DocTR recognizer BatchNorm into conv weights

Apply the same BN-into-weights fold to the CRNN MobileNetV3-small feature
extractor: fold each BN's per-channel scale into the preceding conv's F16
weights at load time so applyBn drops the runtime multiply and keeps only the
shift add. The feature-extractor graph runs once per recognition batch (dozens
of times on a dense page), so removing a full-tensor multiply per conv is
amplified across the page.

Numerically exact: region counts unchanged and all DocTR integration quality
tests pass on Apple Metal (M4). Combined with the detection fold: ct_scan
1173->1134ms, clinical 858->829ms.

* feat[api]: use direct depthwise kernel for DocTR conv2d-dw on GPU

The DBNet detector and CRNN recognizer feature extractors are dominated by
depthwise convolutions. ggml's `ggml_conv_2d_dw` lowers each depthwise conv to
im2col + a per-channel batched matmul (C tiny matmuls), which is pathologically
slow on Metal — a skip-test (replacing every depthwise with a cheap same-shape
op) showed recognition was ~entirely depthwise (rec 0.7s -> ~0 on ct_scan, M4).

Switch both feature extractors to `ggml_conv_2d_dw_direct` (GGML_OP_CONV_2D_DW),
which runs as a single fused kernel (one read, one write, no im2col buffer). This
requires the companion Metal kernel for GGML_OP_CONV_2D_DW in the ggml fork
(qvac-ext-ggml); CPU and Vulkan already implement it.

Depthwise weights ([KW,KH,1,C], KW>1) are promoted to F32 at load so the op runs
on every backend (CPU's conv_2d_dw_direct requires F32; Metal/Vulkan accept F16
too but CPU does not). F32 is perf-neutral on the GPU — the per-channel K*K
weights are register-resident and the F32 activations dominate bandwidth either
way (measured identical). The load-time BN-scale fold now folds into F16 or F32
weights, and the recognizer weight upload converts F16 GGUF tensors to F32 for
depthwise.

Result on Apple M4 Metal (warm, vs the BN-fold baseline -> with both detection
and recognizer depthwise kernels): clinical 858->584ms, ct_scan 1173->754ms,
lab 838->579ms, liver 841->569ms (~31-36%). All DocTR integration quality tests
pass on Metal AND forced-CPU (region counts identical, keyword asserts intact).

* test[notask]: overlay qvac-fabric at the Metal CONV_2D_DW merge commit (all consumers)

Add a vcpkg overlay port to every qvac-fabric consumer (classification-ggml,
embed-llamacpp, llm-llamacpp, ocr-ggml, translation-nmtcpp, vla-ggml) that
builds fabric from the temp-8828 merge commit of the depthwise-conv kernel
(qvac-fabric-llm.cpp#148, commit 7bcd140f, version 8828.1.1). This validates the
new fabric across all consumers — and gives ocr-ggml the kernel its
ggml_conv_2d_dw_direct path needs — before fabric is tagged 8828.1.1 and a
registry port is cut. Revert these overlays + bump to the tagged port before merge.

* style: clang-format DocTR depthwise + BatchNorm fold

* fix[api]: clearer BN-fold dtype errors + comment on F16 scale round-trip

Address review feedback on the DocTR BatchNorm fold:
- The unsupported-dtype guard now names the actual ggml type and states
  that quantized conv weights are not supported by the fold (was a
  generic "unexpected conv weight dtype").
- output-channel mismatch now reports conv oc vs BN scale size.
- Comment the F16 weight scale: decode->scale->re-encode is required
  because F16 has no arithmetic; it is not an f16->f16 copy.

No behavior change; messages/comments only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports

Scope this PR to the all-consumer qvac-fabric overlay-port bump (which
validates that the new fabric does not regress any consumer). The DocTR
depthwise kernel switch (conv_2d_dw -> conv_2d_dw_direct + F32 depthwise
weight promotion + BN-fold dtype handling) will land separately.

Restores both files under packages/ocr-ggml/addon/src/model-interface/doctr/
to the base branch state.

* Revert "revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports"

This reverts commit 8c753e8.

* chore[notask]: migrate qvac-fabric 8828.1.1 from overlay to registry

8828.1.1 (Metal CONV_2D_DW kernel) is now published in qvac-registry-vcpkg
(tetherto/qvac-registry-vcpkg#189). Bump each consumer's qvac-fabric
version>= 8828.1.0 -> 8828.1.1 and drop the temporary vcpkg-overlay-ports
that pinned the unreleased temp-8828 commit.

The default-registry baseline is intentionally unchanged: vcpkg resolves
version>= against the registry HEAD, so a fixed baseline still picks up the
new tagged version.

Consumers: classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml,
translation-nmtcpp, vla-ggml.

* chore[notask]: bump addon versions for qvac-fabric 8828.1.1

Minor-bump each consumer (major for ocr-ggml) and add a CHANGELOG entry for
the qvac-fabric 8828.1.1 dependency (direct Metal CONV_2D_DW depthwise kernel).
Minor bumps keep these out of the SDK 0.13.0 caret ranges so they are not
auto-picked.

- classification-ggml 0.3.1 -> 0.4.0
- embed-llamacpp      0.19.1 -> 0.20.0
- llm-llamacpp        0.24.0 -> 0.25.0
- ocr-ggml            0.1.1  -> 1.0.0   (major; cuts the prior Unreleased section)
- translation-nmtcpp  5.0.1  -> 5.1.0
- vla-ggml            0.3.2  -> 0.4.0

* style: clang-format DocTR recognizer BN-fold tensor get/set

Wrap the two over-long ggml_backend_tensor_get/set calls in the F16
BN-fold branch to satisfy the lint-cpp clang-format config
(AlignAfterOpenBracket: AlwaysBreak). No behavior change.

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

sdk-v0.13.0

Toggle sdk-v0.13.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
feat[api]: DocTR depthwise convs via direct Metal CONV_2D_DW kernel (…

…+ all-consumer fabric overlay) (#2536)

* feat[api]: fold DocTR detection BatchNorm into conv weights

The DBNet detection graph applied BatchNorm as a runtime scale/shift after
every conv: `conv + add(bias_br) + mul(scale) + add(shift) + act`, where
bias_br is a zero tensor for the (bias=False) BN convs. That is three
full-tensor elementwise passes per conv on top of the conv itself.

Fold the per-channel BN scale into the F16 conv weights and combine the conv
bias and BN shift into a single bias at load time:

    out = scale*(W*x + bias) + shift = (scale*W)*x + (scale*bias + shift)

so the runtime graph collapses to `conv + add(combined) + act` (one pass).
This removes ~60 elementwise passes from the detection graph, which matters
most on bandwidth/dispatch-constrained mobile GPUs where detection is ~55% of
the DocTR pipeline.

foldScaleIntoConv() folds scale into the preceding conv weight (per output
channel, ne[3]) and absorbs bias_br into the shift tensor; it handles both
normal BN (running stats present) and the offline-folded identity path. The
sub-pixel transposed conv in the prob head is left as a runtime scale/shift
since its weight is reshaped at graph build.

Numerically exact: region counts unchanged (365/197/187/197) and all DocTR
integration quality tests pass on Apple Metal (M4); ct_scan 1173->1157ms,
clinical 858->841ms.

* feat[api]: fold DocTR recognizer BatchNorm into conv weights

Apply the same BN-into-weights fold to the CRNN MobileNetV3-small feature
extractor: fold each BN's per-channel scale into the preceding conv's F16
weights at load time so applyBn drops the runtime multiply and keeps only the
shift add. The feature-extractor graph runs once per recognition batch (dozens
of times on a dense page), so removing a full-tensor multiply per conv is
amplified across the page.

Numerically exact: region counts unchanged and all DocTR integration quality
tests pass on Apple Metal (M4). Combined with the detection fold: ct_scan
1173->1134ms, clinical 858->829ms.

* feat[api]: use direct depthwise kernel for DocTR conv2d-dw on GPU

The DBNet detector and CRNN recognizer feature extractors are dominated by
depthwise convolutions. ggml's `ggml_conv_2d_dw` lowers each depthwise conv to
im2col + a per-channel batched matmul (C tiny matmuls), which is pathologically
slow on Metal — a skip-test (replacing every depthwise with a cheap same-shape
op) showed recognition was ~entirely depthwise (rec 0.7s -> ~0 on ct_scan, M4).

Switch both feature extractors to `ggml_conv_2d_dw_direct` (GGML_OP_CONV_2D_DW),
which runs as a single fused kernel (one read, one write, no im2col buffer). This
requires the companion Metal kernel for GGML_OP_CONV_2D_DW in the ggml fork
(qvac-ext-ggml); CPU and Vulkan already implement it.

Depthwise weights ([KW,KH,1,C], KW>1) are promoted to F32 at load so the op runs
on every backend (CPU's conv_2d_dw_direct requires F32; Metal/Vulkan accept F16
too but CPU does not). F32 is perf-neutral on the GPU — the per-channel K*K
weights are register-resident and the F32 activations dominate bandwidth either
way (measured identical). The load-time BN-scale fold now folds into F16 or F32
weights, and the recognizer weight upload converts F16 GGUF tensors to F32 for
depthwise.

Result on Apple M4 Metal (warm, vs the BN-fold baseline -> with both detection
and recognizer depthwise kernels): clinical 858->584ms, ct_scan 1173->754ms,
lab 838->579ms, liver 841->569ms (~31-36%). All DocTR integration quality tests
pass on Metal AND forced-CPU (region counts identical, keyword asserts intact).

* test[notask]: overlay qvac-fabric at the Metal CONV_2D_DW merge commit (all consumers)

Add a vcpkg overlay port to every qvac-fabric consumer (classification-ggml,
embed-llamacpp, llm-llamacpp, ocr-ggml, translation-nmtcpp, vla-ggml) that
builds fabric from the temp-8828 merge commit of the depthwise-conv kernel
(qvac-fabric-llm.cpp#148, commit 7bcd140f, version 8828.1.1). This validates the
new fabric across all consumers — and gives ocr-ggml the kernel its
ggml_conv_2d_dw_direct path needs — before fabric is tagged 8828.1.1 and a
registry port is cut. Revert these overlays + bump to the tagged port before merge.

* style: clang-format DocTR depthwise + BatchNorm fold

* fix[api]: clearer BN-fold dtype errors + comment on F16 scale round-trip

Address review feedback on the DocTR BatchNorm fold:
- The unsupported-dtype guard now names the actual ggml type and states
  that quantized conv weights are not supported by the fold (was a
  generic "unexpected conv weight dtype").
- output-channel mismatch now reports conv oc vs BN scale size.
- Comment the F16 weight scale: decode->scale->re-encode is required
  because F16 has no arithmetic; it is not an f16->f16 copy.

No behavior change; messages/comments only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports

Scope this PR to the all-consumer qvac-fabric overlay-port bump (which
validates that the new fabric does not regress any consumer). The DocTR
depthwise kernel switch (conv_2d_dw -> conv_2d_dw_direct + F32 depthwise
weight promotion + BN-fold dtype handling) will land separately.

Restores both files under packages/ocr-ggml/addon/src/model-interface/doctr/
to the base branch state.

* Revert "revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports"

This reverts commit 8c753e8.

* chore[notask]: migrate qvac-fabric 8828.1.1 from overlay to registry

8828.1.1 (Metal CONV_2D_DW kernel) is now published in qvac-registry-vcpkg
(tetherto/qvac-registry-vcpkg#189). Bump each consumer's qvac-fabric
version>= 8828.1.0 -> 8828.1.1 and drop the temporary vcpkg-overlay-ports
that pinned the unreleased temp-8828 commit.

The default-registry baseline is intentionally unchanged: vcpkg resolves
version>= against the registry HEAD, so a fixed baseline still picks up the
new tagged version.

Consumers: classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml,
translation-nmtcpp, vla-ggml.

* chore[notask]: bump addon versions for qvac-fabric 8828.1.1

Minor-bump each consumer (major for ocr-ggml) and add a CHANGELOG entry for
the qvac-fabric 8828.1.1 dependency (direct Metal CONV_2D_DW depthwise kernel).
Minor bumps keep these out of the SDK 0.13.0 caret ranges so they are not
auto-picked.

- classification-ggml 0.3.1 -> 0.4.0
- embed-llamacpp      0.19.1 -> 0.20.0
- llm-llamacpp        0.24.0 -> 0.25.0
- ocr-ggml            0.1.1  -> 1.0.0   (major; cuts the prior Unreleased section)
- translation-nmtcpp  5.0.1  -> 5.1.0
- vla-ggml            0.3.2  -> 0.4.0

* style: clang-format DocTR recognizer BN-fold tensor get/set

Wrap the two over-long ggml_backend_tensor_get/set calls in the F16
BN-fold branch to satisfy the lint-cpp clang-format config
(AlignAfterOpenBracket: AlwaysBreak). No behavior change.

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

ocr-ggml-v0.2.0

Toggle ocr-ggml-v0.2.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
feat[api]: DocTR depthwise convs via direct Metal CONV_2D_DW kernel (…

…+ all-consumer fabric overlay) (#2536)

* feat[api]: fold DocTR detection BatchNorm into conv weights

The DBNet detection graph applied BatchNorm as a runtime scale/shift after
every conv: `conv + add(bias_br) + mul(scale) + add(shift) + act`, where
bias_br is a zero tensor for the (bias=False) BN convs. That is three
full-tensor elementwise passes per conv on top of the conv itself.

Fold the per-channel BN scale into the F16 conv weights and combine the conv
bias and BN shift into a single bias at load time:

    out = scale*(W*x + bias) + shift = (scale*W)*x + (scale*bias + shift)

so the runtime graph collapses to `conv + add(combined) + act` (one pass).
This removes ~60 elementwise passes from the detection graph, which matters
most on bandwidth/dispatch-constrained mobile GPUs where detection is ~55% of
the DocTR pipeline.

foldScaleIntoConv() folds scale into the preceding conv weight (per output
channel, ne[3]) and absorbs bias_br into the shift tensor; it handles both
normal BN (running stats present) and the offline-folded identity path. The
sub-pixel transposed conv in the prob head is left as a runtime scale/shift
since its weight is reshaped at graph build.

Numerically exact: region counts unchanged (365/197/187/197) and all DocTR
integration quality tests pass on Apple Metal (M4); ct_scan 1173->1157ms,
clinical 858->841ms.

* feat[api]: fold DocTR recognizer BatchNorm into conv weights

Apply the same BN-into-weights fold to the CRNN MobileNetV3-small feature
extractor: fold each BN's per-channel scale into the preceding conv's F16
weights at load time so applyBn drops the runtime multiply and keeps only the
shift add. The feature-extractor graph runs once per recognition batch (dozens
of times on a dense page), so removing a full-tensor multiply per conv is
amplified across the page.

Numerically exact: region counts unchanged and all DocTR integration quality
tests pass on Apple Metal (M4). Combined with the detection fold: ct_scan
1173->1134ms, clinical 858->829ms.

* feat[api]: use direct depthwise kernel for DocTR conv2d-dw on GPU

The DBNet detector and CRNN recognizer feature extractors are dominated by
depthwise convolutions. ggml's `ggml_conv_2d_dw` lowers each depthwise conv to
im2col + a per-channel batched matmul (C tiny matmuls), which is pathologically
slow on Metal — a skip-test (replacing every depthwise with a cheap same-shape
op) showed recognition was ~entirely depthwise (rec 0.7s -> ~0 on ct_scan, M4).

Switch both feature extractors to `ggml_conv_2d_dw_direct` (GGML_OP_CONV_2D_DW),
which runs as a single fused kernel (one read, one write, no im2col buffer). This
requires the companion Metal kernel for GGML_OP_CONV_2D_DW in the ggml fork
(qvac-ext-ggml); CPU and Vulkan already implement it.

Depthwise weights ([KW,KH,1,C], KW>1) are promoted to F32 at load so the op runs
on every backend (CPU's conv_2d_dw_direct requires F32; Metal/Vulkan accept F16
too but CPU does not). F32 is perf-neutral on the GPU — the per-channel K*K
weights are register-resident and the F32 activations dominate bandwidth either
way (measured identical). The load-time BN-scale fold now folds into F16 or F32
weights, and the recognizer weight upload converts F16 GGUF tensors to F32 for
depthwise.

Result on Apple M4 Metal (warm, vs the BN-fold baseline -> with both detection
and recognizer depthwise kernels): clinical 858->584ms, ct_scan 1173->754ms,
lab 838->579ms, liver 841->569ms (~31-36%). All DocTR integration quality tests
pass on Metal AND forced-CPU (region counts identical, keyword asserts intact).

* test[notask]: overlay qvac-fabric at the Metal CONV_2D_DW merge commit (all consumers)

Add a vcpkg overlay port to every qvac-fabric consumer (classification-ggml,
embed-llamacpp, llm-llamacpp, ocr-ggml, translation-nmtcpp, vla-ggml) that
builds fabric from the temp-8828 merge commit of the depthwise-conv kernel
(qvac-fabric-llm.cpp#148, commit 7bcd140f, version 8828.1.1). This validates the
new fabric across all consumers — and gives ocr-ggml the kernel its
ggml_conv_2d_dw_direct path needs — before fabric is tagged 8828.1.1 and a
registry port is cut. Revert these overlays + bump to the tagged port before merge.

* style: clang-format DocTR depthwise + BatchNorm fold

* fix[api]: clearer BN-fold dtype errors + comment on F16 scale round-trip

Address review feedback on the DocTR BatchNorm fold:
- The unsupported-dtype guard now names the actual ggml type and states
  that quantized conv weights are not supported by the fold (was a
  generic "unexpected conv weight dtype").
- output-channel mismatch now reports conv oc vs BN scale size.
- Comment the F16 weight scale: decode->scale->re-encode is required
  because F16 has no arithmetic; it is not an f16->f16 copy.

No behavior change; messages/comments only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports

Scope this PR to the all-consumer qvac-fabric overlay-port bump (which
validates that the new fabric does not regress any consumer). The DocTR
depthwise kernel switch (conv_2d_dw -> conv_2d_dw_direct + F32 depthwise
weight promotion + BN-fold dtype handling) will land separately.

Restores both files under packages/ocr-ggml/addon/src/model-interface/doctr/
to the base branch state.

* Revert "revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports"

This reverts commit 8c753e8.

* chore[notask]: migrate qvac-fabric 8828.1.1 from overlay to registry

8828.1.1 (Metal CONV_2D_DW kernel) is now published in qvac-registry-vcpkg
(tetherto/qvac-registry-vcpkg#189). Bump each consumer's qvac-fabric
version>= 8828.1.0 -> 8828.1.1 and drop the temporary vcpkg-overlay-ports
that pinned the unreleased temp-8828 commit.

The default-registry baseline is intentionally unchanged: vcpkg resolves
version>= against the registry HEAD, so a fixed baseline still picks up the
new tagged version.

Consumers: classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml,
translation-nmtcpp, vla-ggml.

* chore[notask]: bump addon versions for qvac-fabric 8828.1.1

Minor-bump each consumer (major for ocr-ggml) and add a CHANGELOG entry for
the qvac-fabric 8828.1.1 dependency (direct Metal CONV_2D_DW depthwise kernel).
Minor bumps keep these out of the SDK 0.13.0 caret ranges so they are not
auto-picked.

- classification-ggml 0.3.1 -> 0.4.0
- embed-llamacpp      0.19.1 -> 0.20.0
- llm-llamacpp        0.24.0 -> 0.25.0
- ocr-ggml            0.1.1  -> 1.0.0   (major; cuts the prior Unreleased section)
- translation-nmtcpp  5.0.1  -> 5.1.0
- vla-ggml            0.3.2  -> 0.4.0

* style: clang-format DocTR recognizer BN-fold tensor get/set

Wrap the two over-long ggml_backend_tensor_get/set calls in the F16
BN-fold branch to satisfy the lint-cpp clang-format config
(AlignAfterOpenBracket: AlwaysBreak). No behavior change.

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

llamacpp-llm-v0.25.0

Toggle llamacpp-llm-v0.25.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
feat[api]: DocTR depthwise convs via direct Metal CONV_2D_DW kernel (…

…+ all-consumer fabric overlay) (#2536)

* feat[api]: fold DocTR detection BatchNorm into conv weights

The DBNet detection graph applied BatchNorm as a runtime scale/shift after
every conv: `conv + add(bias_br) + mul(scale) + add(shift) + act`, where
bias_br is a zero tensor for the (bias=False) BN convs. That is three
full-tensor elementwise passes per conv on top of the conv itself.

Fold the per-channel BN scale into the F16 conv weights and combine the conv
bias and BN shift into a single bias at load time:

    out = scale*(W*x + bias) + shift = (scale*W)*x + (scale*bias + shift)

so the runtime graph collapses to `conv + add(combined) + act` (one pass).
This removes ~60 elementwise passes from the detection graph, which matters
most on bandwidth/dispatch-constrained mobile GPUs where detection is ~55% of
the DocTR pipeline.

foldScaleIntoConv() folds scale into the preceding conv weight (per output
channel, ne[3]) and absorbs bias_br into the shift tensor; it handles both
normal BN (running stats present) and the offline-folded identity path. The
sub-pixel transposed conv in the prob head is left as a runtime scale/shift
since its weight is reshaped at graph build.

Numerically exact: region counts unchanged (365/197/187/197) and all DocTR
integration quality tests pass on Apple Metal (M4); ct_scan 1173->1157ms,
clinical 858->841ms.

* feat[api]: fold DocTR recognizer BatchNorm into conv weights

Apply the same BN-into-weights fold to the CRNN MobileNetV3-small feature
extractor: fold each BN's per-channel scale into the preceding conv's F16
weights at load time so applyBn drops the runtime multiply and keeps only the
shift add. The feature-extractor graph runs once per recognition batch (dozens
of times on a dense page), so removing a full-tensor multiply per conv is
amplified across the page.

Numerically exact: region counts unchanged and all DocTR integration quality
tests pass on Apple Metal (M4). Combined with the detection fold: ct_scan
1173->1134ms, clinical 858->829ms.

* feat[api]: use direct depthwise kernel for DocTR conv2d-dw on GPU

The DBNet detector and CRNN recognizer feature extractors are dominated by
depthwise convolutions. ggml's `ggml_conv_2d_dw` lowers each depthwise conv to
im2col + a per-channel batched matmul (C tiny matmuls), which is pathologically
slow on Metal — a skip-test (replacing every depthwise with a cheap same-shape
op) showed recognition was ~entirely depthwise (rec 0.7s -> ~0 on ct_scan, M4).

Switch both feature extractors to `ggml_conv_2d_dw_direct` (GGML_OP_CONV_2D_DW),
which runs as a single fused kernel (one read, one write, no im2col buffer). This
requires the companion Metal kernel for GGML_OP_CONV_2D_DW in the ggml fork
(qvac-ext-ggml); CPU and Vulkan already implement it.

Depthwise weights ([KW,KH,1,C], KW>1) are promoted to F32 at load so the op runs
on every backend (CPU's conv_2d_dw_direct requires F32; Metal/Vulkan accept F16
too but CPU does not). F32 is perf-neutral on the GPU — the per-channel K*K
weights are register-resident and the F32 activations dominate bandwidth either
way (measured identical). The load-time BN-scale fold now folds into F16 or F32
weights, and the recognizer weight upload converts F16 GGUF tensors to F32 for
depthwise.

Result on Apple M4 Metal (warm, vs the BN-fold baseline -> with both detection
and recognizer depthwise kernels): clinical 858->584ms, ct_scan 1173->754ms,
lab 838->579ms, liver 841->569ms (~31-36%). All DocTR integration quality tests
pass on Metal AND forced-CPU (region counts identical, keyword asserts intact).

* test[notask]: overlay qvac-fabric at the Metal CONV_2D_DW merge commit (all consumers)

Add a vcpkg overlay port to every qvac-fabric consumer (classification-ggml,
embed-llamacpp, llm-llamacpp, ocr-ggml, translation-nmtcpp, vla-ggml) that
builds fabric from the temp-8828 merge commit of the depthwise-conv kernel
(qvac-fabric-llm.cpp#148, commit 7bcd140f, version 8828.1.1). This validates the
new fabric across all consumers — and gives ocr-ggml the kernel its
ggml_conv_2d_dw_direct path needs — before fabric is tagged 8828.1.1 and a
registry port is cut. Revert these overlays + bump to the tagged port before merge.

* style: clang-format DocTR depthwise + BatchNorm fold

* fix[api]: clearer BN-fold dtype errors + comment on F16 scale round-trip

Address review feedback on the DocTR BatchNorm fold:
- The unsupported-dtype guard now names the actual ggml type and states
  that quantized conv weights are not supported by the fold (was a
  generic "unexpected conv weight dtype").
- output-channel mismatch now reports conv oc vs BN scale size.
- Comment the F16 weight scale: decode->scale->re-encode is required
  because F16 has no arithmetic; it is not an f16->f16 copy.

No behavior change; messages/comments only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports

Scope this PR to the all-consumer qvac-fabric overlay-port bump (which
validates that the new fabric does not regress any consumer). The DocTR
depthwise kernel switch (conv_2d_dw -> conv_2d_dw_direct + F32 depthwise
weight promotion + BN-fold dtype handling) will land separately.

Restores both files under packages/ocr-ggml/addon/src/model-interface/doctr/
to the base branch state.

* Revert "revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports"

This reverts commit 8c753e8.

* chore[notask]: migrate qvac-fabric 8828.1.1 from overlay to registry

8828.1.1 (Metal CONV_2D_DW kernel) is now published in qvac-registry-vcpkg
(tetherto/qvac-registry-vcpkg#189). Bump each consumer's qvac-fabric
version>= 8828.1.0 -> 8828.1.1 and drop the temporary vcpkg-overlay-ports
that pinned the unreleased temp-8828 commit.

The default-registry baseline is intentionally unchanged: vcpkg resolves
version>= against the registry HEAD, so a fixed baseline still picks up the
new tagged version.

Consumers: classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml,
translation-nmtcpp, vla-ggml.

* chore[notask]: bump addon versions for qvac-fabric 8828.1.1

Minor-bump each consumer (major for ocr-ggml) and add a CHANGELOG entry for
the qvac-fabric 8828.1.1 dependency (direct Metal CONV_2D_DW depthwise kernel).
Minor bumps keep these out of the SDK 0.13.0 caret ranges so they are not
auto-picked.

- classification-ggml 0.3.1 -> 0.4.0
- embed-llamacpp      0.19.1 -> 0.20.0
- llm-llamacpp        0.24.0 -> 0.25.0
- ocr-ggml            0.1.1  -> 1.0.0   (major; cuts the prior Unreleased section)
- translation-nmtcpp  5.0.1  -> 5.1.0
- vla-ggml            0.3.2  -> 0.4.0

* style: clang-format DocTR recognizer BN-fold tensor get/set

Wrap the two over-long ggml_backend_tensor_get/set calls in the F16
BN-fold branch to satisfy the lint-cpp clang-format config
(AlignAfterOpenBracket: AlwaysBreak). No behavior change.

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

llamacpp-embed-v0.20.0

Toggle llamacpp-embed-v0.20.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
feat[api]: DocTR depthwise convs via direct Metal CONV_2D_DW kernel (…

…+ all-consumer fabric overlay) (#2536)

* feat[api]: fold DocTR detection BatchNorm into conv weights

The DBNet detection graph applied BatchNorm as a runtime scale/shift after
every conv: `conv + add(bias_br) + mul(scale) + add(shift) + act`, where
bias_br is a zero tensor for the (bias=False) BN convs. That is three
full-tensor elementwise passes per conv on top of the conv itself.

Fold the per-channel BN scale into the F16 conv weights and combine the conv
bias and BN shift into a single bias at load time:

    out = scale*(W*x + bias) + shift = (scale*W)*x + (scale*bias + shift)

so the runtime graph collapses to `conv + add(combined) + act` (one pass).
This removes ~60 elementwise passes from the detection graph, which matters
most on bandwidth/dispatch-constrained mobile GPUs where detection is ~55% of
the DocTR pipeline.

foldScaleIntoConv() folds scale into the preceding conv weight (per output
channel, ne[3]) and absorbs bias_br into the shift tensor; it handles both
normal BN (running stats present) and the offline-folded identity path. The
sub-pixel transposed conv in the prob head is left as a runtime scale/shift
since its weight is reshaped at graph build.

Numerically exact: region counts unchanged (365/197/187/197) and all DocTR
integration quality tests pass on Apple Metal (M4); ct_scan 1173->1157ms,
clinical 858->841ms.

* feat[api]: fold DocTR recognizer BatchNorm into conv weights

Apply the same BN-into-weights fold to the CRNN MobileNetV3-small feature
extractor: fold each BN's per-channel scale into the preceding conv's F16
weights at load time so applyBn drops the runtime multiply and keeps only the
shift add. The feature-extractor graph runs once per recognition batch (dozens
of times on a dense page), so removing a full-tensor multiply per conv is
amplified across the page.

Numerically exact: region counts unchanged and all DocTR integration quality
tests pass on Apple Metal (M4). Combined with the detection fold: ct_scan
1173->1134ms, clinical 858->829ms.

* feat[api]: use direct depthwise kernel for DocTR conv2d-dw on GPU

The DBNet detector and CRNN recognizer feature extractors are dominated by
depthwise convolutions. ggml's `ggml_conv_2d_dw` lowers each depthwise conv to
im2col + a per-channel batched matmul (C tiny matmuls), which is pathologically
slow on Metal — a skip-test (replacing every depthwise with a cheap same-shape
op) showed recognition was ~entirely depthwise (rec 0.7s -> ~0 on ct_scan, M4).

Switch both feature extractors to `ggml_conv_2d_dw_direct` (GGML_OP_CONV_2D_DW),
which runs as a single fused kernel (one read, one write, no im2col buffer). This
requires the companion Metal kernel for GGML_OP_CONV_2D_DW in the ggml fork
(qvac-ext-ggml); CPU and Vulkan already implement it.

Depthwise weights ([KW,KH,1,C], KW>1) are promoted to F32 at load so the op runs
on every backend (CPU's conv_2d_dw_direct requires F32; Metal/Vulkan accept F16
too but CPU does not). F32 is perf-neutral on the GPU — the per-channel K*K
weights are register-resident and the F32 activations dominate bandwidth either
way (measured identical). The load-time BN-scale fold now folds into F16 or F32
weights, and the recognizer weight upload converts F16 GGUF tensors to F32 for
depthwise.

Result on Apple M4 Metal (warm, vs the BN-fold baseline -> with both detection
and recognizer depthwise kernels): clinical 858->584ms, ct_scan 1173->754ms,
lab 838->579ms, liver 841->569ms (~31-36%). All DocTR integration quality tests
pass on Metal AND forced-CPU (region counts identical, keyword asserts intact).

* test[notask]: overlay qvac-fabric at the Metal CONV_2D_DW merge commit (all consumers)

Add a vcpkg overlay port to every qvac-fabric consumer (classification-ggml,
embed-llamacpp, llm-llamacpp, ocr-ggml, translation-nmtcpp, vla-ggml) that
builds fabric from the temp-8828 merge commit of the depthwise-conv kernel
(qvac-fabric-llm.cpp#148, commit 7bcd140f, version 8828.1.1). This validates the
new fabric across all consumers — and gives ocr-ggml the kernel its
ggml_conv_2d_dw_direct path needs — before fabric is tagged 8828.1.1 and a
registry port is cut. Revert these overlays + bump to the tagged port before merge.

* style: clang-format DocTR depthwise + BatchNorm fold

* fix[api]: clearer BN-fold dtype errors + comment on F16 scale round-trip

Address review feedback on the DocTR BatchNorm fold:
- The unsupported-dtype guard now names the actual ggml type and states
  that quantized conv weights are not supported by the fold (was a
  generic "unexpected conv weight dtype").
- output-channel mismatch now reports conv oc vs BN scale size.
- Comment the F16 weight scale: decode->scale->re-encode is required
  because F16 has no arithmetic; it is not an f16->f16 copy.

No behavior change; messages/comments only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports

Scope this PR to the all-consumer qvac-fabric overlay-port bump (which
validates that the new fabric does not regress any consumer). The DocTR
depthwise kernel switch (conv_2d_dw -> conv_2d_dw_direct + F32 depthwise
weight promotion + BN-fold dtype handling) will land separately.

Restores both files under packages/ocr-ggml/addon/src/model-interface/doctr/
to the base branch state.

* Revert "revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports"

This reverts commit 8c753e8.

* chore[notask]: migrate qvac-fabric 8828.1.1 from overlay to registry

8828.1.1 (Metal CONV_2D_DW kernel) is now published in qvac-registry-vcpkg
(tetherto/qvac-registry-vcpkg#189). Bump each consumer's qvac-fabric
version>= 8828.1.0 -> 8828.1.1 and drop the temporary vcpkg-overlay-ports
that pinned the unreleased temp-8828 commit.

The default-registry baseline is intentionally unchanged: vcpkg resolves
version>= against the registry HEAD, so a fixed baseline still picks up the
new tagged version.

Consumers: classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml,
translation-nmtcpp, vla-ggml.

* chore[notask]: bump addon versions for qvac-fabric 8828.1.1

Minor-bump each consumer (major for ocr-ggml) and add a CHANGELOG entry for
the qvac-fabric 8828.1.1 dependency (direct Metal CONV_2D_DW depthwise kernel).
Minor bumps keep these out of the SDK 0.13.0 caret ranges so they are not
auto-picked.

- classification-ggml 0.3.1 -> 0.4.0
- embed-llamacpp      0.19.1 -> 0.20.0
- llm-llamacpp        0.24.0 -> 0.25.0
- ocr-ggml            0.1.1  -> 1.0.0   (major; cuts the prior Unreleased section)
- translation-nmtcpp  5.0.1  -> 5.1.0
- vla-ggml            0.3.2  -> 0.4.0

* style: clang-format DocTR recognizer BN-fold tensor get/set

Wrap the two over-long ggml_backend_tensor_get/set calls in the F16
BN-fold branch to satisfy the lint-cpp clang-format config
(AlignAfterOpenBracket: AlwaysBreak). No behavior change.

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

classification-ggml-v0.4.0

Toggle classification-ggml-v0.4.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
feat[api]: DocTR depthwise convs via direct Metal CONV_2D_DW kernel (…

…+ all-consumer fabric overlay) (#2536)

* feat[api]: fold DocTR detection BatchNorm into conv weights

The DBNet detection graph applied BatchNorm as a runtime scale/shift after
every conv: `conv + add(bias_br) + mul(scale) + add(shift) + act`, where
bias_br is a zero tensor for the (bias=False) BN convs. That is three
full-tensor elementwise passes per conv on top of the conv itself.

Fold the per-channel BN scale into the F16 conv weights and combine the conv
bias and BN shift into a single bias at load time:

    out = scale*(W*x + bias) + shift = (scale*W)*x + (scale*bias + shift)

so the runtime graph collapses to `conv + add(combined) + act` (one pass).
This removes ~60 elementwise passes from the detection graph, which matters
most on bandwidth/dispatch-constrained mobile GPUs where detection is ~55% of
the DocTR pipeline.

foldScaleIntoConv() folds scale into the preceding conv weight (per output
channel, ne[3]) and absorbs bias_br into the shift tensor; it handles both
normal BN (running stats present) and the offline-folded identity path. The
sub-pixel transposed conv in the prob head is left as a runtime scale/shift
since its weight is reshaped at graph build.

Numerically exact: region counts unchanged (365/197/187/197) and all DocTR
integration quality tests pass on Apple Metal (M4); ct_scan 1173->1157ms,
clinical 858->841ms.

* feat[api]: fold DocTR recognizer BatchNorm into conv weights

Apply the same BN-into-weights fold to the CRNN MobileNetV3-small feature
extractor: fold each BN's per-channel scale into the preceding conv's F16
weights at load time so applyBn drops the runtime multiply and keeps only the
shift add. The feature-extractor graph runs once per recognition batch (dozens
of times on a dense page), so removing a full-tensor multiply per conv is
amplified across the page.

Numerically exact: region counts unchanged and all DocTR integration quality
tests pass on Apple Metal (M4). Combined with the detection fold: ct_scan
1173->1134ms, clinical 858->829ms.

* feat[api]: use direct depthwise kernel for DocTR conv2d-dw on GPU

The DBNet detector and CRNN recognizer feature extractors are dominated by
depthwise convolutions. ggml's `ggml_conv_2d_dw` lowers each depthwise conv to
im2col + a per-channel batched matmul (C tiny matmuls), which is pathologically
slow on Metal — a skip-test (replacing every depthwise with a cheap same-shape
op) showed recognition was ~entirely depthwise (rec 0.7s -> ~0 on ct_scan, M4).

Switch both feature extractors to `ggml_conv_2d_dw_direct` (GGML_OP_CONV_2D_DW),
which runs as a single fused kernel (one read, one write, no im2col buffer). This
requires the companion Metal kernel for GGML_OP_CONV_2D_DW in the ggml fork
(qvac-ext-ggml); CPU and Vulkan already implement it.

Depthwise weights ([KW,KH,1,C], KW>1) are promoted to F32 at load so the op runs
on every backend (CPU's conv_2d_dw_direct requires F32; Metal/Vulkan accept F16
too but CPU does not). F32 is perf-neutral on the GPU — the per-channel K*K
weights are register-resident and the F32 activations dominate bandwidth either
way (measured identical). The load-time BN-scale fold now folds into F16 or F32
weights, and the recognizer weight upload converts F16 GGUF tensors to F32 for
depthwise.

Result on Apple M4 Metal (warm, vs the BN-fold baseline -> with both detection
and recognizer depthwise kernels): clinical 858->584ms, ct_scan 1173->754ms,
lab 838->579ms, liver 841->569ms (~31-36%). All DocTR integration quality tests
pass on Metal AND forced-CPU (region counts identical, keyword asserts intact).

* test[notask]: overlay qvac-fabric at the Metal CONV_2D_DW merge commit (all consumers)

Add a vcpkg overlay port to every qvac-fabric consumer (classification-ggml,
embed-llamacpp, llm-llamacpp, ocr-ggml, translation-nmtcpp, vla-ggml) that
builds fabric from the temp-8828 merge commit of the depthwise-conv kernel
(qvac-fabric-llm.cpp#148, commit 7bcd140f, version 8828.1.1). This validates the
new fabric across all consumers — and gives ocr-ggml the kernel its
ggml_conv_2d_dw_direct path needs — before fabric is tagged 8828.1.1 and a
registry port is cut. Revert these overlays + bump to the tagged port before merge.

* style: clang-format DocTR depthwise + BatchNorm fold

* fix[api]: clearer BN-fold dtype errors + comment on F16 scale round-trip

Address review feedback on the DocTR BatchNorm fold:
- The unsupported-dtype guard now names the actual ggml type and states
  that quantized conv weights are not supported by the fold (was a
  generic "unexpected conv weight dtype").
- output-channel mismatch now reports conv oc vs BN scale size.
- Comment the F16 weight scale: decode->scale->re-encode is required
  because F16 has no arithmetic; it is not an f16->f16 copy.

No behavior change; messages/comments only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports

Scope this PR to the all-consumer qvac-fabric overlay-port bump (which
validates that the new fabric does not regress any consumer). The DocTR
depthwise kernel switch (conv_2d_dw -> conv_2d_dw_direct + F32 depthwise
weight promotion + BN-fold dtype handling) will land separately.

Restores both files under packages/ocr-ggml/addon/src/model-interface/doctr/
to the base branch state.

* Revert "revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports"

This reverts commit 8c753e8.

* chore[notask]: migrate qvac-fabric 8828.1.1 from overlay to registry

8828.1.1 (Metal CONV_2D_DW kernel) is now published in qvac-registry-vcpkg
(tetherto/qvac-registry-vcpkg#189). Bump each consumer's qvac-fabric
version>= 8828.1.0 -> 8828.1.1 and drop the temporary vcpkg-overlay-ports
that pinned the unreleased temp-8828 commit.

The default-registry baseline is intentionally unchanged: vcpkg resolves
version>= against the registry HEAD, so a fixed baseline still picks up the
new tagged version.

Consumers: classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml,
translation-nmtcpp, vla-ggml.

* chore[notask]: bump addon versions for qvac-fabric 8828.1.1

Minor-bump each consumer (major for ocr-ggml) and add a CHANGELOG entry for
the qvac-fabric 8828.1.1 dependency (direct Metal CONV_2D_DW depthwise kernel).
Minor bumps keep these out of the SDK 0.13.0 caret ranges so they are not
auto-picked.

- classification-ggml 0.3.1 -> 0.4.0
- embed-llamacpp      0.19.1 -> 0.20.0
- llm-llamacpp        0.24.0 -> 0.25.0
- ocr-ggml            0.1.1  -> 1.0.0   (major; cuts the prior Unreleased section)
- translation-nmtcpp  5.0.1  -> 5.1.0
- vla-ggml            0.3.2  -> 0.4.0

* style: clang-format DocTR recognizer BN-fold tensor get/set

Wrap the two over-long ggml_backend_tensor_get/set calls in the F16
BN-fold branch to satisfy the lint-cpp clang-format config
(AlignAfterOpenBracket: AlwaysBreak). No behavior change.

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

rag-v0.6.3

Toggle rag-v0.6.3's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
chore[notask]: release infer-base 0.6.1 (bump bare-events to ^2.9.1, …

…pin→caret) (#2563)

(cherry picked from commit a909fc6)

infer-base-v0.6.1

Toggle infer-base-v0.6.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
chore[notask]: release infer-base 0.6.1 (bump bare-events to ^2.9.1, …

…pin→caret) (#2563)

(cherry picked from commit a909fc6)

infer-base-v0.4.2

Toggle infer-base-v0.4.2's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
QVAC-19255 feat[api]: reintroduce Supertonic GPU support (desktop/iOS…

…; Android CPU-only) (#2506)

* QVAC-19254 test[skiplog]: overlay tts-cpp@f7d4d6c to validate Android dlopen fix

Adds a package-local vcpkg overlay port that rebuilds tts-cpp from
qvac-ext-lib-whisper.cpp@f7d4d6c (the published 2026-06-05 pin 128dae42 +
one fix commit) instead of the registry port, so the tts-ggml Android
prebuild is built against the QVAC-19254 follow-up fix that is not yet
published to qvac-registry-vcpkg.

f7d4d6c reroutes Supertonic's direct CPU-backend calls that are unlinkable
under GGML_BACKEND_DL=ON (the Android per-arch dlopen CPU build):
  - ggml_get_type_traits_cpu(...)->from_float -> ggml_quantize_chunk()
  - ggml_backend_is_cpu() -> tts_cpp::detail::backend_is_cpu() registry shim

Files:
- packages/tts-ggml/ports/tts-cpp/{portfile.cmake,vcpkg.json}: overlay copy
  of the registry 2026-06-05 port with REF -> f7d4d6c and the recomputed
  GitHub-archive SHA512. Build options are otherwise byte-identical.
- packages/tts-ggml/vcpkg-configuration.json: add "overlay-ports": ["ports"]
  so vcpkg prefers the local port over the registry tts-cpp.

Temporary validation aid, NOT for merge/release (no version bump, [skiplog]).
Needs the `verified` label to trigger the Android prebuild + mobile job.

Verify the fix took -- assert on the BINARY, not the mobile test result (the
addon mobile suite is a false-green: it swallows the load error via
Bare.on('unhandledRejection') in test/mobile/integration-runtime.cjs):
  llvm-readelf --dyn-syms prebuilds/android-arm64/qvac__tts-ggml.bare \
    | grep -E 'UND.*(ggml_backend_is_cpu|ggml_get_type_traits_cpu)'
Expect: no UND CPU-backend symbols. Remove this overlay once the fix is
published to the registry and consumed via vcpkg.json.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat: reintroduce Supertonic GPU support (desktop/iOS), keep Android CPU-only

Re-lands the QVAC-19255 Supertonic GPU feature that 0.2.2 reverted, built on
the QVAC-19254 follow-up fix (consumed via the f7d4d6c overlay from the
previous commit) so the addon no longer crashes at dlopen on Android.

- vcpkg.json: tts-cpp version>= 2026-06-03#1 -> 2026-06-05 (the overlay
  provides f7d4d6c = 2026-06-05 + the CPU-symbol fix).
- SupertonicModel.cpp / index.js: remove the "CPU only today" useGPU
  rejection so Supertonic GPU intent flows through to tts-cpp on GPU-capable
  hosts (Metal / Vulkan / CUDA). The cross-field conflict check is preserved.
- KEEP the SupertonicModel::loadLocked #ifdef __ANDROID__ force-off: Adreno
  Vulkan/OpenCL ggml graph compute still aborts (same family as the parakeet
  Adreno crash), so useGPU=true on Android transparently falls back to CPU.
- gpu-smoke.test.js: Supertonic GPU smoke must engage a GPU backend on
  GPU-capable platforms and SKIPS Android (mirrors Chatterbox) instead of the
  old "rejected at constructor" assertion.
- Flip the C++ config unit tests to acceptance + keep a conflict-rejection
  test; refresh inference-test assertion text and the README / index.d.ts /
  examples docs. SupertonicConfig.hpp useGpu docstring updated.
- Bump 0.2.2 -> 0.3.0.

The ports/tts-cpp overlay + overlay-ports entry are interim until f7d4d6c is
published to qvac-registry-vcpkg.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>