Tags: tetherto/qvac
Tags
chore[notask]: bump bare-fetch to ^3.0.1 in ocr-ggml and translation-… …nmtcpp (#2584) Aligns both addons with the latest bare-fetch major already used by rag (0.6.3) and ocr-onnx, removing the duplicate older bare-fetch major from the dependency tree. - @qvac/ocr-ggml: 0.2.0 -> 0.2.1 - @qvac/translation-nmtcpp: 6.0.0 -> 6.0.1
feat[api]: DocTR depthwise convs via direct Metal CONV_2D_DW kernel (… …+ all-consumer fabric overlay) (#2536) * feat[api]: fold DocTR detection BatchNorm into conv weights The DBNet detection graph applied BatchNorm as a runtime scale/shift after every conv: `conv + add(bias_br) + mul(scale) + add(shift) + act`, where bias_br is a zero tensor for the (bias=False) BN convs. That is three full-tensor elementwise passes per conv on top of the conv itself. Fold the per-channel BN scale into the F16 conv weights and combine the conv bias and BN shift into a single bias at load time: out = scale*(W*x + bias) + shift = (scale*W)*x + (scale*bias + shift) so the runtime graph collapses to `conv + add(combined) + act` (one pass). This removes ~60 elementwise passes from the detection graph, which matters most on bandwidth/dispatch-constrained mobile GPUs where detection is ~55% of the DocTR pipeline. foldScaleIntoConv() folds scale into the preceding conv weight (per output channel, ne[3]) and absorbs bias_br into the shift tensor; it handles both normal BN (running stats present) and the offline-folded identity path. The sub-pixel transposed conv in the prob head is left as a runtime scale/shift since its weight is reshaped at graph build. Numerically exact: region counts unchanged (365/197/187/197) and all DocTR integration quality tests pass on Apple Metal (M4); ct_scan 1173->1157ms, clinical 858->841ms. * feat[api]: fold DocTR recognizer BatchNorm into conv weights Apply the same BN-into-weights fold to the CRNN MobileNetV3-small feature extractor: fold each BN's per-channel scale into the preceding conv's F16 weights at load time so applyBn drops the runtime multiply and keeps only the shift add. The feature-extractor graph runs once per recognition batch (dozens of times on a dense page), so removing a full-tensor multiply per conv is amplified across the page. Numerically exact: region counts unchanged and all DocTR integration quality tests pass on Apple Metal (M4). Combined with the detection fold: ct_scan 1173->1134ms, clinical 858->829ms. * feat[api]: use direct depthwise kernel for DocTR conv2d-dw on GPU The DBNet detector and CRNN recognizer feature extractors are dominated by depthwise convolutions. ggml's `ggml_conv_2d_dw` lowers each depthwise conv to im2col + a per-channel batched matmul (C tiny matmuls), which is pathologically slow on Metal — a skip-test (replacing every depthwise with a cheap same-shape op) showed recognition was ~entirely depthwise (rec 0.7s -> ~0 on ct_scan, M4). Switch both feature extractors to `ggml_conv_2d_dw_direct` (GGML_OP_CONV_2D_DW), which runs as a single fused kernel (one read, one write, no im2col buffer). This requires the companion Metal kernel for GGML_OP_CONV_2D_DW in the ggml fork (qvac-ext-ggml); CPU and Vulkan already implement it. Depthwise weights ([KW,KH,1,C], KW>1) are promoted to F32 at load so the op runs on every backend (CPU's conv_2d_dw_direct requires F32; Metal/Vulkan accept F16 too but CPU does not). F32 is perf-neutral on the GPU — the per-channel K*K weights are register-resident and the F32 activations dominate bandwidth either way (measured identical). The load-time BN-scale fold now folds into F16 or F32 weights, and the recognizer weight upload converts F16 GGUF tensors to F32 for depthwise. Result on Apple M4 Metal (warm, vs the BN-fold baseline -> with both detection and recognizer depthwise kernels): clinical 858->584ms, ct_scan 1173->754ms, lab 838->579ms, liver 841->569ms (~31-36%). All DocTR integration quality tests pass on Metal AND forced-CPU (region counts identical, keyword asserts intact). * test[notask]: overlay qvac-fabric at the Metal CONV_2D_DW merge commit (all consumers) Add a vcpkg overlay port to every qvac-fabric consumer (classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml, translation-nmtcpp, vla-ggml) that builds fabric from the temp-8828 merge commit of the depthwise-conv kernel (qvac-fabric-llm.cpp#148, commit 7bcd140f, version 8828.1.1). This validates the new fabric across all consumers — and gives ocr-ggml the kernel its ggml_conv_2d_dw_direct path needs — before fabric is tagged 8828.1.1 and a registry port is cut. Revert these overlays + bump to the tagged port before merge. * style: clang-format DocTR depthwise + BatchNorm fold * fix[api]: clearer BN-fold dtype errors + comment on F16 scale round-trip Address review feedback on the DocTR BatchNorm fold: - The unsupported-dtype guard now names the actual ggml type and states that quantized conv weights are not supported by the fold (was a generic "unexpected conv weight dtype"). - output-channel mismatch now reports conv oc vs BN scale size. - Comment the F16 weight scale: decode->scale->re-encode is required because F16 has no arithmetic; it is not an f16->f16 copy. No behavior change; messages/comments only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports Scope this PR to the all-consumer qvac-fabric overlay-port bump (which validates that the new fabric does not regress any consumer). The DocTR depthwise kernel switch (conv_2d_dw -> conv_2d_dw_direct + F32 depthwise weight promotion + BN-fold dtype handling) will land separately. Restores both files under packages/ocr-ggml/addon/src/model-interface/doctr/ to the base branch state. * Revert "revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports" This reverts commit 8c753e8. * chore[notask]: migrate qvac-fabric 8828.1.1 from overlay to registry 8828.1.1 (Metal CONV_2D_DW kernel) is now published in qvac-registry-vcpkg (tetherto/qvac-registry-vcpkg#189). Bump each consumer's qvac-fabric version>= 8828.1.0 -> 8828.1.1 and drop the temporary vcpkg-overlay-ports that pinned the unreleased temp-8828 commit. The default-registry baseline is intentionally unchanged: vcpkg resolves version>= against the registry HEAD, so a fixed baseline still picks up the new tagged version. Consumers: classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml, translation-nmtcpp, vla-ggml. * chore[notask]: bump addon versions for qvac-fabric 8828.1.1 Minor-bump each consumer (major for ocr-ggml) and add a CHANGELOG entry for the qvac-fabric 8828.1.1 dependency (direct Metal CONV_2D_DW depthwise kernel). Minor bumps keep these out of the SDK 0.13.0 caret ranges so they are not auto-picked. - classification-ggml 0.3.1 -> 0.4.0 - embed-llamacpp 0.19.1 -> 0.20.0 - llm-llamacpp 0.24.0 -> 0.25.0 - ocr-ggml 0.1.1 -> 1.0.0 (major; cuts the prior Unreleased section) - translation-nmtcpp 5.0.1 -> 5.1.0 - vla-ggml 0.3.2 -> 0.4.0 * style: clang-format DocTR recognizer BN-fold tensor get/set Wrap the two over-long ggml_backend_tensor_get/set calls in the F16 BN-fold branch to satisfy the lint-cpp clang-format config (AlignAfterOpenBracket: AlwaysBreak). No behavior change. --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
feat[api]: DocTR depthwise convs via direct Metal CONV_2D_DW kernel (… …+ all-consumer fabric overlay) (#2536) * feat[api]: fold DocTR detection BatchNorm into conv weights The DBNet detection graph applied BatchNorm as a runtime scale/shift after every conv: `conv + add(bias_br) + mul(scale) + add(shift) + act`, where bias_br is a zero tensor for the (bias=False) BN convs. That is three full-tensor elementwise passes per conv on top of the conv itself. Fold the per-channel BN scale into the F16 conv weights and combine the conv bias and BN shift into a single bias at load time: out = scale*(W*x + bias) + shift = (scale*W)*x + (scale*bias + shift) so the runtime graph collapses to `conv + add(combined) + act` (one pass). This removes ~60 elementwise passes from the detection graph, which matters most on bandwidth/dispatch-constrained mobile GPUs where detection is ~55% of the DocTR pipeline. foldScaleIntoConv() folds scale into the preceding conv weight (per output channel, ne[3]) and absorbs bias_br into the shift tensor; it handles both normal BN (running stats present) and the offline-folded identity path. The sub-pixel transposed conv in the prob head is left as a runtime scale/shift since its weight is reshaped at graph build. Numerically exact: region counts unchanged (365/197/187/197) and all DocTR integration quality tests pass on Apple Metal (M4); ct_scan 1173->1157ms, clinical 858->841ms. * feat[api]: fold DocTR recognizer BatchNorm into conv weights Apply the same BN-into-weights fold to the CRNN MobileNetV3-small feature extractor: fold each BN's per-channel scale into the preceding conv's F16 weights at load time so applyBn drops the runtime multiply and keeps only the shift add. The feature-extractor graph runs once per recognition batch (dozens of times on a dense page), so removing a full-tensor multiply per conv is amplified across the page. Numerically exact: region counts unchanged and all DocTR integration quality tests pass on Apple Metal (M4). Combined with the detection fold: ct_scan 1173->1134ms, clinical 858->829ms. * feat[api]: use direct depthwise kernel for DocTR conv2d-dw on GPU The DBNet detector and CRNN recognizer feature extractors are dominated by depthwise convolutions. ggml's `ggml_conv_2d_dw` lowers each depthwise conv to im2col + a per-channel batched matmul (C tiny matmuls), which is pathologically slow on Metal — a skip-test (replacing every depthwise with a cheap same-shape op) showed recognition was ~entirely depthwise (rec 0.7s -> ~0 on ct_scan, M4). Switch both feature extractors to `ggml_conv_2d_dw_direct` (GGML_OP_CONV_2D_DW), which runs as a single fused kernel (one read, one write, no im2col buffer). This requires the companion Metal kernel for GGML_OP_CONV_2D_DW in the ggml fork (qvac-ext-ggml); CPU and Vulkan already implement it. Depthwise weights ([KW,KH,1,C], KW>1) are promoted to F32 at load so the op runs on every backend (CPU's conv_2d_dw_direct requires F32; Metal/Vulkan accept F16 too but CPU does not). F32 is perf-neutral on the GPU — the per-channel K*K weights are register-resident and the F32 activations dominate bandwidth either way (measured identical). The load-time BN-scale fold now folds into F16 or F32 weights, and the recognizer weight upload converts F16 GGUF tensors to F32 for depthwise. Result on Apple M4 Metal (warm, vs the BN-fold baseline -> with both detection and recognizer depthwise kernels): clinical 858->584ms, ct_scan 1173->754ms, lab 838->579ms, liver 841->569ms (~31-36%). All DocTR integration quality tests pass on Metal AND forced-CPU (region counts identical, keyword asserts intact). * test[notask]: overlay qvac-fabric at the Metal CONV_2D_DW merge commit (all consumers) Add a vcpkg overlay port to every qvac-fabric consumer (classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml, translation-nmtcpp, vla-ggml) that builds fabric from the temp-8828 merge commit of the depthwise-conv kernel (qvac-fabric-llm.cpp#148, commit 7bcd140f, version 8828.1.1). This validates the new fabric across all consumers — and gives ocr-ggml the kernel its ggml_conv_2d_dw_direct path needs — before fabric is tagged 8828.1.1 and a registry port is cut. Revert these overlays + bump to the tagged port before merge. * style: clang-format DocTR depthwise + BatchNorm fold * fix[api]: clearer BN-fold dtype errors + comment on F16 scale round-trip Address review feedback on the DocTR BatchNorm fold: - The unsupported-dtype guard now names the actual ggml type and states that quantized conv weights are not supported by the fold (was a generic "unexpected conv weight dtype"). - output-channel mismatch now reports conv oc vs BN scale size. - Comment the F16 weight scale: decode->scale->re-encode is required because F16 has no arithmetic; it is not an f16->f16 copy. No behavior change; messages/comments only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports Scope this PR to the all-consumer qvac-fabric overlay-port bump (which validates that the new fabric does not regress any consumer). The DocTR depthwise kernel switch (conv_2d_dw -> conv_2d_dw_direct + F32 depthwise weight promotion + BN-fold dtype handling) will land separately. Restores both files under packages/ocr-ggml/addon/src/model-interface/doctr/ to the base branch state. * Revert "revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports" This reverts commit 8c753e8. * chore[notask]: migrate qvac-fabric 8828.1.1 from overlay to registry 8828.1.1 (Metal CONV_2D_DW kernel) is now published in qvac-registry-vcpkg (tetherto/qvac-registry-vcpkg#189). Bump each consumer's qvac-fabric version>= 8828.1.0 -> 8828.1.1 and drop the temporary vcpkg-overlay-ports that pinned the unreleased temp-8828 commit. The default-registry baseline is intentionally unchanged: vcpkg resolves version>= against the registry HEAD, so a fixed baseline still picks up the new tagged version. Consumers: classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml, translation-nmtcpp, vla-ggml. * chore[notask]: bump addon versions for qvac-fabric 8828.1.1 Minor-bump each consumer (major for ocr-ggml) and add a CHANGELOG entry for the qvac-fabric 8828.1.1 dependency (direct Metal CONV_2D_DW depthwise kernel). Minor bumps keep these out of the SDK 0.13.0 caret ranges so they are not auto-picked. - classification-ggml 0.3.1 -> 0.4.0 - embed-llamacpp 0.19.1 -> 0.20.0 - llm-llamacpp 0.24.0 -> 0.25.0 - ocr-ggml 0.1.1 -> 1.0.0 (major; cuts the prior Unreleased section) - translation-nmtcpp 5.0.1 -> 5.1.0 - vla-ggml 0.3.2 -> 0.4.0 * style: clang-format DocTR recognizer BN-fold tensor get/set Wrap the two over-long ggml_backend_tensor_get/set calls in the F16 BN-fold branch to satisfy the lint-cpp clang-format config (AlignAfterOpenBracket: AlwaysBreak). No behavior change. --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
feat[api]: DocTR depthwise convs via direct Metal CONV_2D_DW kernel (… …+ all-consumer fabric overlay) (#2536) * feat[api]: fold DocTR detection BatchNorm into conv weights The DBNet detection graph applied BatchNorm as a runtime scale/shift after every conv: `conv + add(bias_br) + mul(scale) + add(shift) + act`, where bias_br is a zero tensor for the (bias=False) BN convs. That is three full-tensor elementwise passes per conv on top of the conv itself. Fold the per-channel BN scale into the F16 conv weights and combine the conv bias and BN shift into a single bias at load time: out = scale*(W*x + bias) + shift = (scale*W)*x + (scale*bias + shift) so the runtime graph collapses to `conv + add(combined) + act` (one pass). This removes ~60 elementwise passes from the detection graph, which matters most on bandwidth/dispatch-constrained mobile GPUs where detection is ~55% of the DocTR pipeline. foldScaleIntoConv() folds scale into the preceding conv weight (per output channel, ne[3]) and absorbs bias_br into the shift tensor; it handles both normal BN (running stats present) and the offline-folded identity path. The sub-pixel transposed conv in the prob head is left as a runtime scale/shift since its weight is reshaped at graph build. Numerically exact: region counts unchanged (365/197/187/197) and all DocTR integration quality tests pass on Apple Metal (M4); ct_scan 1173->1157ms, clinical 858->841ms. * feat[api]: fold DocTR recognizer BatchNorm into conv weights Apply the same BN-into-weights fold to the CRNN MobileNetV3-small feature extractor: fold each BN's per-channel scale into the preceding conv's F16 weights at load time so applyBn drops the runtime multiply and keeps only the shift add. The feature-extractor graph runs once per recognition batch (dozens of times on a dense page), so removing a full-tensor multiply per conv is amplified across the page. Numerically exact: region counts unchanged and all DocTR integration quality tests pass on Apple Metal (M4). Combined with the detection fold: ct_scan 1173->1134ms, clinical 858->829ms. * feat[api]: use direct depthwise kernel for DocTR conv2d-dw on GPU The DBNet detector and CRNN recognizer feature extractors are dominated by depthwise convolutions. ggml's `ggml_conv_2d_dw` lowers each depthwise conv to im2col + a per-channel batched matmul (C tiny matmuls), which is pathologically slow on Metal — a skip-test (replacing every depthwise with a cheap same-shape op) showed recognition was ~entirely depthwise (rec 0.7s -> ~0 on ct_scan, M4). Switch both feature extractors to `ggml_conv_2d_dw_direct` (GGML_OP_CONV_2D_DW), which runs as a single fused kernel (one read, one write, no im2col buffer). This requires the companion Metal kernel for GGML_OP_CONV_2D_DW in the ggml fork (qvac-ext-ggml); CPU and Vulkan already implement it. Depthwise weights ([KW,KH,1,C], KW>1) are promoted to F32 at load so the op runs on every backend (CPU's conv_2d_dw_direct requires F32; Metal/Vulkan accept F16 too but CPU does not). F32 is perf-neutral on the GPU — the per-channel K*K weights are register-resident and the F32 activations dominate bandwidth either way (measured identical). The load-time BN-scale fold now folds into F16 or F32 weights, and the recognizer weight upload converts F16 GGUF tensors to F32 for depthwise. Result on Apple M4 Metal (warm, vs the BN-fold baseline -> with both detection and recognizer depthwise kernels): clinical 858->584ms, ct_scan 1173->754ms, lab 838->579ms, liver 841->569ms (~31-36%). All DocTR integration quality tests pass on Metal AND forced-CPU (region counts identical, keyword asserts intact). * test[notask]: overlay qvac-fabric at the Metal CONV_2D_DW merge commit (all consumers) Add a vcpkg overlay port to every qvac-fabric consumer (classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml, translation-nmtcpp, vla-ggml) that builds fabric from the temp-8828 merge commit of the depthwise-conv kernel (qvac-fabric-llm.cpp#148, commit 7bcd140f, version 8828.1.1). This validates the new fabric across all consumers — and gives ocr-ggml the kernel its ggml_conv_2d_dw_direct path needs — before fabric is tagged 8828.1.1 and a registry port is cut. Revert these overlays + bump to the tagged port before merge. * style: clang-format DocTR depthwise + BatchNorm fold * fix[api]: clearer BN-fold dtype errors + comment on F16 scale round-trip Address review feedback on the DocTR BatchNorm fold: - The unsupported-dtype guard now names the actual ggml type and states that quantized conv weights are not supported by the fold (was a generic "unexpected conv weight dtype"). - output-channel mismatch now reports conv oc vs BN scale size. - Comment the F16 weight scale: decode->scale->re-encode is required because F16 has no arithmetic; it is not an f16->f16 copy. No behavior change; messages/comments only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports Scope this PR to the all-consumer qvac-fabric overlay-port bump (which validates that the new fabric does not regress any consumer). The DocTR depthwise kernel switch (conv_2d_dw -> conv_2d_dw_direct + F32 depthwise weight promotion + BN-fold dtype handling) will land separately. Restores both files under packages/ocr-ggml/addon/src/model-interface/doctr/ to the base branch state. * Revert "revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports" This reverts commit 8c753e8. * chore[notask]: migrate qvac-fabric 8828.1.1 from overlay to registry 8828.1.1 (Metal CONV_2D_DW kernel) is now published in qvac-registry-vcpkg (tetherto/qvac-registry-vcpkg#189). Bump each consumer's qvac-fabric version>= 8828.1.0 -> 8828.1.1 and drop the temporary vcpkg-overlay-ports that pinned the unreleased temp-8828 commit. The default-registry baseline is intentionally unchanged: vcpkg resolves version>= against the registry HEAD, so a fixed baseline still picks up the new tagged version. Consumers: classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml, translation-nmtcpp, vla-ggml. * chore[notask]: bump addon versions for qvac-fabric 8828.1.1 Minor-bump each consumer (major for ocr-ggml) and add a CHANGELOG entry for the qvac-fabric 8828.1.1 dependency (direct Metal CONV_2D_DW depthwise kernel). Minor bumps keep these out of the SDK 0.13.0 caret ranges so they are not auto-picked. - classification-ggml 0.3.1 -> 0.4.0 - embed-llamacpp 0.19.1 -> 0.20.0 - llm-llamacpp 0.24.0 -> 0.25.0 - ocr-ggml 0.1.1 -> 1.0.0 (major; cuts the prior Unreleased section) - translation-nmtcpp 5.0.1 -> 5.1.0 - vla-ggml 0.3.2 -> 0.4.0 * style: clang-format DocTR recognizer BN-fold tensor get/set Wrap the two over-long ggml_backend_tensor_get/set calls in the F16 BN-fold branch to satisfy the lint-cpp clang-format config (AlignAfterOpenBracket: AlwaysBreak). No behavior change. --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
feat[api]: DocTR depthwise convs via direct Metal CONV_2D_DW kernel (… …+ all-consumer fabric overlay) (#2536) * feat[api]: fold DocTR detection BatchNorm into conv weights The DBNet detection graph applied BatchNorm as a runtime scale/shift after every conv: `conv + add(bias_br) + mul(scale) + add(shift) + act`, where bias_br is a zero tensor for the (bias=False) BN convs. That is three full-tensor elementwise passes per conv on top of the conv itself. Fold the per-channel BN scale into the F16 conv weights and combine the conv bias and BN shift into a single bias at load time: out = scale*(W*x + bias) + shift = (scale*W)*x + (scale*bias + shift) so the runtime graph collapses to `conv + add(combined) + act` (one pass). This removes ~60 elementwise passes from the detection graph, which matters most on bandwidth/dispatch-constrained mobile GPUs where detection is ~55% of the DocTR pipeline. foldScaleIntoConv() folds scale into the preceding conv weight (per output channel, ne[3]) and absorbs bias_br into the shift tensor; it handles both normal BN (running stats present) and the offline-folded identity path. The sub-pixel transposed conv in the prob head is left as a runtime scale/shift since its weight is reshaped at graph build. Numerically exact: region counts unchanged (365/197/187/197) and all DocTR integration quality tests pass on Apple Metal (M4); ct_scan 1173->1157ms, clinical 858->841ms. * feat[api]: fold DocTR recognizer BatchNorm into conv weights Apply the same BN-into-weights fold to the CRNN MobileNetV3-small feature extractor: fold each BN's per-channel scale into the preceding conv's F16 weights at load time so applyBn drops the runtime multiply and keeps only the shift add. The feature-extractor graph runs once per recognition batch (dozens of times on a dense page), so removing a full-tensor multiply per conv is amplified across the page. Numerically exact: region counts unchanged and all DocTR integration quality tests pass on Apple Metal (M4). Combined with the detection fold: ct_scan 1173->1134ms, clinical 858->829ms. * feat[api]: use direct depthwise kernel for DocTR conv2d-dw on GPU The DBNet detector and CRNN recognizer feature extractors are dominated by depthwise convolutions. ggml's `ggml_conv_2d_dw` lowers each depthwise conv to im2col + a per-channel batched matmul (C tiny matmuls), which is pathologically slow on Metal — a skip-test (replacing every depthwise with a cheap same-shape op) showed recognition was ~entirely depthwise (rec 0.7s -> ~0 on ct_scan, M4). Switch both feature extractors to `ggml_conv_2d_dw_direct` (GGML_OP_CONV_2D_DW), which runs as a single fused kernel (one read, one write, no im2col buffer). This requires the companion Metal kernel for GGML_OP_CONV_2D_DW in the ggml fork (qvac-ext-ggml); CPU and Vulkan already implement it. Depthwise weights ([KW,KH,1,C], KW>1) are promoted to F32 at load so the op runs on every backend (CPU's conv_2d_dw_direct requires F32; Metal/Vulkan accept F16 too but CPU does not). F32 is perf-neutral on the GPU — the per-channel K*K weights are register-resident and the F32 activations dominate bandwidth either way (measured identical). The load-time BN-scale fold now folds into F16 or F32 weights, and the recognizer weight upload converts F16 GGUF tensors to F32 for depthwise. Result on Apple M4 Metal (warm, vs the BN-fold baseline -> with both detection and recognizer depthwise kernels): clinical 858->584ms, ct_scan 1173->754ms, lab 838->579ms, liver 841->569ms (~31-36%). All DocTR integration quality tests pass on Metal AND forced-CPU (region counts identical, keyword asserts intact). * test[notask]: overlay qvac-fabric at the Metal CONV_2D_DW merge commit (all consumers) Add a vcpkg overlay port to every qvac-fabric consumer (classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml, translation-nmtcpp, vla-ggml) that builds fabric from the temp-8828 merge commit of the depthwise-conv kernel (qvac-fabric-llm.cpp#148, commit 7bcd140f, version 8828.1.1). This validates the new fabric across all consumers — and gives ocr-ggml the kernel its ggml_conv_2d_dw_direct path needs — before fabric is tagged 8828.1.1 and a registry port is cut. Revert these overlays + bump to the tagged port before merge. * style: clang-format DocTR depthwise + BatchNorm fold * fix[api]: clearer BN-fold dtype errors + comment on F16 scale round-trip Address review feedback on the DocTR BatchNorm fold: - The unsupported-dtype guard now names the actual ggml type and states that quantized conv weights are not supported by the fold (was a generic "unexpected conv weight dtype"). - output-channel mismatch now reports conv oc vs BN scale size. - Comment the F16 weight scale: decode->scale->re-encode is required because F16 has no arithmetic; it is not an f16->f16 copy. No behavior change; messages/comments only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports Scope this PR to the all-consumer qvac-fabric overlay-port bump (which validates that the new fabric does not regress any consumer). The DocTR depthwise kernel switch (conv_2d_dw -> conv_2d_dw_direct + F32 depthwise weight promotion + BN-fold dtype handling) will land separately. Restores both files under packages/ocr-ggml/addon/src/model-interface/doctr/ to the base branch state. * Revert "revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports" This reverts commit 8c753e8. * chore[notask]: migrate qvac-fabric 8828.1.1 from overlay to registry 8828.1.1 (Metal CONV_2D_DW kernel) is now published in qvac-registry-vcpkg (tetherto/qvac-registry-vcpkg#189). Bump each consumer's qvac-fabric version>= 8828.1.0 -> 8828.1.1 and drop the temporary vcpkg-overlay-ports that pinned the unreleased temp-8828 commit. The default-registry baseline is intentionally unchanged: vcpkg resolves version>= against the registry HEAD, so a fixed baseline still picks up the new tagged version. Consumers: classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml, translation-nmtcpp, vla-ggml. * chore[notask]: bump addon versions for qvac-fabric 8828.1.1 Minor-bump each consumer (major for ocr-ggml) and add a CHANGELOG entry for the qvac-fabric 8828.1.1 dependency (direct Metal CONV_2D_DW depthwise kernel). Minor bumps keep these out of the SDK 0.13.0 caret ranges so they are not auto-picked. - classification-ggml 0.3.1 -> 0.4.0 - embed-llamacpp 0.19.1 -> 0.20.0 - llm-llamacpp 0.24.0 -> 0.25.0 - ocr-ggml 0.1.1 -> 1.0.0 (major; cuts the prior Unreleased section) - translation-nmtcpp 5.0.1 -> 5.1.0 - vla-ggml 0.3.2 -> 0.4.0 * style: clang-format DocTR recognizer BN-fold tensor get/set Wrap the two over-long ggml_backend_tensor_get/set calls in the F16 BN-fold branch to satisfy the lint-cpp clang-format config (AlignAfterOpenBracket: AlwaysBreak). No behavior change. --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
feat[api]: DocTR depthwise convs via direct Metal CONV_2D_DW kernel (… …+ all-consumer fabric overlay) (#2536) * feat[api]: fold DocTR detection BatchNorm into conv weights The DBNet detection graph applied BatchNorm as a runtime scale/shift after every conv: `conv + add(bias_br) + mul(scale) + add(shift) + act`, where bias_br is a zero tensor for the (bias=False) BN convs. That is three full-tensor elementwise passes per conv on top of the conv itself. Fold the per-channel BN scale into the F16 conv weights and combine the conv bias and BN shift into a single bias at load time: out = scale*(W*x + bias) + shift = (scale*W)*x + (scale*bias + shift) so the runtime graph collapses to `conv + add(combined) + act` (one pass). This removes ~60 elementwise passes from the detection graph, which matters most on bandwidth/dispatch-constrained mobile GPUs where detection is ~55% of the DocTR pipeline. foldScaleIntoConv() folds scale into the preceding conv weight (per output channel, ne[3]) and absorbs bias_br into the shift tensor; it handles both normal BN (running stats present) and the offline-folded identity path. The sub-pixel transposed conv in the prob head is left as a runtime scale/shift since its weight is reshaped at graph build. Numerically exact: region counts unchanged (365/197/187/197) and all DocTR integration quality tests pass on Apple Metal (M4); ct_scan 1173->1157ms, clinical 858->841ms. * feat[api]: fold DocTR recognizer BatchNorm into conv weights Apply the same BN-into-weights fold to the CRNN MobileNetV3-small feature extractor: fold each BN's per-channel scale into the preceding conv's F16 weights at load time so applyBn drops the runtime multiply and keeps only the shift add. The feature-extractor graph runs once per recognition batch (dozens of times on a dense page), so removing a full-tensor multiply per conv is amplified across the page. Numerically exact: region counts unchanged and all DocTR integration quality tests pass on Apple Metal (M4). Combined with the detection fold: ct_scan 1173->1134ms, clinical 858->829ms. * feat[api]: use direct depthwise kernel for DocTR conv2d-dw on GPU The DBNet detector and CRNN recognizer feature extractors are dominated by depthwise convolutions. ggml's `ggml_conv_2d_dw` lowers each depthwise conv to im2col + a per-channel batched matmul (C tiny matmuls), which is pathologically slow on Metal — a skip-test (replacing every depthwise with a cheap same-shape op) showed recognition was ~entirely depthwise (rec 0.7s -> ~0 on ct_scan, M4). Switch both feature extractors to `ggml_conv_2d_dw_direct` (GGML_OP_CONV_2D_DW), which runs as a single fused kernel (one read, one write, no im2col buffer). This requires the companion Metal kernel for GGML_OP_CONV_2D_DW in the ggml fork (qvac-ext-ggml); CPU and Vulkan already implement it. Depthwise weights ([KW,KH,1,C], KW>1) are promoted to F32 at load so the op runs on every backend (CPU's conv_2d_dw_direct requires F32; Metal/Vulkan accept F16 too but CPU does not). F32 is perf-neutral on the GPU — the per-channel K*K weights are register-resident and the F32 activations dominate bandwidth either way (measured identical). The load-time BN-scale fold now folds into F16 or F32 weights, and the recognizer weight upload converts F16 GGUF tensors to F32 for depthwise. Result on Apple M4 Metal (warm, vs the BN-fold baseline -> with both detection and recognizer depthwise kernels): clinical 858->584ms, ct_scan 1173->754ms, lab 838->579ms, liver 841->569ms (~31-36%). All DocTR integration quality tests pass on Metal AND forced-CPU (region counts identical, keyword asserts intact). * test[notask]: overlay qvac-fabric at the Metal CONV_2D_DW merge commit (all consumers) Add a vcpkg overlay port to every qvac-fabric consumer (classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml, translation-nmtcpp, vla-ggml) that builds fabric from the temp-8828 merge commit of the depthwise-conv kernel (qvac-fabric-llm.cpp#148, commit 7bcd140f, version 8828.1.1). This validates the new fabric across all consumers — and gives ocr-ggml the kernel its ggml_conv_2d_dw_direct path needs — before fabric is tagged 8828.1.1 and a registry port is cut. Revert these overlays + bump to the tagged port before merge. * style: clang-format DocTR depthwise + BatchNorm fold * fix[api]: clearer BN-fold dtype errors + comment on F16 scale round-trip Address review feedback on the DocTR BatchNorm fold: - The unsupported-dtype guard now names the actual ggml type and states that quantized conv weights are not supported by the fold (was a generic "unexpected conv weight dtype"). - output-channel mismatch now reports conv oc vs BN scale size. - Comment the F16 weight scale: decode->scale->re-encode is required because F16 has no arithmetic; it is not an f16->f16 copy. No behavior change; messages/comments only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports Scope this PR to the all-consumer qvac-fabric overlay-port bump (which validates that the new fabric does not regress any consumer). The DocTR depthwise kernel switch (conv_2d_dw -> conv_2d_dw_direct + F32 depthwise weight promotion + BN-fold dtype handling) will land separately. Restores both files under packages/ocr-ggml/addon/src/model-interface/doctr/ to the base branch state. * Revert "revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports" This reverts commit 8c753e8. * chore[notask]: migrate qvac-fabric 8828.1.1 from overlay to registry 8828.1.1 (Metal CONV_2D_DW kernel) is now published in qvac-registry-vcpkg (tetherto/qvac-registry-vcpkg#189). Bump each consumer's qvac-fabric version>= 8828.1.0 -> 8828.1.1 and drop the temporary vcpkg-overlay-ports that pinned the unreleased temp-8828 commit. The default-registry baseline is intentionally unchanged: vcpkg resolves version>= against the registry HEAD, so a fixed baseline still picks up the new tagged version. Consumers: classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml, translation-nmtcpp, vla-ggml. * chore[notask]: bump addon versions for qvac-fabric 8828.1.1 Minor-bump each consumer (major for ocr-ggml) and add a CHANGELOG entry for the qvac-fabric 8828.1.1 dependency (direct Metal CONV_2D_DW depthwise kernel). Minor bumps keep these out of the SDK 0.13.0 caret ranges so they are not auto-picked. - classification-ggml 0.3.1 -> 0.4.0 - embed-llamacpp 0.19.1 -> 0.20.0 - llm-llamacpp 0.24.0 -> 0.25.0 - ocr-ggml 0.1.1 -> 1.0.0 (major; cuts the prior Unreleased section) - translation-nmtcpp 5.0.1 -> 5.1.0 - vla-ggml 0.3.2 -> 0.4.0 * style: clang-format DocTR recognizer BN-fold tensor get/set Wrap the two over-long ggml_backend_tensor_get/set calls in the F16 BN-fold branch to satisfy the lint-cpp clang-format config (AlignAfterOpenBracket: AlwaysBreak). No behavior change. --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
feat[api]: DocTR depthwise convs via direct Metal CONV_2D_DW kernel (… …+ all-consumer fabric overlay) (#2536) * feat[api]: fold DocTR detection BatchNorm into conv weights The DBNet detection graph applied BatchNorm as a runtime scale/shift after every conv: `conv + add(bias_br) + mul(scale) + add(shift) + act`, where bias_br is a zero tensor for the (bias=False) BN convs. That is three full-tensor elementwise passes per conv on top of the conv itself. Fold the per-channel BN scale into the F16 conv weights and combine the conv bias and BN shift into a single bias at load time: out = scale*(W*x + bias) + shift = (scale*W)*x + (scale*bias + shift) so the runtime graph collapses to `conv + add(combined) + act` (one pass). This removes ~60 elementwise passes from the detection graph, which matters most on bandwidth/dispatch-constrained mobile GPUs where detection is ~55% of the DocTR pipeline. foldScaleIntoConv() folds scale into the preceding conv weight (per output channel, ne[3]) and absorbs bias_br into the shift tensor; it handles both normal BN (running stats present) and the offline-folded identity path. The sub-pixel transposed conv in the prob head is left as a runtime scale/shift since its weight is reshaped at graph build. Numerically exact: region counts unchanged (365/197/187/197) and all DocTR integration quality tests pass on Apple Metal (M4); ct_scan 1173->1157ms, clinical 858->841ms. * feat[api]: fold DocTR recognizer BatchNorm into conv weights Apply the same BN-into-weights fold to the CRNN MobileNetV3-small feature extractor: fold each BN's per-channel scale into the preceding conv's F16 weights at load time so applyBn drops the runtime multiply and keeps only the shift add. The feature-extractor graph runs once per recognition batch (dozens of times on a dense page), so removing a full-tensor multiply per conv is amplified across the page. Numerically exact: region counts unchanged and all DocTR integration quality tests pass on Apple Metal (M4). Combined with the detection fold: ct_scan 1173->1134ms, clinical 858->829ms. * feat[api]: use direct depthwise kernel for DocTR conv2d-dw on GPU The DBNet detector and CRNN recognizer feature extractors are dominated by depthwise convolutions. ggml's `ggml_conv_2d_dw` lowers each depthwise conv to im2col + a per-channel batched matmul (C tiny matmuls), which is pathologically slow on Metal — a skip-test (replacing every depthwise with a cheap same-shape op) showed recognition was ~entirely depthwise (rec 0.7s -> ~0 on ct_scan, M4). Switch both feature extractors to `ggml_conv_2d_dw_direct` (GGML_OP_CONV_2D_DW), which runs as a single fused kernel (one read, one write, no im2col buffer). This requires the companion Metal kernel for GGML_OP_CONV_2D_DW in the ggml fork (qvac-ext-ggml); CPU and Vulkan already implement it. Depthwise weights ([KW,KH,1,C], KW>1) are promoted to F32 at load so the op runs on every backend (CPU's conv_2d_dw_direct requires F32; Metal/Vulkan accept F16 too but CPU does not). F32 is perf-neutral on the GPU — the per-channel K*K weights are register-resident and the F32 activations dominate bandwidth either way (measured identical). The load-time BN-scale fold now folds into F16 or F32 weights, and the recognizer weight upload converts F16 GGUF tensors to F32 for depthwise. Result on Apple M4 Metal (warm, vs the BN-fold baseline -> with both detection and recognizer depthwise kernels): clinical 858->584ms, ct_scan 1173->754ms, lab 838->579ms, liver 841->569ms (~31-36%). All DocTR integration quality tests pass on Metal AND forced-CPU (region counts identical, keyword asserts intact). * test[notask]: overlay qvac-fabric at the Metal CONV_2D_DW merge commit (all consumers) Add a vcpkg overlay port to every qvac-fabric consumer (classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml, translation-nmtcpp, vla-ggml) that builds fabric from the temp-8828 merge commit of the depthwise-conv kernel (qvac-fabric-llm.cpp#148, commit 7bcd140f, version 8828.1.1). This validates the new fabric across all consumers — and gives ocr-ggml the kernel its ggml_conv_2d_dw_direct path needs — before fabric is tagged 8828.1.1 and a registry port is cut. Revert these overlays + bump to the tagged port before merge. * style: clang-format DocTR depthwise + BatchNorm fold * fix[api]: clearer BN-fold dtype errors + comment on F16 scale round-trip Address review feedback on the DocTR BatchNorm fold: - The unsupported-dtype guard now names the actual ggml type and states that quantized conv weights are not supported by the fold (was a generic "unexpected conv weight dtype"). - output-channel mismatch now reports conv oc vs BN scale size. - Comment the F16 weight scale: decode->scale->re-encode is required because F16 has no arithmetic; it is not an f16->f16 copy. No behavior change; messages/comments only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports Scope this PR to the all-consumer qvac-fabric overlay-port bump (which validates that the new fabric does not regress any consumer). The DocTR depthwise kernel switch (conv_2d_dw -> conv_2d_dw_direct + F32 depthwise weight promotion + BN-fold dtype handling) will land separately. Restores both files under packages/ocr-ggml/addon/src/model-interface/doctr/ to the base branch state. * Revert "revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports" This reverts commit 8c753e8. * chore[notask]: migrate qvac-fabric 8828.1.1 from overlay to registry 8828.1.1 (Metal CONV_2D_DW kernel) is now published in qvac-registry-vcpkg (tetherto/qvac-registry-vcpkg#189). Bump each consumer's qvac-fabric version>= 8828.1.0 -> 8828.1.1 and drop the temporary vcpkg-overlay-ports that pinned the unreleased temp-8828 commit. The default-registry baseline is intentionally unchanged: vcpkg resolves version>= against the registry HEAD, so a fixed baseline still picks up the new tagged version. Consumers: classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml, translation-nmtcpp, vla-ggml. * chore[notask]: bump addon versions for qvac-fabric 8828.1.1 Minor-bump each consumer (major for ocr-ggml) and add a CHANGELOG entry for the qvac-fabric 8828.1.1 dependency (direct Metal CONV_2D_DW depthwise kernel). Minor bumps keep these out of the SDK 0.13.0 caret ranges so they are not auto-picked. - classification-ggml 0.3.1 -> 0.4.0 - embed-llamacpp 0.19.1 -> 0.20.0 - llm-llamacpp 0.24.0 -> 0.25.0 - ocr-ggml 0.1.1 -> 1.0.0 (major; cuts the prior Unreleased section) - translation-nmtcpp 5.0.1 -> 5.1.0 - vla-ggml 0.3.2 -> 0.4.0 * style: clang-format DocTR recognizer BN-fold tensor get/set Wrap the two over-long ggml_backend_tensor_get/set calls in the F16 BN-fold branch to satisfy the lint-cpp clang-format config (AlignAfterOpenBracket: AlwaysBreak). No behavior change. --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
QVAC-19255 feat[api]: reintroduce Supertonic GPU support (desktop/iOS… …; Android CPU-only) (#2506) * QVAC-19254 test[skiplog]: overlay tts-cpp@f7d4d6c to validate Android dlopen fix Adds a package-local vcpkg overlay port that rebuilds tts-cpp from qvac-ext-lib-whisper.cpp@f7d4d6c (the published 2026-06-05 pin 128dae42 + one fix commit) instead of the registry port, so the tts-ggml Android prebuild is built against the QVAC-19254 follow-up fix that is not yet published to qvac-registry-vcpkg. f7d4d6c reroutes Supertonic's direct CPU-backend calls that are unlinkable under GGML_BACKEND_DL=ON (the Android per-arch dlopen CPU build): - ggml_get_type_traits_cpu(...)->from_float -> ggml_quantize_chunk() - ggml_backend_is_cpu() -> tts_cpp::detail::backend_is_cpu() registry shim Files: - packages/tts-ggml/ports/tts-cpp/{portfile.cmake,vcpkg.json}: overlay copy of the registry 2026-06-05 port with REF -> f7d4d6c and the recomputed GitHub-archive SHA512. Build options are otherwise byte-identical. - packages/tts-ggml/vcpkg-configuration.json: add "overlay-ports": ["ports"] so vcpkg prefers the local port over the registry tts-cpp. Temporary validation aid, NOT for merge/release (no version bump, [skiplog]). Needs the `verified` label to trigger the Android prebuild + mobile job. Verify the fix took -- assert on the BINARY, not the mobile test result (the addon mobile suite is a false-green: it swallows the load error via Bare.on('unhandledRejection') in test/mobile/integration-runtime.cjs): llvm-readelf --dyn-syms prebuilds/android-arm64/qvac__tts-ggml.bare \ | grep -E 'UND.*(ggml_backend_is_cpu|ggml_get_type_traits_cpu)' Expect: no UND CPU-backend symbols. Remove this overlay once the fix is published to the registry and consumed via vcpkg.json. Co-authored-by: Cursor <cursoragent@cursor.com> * feat: reintroduce Supertonic GPU support (desktop/iOS), keep Android CPU-only Re-lands the QVAC-19255 Supertonic GPU feature that 0.2.2 reverted, built on the QVAC-19254 follow-up fix (consumed via the f7d4d6c overlay from the previous commit) so the addon no longer crashes at dlopen on Android. - vcpkg.json: tts-cpp version>= 2026-06-03#1 -> 2026-06-05 (the overlay provides f7d4d6c = 2026-06-05 + the CPU-symbol fix). - SupertonicModel.cpp / index.js: remove the "CPU only today" useGPU rejection so Supertonic GPU intent flows through to tts-cpp on GPU-capable hosts (Metal / Vulkan / CUDA). The cross-field conflict check is preserved. - KEEP the SupertonicModel::loadLocked #ifdef __ANDROID__ force-off: Adreno Vulkan/OpenCL ggml graph compute still aborts (same family as the parakeet Adreno crash), so useGPU=true on Android transparently falls back to CPU. - gpu-smoke.test.js: Supertonic GPU smoke must engage a GPU backend on GPU-capable platforms and SKIPS Android (mirrors Chatterbox) instead of the old "rejected at constructor" assertion. - Flip the C++ config unit tests to acceptance + keep a conflict-rejection test; refresh inference-test assertion text and the README / index.d.ts / examples docs. SupertonicConfig.hpp useGpu docstring updated. - Bump 0.2.2 -> 0.3.0. The ports/tts-cpp overlay + overlay-ports entry are interim until f7d4d6c is published to qvac-registry-vcpkg. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
PreviousNext