Skip to content

QVAC-20556 feat[api]: enable Android GPU for Parakeet (overlay; CI validation) [DO-NOT-MERGE]#2577

Open
pratiknarola-t wants to merge 3 commits into
mainfrom
qvac-20556-parakeet-android-gpu
Open

QVAC-20556 feat[api]: enable Android GPU for Parakeet (overlay; CI validation) [DO-NOT-MERGE]#2577
pratiknarola-t wants to merge 3 commits into
mainfrom
qvac-20556-parakeet-android-gpu

Conversation

@pratiknarola-t

Copy link
Copy Markdown
Contributor

⚠️ DO-NOT-MERGE — measurement vehicle

Overlay-only PR (ticket QVAC-20556) to get an empirical AWS Device Farm signal on whether the latest speech stack drives Parakeet on Android GPUs (Pixel 9 / Mali + S25 Ultra / Adreno 830). This is the inverse of the CPU-only workaround in #2525 — please don't merge over it.

Add the verified label to fire the device-farm leg.

What this changes

packages/transcription-parakeet/:

  • ParakeetModel::load — remove the #ifdef __ANDROID__ guard that forced useGPU=false (kept the n_gpu_layers logic + the GPU-init→CPU fallback warning).
  • CMakeLists.txt — widen the Android backend-staging glob from libqvac-speech-ggml-cpu-*.so to libqvac-speech-ggml-*.so so the vulkan/opencl MODULE libs ship in the prebuild (reverses the [0.7.2] CPU-only packaging); refresh the now-stale "intentionally CPU-only" comments.
  • gpu-smoke.test.js — drop the four Android early-pass skips so the strict assertGpuBackend (backendDevice=1, backendId Vulkan/OpenCL) runs on device.
  • In-package vcpkg overlay portsggml-speech@44fd4817 (speech HEAD) + parakeet-cpp@ed749556 (whisper.cpp master), wired via overlay-ports in vcpkg-configuration.json. Registry baseline and registry version>= pins are unchanged — the registry PR is deferred until the device-farm result is understood.
  • vcpkg.json — bump parakeet-cpp version>= to the overlay version-date.

Local device finding (Adreno 740 / iQOO 11, TDT q4_0)

Run directly against this branch's prebuild on a physically-attached Adreno 740:

Path Backend Result
CPU (useGPU=false) CPU (id 0) ✅ correct transcript
GPU, engine default OpenCL (id 4, auto-selected on Adreno>700) SIGABRTggml_backend_opencl_graph_compute: op not supported joint.token_argmax (ARGMAX)GGML_ASSERT
GPU, OpenCL withheld Vulkan (id 3) ⚠️ runs, but transcript degraded vs CPU (dropped words) and ~2× slower

So on the Adreno the engine picks OpenCL, whose backend lacks ARGMAX and aborts in graph-compute instead of falling back to CPU. The Vulkan path (the one ggml-speech@8bf760f4 reported byte-identical on this exact device) is not what the engine selects, and even when forced it no longer reproduces the byte-identical result on the current 44fd4817/ed749556 stack.

Expectation for the device-farm run: the Adreno (S25) leg likely hits the same OpenCL ARGMAX abort (which can SIGABRT the Bare worklet and take down subsequent tests, cf. #2525); the Mali (Pixel 9) leg exercises the Vulkan path.

Note (pre-existing, out of scope)

While bringing this up on a local device, found that the addon's BACKENDS_SUBDIR compile-definition is PRIVATE on the bare-module target but ParakeetModel.cpp compiles into parakeet_model_core, so the subdir isn't appended to a host-provided default backendsDir. The device-farm/APK passes an explicit flat nativeLibraryDir, so CI is unaffected — but a host relying on the __dirname/prebuilds default would not find the backend .so. Filed mentally as a follow-up; not touched here.

Refs

…lidation)

DO-NOT-MERGE — overlay-only PR to get an empirical AWS Device Farm signal on
whether the latest speech stack drives Parakeet on Android GPUs (Pixel 9/Mali +
S25/Adreno 830). This is the inverse of the CPU-only workaround in #2525.

Changes (packages/transcription-parakeet):
- ParakeetModel::load — remove the __ANDROID__ guard that forced useGPU=false.
- CMakeLists — widen the Android backend-staging glob from
  libqvac-speech-ggml-cpu-*.so to libqvac-speech-ggml-*.so so the Vulkan/OpenCL
  MODULE libs ship in the prebuild (reverses the [0.7.2] CPU-only packaging);
  refresh the now-stale "intentionally CPU-only" comments.
- gpu-smoke.test.js — drop the four Android early-pass skips so the strict
  assertGpuBackend (backendDevice=1, backendId Vulkan/OpenCL) runs on device.
- vcpkg overlay ports (in-package) — ggml-speech@44fd4817 (speech HEAD) +
  parakeet-cpp@ed749556 (whisper.cpp master), wired via the overlay-ports entry
  in vcpkg-configuration.json. Registry baseline and registry version>= pins are
  unchanged; the registry PR is deferred.
- vcpkg.json — bump parakeet-cpp version>= to the overlay version-date.

Local device finding (Adreno 740 / iQOO 11), TDT q4_0, recorded for reviewers:
- CPU: correct transcript, backendDevice=0.
- GPU OpenCL (engine auto-selects this on Adreno>700): aborts in graph-compute —
  "op not supported joint.token_argmax (ARGMAX)" -> GGML_ASSERT (SIGABRT).
- GPU Vulkan (forced by withholding the OpenCL module): runs (backendId=3) but
  output is degraded vs CPU (dropped words) and ~2x slower; NOT the byte-identical
  result ggml-speech 8bf760f4 reported. Expect the Device Farm Adreno (S25) leg to
  hit the OpenCL ARGMAX abort and the Mali leg to exercise the Vulkan path.
  Do not merge — this is a measurement vehicle.
@pratiknarola-t

Copy link
Copy Markdown
Contributor Author

Local Adreno 740 (iQOO 11) matrix — refined

Ran each model type directly against this branch's prebuild on a physically-attached Adreno 740. On Adreno the engine auto-selects OpenCL (policy: Adreno>700 → OpenCL). Results:

Model CPU OpenCL (GPU, auto) Vulkan (GPU, OpenCL withheld)
TDT (q4_0) ✅ correct SIGABRTggml_backend_opencl_graph_compute: op not supported joint.token_argmax (ARGMAX)GGML_ASSERT ⚠️ runs (backendId=3) but transcript degraded vs CPU + ~2× slower
EOU (q4_0) ✅ correct (95 tokens)
Sortformer (q8_0) ✅ correct (speaker labels)
CTC n/a on mobile n/a

Takeaway: the GPU blocker is narrow — TDT's joint.token_argmax (ARGMAX) is not implemented in the ggml OpenCL backend, and supports_op/graph-compute aborts instead of falling back to CPU. EOU and Sortformer run fine on OpenCL. The Vulkan path supports the op (no crash) but is degraded/slower on this device, and is not what the engine selects on Adreno anyway.

Implications for the Device Farm run:

  • Adreno (S25/830) leg: EOU + Sortformer GPU should pass; the TDT GPU smoke will likely SIGABRT (and a Bare-worklet abort can cascade to later tests).
  • Mali (Pixel 9) leg: exercises the Vulkan path (no OpenCL on non-Adreno) — separate unknown.

Fix directions (follow-up, not in this PR): implement ARGMAX in ggml-opencl, OR make ggml-opencl supports_op return false for ARGMAX so it routes to CPU, OR have parakeet-cpp keep the TDT joint argmax on CPU.

Separately, a pre-existing latent bug surfaced during bring-up: the addon's BACKENDS_SUBDIR compile-def is PRIVATE on the bare-module target while ParakeetModel.cpp compiles into parakeet_model_core, so the subdir isn't appended to a host-provided default backendsDir (__dirname/prebuilds). The device-farm APK passes an explicit flat nativeLibraryDir, so CI is unaffected — but a host relying on the default would not find the backend .so.

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ❌ PENDING

**Requirements:**
- 1 Team Member approval ❌ (0/1)
- 1 Team Lead OR Management approval ❌ (0/1)



---
*This comment is automatically updated when reviews change.*

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Mobile integration tests — @qvac/transcription-parakeet (Android)

Result: passed

metric value
Devices passed 2
Devices failed 0
Test cases total 6
Test cases passed 6
Test cases failed 0
Test cases skipped 0

View workflow run

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Mobile integration tests — @qvac/transcription-parakeet (iOS)

Result: failed

metric value
Devices passed 1
Devices failed 1
Test cases total 6
Test cases passed 4
Test cases failed 0
Test cases skipped 0

View workflow run

Bump the parakeet-cpp overlay to 06cef8e7 (off ed749556). The TDT
transducer's per-step GPU graphs do an in-place read-and-write of the LSTM
persistent state; Adreno OpenCL drops those aliased ggml_cpy writes, so the
prediction state never advances and the decode emits one constant token per
frame. The fix routes the TDT decode to the host scalar path on OpenCL while
the encoder still runs on the GPU (stats.backendId stays OpenCL). enc_proj is
write-only so it's fine; EOU/Sortformer don't use this persistent-state
pattern, so they already ran correctly on OpenCL.

Verified on-device (Adreno 740 / iQOO 11): TDT-OpenCL now matches the CPU
baseline byte-for-byte; TDT-Vulkan/CPU and EOU/Sortformer-OpenCL unchanged.

- vcpkg-overlay-ports/parakeet-cpp: REF/SHA512 -> 06cef8e7, version-date 2026-06-15
- vcpkg.json: parakeet-cpp version>= 2026-06-15
@pratiknarola-t pratiknarola-t force-pushed the qvac-20556-parakeet-android-gpu branch from 11c94aa to f1fa6e3 Compare June 15, 2026 10:17
Bump the parakeet-cpp overlay to bb585eb1: ARM Mali (Valhall) Vulkan
mis-computes every parakeet model (its narrow subgroup width breaks the
ggml-vulkan shaders), so the engine guards Mali by name and routes it to CPU;
Adreno OpenCL and Samsung Xclipse Vulkan are correct and run on the GPU. TDT
host-decode on Adreno OpenCL is unchanged.

The addon surfaces engine gpu_unsupported() as stats.gpuUnsupported, and the
GPU smoke test treats a CPU backend with gpuUnsupported=1 on Android as the
expected, correct result instead of a GPU regression.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

verified Authorize secrets / label-gate in PR workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant