Skip to content

akeybl/voice-loop

Repository files navigation

Voice Loop for Codex

Voice Loop for Codex is a local, wake-word-driven voice interface for Codex. It listens through your microphone, transcribes speech locally with whisper.cpp, sends the text to a normal Codex app-server session, and reads Codex back with Kokoro.

What makes it different from a simple speech-to-text wrapper:

  • Local STT and TTS: whisper.cpp for transcription, Kokoro for speech.
  • Wake word by default: say "Jarvis" to start, then speak naturally during the orange-border cooldown.
  • Wake-word voice matching: the wake-word audio seeds a local speaker profile, so follow-up speech during cooldown has to sound like the same speaker before it can interrupt or reach Codex.
  • Barge-in: interrupt Codex while it is speaking; non-empty transcripts cancel the old response.
  • Echo control: WebRTC AEC uses the actual playback stream as a speaker reference so the assistant is less likely to hear itself.
  • Spoken-output-aware prompting: Codex is told that messages are dictated and replies are spoken.
  • Interruption recovery: interrupted turns include the spoken cutoff so Codex does not assume unheard text was conveyed.
  • Prototype local web display: large chat bubbles, wake-state border, streaming text, tool-call waiting indicators, and interrupted text fading. Microphone recording and response playback still run from the CLI.

Quick Start

This project is macOS Apple Silicon-first. It has only been tested by the author on an Apple MacBook Pro 14-Inch (2026, M5 Max).

git clone <repo-url>
cd voice-loop
./scripts/bootstrap.sh
./run.sh

The default run is equivalent to:

./run.sh --codex-new --effort medium --wake-word-mode openwakeword --wake-word "jarvis" --wake-word-cooldown-seconds 5

The first run may download model files from Hugging Face and openWakeWord. A HF_TOKEN is optional, but it can improve Hugging Face rate limits.

In the default openWakeWord mode, the detected wake-word audio also seeds SpeechBrain speaker matching. Follow-up speech during playback or the five-second cooldown can omit "Jarvis", but it must match that speaker profile before the client pauses playback, transcribes, or sends anything to Codex.

Prerequisites

Bootstrap installs or verifies the local developer dependencies it can manage:

  • Homebrew
  • Python 3.12
  • CMake, Git, Make, pkg-config
  • PortAudio and espeak-ng
  • whisper.cpp, built from source under third_party/
  • Whisper base.en and Silero VAD models

You must also have the Codex CLI installed and authenticated. The runtime talks to Codex with:

codex app-server --listen stdio://

Check your setup at any time:

./scripts/doctor.sh

For local development and tests, bootstrap with ./scripts/bootstrap.sh --dev.

Common Commands

./run.sh --list-devices
./run.sh --self-test third_party/whisper.cpp/samples/jfk.wav
./run.sh --no-web-client
./run.sh --wake-word-mode transcript --wake-word "Codex"
./run.sh --wake-word-cooldown-seconds 0
./run.sh --no-speaker-match
./run.sh --no-codex-new
./run.sh --model gpt-5-codex --effort high
./run.sh --kokoro-speed 1.0
./run.sh --no-playback-alignment

Advanced flags and tuning notes live in docs/configuration.md.

Privacy and Model Behavior

Microphone audio is processed locally for VAD, wake-word detection, speaker matching, and transcription. Kokoro TTS also runs locally after model files are downloaded. The dictated transcript is sent to Codex because Codex is the assistant backend.

Generated state is intentionally ignored by Git:

  • .venv/
  • third_party/
  • .codex-voice/
  • .openwakeword/
  • .speechbrain/
  • generated .wav and transcript artifacts

Project Layout

src/codex_voice_loop/   Python package and CLI
models/                 Redistributable packaged wake-word models
scripts/                Bootstrap and doctor scripts
docs/                   Runtime configuration and troubleshooting
tests/                  Fast behavior tests
third_party/            Ignored upstream checkouts created by bootstrap

License

Voice Loop for Codex source code is released under the MIT License. The included models/jarvis.onnx wake-word model is a separate OpenWakeWord library asset licensed for personal and non-commercial use by default; commercial use requires an OpenWakeWord commercial wake-word license. See LICENSE and THIRD_PARTY_NOTICES.md.

About

Talk to Codex hands-free with wake-word activation, speech detection, spoken replies, barge-in interruption, echo suppression, remembered playback cutoffs, and a simple local web display.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors