Voice Loop for Codex is a local, wake-word-driven voice interface for Codex. It listens through your microphone, transcribes speech locally with whisper.cpp, sends the text to a normal Codex app-server session, and reads Codex back with Kokoro.
What makes it different from a simple speech-to-text wrapper:
- Local STT and TTS: whisper.cpp for transcription, Kokoro for speech.
- Wake word by default: say "Jarvis" to start, then speak naturally during the orange-border cooldown.
- Wake-word voice matching: the wake-word audio seeds a local speaker profile, so follow-up speech during cooldown has to sound like the same speaker before it can interrupt or reach Codex.
- Barge-in: interrupt Codex while it is speaking; non-empty transcripts cancel the old response.
- Echo control: WebRTC AEC uses the actual playback stream as a speaker reference so the assistant is less likely to hear itself.
- Spoken-output-aware prompting: Codex is told that messages are dictated and replies are spoken.
- Interruption recovery: interrupted turns include the spoken cutoff so Codex does not assume unheard text was conveyed.
- Prototype local web display: large chat bubbles, wake-state border, streaming text, tool-call waiting indicators, and interrupted text fading. Microphone recording and response playback still run from the CLI.
This project is macOS Apple Silicon-first. It has only been tested by the author on an Apple MacBook Pro 14-Inch (2026, M5 Max).
git clone <repo-url>
cd voice-loop
./scripts/bootstrap.sh
./run.shThe default run is equivalent to:
./run.sh --codex-new --effort medium --wake-word-mode openwakeword --wake-word "jarvis" --wake-word-cooldown-seconds 5The first run may download model files from Hugging Face and openWakeWord. A HF_TOKEN is optional, but it can improve Hugging Face rate limits.
In the default openWakeWord mode, the detected wake-word audio also seeds SpeechBrain speaker matching. Follow-up speech during playback or the five-second cooldown can omit "Jarvis", but it must match that speaker profile before the client pauses playback, transcribes, or sends anything to Codex.
Bootstrap installs or verifies the local developer dependencies it can manage:
- Homebrew
- Python 3.12
- CMake, Git, Make, pkg-config
- PortAudio and espeak-ng
- whisper.cpp, built from source under
third_party/ - Whisper
base.enand Silero VAD models
You must also have the Codex CLI installed and authenticated. The runtime talks to Codex with:
codex app-server --listen stdio://Check your setup at any time:
./scripts/doctor.shFor local development and tests, bootstrap with ./scripts/bootstrap.sh --dev.
./run.sh --list-devices
./run.sh --self-test third_party/whisper.cpp/samples/jfk.wav
./run.sh --no-web-client
./run.sh --wake-word-mode transcript --wake-word "Codex"
./run.sh --wake-word-cooldown-seconds 0
./run.sh --no-speaker-match
./run.sh --no-codex-new
./run.sh --model gpt-5-codex --effort high
./run.sh --kokoro-speed 1.0
./run.sh --no-playback-alignmentAdvanced flags and tuning notes live in docs/configuration.md.
Microphone audio is processed locally for VAD, wake-word detection, speaker matching, and transcription. Kokoro TTS also runs locally after model files are downloaded. The dictated transcript is sent to Codex because Codex is the assistant backend.
Generated state is intentionally ignored by Git:
.venv/third_party/.codex-voice/.openwakeword/.speechbrain/- generated
.wavand transcript artifacts
src/codex_voice_loop/ Python package and CLI
models/ Redistributable packaged wake-word models
scripts/ Bootstrap and doctor scripts
docs/ Runtime configuration and troubleshooting
tests/ Fast behavior tests
third_party/ Ignored upstream checkouts created by bootstrap
Voice Loop for Codex source code is released under the MIT License. The included models/jarvis.onnx wake-word model is a separate OpenWakeWord library asset licensed for personal and non-commercial use by default; commercial use requires an OpenWakeWord commercial wake-word license. See LICENSE and THIRD_PARTY_NOTICES.md.