Onju Voice v2 (OnjuClaw) 🍐🦞

Enable multiple "Google Home" speakers to connect to your Mac Mini for talking to your agent(s).

This repo consists of:

An async server pipeline handling ASR -> TTS from multiple devices using any LLM or agent platforms like OpenClaw 🦞
Hardware designs for a drop-in replacement PCB to the original Google Nest Mini (2nd gen), using the ESP32-S3 for audio processing and WiFi connectivity (order link)

This is an upgraded version of onju-voice as DEMO'd here!.

What's new in v2

Agentic backend 🦞 -- delegate conversation history, session management, and tool execution to an OpenClaw gateway for centralized, multi-device orchestration
Opus compression -- 14-16x downstream compression (server to speaker) for better audio quality over WiFi
Streaming-ready architecture -- designed for sentence-level TTS streaming and agentic tool-calling loops
Modular async pipeline -- replaced the monolithic server with a pluggable architecture for ASR, LLM, and TTS backends etc.
Any LLM -- works with any OpenAI-compatible API (Ollama, mlx_lm, Gemini, OpenRouter, Claude, etc.)
Pluggable TTS -- ElevenLabs (recommended) or local via mlx-audio for fully offline operation
Silero VAD -- server-side voice activity detection with configurable thresholds, replacing webrtcvad
VAD-aware interruption -- tap to interrupt playback and start speaking immediately
M5 Echo support -- get started with a $13 dev kit instead of ordering a custom PCB (battery base)
One-command flashing -- ./flash.sh handles compilation, WiFi credential generation (from macOS Keychain), and upload. No Arduino IDE or manual configuration required

Supported devices

	Onjuino (custom PCB)	M5Stack ATOM Echo
Board	ESP32-S3	ESP32-PICO-D4
Interaction	Capacitive touch: tap to start (uses VAD to end)	Physical button: hold to talk
Mic	I2S (INMP441)	PDM (SPM1423)
Speaker	MAX98357A, 6 NeoPixel LEDs	NS4168, 1 SK6812 LED
PSRAM	Yes (2MB playback buffer)	No (smaller buffers)
Audio upstream	mu-law 16kHz UDP (16 KB/s)	mu-law 16kHz UDP (16 KB/s)
Audio downstream	Opus 16kHz TCP (~1.5 KB/s)	Opus 16kHz TCP (~1.5 KB/s)

Both targets use the same network protocol and connect to the same server. See the M5 Echo README for hardware-specific details.

Architecture

                ESP32 Device                              Server Pipeline
  ┌──────────────────────────────┐       ┌──────────────────────────────────────┐
  │  Mic > I2S RX > mu-law =======UDP 3000===> mu-law decode > VAD > ASR        │
  │                              │       │                                      │
  │  Speaker < I2S TX < Opus <===TCP 3001<=== Opus encode < TTS < LLM           │
  └──────────────────────────────┘       └──────────────────────────────────────┘

Why mu-law upstream: Stateless sample-by-sample encoding (~1% CPU), zero buffering latency. ASR models handle the quality fine.

Why Opus downstream: Human ears need better quality than ASR, and Opus decoding is easier for an ESP32. Opus gives 14-16x compression vs mu-law's 2x, and TCP ensures reliable ordered delivery for the stateful codec.

Device discovery

ESP32 boots and joins WiFi
Sends multicast announcement to 239.0.0.1:12345 with hostname, git hash, and PTT flag
Server discovers device and connects to its TCP server on port 3001
ESP32 learns server IP from the TCP connection and starts sending mic audio via UDP

TCP command protocol

All commands use a 6-byte header. The server initiates TCP connections to the ESP32.

Byte 0	Command	Payload
`0xAA`	Audio playback	mic_timeout(2B), volume, LED fade, compression type, then length-prefixed Opus frames
`0xBB`	Set LEDs	LED bitmask, RGB color
`0xCC`	LED blink (VAD)	intensity, RGB color, fade rate
`0xDD`	Mic timeout	timeout in seconds (2B)

A zero-length Opus frame (0x00 0x00) signals end of speech.

FreeRTOS task layout

Core	Task	Purpose
Core 0	Arduino loop	TCP server, touch/mute input, UART debug
Core 1	`micTask`	I2S read, mu-law encode, UDP send
Core 1	`opusDecodeTask`	TCP read, Opus decode, I2S write (created per playback)
Core 1	`updateLedTask`	40Hz LED refresh with gamma-corrected fade

Conversation backends

The pipeline supports two conversation backends, selectable via config.yaml:

Conversational (conversation.backend: "conversational"): Plain chat-completions backend. Manages conversation history client-side with per-device JSON persistence and sends the full message history on each LLM request. Works with any OpenAI-compatible endpoint (OpenRouter, Gemini, Ollama, mlx_lm.server, etc.). Good for simple voice chat with no tool use.

Agentic (conversation.backend: "agentic"): Delegates session management and tool execution to a remote agent gateway like OpenClaw. Only sends the latest user message — the gateway tracks history server-side using the device ID as the session key, and runs its own tool loop (web search, file access, multi-step work). Set OPENCLAW_GATEWAY_TOKEN in your environment and point base_url at your gateway.

Setting up OpenClaw

If you have OpenClaw installed, a setup script is included:

./setup_openclaw.sh

This will:

Enable the chat completions HTTP endpoint on the gateway
Append a voice mode prompt to ~/.openclaw/workspace/AGENTS.md (tells the agent to respond in concise, speech-friendly prose when the message channel is onju-voice)
Restart the gateway

Then set conversation.backend: "agentic" in pipeline/config.yaml and ensure OPENCLAW_GATEWAY_TOKEN is set in your environment.

Streaming and stalls w/ OpenClaw

Agentic requests can take 5-60+ seconds while the gateway runs tools, so the pipeline is built to make that wait feel responsive.

Sentence-level streaming. The agent's SSE response is consumed delta by delta. A splitter (pipeline/conversation/__init__.py:sentence_chunks) buffers text until it hits a sentence boundary, then hands the sentence to TTS and pushes the audio to the device. Any narration the agent emits between tool calls gets spoken aloud as it arrives, while the next tool runs in the background. Intermediate sends use mic_timeout=0 so the mic only reopens after the final chunk.

Contextual stall phrases. Before the main agent call, the pipeline fires a fast classifier that decides whether the question needs a brief spoken acknowledgment. Conversational questions return NONE and get no stall. Tool-needing questions get a short personality-matched phrase that plays within about a second while the agent works. The stall text is then injected back into the agent's user message as a parenthetical continuity note so it doesn't repeat itself. Configure in conversation.stall.

The first-turn caveat with OpenClaw. OpenClaw's OpenAI-compatible endpoint buffers all content from the first agent turn until the first round of tool execution completes. If the model generates an opening sentence and then calls a tool, that sentence is held server-side until the tool finishes. Narration between subsequent tool rounds streams fine. This is why the stall classifier exists: it gives the user a fast spoken acknowledgment that bypasses the gateway's first-turn buffering. See pipeline/conversation/stall.py.

Installation

Server

# Clone and set up Python environment
git clone https://github.com/justLV/onju-v2.git
cd onju-v2
uv venv && source .venv/bin/activate
uv pip install -e .

# macOS: install system libraries for Opus encoding
brew install opus portaudio

# Configure
cp pipeline/config.yaml.example pipeline/config.yaml
# Edit config.yaml with your API keys and preferences

ASR -- an embedded parakeet-mlx server is included (Apple Silicon):

uv pip install -e ".[asr]"
python -m pipeline.services.asr_server  # runs on port 8100

Or point asr.url in config.yaml at any Whisper-compatible endpoint.

LLM -- any OpenAI-compatible server:

# Local (mlx_lm on Apple Silicon)
mlx_lm.server --model unsloth/gemma-4-E4B-it-UD-MLX-4bit --port 8080

# Local (Ollama)
ollama run gemma4:e4b

# Cloud -- just set base_url and api_key in config.yaml (default: Haiku via OpenRouter)

TTS -- ElevenLabs is the default (set your API key in config.yaml). For fully offline TTS, you can use mlx-audio (uv pip install -e ".[tts-local]", then set tts.backend: "local" in config.yaml - I don't think this is the best quality, just including as reference for a local TTS!).

Run:

source .venv/bin/activate
python -m pipeline.main

Firmware

Both targets can be compiled and flashed from the command line:

# Flash onjuino (default)
./flash.sh

# Flash M5 Echo
./flash.sh m5_echo

# Compile only (no device needed)
./flash.sh compile

# Regenerate WiFi credentials from macOS Keychain, defaults to manual entry
./flash.sh --regen

Requires arduino-cli:

# macOS
brew install arduino-cli
arduino-cli core install esp32:esp32

The flash script auto-installs required libraries (Adafruit NeoPixel, esp32_opus).

WiFi credentials are generated from your macOS Keychain on first flash, or you can edit the credentials.h.template files manually.

For Arduino IDE users: select ESP32S3 Dev Module (onjuino) or ESP32 Dev Module (M5 Echo), enable USB CDC on Boot and OPI PSRAM (onjuino only), then build and upload.

Hardware

Preview schematics & PCB | Order from PCBWay | Altium source files and schematics in hardware/.

If you don't have a custom PCB, you can use the M5Stack ATOM Echo. I'd recommend adding a Battery (Biscuit) Base (link)

Configuration reference

See pipeline/config.yaml.example for all options. Key sections:

Section	What it controls
`asr`	Speech-to-text service URL
`conversation.backend`	`"conversational"` (plain chat) or `"agentic"` (OpenClaw, tools)
`conversation.conversational`	LLM endpoint, model, system prompt, message history
`conversation.agentic`	OpenClaw gateway URL, auth token, message channel
`conversation.stall`	Fast classifier that decides if the agentic backend needs a brief spoken stall
`tts`	TTS backend (`"elevenlabs"` or `"local"`), voice settings
`vad`	Voice activity detection thresholds and timing
`network`	UDP/TCP/multicast ports
`device`	Volume, mic timeout, LED settings, greeting audio

Environment variables

Variable	Used by
`OPENROUTER_API_KEY`	Conversational backend via OpenRouter
`GEMINI_API_KEY`	Conversational backend via Gemini, and the stall classifier
`ANTHROPIC_API_KEY`	Conversational backend via Anthropic API directly
`OPENCLAW_GATEWAY_TOKEN`	Agentic (OpenClaw) backend

Testing

All scripts are in tests/ and should be run from the project root.

# Emulate an ESP32 device (no hardware needed)
python tests/test_client.py                  # localhost
python tests/test_client.py 192.168.1.50     # remote server

# Test speaker output (send audio file to device w/ TCP and Opus encoding)
python tests/test_speaker.py <device-ip>

# Test mic input (receive and record UDP audio)
python tests/test_mic.py --duration 10

# Benchmark the stall classifier against the configured Gemini endpoint
python tests/test_stall.py

# Inspect raw SSE chunk shapes from the configured agentic gateway
python tests/test_stream.py
python tests/test_stream.py "your prompt here"

# Serial monitor (auto-detects USB port)
python serial_monitor.py

UART debug commands

Both firmware targets support serial commands at 115200 baud:

Key	Action
`r`	Reboot
`M`	Enable mic for 10 minutes
`m`	Disable mic
`A`	Re-send multicast announcement
`c`	Enter config mode (WiFi, server, volume)
`W`/`w`	LED test fast/slow (onjuino)
`P`	Play 440Hz test tone (M5 Echo)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
hardware		hardware
images		images
m5_echo		m5_echo
onjuino		onjuino
pipeline		pipeline
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
flash.sh		flash.sh
pyproject.toml		pyproject.toml
run.sh		run.sh
serial_monitor.py		serial_monitor.py
setup_openclaw.sh		setup_openclaw.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Onju Voice v2 (OnjuClaw) 🍐🦞

What's new in v2

Supported devices

Architecture

Device discovery

TCP command protocol

FreeRTOS task layout

Conversation backends

Setting up OpenClaw

Streaming and stalls w/ OpenClaw

Installation

Server

Firmware

Hardware

Configuration reference

Environment variables

Testing

UART debug commands

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Onju Voice v2 (OnjuClaw) 🍐🦞

What's new in v2

Supported devices

Architecture

Device discovery

TCP command protocol

FreeRTOS task layout

Conversation backends

Setting up OpenClaw

Streaming and stalls w/ OpenClaw

Installation

Server

Firmware

Hardware

Configuration reference

Environment variables

Testing

UART debug commands

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages