A hands-free AI voice assistant that runs entirely in your browser. Speech recognition, LLM, and text-to-speech all run locally using WebGPU - no API keys, no server, no data leaves your device. Just talk naturally and the AI responds.
Try it now: HuggingFace Space
Everything runs in your browser:
- Speech-to-Text: Whisper model via WebGPU/WASM
- Voice Activity Detection: Silero VAD detects when you're speaking
- LLM: Qwen 1.5B via WebLLM (easily swappable - see below)
- Text-to-Speech: Supertonic TTS with 10 natural voices
No audio leaves your device. No API keys needed. Just open and talk.
The built-in LLM is just a demo. The real value is the voice pipeline - STT, VAD, and TTS all wired up and working. Rip out the tiny in-browser model and point it at any LLM you want:
- Claude, GPT-4, Gemini - via API routes
- Ollama, LM Studio - local inference servers
- Any OpenAI-compatible endpoint
It's ~10 lines of code to swap. See Using a Different LLM below.
# Install dependencies
pnpm install
# Run development server
pnpm devOpen http://localhost:3000 in Chrome or Edge.
| Asset | Size | When | Cached |
|---|---|---|---|
| Voice embeddings | ~500KB | Included in repo | ✓ Already local |
| Whisper STT model | ~150MB | First use | ✓ IndexedDB |
| Silero VAD model | ~2MB | First use | ✓ IndexedDB |
| Qwen 1.5B LLM | ~900MB | First use | ✓ IndexedDB |
| Supertonic TTS | ~50MB | First use | ✓ IndexedDB |
First load downloads ~1GB of models from HuggingFace CDN. After that, everything runs offline.
- Browser: Chrome 113+ or Edge 113+ (WebGPU required)
- RAM: ~4GB available for models
- Microphone: Required for voice input
Falls back to WASM if WebGPU unavailable (slower but works everywhere).
┌─────────────────────────────────────────────────────────────┐
│ Browser │
│ │
│ Microphone │
│ | │
│ v │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Silero │ > │ Whisper │ > │ WebLLM │ > │Supertonic│ │
│ │ VAD │ │ STT │ │ (Qwen) │ │ TTS │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ | | | | │
│ Detects Transcribes Generates Speaks │
│ speech to text response response │
│ │
└─────────────────────────────────────────────────────────────┘
src/
├── app/
│ ├── page.tsx # Main voice chat UI
│ ├── layout.tsx # App layout
│ └── globals.css # Styles
├── components/ui/ # UI components
├── hooks/
│ ├── use-webllm.ts # WebLLM integration
│ └── use-tts.ts # TTS integration
└── lib/
├── tts.ts # TTS pipeline
└── splitter.ts # Text chunking
public/
├── stt-worker-esm.js # Whisper + VAD worker
├── vad-processor.js # Audio worklet
└── voices/ # TTS voice embeddings (bundled)
This demo uses WebLLM for fully local operation. To use an external LLM instead:
- Create an API route (e.g.,
src/app/api/chat/route.ts) - In
page.tsx, findhandleLLMResponse()and replace the WebLLM call:
// Instead of webllm.chat(), call your API:
const response = await fetch("/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ messages: conversationHistory })
});
const data = await response.json();
return data.response;- Framework: Next.js 16, React 19
- STT: Whisper via @huggingface/transformers
- VAD: Silero VAD via ONNX Runtime
- LLM: Qwen 1.5B via @mlc-ai/web-llm
- TTS: Supertonic via @huggingface/transformers
- Styling: Tailwind CSS v4
10 voices bundled (5 female, 5 male):
- F1: Calm, steady
- F2: Bright, cheerful
- F3: Professional
- F4: Confident
- F5: Gentle
- M1: Lively, upbeat
- M2: Deep, calm
- M3: Authoritative
- M4: Soft, friendly
- M5: Warm
MIT License - see LICENSE
- Whisper - OpenAI
- Silero VAD - Silero Team
- WebLLM - MLC AI
- Transformers.js - Hugging Face
- Supertonic TTS - Supertone