Nanovoice is a realtime multi-modal conversational voice AI.
git clone https://github.com/theseyan/nanovoice.git
cd nanovoiceGo to the API folder and install dependencies:
cd api
bun installCreate api/.env:
SONIOX_API_KEY=your_soniox_api_key
GROQ_API_KEY=your_groq_api_key
FIRECRAWL_API_KEY=your_firecrawl_api_key
PORT=3001Run the backend:
bun run index.tsThe API server will run on http://localhost:3001 and the WebSocket endpoint will be ws://localhost:3001/ws.
In a new terminal:
cd app
bun installOptional: create app/.env if your backend is not running on the default local URL:
VITE_WS_URL=ws://localhost:3001/wsRun the frontend:
npm run devOpen the local Vite URL shown in the terminal, usually:
http://localhost:5173
- Start the backend.
- Start the frontend.
- Open the app in your browser.
- Click the power button to connect.
- Allow microphone access.
- Click the mic button to start talking.
- Optionally enable the camera button for vision context.
- Vision requires a browser with WebGPU support.
- The assistant works without vision in normal voice-only mode.
Nanovoice is built to make voice AI feel more natural by improving response latency, speech expressiveness, interruption handling, memory, vision awareness, and time awareness.
Because it can use more than one input modality:
- voice
- chronological "sense of time" in the conversation
- optional camera vision
It is designed for realtime conversation. It supports life-like interruption while the assistant is speaking, can use visual context, and remembers useful long-term facts.
The user's local time is included in the context sent to the assistant for every message, so the assistant always has a sense of the natural flow of time within the conversation.
If camera mode is enabled, the app captures a frame locally in the browser, generates a caption for it, and uses that caption as context for the assistant’s reply.
No. The current design captions the image locally in the browser and sends only the text caption to the backend.
Nanovoice stores durable facts, preferences, relationships, and project-related information in a local memory file so relevant details can be recalled in later conversations.
The goal is to keep latency low and the system simple. External memory services can add extra network delay, which is not ideal for realtime voice interaction.
Yes. If the user starts speaking while the assistant is replying, playback is stopped and the system switches back to listening.