Skip to content

theseyan/nanovoice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Nanovoice

Nanovoice is a realtime multi-modal conversational voice AI.

Local Setup

1. Clone the repo

git clone https://github.com/theseyan/nanovoice.git
cd nanovoice

2. Set up the backend

Go to the API folder and install dependencies:

cd api
bun install

Create api/.env:

SONIOX_API_KEY=your_soniox_api_key
GROQ_API_KEY=your_groq_api_key
FIRECRAWL_API_KEY=your_firecrawl_api_key
PORT=3001

Run the backend:

bun run index.ts

The API server will run on http://localhost:3001 and the WebSocket endpoint will be ws://localhost:3001/ws.

3. Set up the frontend

In a new terminal:

cd app
bun install

Optional: create app/.env if your backend is not running on the default local URL:

VITE_WS_URL=ws://localhost:3001/ws

Run the frontend:

npm run dev

Open the local Vite URL shown in the terminal, usually:

http://localhost:5173

Using Nanovoice

  1. Start the backend.
  2. Start the frontend.
  3. Open the app in your browser.
  4. Click the power button to connect.
  5. Allow microphone access.
  6. Click the mic button to start talking.
  7. Optionally enable the camera button for vision context.

Notes

  • Vision requires a browser with WebGPU support.
  • The assistant works without vision in normal voice-only mode.

FAQ

What problem does Nanovoice solve?

Nanovoice is built to make voice AI feel more natural by improving response latency, speech expressiveness, interruption handling, memory, vision awareness, and time awareness.

Why is it called multi-modal?

Because it can use more than one input modality:

  • voice
  • chronological "sense of time" in the conversation
  • optional camera vision

How is Nanovoice different from a normal voice bot?

It is designed for realtime conversation. It supports life-like interruption while the assistant is speaking, can use visual context, and remembers useful long-term facts.

What does “time awareness” mean here?

The user's local time is included in the context sent to the assistant for every message, so the assistant always has a sense of the natural flow of time within the conversation.

What does the vision feature do?

If camera mode is enabled, the app captures a frame locally in the browser, generates a caption for it, and uses that caption as context for the assistant’s reply.

Does Nanovoice store raw camera images?

No. The current design captions the image locally in the browser and sends only the text caption to the backend.

How does memory work?

Nanovoice stores durable facts, preferences, relationships, and project-related information in a local memory file so relevant details can be recalled in later conversations.

Why use a local memory layer instead of an external memory provider?

The goal is to keep latency low and the system simple. External memory services can add extra network delay, which is not ideal for realtime voice interaction.

Can the user interrupt the assistant while it is talking?

Yes. If the user starts speaking while the assistant is replying, playback is stopped and the system switches back to listening.

About

Realistic, conversational voice AI with vision, memory and chronological understanding.

Resources

Stars

Watchers

Forks

Releases

No releases published

Contributors