Nanovoice

Nanovoice is a realtime multi-modal conversational voice AI.

Local Setup

1. Clone the repo

git clone https://github.com/theseyan/nanovoice.git
cd nanovoice

2. Set up the backend

Go to the API folder and install dependencies:

cd api
bun install

Create api/.env:

SONIOX_API_KEY=your_soniox_api_key
GROQ_API_KEY=your_groq_api_key
FIRECRAWL_API_KEY=your_firecrawl_api_key
PORT=3001

Run the backend:

bun run index.ts

The API server will run on http://localhost:3001 and the WebSocket endpoint will be ws://localhost:3001/ws.

3. Set up the frontend

In a new terminal:

cd app
bun install

Optional: create app/.env if your backend is not running on the default local URL:

VITE_WS_URL=ws://localhost:3001/ws

Run the frontend:

npm run dev

Open the local Vite URL shown in the terminal, usually:

http://localhost:5173

Using Nanovoice

Start the backend.
Start the frontend.
Open the app in your browser.
Click the power button to connect.
Allow microphone access.
Click the mic button to start talking.
Optionally enable the camera button for vision context.

Notes

Vision requires a browser with WebGPU support.
The assistant works without vision in normal voice-only mode.

FAQ

What problem does Nanovoice solve?

Nanovoice is built to make voice AI feel more natural by improving response latency, speech expressiveness, interruption handling, memory, vision awareness, and time awareness.

Why is it called multi-modal?

Because it can use more than one input modality:

voice
chronological "sense of time" in the conversation
optional camera vision

How is Nanovoice different from a normal voice bot?

It is designed for realtime conversation. It supports life-like interruption while the assistant is speaking, can use visual context, and remembers useful long-term facts.

What does “time awareness” mean here?

The user's local time is included in the context sent to the assistant for every message, so the assistant always has a sense of the natural flow of time within the conversation.

What does the vision feature do?

If camera mode is enabled, the app captures a frame locally in the browser, generates a caption for it, and uses that caption as context for the assistant’s reply.

Does Nanovoice store raw camera images?

No. The current design captions the image locally in the browser and sends only the text caption to the backend.

How does memory work?

Nanovoice stores durable facts, preferences, relationships, and project-related information in a local memory file so relevant details can be recalled in later conversations.

Why use a local memory layer instead of an external memory provider?

The goal is to keep latency low and the system simple. External memory services can add extra network delay, which is not ideal for realtime voice interaction.

Can the user interrupt the assistant while it is talking?

Yes. If the user starts speaking while the assistant is replying, playback is stopped and the system switches back to listening.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
api		api
app		app
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nanovoice

Local Setup

1. Clone the repo

2. Set up the backend

3. Set up the frontend

Using Nanovoice

Notes

FAQ

What problem does Nanovoice solve?

Why is it called multi-modal?

How is Nanovoice different from a normal voice bot?

What does “time awareness” mean here?

What does the vision feature do?

Does Nanovoice store raw camera images?

How does memory work?

Why use a local memory layer instead of an external memory provider?

Can the user interrupt the assistant while it is talking?

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Nanovoice

Local Setup

1. Clone the repo

2. Set up the backend

3. Set up the frontend

Using Nanovoice

Notes

FAQ

What problem does Nanovoice solve?

Why is it called multi-modal?

How is Nanovoice different from a normal voice bot?

What does “time awareness” mean here?

What does the vision feature do?

Does Nanovoice store raw camera images?

How does memory work?

Why use a local memory layer instead of an external memory provider?

Can the user interrupt the assistant while it is talking?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages