Voi is a free and open source backend for realtime voice agents. Check the JS client.
- 02/10/2025 - Voi is open source 🎉
- 02/22/2025 - Added user images support
- 03/07/2025 - Added call mode 📞
- 9Gb+ of GPU memory. I recommend GeForce RTX 3090 or better for a single worker.
- 8-core CPU, 32Gb RAM is enough.
- 10Gb of disk space.
- Ubuntu 22.04 or higher.
- Fresh Nvidia drivers (tested on 545+ driver versions).
- Docker with Nvidia runtime support.
- Caddy server.
- LiteLLM server.
Voi uses Docker Compose to run a server. It uses Docker mostly for runtime, while keeping source code, Python packages and model weights on the host file system. This was made intentionally to allow fast development.
There are two Docker environments to run Voi server, production and development. Basically they are the same, except the production config starts the Voi server automatically and uses a different port.
Get the sources.
git clone https://github.com/alievk/voi-core.git
cd voi-coreCopy your id_rsa.pub into docker folder to be able to ssh directly into the container.
cp ~/.ssh/id_rsa.pub docker/Make a copy of docker/docker-compose-dev.example.yml.
cp docker/docker-compose-dev.example.yml docker/docker-compose-dev.ymlIn the docker-compose-dev.yml, edit the environment, ports and volumes sections as you need. If you need a Jupyter server, set your token in the JUPYTER_TOKEN variable, otherwise it won't run for safety reasons.
Build and run the development container.
cd docker
./up-dev.shWhen the container is created, you will see voice-agent-core-container-dev in docker ps output, otherwise check docker-compose logs for errors. If there were no errors, ssh daemon and Jupyter server will be listening on the ports defined in docker-compose-dev.yml.
Connect via ssh into the container from e.g. your laptop:
ssh user@<host> -p <port>where <host> is the address of your host machine and port is the port specified in docker-compose-dev.yml. You will see a bash prompt like user@8846788f5e9c:~$.
My personal recommendation is to add a config to your ~/.ssh/config file to easily connect to the container:
Host voi_docker_dev
Hostname your_host_address
AddKeysToAgent yes
UseKeychain yes
User user
Port port_from_the_above
Then you do just this and get into the container:
ssh voi_docker_devIn the container, install the Python dependencies:
cd voi-core
./install.shThis step is intentionally not incorporated in the Dockerfile because at the active development stage you often change the requirements and don't want to rebuild the container each time. You won't need to do this each time when the container was restarted if you have mapped .local directory properly in the docker-compose-dev.yml.
The Voi server uses secure web socket connection and relies on Caddy, which nicely manages SSL certificates for us. Follow the docs to get it.
On your host machine, make sure you have proper config in the Caddyfile (usually /etc/caddy/Caddyfile):
your_domain.com:8774 {
reverse_proxy localhost:8775
}
This will proxy secure web socket connection from 8775 to 8774 port.
LiteLLM allows calling all LLM APIs using OpenAI format, which is neat.
If you run a Voi server in a country restricted by OpenAI (like Russia or China), you will need to run a remote LiteLLM server in a closest unrestricted country. You can do this for just $10/mo using AWS Lightsail. These are minimal specs you need:
- 2 GB RAM, 2 vCPUs, 60 GB SSD
- Ubuntu
If you use AWS Lightsail, do not forget to add a custom TCP rule for the port 4000.
If you are not in the restricted region, you can run LiteLLM server locally on your host machine.
For the details of setting up LiteLLM, visit the repo, but basically you need to follow these steps.
Get the code.
git clone https://github.com/BerriAI/litellm
cd litellmAdd the master key - you can change this after setup.
echo 'LITELLM_MASTER_KEY="sk-1234"' > .env
source .envCreate models configuration file.
vim litellm_config.yamlExample configuration:
model_list:
- model_name: gemini-1.5-flash
litellm_params:
model: openai/gemini-1.5-flash
api_key: your_googleapi_key
api_base: https://generativelanguage.googleapis.com/v1beta/openai
- model_name: meta-llama-3.1-70b-instruct-turbo
litellm_params:
model: openai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
api_key: your_deepinfra_key
api_base: https://api.deepinfra.com/v1/openaiThe model format is {API format}/{model name}, where {API format} is openai/anthropic and {model name} is the model name in the provider's format (gpt-4o-mini for OpenAI or meta-llama/Meta-Llama-3.1-8B-Instruct for DeepInfra). Look at litellm_config.example.yaml for more examples.
- Start LiteLLM server.
docker-compose upBefore running the server, we need to set the environment variables and create agents config.
My typical workflow is to run the development environment and ssh to the container using Cursor (Connect to Host -> voi_docker_dev). In this way, I can edit source code and run the scripts in one place.
Make a copy of .env.example.
# Assuming you are in the Voi root
cp .env.example .envLITELLM_API_BASEis the address of your LiteLLM server, likehttp://111.1.1.1:4000ofhttp://localhost:4000.LITELLM_API_KEYisLITELLM_MASTER_KEYfrom the LiteLLM's.envfile.TOKEN_SECRET_KEYis a secret key for generating access tokens to the websocket endpoint. You should not reveal this key to a client.API_KEYis the HTTPS API access key. You need to share it with a client.
Voi relies on Whisper for speech transcribition and adds realtime (transcribe-as-you-speak) processing on top of that. The model weights are downloaded automatically on the first launch.
Voi uses xTTS-v2 model to generate speech. It gives the best tradeoff between the quality and speed.
To test your agents, you can download the pre-trained multi-speaker model from HuggingFace. Download these files and put them in a directory of your choice (f.e., models/xtts_v2):
model.pthconfig.jsonvocab.jsonspeakers_xtts.pth
Then make a copy of tts_models.example.json and fix the paths in multispeaker_original so that they point to the model files above.
cp tts_models.example.json tts_models.jsonVoi allows changing voice tone of the agent dynamically during the conversation (like neutral or excited), but the pre-trained model coming along with xTTS doesn't allow this. I have a custom pipeline for fine-tuning text-to-speech models on audio datasets and enabling dynamic tone changing, which I'm not open sourcing today. If you need a custom model, please DM me on X.
Agents are defined in JSON files in the agents directory. The control agents are defined in agents/control_agents.json. To add a new agent, simply create a JSON file with agent configurations in the agents directory and it will be loaded when the server starts. A client can also send an agent config when opening a new connection using the agent_config field.
An example of agent configurations can be found in the voi-js-client repository.
Each agent configuration has the following structure:
llm_model: The language model to use (must match models inlitellm_config.yaml)control_agent(optional): Name of an agent that filters/controls the main agent's responsesvoices: Configuration for speech synthesischaracter: Main voice settingsmodel: TTS model name fromtts_models.jsonvoice: Voice identifier for the modelspeed(optional): Speech speed multiplier
narrator(optional): Voice for narrative comments- Same settings as
characterplus: leading_silence: Silence before narrationtrailing_silence: Silence after narration
- Same settings as
system_prompt: Array of strings defining the agent's personality and behavior. Can include special templates:{character_agent_message_format_voice_tone}: Adds instructions for voice tone control (neutral/warm/excited/sad){character_agent_message_format_narrator_comments}: Adds instructions for narrator comments format (actions in third person)
examples(optional): List of conversation examples for few-shot learninggreetings: Initial messages configurationchoices: List of greeting messages (can include pre-cached voice files)voice_tone: Emotional tone for greeting (must match tones intts_models.json)
Special agents like control_agent can have additional fields:
model: Processing type (e.g. "pattern_matching")denial_phrases: Phrases to filter outgiveup_after: Number of retries before giving upgiveup_response: Fallback responses
Ssh into the container and run:
python3 ws_server.pyNote that the first time the client connects to an agent it may take some time to load the text-to-speech models.
Clients and agents communicate via the websocket. A client must receive it's personal token to access the websocket endpoint. This can be made in two ways:
- Through the API:
curl -I -X POST "https://your_host_address:port/integrations/your_app" \
-H "API-Key: your_api_key"where
your_host_address:portis the address of the host running Voi server andportis the port where the server is listening.your_appis the name of your app, likerelationships_coach.your_api_keyisAPI_KEYfrom.env.
Note that this will generate a token which will expire after 1 day.
- Manually:
python3 token_generator.py your_app --expire n_daysHere you can set any number of n_days when your token will expire.
- Make it open source
- Support for user images
- Incoming calls
- Context gathering: understand user's problem
- Function calling: add external actuators like DB inquiry
- Turn detection: detect a moment when agent can start to speak
- Add a call-center-like voice
- WebRTC support
- VoIP support
- Outcoming calls
Realtime conversation with a human is a really complex task, as it requires from the agent an empathy, competence and speed. If you lack a single piece of these, your agent is useless. That's why making a good voice agent is not just stacking a bunch of APIs together. You have to develop it very carefully, making a small step, then testing, making a small step, then testing...
There are two main factors which enabled me to run this project. First, the emergence of smart, fast and cheap LLMs necessary for agents intelligence. Second, the advancement of code copilots. Though I have a deep learning background, there are lots of topics beyond my competence required to build a good voice agent.
While open sourcing Voi, I realized many people could use it for learning software engineering. Yes, this is still actual, because this project is basically many pieces of AI-generated code carefully stitched together by a human engineer.
You are welcome to open PRs with bug fixes, new features and documentation improvements.
Or you can just buy me a coffee and I will convert it to code!
Voi uses the MIT license, which basically means you can do anything with it, free of charge. However, the dependencies may have different licenses. Check them if you care.