VideoSDK AI Agents

Open-source Python framework for building production-ready, real-time voice and multimodal AI agents.

The VideoSDK AI Agents framework is a Python SDK for building AI agents that join VideoSDK rooms as real-time participants. It connects your agent worker, AI models, and user devices into a single low-latency pipeline — handling audio streaming, turn detection, interruptions, and media routing automatically so you can focus on agent logic.

Overview

VideoSDK AI Agents is a Python framework that lets you build voice and multimodal AI agents that participate directly in VideoSDK rooms. The framework manages the full agent lifecycle — from joining a room and processing live audio, to running STT → LLM → TTS pipelines or connecting to unified realtime models, to handling turn detection, VAD, interruptions, and clean teardown.

v1.0.0 introduces a unified Pipeline class that replaces the previous CascadingPipeline and RealtimePipeline. Pass in any combination of components — STT, LLM, TTS, VAD, turn detector, avatar — and the framework wires them together and selects the optimal execution mode automatically. A decorator-based hooks system (@pipeline.on(...)) lets you intercept and transform data at any stage without subclassing.

🎙️ Agent with Cascade Mode Build an AI Voice Agent using Cascade Mode (STT → LLM → TTS).	⚡ Agent with Realtime Mode Build an AI Voice Agent using a unified Realtime model (e.g. Gemini Live).
💻 Agent Documentation The VideoSDK Agent Official Documentation.	📚 SDK Reference Reference Docs for Agents Framework.

#	Feature	Description
1	🎤 Real-time Communication (Audio/Video)	Agents can listen, speak, and interact live in meetings.
2	📞 SIP & Telephony Integration	Seamlessly connect agents to phone systems via SIP for call handling, routing, and PSTN access.
3	🧍 Virtual Avatars	Build or plug in any avatar provider — the framework handles audio routing, sync, and teardown automatically.
4	🤖 Multi-Model Support	Integrate with OpenAI, Gemini, AWS NovaSonic, Anthropic, and more.
5	🧩 Cascade Mode	Compose any STT → LLM → TTS chain across providers for full control and flexibility.
6	⚡ Realtime Mode	Use unified realtime models (OpenAI Realtime, AWS Nova Sonic, Gemini Live) for lowest latency.
7	🔀 Hybrid Mode	Mix cascade and realtime components — custom STT with a realtime model, or realtime with custom TTS.
8	🪝 Pipeline Hooks	Intercept and transform data at any stage (STT, LLM, TTS, turns) using `@pipeline.on(...)`.
9	🛠️ Function Tools	Extend agent capabilities with any external tool or API call.
10	🌐 MCP Integration	Connect agents to external data sources and tools using Model Context Protocol.
11	🔗 A2A Protocol	Reliable agent-to-agent routing with correlation-based request tracking.
12	🦜 LangChain & LangGraph	Plug in any LangChain `BaseChatModel` or LangGraph `StateGraph` as the agent's LLM.
13	📊 Observability	Built-in metrics, OpenTelemetry tracing, and structured logging per component.

Important

Star VideoSDK Repositories ⭐️

Get instant notifications for new releases and updates. Your support helps us grow and improve VideoSDK!

Pipeline Modes

All agents are built around a single Pipeline class. Pass in your components — the SDK picks the right execution mode automatically.

Cascade Mode — STT → LLM → TTS

Mix and match any provider for each stage. Best when you need custom STT, specific LLM behaviour, or a particular TTS voice.

async def start_session(context: JobContext):
    pipeline = Pipeline(
        stt=DeepgramSTT(),
        llm=GoogleLLM(),
        tts=CartesiaTTS(),
        vad=SileroVAD(),
        turn_detector=TurnDetector(),
    )
    session = AgentSession(agent=MyAgent(), pipeline=pipeline)
    await session.start(wait_for_participant=True, run_until_shutdown=True)

Realtime Mode — Lowest Latency with Unified Models

Use a single realtime model for the entire voice pipeline. Best for sub-500ms response latency.

async def start_session(context: JobContext):
    pipeline = Pipeline(
        llm=GeminiRealtime(
            model="gemini-3.1-flash-live-preview",
            config=GeminiLiveConfig(voice="Leda", response_modalities=["AUDIO"]),
        )
    )
    session = AgentSession(agent=MyAgent(), pipeline=pipeline)
    await session.start(wait_for_participant=True, run_until_shutdown=True)

Hybrid Mode — Mix & Match

Use an external STT with a Realtime LLM, or a Realtime model with a custom TTS:

# External STT → Realtime LLM
pipeline = Pipeline(stt=DeepgramSTT(), llm=OpenAIRealtime(...))

# Realtime LLM → External TTS
pipeline = Pipeline(llm=OpenAIRealtime(...), tts=ElevenLabsTTS(...))

Pipeline Hooks — Intercept Any Stage

@pipeline.on("stt")
async def clean_transcript(text: str) -> str:
    return text.strip()

@pipeline.on("llm")
async def route_llm(messages):
    if "transfer" in messages[-1].content:
        yield "Transferring you now."  # bypass LLM entirely

@pipeline.on("tts")
async def fix_pronunciation(text: str) -> str:
    return text.replace("VideoSDK", "Video S D K")

@pipeline.on("user_turn_start")
async def on_user_starts():
    print("User is speaking...")

Available hook points: stt · tts · llm · vision_frame · user_turn_start · user_turn_end · agent_turn_start · agent_turn_end

Pre-requisites

Before you begin, ensure you have:

A VideoSDK authentication token (generate from app.videosdk.live)
- A VideoSDK meeting ID (you can generate one using the Create Room API or through the VideoSDK dashboard)
Python 3.12 or higher
Third-Party API Keys:
- API keys for the services you intend to use (e.g., OpenAI for LLM/STT/TTS, ElevenLabs for TTS, Google for Gemini etc.).

Installation

Using UV (Recommended)

UV is a fast Python package manager that handles virtual environments and dependency management automatically.

If you don't have UV installed, see the UV installation guide.

Install the core VideoSDK AI Agent package:
```
uv add videosdk-agents
```

Install Optional Plugins:

uv add videosdk-plugins-openai
uv add videosdk-plugins-deepgram

Run your agent:
```
uv run python main.py
```

Using pip

Create and activate a virtual environment with Python 3.12 or higher.
macOS / Linux
```
python3 -m venv venv
source venv/bin/activate
```
Windows
```
python -m venv venv
venv\Scripts\activate
```
Install the core VideoSDK AI Agent package
```
pip install videosdk-agents
```
Install Optional Plugins. Plugins help integrate different providers for Realtime, STT, LLM, TTS, and more. Install what your use case needs:
```
# Example: Install the Turn Detector plugin
pip install videosdk-plugins-turn-detector
```
👉 Supported plugins (Realtime, LLM, STT, TTS, VAD, Avatar, SIP) are listed in the Supported Libraries section below.

Development Setup

To set up the project locally, clone the repo and install all packages (core + all plugins) as editable installs:

Using UV (Recommended):

git clone https://github.com/videosdk-live/agents.git
cd agents
uv sync
uv run python examples/cascade_basic.py

Using pip:

git clone https://github.com/videosdk-live/agents.git
cd agents
bash setup.sh
source venv/bin/activate
python examples/cascade_basic.py

Generating a VideoSDK Meeting ID

Before your AI agent can join a meeting, you'll need to create a meeting ID. You can generate one using the VideoSDK Create Room API:

Using cURL

curl -X POST https://api.videosdk.live/v2/rooms \
  -H "Authorization: YOUR_JWT_TOKEN_HERE" \
  -H "Content-Type: application/json"

For more details on the Create Room API, refer to the VideoSDK documentation.

Getting Started: Your First Agent

Quick Start

Now that you've installed the necessary packages, you're ready to build!

Step 1: Creating a Custom Agent

First, let's create a custom voice agent by inheriting from the base Agent class:

from videosdk.agents import Agent, function_tool

# External Tool
# async def get_weather(self, latitude: str, longitude: str):

class VoiceAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="You are a helpful voice assistant that can answer questions and help with tasks.",
             tools=[get_weather] # You can register any external tool defined outside of this scope
        )

    async def on_enter(self) -> None:
        """Called when the agent first joins the meeting"""
        await self.session.say("Hi there! How can I help you today?")
    
    async def on_exit(self) -> None:
      """Called when the agent exits the meeting"""
        await self.session.say("Goodbye!")

This code defines a basic voice agent with:

Custom instructions that define the agent's personality and capabilities
An entry message when joining a meeting
State change handling to track the agent's current activity

Step 2: Implementing Function Tools

Function tools allow your agent to perform actions beyond conversation. There are two ways to define tools:

External Tools: Defined as standalone functions outside the agent class and registered via the tools argument in the agent's constructor.
Internal Tools: Defined as methods inside the agent class and decorated with @function_tool.

Below is an example of both:

import aiohttp

# External Function Tools
@function_tool
def get_weather(latitude: str, longitude: str):
    print(f"Getting weather for {latitude}, {longitude}")
    url = f"https://api.open-meteo.com/v1/forecast?latitude={latitude}&longitude={longitude}&current=temperature_2m"
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            if response.status == 200:
                data = await response.json()
                return {
                    "temperature": data["current"]["temperature_2m"],
                    "temperature_unit": "Celsius",
                }
            else:
                raise Exception(
                    f"Failed to get weather data, status code: {response.status}"
                )

class VoiceAgent(Agent):
# ... previous code ...
# Internal Function Tools
    @function_tool
    async def get_horoscope(self, sign: str) -> dict:
        horoscopes = {
            "Aries": "Today is your lucky day!",
            "Taurus": "Focus on your goals today.",
            "Gemini": "Communication will be important today.",
        }
        return {
            "sign": sign,
            "horoscope": horoscopes.get(sign, "The stars are aligned for you today!"),
        }

Use external tools for reusable, standalone functions (registered via tools=[...]).
Use internal tools for agent-specific logic as class methods.
Both must be decorated with @function_tool for the agent to recognize and use them.

Step 3: Setting Up the Pipeline

Connect your agent to an AI model using the unified Pipeline class. Pass in whichever components you need — the SDK handles the rest.

Realtime mode (single model, lowest latency):

async def start_session(context: JobContext):
    pipeline = Pipeline(
        llm=GeminiRealtime(
            model="gemini-3.1-flash-live-preview",
            config=GeminiLiveConfig(voice="Leda", response_modalities=["AUDIO"]),
        )
    )
    session = AgentSession(agent=VoiceAgent(), pipeline=pipeline)
    await session.start(wait_for_participant=True, run_until_shutdown=True)

Cascade mode (STT → LLM → TTS, full provider control):

async def start_session(context: JobContext):
    pipeline = Pipeline(
        stt=DeepgramSTT(),
        llm=GoogleLLM(),
        tts=CartesiaTTS(),
        vad=SileroVAD(),
        turn_detector=TurnDetector(),
    )
    session = AgentSession(agent=VoiceAgent(), pipeline=pipeline)
    await session.start(wait_for_participant=True, run_until_shutdown=True)

Step 4: Assembling and Starting the Agent Session

from videosdk.agents import AgentSession, WorkerJob, RoomOptions, JobContext

async def start_session(context: JobContext):
    session = AgentSession(
        agent=VoiceAgent(),
        pipeline=pipeline,
    )
    await session.start(wait_for_participant=True, run_until_shutdown=True)

def make_context() -> JobContext:
    room_options = RoomOptions(
        room_id="<meeting_id>",
        name="Test Agent",
        playground=True,
    )
    return JobContext(room_options=room_options)

if __name__ == "__main__":
    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
    job.start()

Step 5: Connecting with VideoSDK Client Applications

After setting up your AI Agent, you'll need a client application to connect with it. You can use any of the VideoSDK quickstart examples to create a client that joins the same meeting:

When setting up your client application, make sure to use the same meeting ID that your AI Agent is using.

Step 6: Running the Project

Once you have completed the setup, you can run your AI Voice Agent project using Python. Make sure your .env file is properly configured and all dependencies are installed.

python main.py

Tip

Console Mode — test your agent locally without a meeting room. Set playground=True in RoomOptions and run python main.py to interact via your mic and speakers directly from the terminal.

Step 7: Deployment

For deployment options and guide, checkout the official documentation here: Deployment

VideoSDK Inference

VideoSDK Inference provides a unified gateway to access STT, LLM, TTS, Denoise, and Realtime models — without managing individual provider API keys. Authentication is handled via your VIDEOSDK_AUTH_TOKEN and usage is billed from your VideoSDK account balance.

from videosdk.agents.inference import STT, LLM, TTS, Denoise, Realtime

Cascade Mode with VideoSDK Inference:

async def start_session(context: JobContext):
    pipeline = Pipeline(
        stt=STT.sarvam(model_id="saarika:v2.5", language="en-IN"),
        llm=LLM.google(model_id="gemini-2.5-flash"),
        tts=TTS.sarvam(model_id="bulbul:v2", speaker="anushka", language="en-IN"),
        denoise=Denoise.sanas(),
        vad=SileroVAD(),
    )
    session = AgentSession(agent=MyAgent(), pipeline=pipeline)
    await session.start(wait_for_participant=True, run_until_shutdown=True)

Realtime Mode with VideoSDK Inference:

async def start_session(context: JobContext):
    pipeline = Pipeline(
        llm=Realtime.gemini(
            model_id="gemini-3.1-flash-live-preview",
            voice="Puck",
            language_code="en-US",
            response_modalities=["AUDIO"],
        )
    )
    session = AgentSession(agent=MyAgent(), pipeline=pipeline)
    await session.start(wait_for_participant=True, run_until_shutdown=True)

See Inference Pricing for provider-wise billing details.

Supported Libraries and Plugins

The framework supports integration with various AI models and tools, across multiple categories:

Category	Services
Real-time Models	OpenAI \| Gemini \| AWS Nova Sonic \| Azure Voice Live
Speech-to-Text (STT)	OpenAI \| Google \| Azure AI Speech \| Azure OpenAI \| Sarvam AI \| Deepgram \| Cartesia \| AssemblyAI \| Navana
Language Models (LLM)	OpenAI \| Azure OpenAI \| Google \| Sarvam AI \| Anthropic \| Cerebras
Text-to-Speech (TTS)	OpenAI \| Google \| AWS Polly \| Azure AI Speech \| Azure OpenAI \| Deepgram \| Sarvam AI \| ElevenLabs \| Cartesia \| Resemble AI \| Smallest AI \| Speechify \| InWorld \| Neuphonic \| Rime AI \| Hume AI \| Groq \| LMNT AI \| Papla Media
Voice Activity Detection (VAD)	SileroVAD
Turn Detection Model	Namo Turn Detector
Virtual Avatar	Simli \| Anam \| Custom (implement `connect` / `aclose` protocol)
LLM Orchestration	LangChain \| LangGraph
Denoise	RNNoise

Tip

Installation Examples

# Install with specific plugins
pip install videosdk-agents[openai,elevenlabs,silero]

# Install individual plugins
pip install videosdk-plugins-anthropic
pip install videosdk-plugins-deepgram

Examples

Explore the following examples to see the framework in action:

Core Mode Examples

🎙️ Cascade Mode (Basic) Simple STT → LLM → TTS voice agent using Google LLM + Deepgram STT + Cartesia TTS.	🔧 Cascade Mode (Advanced) Advanced cascade agent with VAD, turn detection, and interruption handling.
⚡ Realtime Mode Minimal realtime agent using Gemini Live for lowest-latency voice interactions.	🔀 Hybrid Mode Mix cascade and realtime — custom STT with a realtime model, or realtime with custom TTS.
🧩 Composable Pipelines Flexible Pipeline configs — transcription-only, LLM-only, voice+chat, full voice agent.	🪝 Pipeline Hooks Intercept and transform STT, LLM, and TTS data at any stage using `@pipeline.on(...)`.

Integrations & Advanced Features

🌐 Agent with MCP Server Stock Market Analyst Agent with real-time market data access via Model Context Protocol.	🤝 Agent-to-Agent (A2A) Multi-agent workflow: customer agent that transfers loan queries to a Loan Specialist Agent.
🦜 LangChain Integration Use LangChain tools and agents within the VideoSDK agent framework.	🕸️ LangGraph Integration Orchestrate multi-step agent workflows using LangGraph state machines.
🧠 Memory Agent (Mem0) Persistent memory across sessions using Mem0 for long-term context retention.	👁️ Vision Agent Multimodal agent that processes video frames alongside voice using cascading or realtime pipelines.
🔄 n8n Workflow Integration Trigger n8n automation workflows from within your agent using webhooks.	🧑‍💼 Human in the Loop Escalate to a human agent mid-conversation via Discord or other channels.

Use Case Examples

📞 AI Telephony Agent Hospital appointment booking via a voice-enabled telephony agent.	✈️ AI WhatsApp Agent Ask about available hotel rooms and book on the go.
🛒 Agent with Knowledge (RAG) Agent that answers questions based on documentation knowledge.	🎭 Virtual Avatar Agent A Virtual Avatar Agent that presents a weather forecast.
🏥 Appointment Booking Healthcare front-desk receptionist for scheduling clinic appointments.	📣 Announcement Agent Proactive outbound agent for broadcasting announcements.
🎧 Customer Support AI-powered customer support agent with escalation and knowledge base.	📂 More Use Cases Call center, IVR, medical triage, language tutor, meeting notes, and more.

Documentation

For comprehensive guides and API references:

📄 Official Documentation

Complete framework documentation

📝 API Reference

Detailed API documentation

📂 Examples Directory

Additional code examples

Contributing

We welcome contributions! Here's how you can help:

🐞 Report Issues

Open an issue for bugs or feature requests

🔀 Submit PRs

Create a pull request with improvements

🛠️ Build Plugins

Follow our plugin development guide

💬 Join Community

Connect with us on Discord

The framework is under active development, so contributions in the form of new plugins, features, bug fixes, or documentation improvements are highly appreciated.

🛠️ Building Custom Plugins

Want to integrate a new AI provider? Check out BUILD YOUR OWN PLUGIN for:

Step-by-step plugin creation guide
Directory structure and file requirements
Implementation examples for STT, LLM, and TTS
Testing and submission guidelines

Community & Support

Stay connected with VideoSDK:

💬 Discord

Join our community

🐦 Twitter

@video_sdk

▶️ YouTube

VideoSDK Channel

🔗 LinkedIn

VideoSDK Company

Tip

Support the Project! ⭐️
Star the repository, join the community, and help us improve VideoSDK by providing feedback, reporting bugs, or contributing plugins.

Made with ❤️ by The VideoSDK Team

Name		Name	Last commit message	Last commit date
Latest commit History 436 Commits
.github		.github
examples		examples
scripts		scripts
use_case_examples		use_case_examples
videosdk-agents		videosdk-agents
videosdk-plugins		videosdk-plugins
.gitignore		.gitignore
BUILD_YOUR_OWN_PLUGIN.md		BUILD_YOUR_OWN_PLUGIN.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.sh		setup.sh
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

VideoSDK AI Agents

Overview

🎙️ Agent with Cascade Mode

⚡ Agent with Realtime Mode

💻 Agent Documentation

📚 SDK Reference

Pipeline Modes

Cascade Mode — STT → LLM → TTS

Realtime Mode — Lowest Latency with Unified Models

Hybrid Mode — Mix & Match

Pipeline Hooks — Intercept Any Stage

Pre-requisites

Installation

Using UV (Recommended)

Using pip

Development Setup

Generating a VideoSDK Meeting ID

Using cURL

Getting Started: Your First Agent

Quick Start

Step 1: Creating a Custom Agent

Step 2: Implementing Function Tools

Step 3: Setting Up the Pipeline

Step 4: Assembling and Starting the Agent Session

Step 5: Connecting with VideoSDK Client Applications

Step 6: Running the Project

Step 7: Deployment

VideoSDK Inference

Supported Libraries and Plugins

Examples

Core Mode Examples

🎙️ Cascade Mode (Basic)

🔧 Cascade Mode (Advanced)

⚡ Realtime Mode

🔀 Hybrid Mode

🧩 Composable Pipelines

🪝 Pipeline Hooks

Integrations & Advanced Features

🌐 Agent with MCP Server

🤝 Agent-to-Agent (A2A)

🦜 LangChain Integration

🕸️ LangGraph Integration

🧠 Memory Agent (Mem0)

👁️ Vision Agent

🔄 n8n Workflow Integration

🧑‍💼 Human in the Loop

Use Case Examples

📞 AI Telephony Agent

✈️ AI WhatsApp Agent

🛒 Agent with Knowledge (RAG)

🎭 Virtual Avatar Agent

🏥 Appointment Booking

📣 Announcement Agent

🎧 Customer Support

📂 More Use Cases

Documentation

📄 Official Documentation

📝 API Reference

📂 Examples Directory

Contributing

🐞 Report Issues

🔀 Submit PRs

🛠️ Build Plugins

💬 Join Community

🛠️ Building Custom Plugins

Community & Support

💬 Discord

🐦 Twitter

▶️ YouTube

🔗 LinkedIn

About

Resources

License

Code of conduct

Contributing

Uh oh!