- Project Introduction
- Key Features
- System Architecture
- Code Description
- Hardware Requirements
- Quick Start
- Example Projects
- Community
DAZI-AI is a serverless AI voice assistant developed entirely on the ESP32 platform using the Arduino environment. It allows you to run AI voice interactions directly on ESP32 devices without the need for additional server support. The system provides complete voice interaction capabilities including speech recognition, AI processing, and text-to-speech output.
✅ Serverless Design:
- More flexible secondary development
- Higher degree of freedom (customize prompts or models)
- Simpler deployment (no additional server required)
✅ Complete Voice Interaction:
- Voice input via INMP441 microphone
- Real-time speech recognition using ByteDance ASR API
- AI processing through OpenAI API
- Voice output via MAX98357A I2S audio amplifier
✅ Continuous Conversation Mode:
- Automatic speech recognition with VAD (Voice Activity Detection)
- Seamless ASR → LLM → TTS conversation loop
- Configurable conversation memory to maintain context
- One-button control to start/stop continuous mode
The system uses a modular design with the following key components:
- Voice Input: INMP441 microphone with I2S interface
- Speech Recognition: ByteDance ASR API for real-time transcription
- AI Processing: OpenAI ChatGPT API for conversation with memory support
- Voice Output: MAX98357A I2S audio amplifier for TTS playback
- Connectivity: WiFi for API communication
- Push-to-Talk Mode (examples/chat): Hold button to record, release to process
- Continuous Conversation Mode (examples/chat_asr): Automatic ASR with VAD, seamless conversation loop
A unified Arduino library that integrates all necessary components for AI voice assistant development.
| Feature | Description |
|---|---|
| ChatGPT Communication | Communicates with OpenAI API, handles requests and responses |
| Conversation Memory | Maintains conversation history for context-aware responses |
| TTS | Text-to-Speech functionality, converts AI replies to voice |
| STT | Speech-to-Text functionality, converts user input to text |
| Real-time ASR | ByteDance ASR integration with WebSocket protocol for streaming recognition |
| VAD | Voice Activity Detection for automatic speech detection and silence handling |
| Audio Processing | Processes and converts audio data formats (modified ESP32-audioI2S) |
| Audio Playback | I2S audio output with support for multiple codecs (MP3, AAC, FLAC, Opus, Vorbis) |
DAZI-AI/
├── library.properties # Arduino library configuration
├── keywords.txt # Syntax highlighting keywords
├── README.md # Documentation
├── src/ # All source code
│ ├── ArduinoGPTChat.cpp # ChatGPT & TTS implementation
│ ├── ArduinoGPTChat.h # ChatGPT & TTS header
│ ├── ArduinoASRChat.cpp # Real-time ASR implementation
│ ├── ArduinoASRChat.h # Real-time ASR header
│ ├── Audio.cpp # Modified ESP32-audioI2S library
│ ├── Audio.h # Audio library header
│ ├── aac_decoder/ # AAC audio decoder
│ ├── flac_decoder/ # FLAC audio decoder
│ ├── mp3_decoder/ # MP3 audio decoder
│ ├── opus_decoder/ # Opus audio decoder
│ └── vorbis_decoder/ # Vorbis audio decoder
└── examples/ # Example projects
├── chat/ # Push-to-talk voice chat example
│ └── chat.ino # Push-to-talk mode with INMP441
└── chat_asr/ # Continuous conversation example
└── chat_asr.ino # ASR-based continuous mode with memory
- Controller: ESP32 development board (ESP32-S3 recommended)
- Audio Amplifier: MAX98357A or similar I2S amplifier
- Microphone: INMP441 I2S MEMS microphone
- Speaker: 4Ω 3W speaker or headphones
| INMP441 Pin | ESP32 Pin | Description |
|---|---|---|
| VDD | 3.3V | Power (DO NOT use 5V!) |
| GND | GND | Ground |
| L/R | GND | Left channel select |
| WS | GPIO 4 | Left/Right clock |
| SCK | GPIO 5 | Serial clock |
| SD | GPIO 6 | Serial data |
| Function | ESP32 Pin | Description |
|---|---|---|
| I2S_DOUT | GPIO 47 | Audio data output |
| I2S_BCLK | GPIO 48 | Bit clock |
| I2S_LRC | GPIO 45 | Left/Right clock |
-
Environment Setup
- Install Arduino IDE (version 2.0+ recommended)
- Install ESP32 board support in Arduino IDE:
- Go to
File→Preferences - Add ESP32 board manager URL:
https://espressif.github.io/arduino-esp32/package_esp32_index.json - Go to
Tools→Board→Boards Manager - Search for "ESP32" and install "esp32 by Espressif Systems"
- Go to
-
Library Installation via ZIP
Method 1: Direct ZIP Installation (Recommended)
- Download or create a ZIP file of the entire
DAZI-AIfolder - Ensure the ZIP file structure has
library.propertiesat the root level - Open Arduino IDE
- Go to
Sketch→Include Library→Add .ZIP Library... - Select the
DAZI-AI.zipfile - Wait for installation to complete
Method 2: Manual Installation
- Copy the entire
DAZI-AIfolder to your Arduino libraries directory:- Windows:
Documents\Arduino\libraries\ - macOS:
~/Documents/Arduino/libraries/ - Linux:
~/Arduino/libraries/
- Windows:
- Restart Arduino IDE
- Download or create a ZIP file of the entire
-
Install Required Dependencies
- Open Arduino IDE Library Manager (
Tools→Manage Libraries...) - Search and install the following libraries:
- ArduinoWebsocket (v0.5.4 or later)
- ArduinoJson (v7.4.1 or later)
- Seeed_Arduino_mbedtls (v3.0.2 or later)
- Open Arduino IDE Library Manager (
-
API Key Configuration
For Push-to-Talk Mode (
examples/chat/chat.ino):- Replace
"your-api-key"with your actual OpenAI API key - Replace
"your-wifi-ssid"and"your-wifi-password"with your WiFi credentials - Optionally modify the system prompt to customize AI behavior
For Continuous Conversation Mode (
examples/chat_asr/chat_asr.ino):- Replace
"your-bytedance-asr-api-key"with your ByteDance ASR API key (line 37) - Replace
"your-openai-api-key"with your OpenAI API key (line 41) - Replace WiFi credentials (lines 33-34)
- Set
ENABLE_CONVERSATION_MEMORYto 1 to enable memory or 0 to disable (line 7) - Optionally modify the system prompt to customize AI personality (lines 81-104)
- Replace
-
Hardware Wiring
- Connect INMP441 microphone according to pin table above
- Connect MAX98357A I2S audio amplifier for speaker output
-
Open Example Projects
- After installing the library, examples will be available in Arduino IDE
- Go to
File→Examples→DAZI-AI - Choose either:
- chat: Push-to-talk mode example
- chat_asr: Continuous conversation mode example
-
Compile and Upload
- Select the appropriate ESP32 development board
- This project has been tested on ESP32S3 Dev Module and XIAO ESP32S3
- Requirements: Flash Size >8M and PSRAM >4Mb
- In Arduino IDE, configure board settings:
- Partition Scheme: Select "8M with spiffs"
- PSRAM: Select "OPI PSRAM"
- Compile and upload the code to your device
- Select the appropriate ESP32 development board
-
Testing
- Open the serial monitor (115200 baud)
- Wait for WiFi connection
- Hold the BOOT button on your ESP32 to start recording
- Speak your question or command while holding the button
- Release the button to send the recording to ChatGPT
- Listen to the AI response through your connected speaker
Traditional push-to-talk voice interaction system with ChatGPT.
Features:
- Push-to-talk voice recording with INMP441 microphone using BOOT button
- Speech-to-text conversion using OpenAI Whisper API
- ChatGPT conversation processing with customizable system prompts
- Text-to-speech output with natural voice playback
- Real-time audio processing and I2S audio output
- Configurable API endpoints for different OpenAI-compatible services
Usage:
- Hold the BOOT button to start voice recording
- Speak while holding the button
- Release the button to stop recording and send to ChatGPT
- The system will transcribe your speech and send it to ChatGPT
- ChatGPT's response will be played back as speech through the speaker
Control:
- The system uses the ESP32's built-in BOOT button (GPIO 0) for voice control
- Press and hold to record, release to process
- No need to type commands in serial monitor - just use the button!
Advanced continuous voice conversation with real-time ASR and conversation memory.
Features:
- Real-time ASR: ByteDance ASR API for streaming speech recognition
- VAD (Voice Activity Detection): Automatic detection of speech start/end
- Seamless Conversation Loop: Automatic ASR → LLM → TTS → ASR cycle
- Conversation Memory: Maintains context across multiple conversation turns
- One-Button Control: Single button press to start/stop continuous mode
- Intelligent Timeouts: Auto-exit continuous mode if no speech detected
- State Machine Design: Robust state management for smooth transitions
How It Works:
- Press BOOT button → Enters continuous conversation mode
- ASR Listening → System starts listening for speech automatically
- Speech Detection → VAD detects when you start and stop speaking
- Auto Processing → Transcription sent to ChatGPT automatically
- TTS Playback → AI response plays through speaker
- Auto Loop → System automatically returns to listening state
- Press BOOT again → Exit continuous mode
Configuration Options:
ENABLE_CONVERSATION_MEMORY: Toggle conversation history on/off (line 7)systemPrompt: Customize AI personality and behavior (lines 81-104)setSilenceDuration(): Adjust silence detection threshold (line 194)setMaxRecordingSeconds(): Set maximum recording duration (line 195)
Usage:
- Press BOOT button once to start continuous conversation mode
- Speak naturally - system will detect when you start and stop talking
- AI responses play automatically
- System loops back to listening after each response
- Press BOOT button again to exit continuous mode
State Machine:
IDLE → LISTENING → PROCESSING_LLM → PLAYING_TTS → WAIT_TTS_COMPLETE → LISTENING (loop)
Benefits:
- No need to hold button while speaking
- Natural conversation flow like talking to a person
- Context-aware responses with conversation memory
- Automatic voice detection eliminates manual control
Join our Discord community to share development experiences, ask questions, and collaborate with other developers:
Discord Server: https://discord.gg/RFPwfhTM
If you find this project helpful, please give it a ⭐️