Live speech translation powered by OpenAI, Google Gemini, Palabra.ai, and Kizuna AI
English | 日本語
Sokuji is a cross-platform desktop application designed to provide live speech translation using OpenAI, Google Gemini, Palabra.ai, and Kizuna AI APIs. Available for Windows, macOS, and Linux, it bridges language barriers in live conversations by capturing audio input, processing it through advanced AI models, and delivering translated output in real-time. It also supports OpenAI-compatible API endpoints for flexibility.
demo.mp4
Prefer not to install a desktop application? Try our browser extension for Chrome, Edge, and other Chromium-based browsers. It offers the same powerful live speech translation features directly in your browser, with special integration for Google Meet and Microsoft Teams.
If you want to install the latest version of the browser extension:
- Download the latest
sokuji-extension.zipfrom the releases page - Extract the zip file to a folder
- Open Chrome/Chromium and go to
chrome://extensions/ - Enable "Developer mode" in the top right corner
- Click "Load unpacked" and select the extracted folder
- The Sokuji extension will be installed and ready to use
Sokuji goes beyond basic translation by offering a complete audio routing solution with virtual device management (Linux only), allowing for seamless integration with other applications. It provides a modern, intuitive interface with real-time audio visualization and comprehensive logging.
- Real-time speech translation using OpenAI, Google Gemini, Palabra.ai, and Kizuna AI APIs
- Simple Mode Interface: Streamlined 6-section configuration for non-technical users:
- Interface language selection
- Translation language pairs (source/target)
- API key management with validation
- Microphone selection with "Off" option
- Speaker selection with "Off" option
- Real-time session duration display
- Multi-Provider Support: Seamlessly switch between OpenAI, Google Gemini, Palabra.ai, and Kizuna AI.
- Supported Models:
- OpenAI:
gpt-4o-realtime-preview,gpt-4o-mini-realtime-preview,gpt-realtime,gpt-realtime-2025-08-28 - Google Gemini:
gemini-2.0-flash-live-001,gemini-2.5-flash-preview-native-audio-dialog - Palabra.ai: Real-time speech-to-speech translation via WebRTC
- Kizuna AI: OpenAI-compatible models with backend-managed authentication
- OpenAI Compatible: Support for custom OpenAI-compatible API endpoints (Electron only)
- OpenAI:
- Automatic turn detection with multiple modes (Normal, Semantic, Disabled) for OpenAI
- Audio visualization with waveform display
- Advanced Virtual Microphone (Linux only) with dual-queue audio mixing system:
- Regular audio tracks: Queued and played sequentially
- Immediate audio tracks: Separate queue for real-time audio mixing
- Simultaneous playback: Mix both track types for enhanced audio experience
- Chunked audio support: Efficient handling of large audio streams
- Real-time Voice Passthrough: Live audio monitoring during recording sessions
- Virtual audio device creation and management on Linux (using PulseAudio/PipeWire)
- Automatic audio routing between virtual devices (Linux only)
- Automatic device switching and configuration persistence
- Audio input and output device selection
- Comprehensive logs for tracking API interactions
- Customizable model settings (temperature, max tokens)
- User transcript model selection (for OpenAI:
gpt-4o-mini-transcribe,gpt-4o-transcribe,whisper-1) - Noise reduction options (for OpenAI: None, Near field, Far field)
- API key validation with real-time feedback
- Configuration persistence in user's home directory
- Optimized AI Client Performance: Enhanced conversation management with consistent ID generation
- Enhanced Tooltips: Interactive help tooltips powered by @floating-ui for better user guidance
- Multi-language Support: Complete internationalization with 35+ languages and English fallback
Sokuji uses a modern audio processing pipeline built on Web Audio API, with additional virtual device capabilities on Linux:
- ModernAudioRecorder: Captures input with advanced echo cancellation
- ModernAudioPlayer: Handles playback with queue-based audio management
- Real-time Processing: Low-latency audio streaming with chunked playback
- Virtual Device Support: On Linux, creates virtual audio devices for application integration
The audio flow in Sokuji:
- Input Capture: Microphone audio is captured with echo cancellation enabled
- AI Processing: Audio is sent to the selected AI provider for translation
- Playback: Translated audio is played through the selected monitor device
- Virtual Device Output (Linux only): Audio is also routed to virtual microphone for other applications
- Optional Passthrough: Original voice can be monitored in real-time
This architecture provides:
- Better echo cancellation using modern browser APIs
- Lower latency through optimized audio pipelines
- Virtual device integration on Linux for seamless app-to-app audio routing
- Cross-platform compatibility with graceful degradation
Modern Audio Service Architecture:
ModernAudioRecorder: Web Audio API-based recording with echo cancellationModernAudioPlayer: Queue-based playback with event-driven processing- Unified audio service for both Electron and browser extension platforms
Optimized Client Management:
GeminiClient: Improved conversation item management with consistent instance IDs- Reduced method calls and improved performance
- Better memory management for long-running sessions
Audio Processing Implementation:
- Queue-based audio chunk management for smooth playback
- Real-time passthrough with configurable volume control
- Event-driven playback to reduce CPU usage
- Automatic device switching and reconnection
- (required) An OpenAI, Google Gemini, or Palabra.ai API key, OR a Kizuna AI account. For Palabra.ai, you will need a Client ID and Client Secret. For Kizuna AI, sign in to your account to automatically access backend-managed API keys. For OpenAI-compatible endpoints, configure your custom API endpoint URL in the settings (Electron only).
- (optional) Linux with PulseAudio or PipeWire for virtual audio device features (desktop app only)
- Node.js (latest LTS version recommended)
- npm
- Audio support works on all platforms (Windows, macOS, Linux)
- Virtual audio devices require Linux with PulseAudio or PipeWire (desktop app only)
-
Clone the repository
git clone https://github.com/kizuna-ai-lab/sokuji.git cd sokuji -
Install dependencies
npm install
-
Launch the application in development mode
npm run electron:dev
-
Build the application for production
npm run electron:build
Download the appropriate package for your platform from the releases page:
Download and run the .exe installer:
Sokuji Setup 0.9.18.exe
Download and install the .dmg package:
Sokuji-0.9.18.dmg
Download and install the .deb package:
sudo dpkg -i sokuji_0.9.18_amd64.debFor other Linux distributions, you can also download the portable .zip package and extract it to your preferred location.
-
Setup your API key:
- Click the Settings button in the top-right corner
- Select your desired provider (OpenAI, Gemini, Palabra, or Kizuna AI).
- For user-managed providers: Enter your API key and click "Validate". For Palabra, you will need to enter a Client ID and Client Secret. For OpenAI Compatible endpoints (Electron only), configure both the API key and custom endpoint URL.
- For Kizuna AI: Sign in to your account to automatically access backend-managed API keys.
- Click "Save" to store your configuration securely.
-
Configure audio devices:
- Click the Audio button to open the Audio panel
- Select your input device (microphone)
- Select your output device (speakers/headphones)
-
Start a session:
- Click "Start Session" to begin
- Speak into your microphone
- View real-time transcription and translation
-
Monitor and control audio:
- Toggle monitor device to hear translated output
- Enable real voice passthrough for live monitoring
- Adjust passthrough volume as needed
-
Use with other applications (Linux only):
- Select "Sokuji_Virtual_Mic" as the microphone input in your target application
- Translated audio will be sent to that application with advanced mixing support
Redesigned user interface for improved accessibility:
- Streamlined Configuration: 6-section unified layout replacing complex tabbed interface
- Enhanced Tooltips: Interactive help using @floating-ui library for better user guidance
- Session Duration Display: Real-time tracking of conversation length
- Unified Styling: Consistent UI design with improved visual hierarchy
- Multi-language Support: Complete i18n with 35+ languages and English fallback
The audio system now features improved echo cancellation and processing:
- Echo Cancellation: Advanced echo suppression using modern Web Audio APIs
- Queue-Based Playback: Smooth audio streaming with intelligent buffering
- Real-time Passthrough: Monitor your voice with adjustable volume control
- Event-Driven Architecture: Reduced CPU usage through efficient event handling
- Cross-Platform Support: Unified audio handling across all platforms
Enhanced Google Gemini client performance:
- Consistent ID Generation: Optimized conversation item management with fixed instance IDs
- Improved Memory Usage: Reduced redundant ID generation calls
- Better Performance: Streamlined conversation handling for faster response times
Live audio monitoring capabilities:
- Real-time Feedback: Hear your voice while recording for better user experience
- Volume Control: Adjustable passthrough volume for optimal monitoring
- Low Latency: Immediate audio feedback using optimized audio processing
Sokuji features a simplified architecture focused on core functionality:
- Simplified User System: Only users and usage_logs tables
- Real-time Usage Tracking: Relay server directly writes usage data to database
- Better Auth: Handles all user authentication and session management
- Streamlined API: Only essential endpoints maintained (/quota, /check, /reset)
- Service Factory Pattern: Platform-specific implementations (Electron/Browser Extension)
- Modern Audio Processing: AudioWorklet with ScriptProcessor fallback
- Unified Components: SimpleConfigPanel and SimpleMainPanel for streamlined UX
- Context-Based State: React Context API without external state management
-- Core user table
users (id, email, name, subscription, token_quota)
-- Simplified usage tracking (written by relay)
usage_logs (id, user_id, session_id, model, total_tokens, input_tokens, output_tokens, created_at)- Runtime: Electron 34+ (Windows, macOS, Linux) / Chrome Extension Manifest V3
- Frontend: React 18 + TypeScript
- Backend: Cloudflare Workers + Hono + D1 Database
- Authentication: Better Auth
- AI Providers: OpenAI, Google Gemini, Palabra.ai, Kizuna AI, and OpenAI-compatible endpoints
- Advanced Audio Processing:
- Web Audio API for real-time audio processing
- MediaRecorder API for reliable audio capture
- ScriptProcessor for real-time audio analysis
- Queue-based playback system for smooth streaming
- UI Libraries:
- @floating-ui/react for advanced tooltip positioning
- SASS for styling
- Lucide React for icons
- Internationalization:
- i18next for multi-language support
- 35+ language translations