A live caption app for macOS.
Privacy first, light weight, friendly user experience for macOS users. What happens on your device, stays on your device.
- Privacy First β No cloud, no analytics, no ads, no internet required, and no screen capture access.
- Lightweight & Fast β Runs efficiently, with up to 1.7Γ faster word-level performance, 10% latency reduce compared to default live caption.
- Minimalist Design β One-click on/off, no distractions. Less is more.
- Open Source β Free and transparent.
Promo.mp4
Livcap.demo0.mp4
Livcap-demo1.mp4
Livcap-demo2.mp4
π v1.0 Now Available on the App Store!
Download Livcap from the Mac App Store
Livcap outperforms macOS's native Live Caption with significant improvements:
β
1.7x faster word-level lead rate
β
10% lower latency
β
More efficient processing with better resource utilization
See detailed comparison benchmarks in
livcapComparision.md
Our performance gains come from three key optimizations:
π― Single-pass inference - Uses one SFSpeechRecognizer call instead of multiple inferences observed in native Live Caption
β‘ Smart downsampling - Converts audio from 48kHz to 16kHz before processing, maintaining quality while reducing computational overhead
π VAD-based silence skipping - Voice Activity Detection prevents unnecessary processing during silent periods, saving resources and improving responsiveness
Complete local processing with zero external dependencies:
π No cloud services - Built entirely on Apple's native SFSpeechRecognizer framework, ensuring all speech processing happens locally on your device
π΅ Direct audio access - Uses CoreAudio Tap to capture system audio directly from the buffer, eliminating the need for ScreenCaptureKit or screen recording permissions
π‘οΈ Zero data transmission - Your conversations never leave your Mac - no servers, no analytics, no tracking
Development History
- Compare the whisper.cpp and built-in SFSpeechRecognizer.
- 3 Approaches audio arch:
- VAD-Based Silence Detection
- 5-Second Fixed Sliding Windows
- 30-Second WhisperLive-Inspired Buffer
tccutil reset All com.xxx.xx
Based on SFSpeechRecognizer from the apple built-in framework.
Approach 1: VAD-Based Silence Detection β **Most Reliable**
Files: BufferManager.swift, VADProcessor.swift, EnhancedVAD.swift
How it works:
- Accumulates speech until 3 consecutive silence frames
- Triggers inference on speech end or 15s maximum
- RMS threshold (0.01) with asymmetric hysteresis
Characteristics: Event-driven, variable buffer, speech-only segments
Status: β Best balance of quality and usability
Limitations: Variable latency, potential word cutoff, VAD tuning needed
Approach 2: 5-Second Sliding Windows β **Word-Level Chaos**
Files: ContinuousStreamManager.swift, TranscriptionStabilizationManager.swift
How it works:
- 5s sliding window with 1s stride (4s overlap)
- LocalAgreement algorithm for word-level stabilization
- Temporal overlap analysis for conflicts
Characteristics: Fixed 1s intervals, 5s buffer, word-level matching
Status: β Overlap analysis creates transcription instability
Limitations: Complex word matching, frequent text changes, poor readability
Approach 3: 30-Second WhisperLive β **High Latency**
Files: WhisperLiveContinuousManager.swift, WhisperLiveAudioBuffer.swift
How it works:
- Continuous 30s audio buffer
- 1s inference intervals with smart trimming
- Pre-inference VAD for speech extraction
Characteristics: Fixed 1s intervals, 30s context, maximum Whisper context
Status: β >2s latency unsuitable for real-time
Limitations: Excessive latency, high overhead, memory intensive
After extensive testing of all three approaches:
-
Approach 1 (VAD-Based) is currently the most practical solution, providing the best balance of quality and usability despite variable latency.
-
Approach 2 (5s Sliding) suffers from word-level chaos due to complex overlap analysis, making transcriptions unstable and hard to read.
-
Approach 3 (30s WhisperLive) provides excellent context but has unacceptable latency (>2s) for real-time applications.
Comparison Chart
| Aspect | Approach 1: VAD-Based | Approach 2: 5s Sliding | Approach 3: 30s WhisperLive |
|---|---|---|---|
| Trigger | Silence detection | Fixed 1s intervals | Fixed 1s intervals |
| Buffer Size | Variable (up to 15s) | Fixed 5s sliding | Variable (0-30s) |
| Overlap | None | 4s temporal overlap | Continuous context |
| Latency | Variable (silence-dependent) | Predictable 1s | Predictable 1s |
| Context | Speech segments only | 5s windows | Maximum 30s context |
| Stabilization | None | LocalAgreement | Pre-inference VAD |
We welcome contributions! Please read our Contributing Guidelines before submitting PRs.
Key Requirements:
- Privacy first (no data collection/network features)
- Lightweight performance (maintain efficiency)
- Simple UI design (minimal interface)
- Follow PR template with motivation, code summary, AI assistance docs, and demo(optional)
invalid display identifier 37D8832A-2D66-02CA-B9F7-8F30A301B230 when happend at the monitor changing.
- Compare new API SpeechAnalyzer when macOS 26 is released (non-beta). Nov 2025.
- Implement MLX whisper and compare performance. Oct 2025.
- Add KV cache support.
- Tokenizer support.
- Quantization support for speed up
- Explore hybrid approaches combining the best aspects of each method
- Investigate adaptive buffer sizing based on speech patterns
- Optimize VAD parameters for different acoustic environments
MLX-Swift only supports safetensors files. Use Utilities/convert.py to convert .pt files to .safetensors format.
Required Files:
Livcap/CoreWhisperCpp/ggml-base.en.bin
Livcap/CoreWhisperCpp/ggml-tiny.en.bin
Livcap/CoreWhisperCpp/ggml-base.en-encoder.mlmodelc
Livcap/CoreWhisperCpp/ggml-tiny.en-encoder.mlmodelc
Livcap/CoreWhisperCpp/whisper.xcframework