Push-to-talk speech-to-text that types anywhere.
Hold a key → speak → release → your words appear wherever you're typing. Works in any application — terminals, browsers, editors, chat apps, anything with a text cursor.
TAK is free and open source (MIT). If it makes your workflow faster, consider supporting continued development:
| Method | Details |
|---|---|
| Credit card | Sponsor on GitHub — no fees, recurring or one-time |
| Solana (SOL) | GWhM8kiCcAKqsNn1WMfyVv1tgTCdz6mGCGWy3UqqHLQx |
All crypto goes directly to the maintainer's wallets. Full details: docs/donations.md
- Push-to-talk — microphone is only open while you hold the key (no always-on listening)
- System-wide — types into whatever window/field currently has focus
- Cross-platform — Linux (X11) and macOS (Apple Silicon)
- Bilingual — auto-detects English and Spanish
- Local & private — runs entirely on your machine via faster-whisper (Linux) or mlx-whisper (macOS) — no cloud APIs
- GPU-accelerated — uses CUDA on NVIDIA GPUs (Linux) or Metal on Apple Silicon (macOS)
- Auto-normalization — automatically boosts quiet microphone levels
- Voice activity detection — filters out silence and background noise
- Modular architecture — platform-agnostic core with pluggable backends
- Visual overlay — floating recording indicator on all screens (macOS)
- Native menu bar app — macOS status item with recording state, preferences, and uninstall (macOS)
- Preferences UI — graphical settings for trigger key, model, audio device, and clipboard mode (macOS)
- In-app model downloads — progress bar with speed/ETA when switching models or on first launch (macOS)
- Configurable — choose your trigger key, model size, and input method
- Linux with X11 (Wayland support planned)
- NVIDIA GPU with CUDA (or use
--cpufor CPU-only) - Conda (Miniconda or Anaconda)
- System packages:
xdotool,xclip,libportaudio2
- macOS 13+ (Ventura or later)
- Apple Silicon (M1/M2/M3/M4) recommended — Metal GPU acceleration via MLX
- Intel Macs work but run CPU-only inference (significantly slower)
- Homebrew
- Conda (Miniconda or Anaconda)
Download the latest signed and notarized DMG:
Open the disk image, drag TAK into Applications, and launch. Grant Accessibility and Microphone permissions when prompted. The speech model (~1.5 GB) downloads automatically on first launch.
Requires macOS 13+ and Apple Silicon (M1/M2/M3/M4).
git clone https://github.com/lchonkan/tak.git
cd tak
./install.shThe installer automatically detects your platform, installs system dependencies, creates a conda environment, and verifies the setup.
sudo apt install xdotool xclip libportaudio2conda create -n tak python=3.11 -y
conda activate takpip install -r requirements-linux.txtOr install manually:
pip install faster-whisper pynput sounddevice numpyFor GPU acceleration (recommended), also install the CUDA libraries:
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12pynput needs access to /dev/input to detect key presses. Add your user to the input group:
sudo usermod -aG input $USER
# Log out and back in for the change to take effect# 1. Install system dependencies
brew install portaudio ffmpeg
# 2. Create Conda environment and install Python packages
conda create -n tak python=3.11 -y
conda activate tak
pip install -r requirements-macos.txt
# 3. Grant Accessibility permission (required for key detection)
# System Settings → Privacy & Security → Accessibility → add your terminal app
# 4. Run
./run.shTAK checks for Accessibility permission on startup and will show a clear error if it's missing. macOS will also prompt for Microphone access on the first recording — click "Allow".
./run.shHold Right Ctrl → speak → release → text appears at cursor. Press Ctrl+C to quit.
./run.shHold Right Option → speak → release → text appears at cursor. Press Ctrl+C to quit.
A red REC pill appears at the bottom of every screen while recording, turning yellow during transcription.
First run downloads the Whisper model (~1.5 GB). Subsequent runs start much faster.
./run.sh --key caps_lock # Use a different trigger key
./run.sh --model large-v3 # More accurate (uses more VRAM)
./run.sh --model small # Faster, less accurate
./run.sh --model tiny # Fastest, least accurate
./run.sh --clipboard # Use clipboard paste (always on for macOS)
./run.sh --cpu # Run on CPU (Linux only, no GPU required)
./run.sh --device 2 # Use a specific audio input device
You can also run directly with Python (after activating the conda env):
conda activate tak
python -m tak --model medium| Linux | macOS | |
|---|---|---|
| Trigger key | ctrl_r (Right Ctrl) |
alt_r (Right Option) |
| Whisper model | medium |
turbo |
| Text injection | Simulated keystrokes | Clipboard paste (Cmd+V) |
| GPU acceleration | CUDA (NVIDIA) | Metal (Apple Silicon) |
alt_r (macOS default), ctrl_r (Linux default), ctrl_l, alt_l,
shift_r, shift_l, scroll_lock, pause, insert, f1–f12, caps_lock
Note:
scroll_lock,pause, andinsertare only available on Linux.
| Model | VRAM/RAM | Speed | Accuracy | Notes |
|---|---|---|---|---|
tiny |
~1 GB | Fastest | Basic | |
base |
~1 GB | Fast | Good | |
small |
~2 GB | Moderate | Better | |
medium |
~5 GB | Slower | Great | Linux default |
large-v3 |
~6 GB | Slowest | Best | |
turbo |
~2 GB | Fast | Great | macOS default |
Models are downloaded on first use and cached in ~/.cache/huggingface/hub/.
TAK has three main stages that run in a loop:
- Key listener —
pynputmonitors for the trigger key. On press, recording starts; on release, recording stops. - Audio recording — On Linux, captures audio via PipeWire (
pw-record) or falls back to ALSA viasounddevice. On macOS, captures audio via Core Audio throughsounddevice. Audio is resampled to 16 kHz mono (Whisper's native format). Quiet audio is auto-normalized so Whisper can hear it. - Transcription & typing — On Linux,
faster-whispertranscribes the audio using CUDA on your NVIDIA GPU. On macOS,mlx-whispertranscribes using Metal on Apple Silicon. The detected text is injected into the focused window using platform-specific methods (xdotool on Linux, clipboard paste via Cmd+V on macOS).
Transcription runs in a background thread so the key listener stays responsive. If you start a new recording while the previous one is still being transcribed, it waits until the current transcription finishes.
On macOS, a floating pill overlay appears on all connected screens: red while recording, yellow while transcribing. The overlay uses PyObjC (NSPanel) and runs on the main thread via an NSApplication event loop, while pynput runs in a daemon thread.
TAK uses a modular architecture with dependency injection. The core application logic is platform-agnostic, while platform-specific backends (audio recording, transcription, text injection) are plugged in at startup.
graph TD
subgraph "tak/__main__.py (Entry Point)"
EP[Platform Detection] --> |Linux| LINUX[platforms.linux]
EP --> |macOS| MACOS[platforms.macos]
end
subgraph "tak/core/app.py (Shared Core)"
APP[TakApp]
BASE_REC[BaseAudioRecorder]
BASE_TR[BaseTranscriber]
PARSE[parse_args]
end
subgraph "tak/backend/linux.py"
LREC[LinuxAudioRecorder<br/>PipeWire / ALSA]
LTR[LinuxTranscriber<br/>faster-whisper + CUDA]
LTI[type_text<br/>xdotool / xclip]
end
subgraph "tak/backend/macos.py"
MREC[MacAudioRecorder<br/>Core Audio]
MTR[MacTranscriber<br/>mlx-whisper + Metal]
MTI[type_text<br/>AppleScript / pbcopy]
end
LINUX --> |injects backends| APP
MACOS --> |injects backends| APP
LREC --> |extends| BASE_REC
LTR --> |extends| BASE_TR
MREC --> |extends| BASE_REC
MTR --> |extends| BASE_TR
APP --> |uses| LREC
APP --> |uses| LTR
APP --> |uses| LTI
APP --> |uses| MREC
APP --> |uses| MTR
APP --> |uses| MTI
For detailed architecture diagrams (class diagrams, sequence diagrams, state machines, threading model, audio pipeline, and more), see docs/architecture.md.
tak/ # Project root
├── run.sh # Cross-platform launcher
├── requirements-linux.txt # Linux Python dependencies
├── requirements-macos.txt # macOS Python dependencies
├── TAK.spec # PyInstaller spec for macOS .app bundle
├── setup_app.py # .app bundle build script
├── resources/
│ └── tak.icns # macOS app icon
├── README.md # This file
├── CONTRIBUTING.md # Git Flow and contributor guide
├── CLAUDE.md # Instructions for Claude Code
├── LICENSE
├── docs/
│ ├── architecture.md # Detailed architecture diagrams
│ ├── platform-architecture.md # Cross-platform stack comparison
│ ├── macos-implementation-plan.md # macOS implementation spec (completed)
│ └── donations.md # Donation methods and wallet addresses
├── tak/ # Python package
│ ├── __init__.py # Package marker
│ ├── __main__.py # CLI entry point (platform detection, backend wiring)
│ ├── core/
│ │ ├── app.py # Shared core (TakApp, base classes, CLI, constants)
│ │ ├── config.py # TakConfig dataclass (platform-agnostic settings)
│ │ └── models.py # Shared model metadata (MLX repo IDs)
│ ├── backend/
│ │ ├── linux.py # Linux backend (faster-whisper, PipeWire/ALSA, xdotool)
│ │ └── macos.py # macOS backend (mlx-whisper, Core Audio, AppleScript)
│ └── ui/
│ └── macos/
│ ├── design.py # macOS design system (colors, fonts, card views)
│ ├── gui_main.py # GUI entry point for macOS .app bundle
│ ├── overlay.py # Floating recording/transcribing pill overlay
│ ├── menubar.py # macOS menu bar status item and dropdown
│ ├── settings.py # Preferences window (NSUserDefaults persistence)
│ └── splash.py # Model download splash screen
└── .gitignore
Some applications don't accept simulated keystrokes from xdotool. Use clipboard mode instead:
./run.sh --clipboardpynput needs access to /dev/input. Make sure your user is in the input group:
sudo usermod -aG input $USER
# Log out and back inList available audio devices:
conda activate tak && python -m sounddeviceThen specify the device index:
./run.sh --device <index>TAK needs Accessibility permission for pynput to detect global key events. Go to:
System Settings → Privacy & Security → Accessibility
Add your terminal app (Terminal.app, iTerm2, etc.) to the list.
macOS will prompt for microphone access on the first recording attempt. Click "Allow" when prompted. If you accidentally denied it, re-enable in:
System Settings → Privacy & Security → Microphone
If pw-record is not installed, TAK automatically falls back to direct ALSA recording via sounddevice. This works but may not see PipeWire virtual devices (e.g., Bluetooth headsets routed through PipeWire). To install PipeWire tools:
sudo apt install pipewire-pulse pipewire-audio-client-librariesMake sure you have the NVIDIA CUDA pip packages installed:
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12Or bypass GPU entirely:
./run.sh --cpuWhisper models are downloaded from Hugging Face on first use. If downloads are slow, you can set a mirror:
export HF_ENDPOINT=https://hf-mirror.com
./run.sh- Apple Developer ID code signing and notarization for Gatekeeper-ready distribution
- DMG installer packaging
- Auto-launch on login (Launch Agent / Login Item)
- Wayland support (Linux)
- Windows support
- Multiple language selection in preferences
- Test model switching behavior (what happens when user changes model in Settings)
- Evaluate which Whisper models to expose to end users
- Fix Spanish accent characters pasting incorrectly (keyboard input encoding)
Contributions are welcome! See CONTRIBUTING.md for the branching model, commit conventions, and PR guidelines.
MIT