An app and gadget that brings your realtime music creations to life, powered by your Mac, Magenta RealTime, MLX, and ESP32
JamBox by ELATO is a local-first macOS app for creating realtime generative music and streaming it to an ESP32 speaker gadget. Touch the ESP32 to interrupt playback, speak a request like "add drums" or "make it lo-fi", and the Mac uses local Whisper STT plus Qwen3.5 tool calls to update the musical bubbles before resuming playback.
Apple Silicon · Magenta RealTime 2 · MLX · React · ESP32-S3 · Whisper ASR · Qwen3.5 Tool Calls · Opus Audio
This repo is built on Magenta RealTime 2, an open-weights realtime music generation model. The main app in this fork is the Collider-based macOS app:
examples/collider/: native macOS app, React UI, websocket server, ESP32 audio streaming, voice-command agent.arduino/: ESP32-S3 firmware for WiFi discovery, touch interrupt, mic streaming, Opus decode, and speaker output.core/andmagenta_rt/: Magenta RealTime C++ and Python inference code.
- Realtime music generation: Drag prompt bubbles around a listener puck to steer the music.
- ESP32 speaker mode: Stream generated audio over websocket as Opus packets.
- Voice interrupt: Touch the ESP32, speak a command, and the Mac pauses music while it listens.
- Local STT and LLM: Whisper transcribes locally; Qwen3.5 chooses tool calls locally.
- Tool-call UI actions: The agent can call
addBubble(text, nearness)andremoveBubble(text). - No cloud required once your models are downloaded.
Realtime generation requires Apple Silicon.
mrt2_smallruns realtime on most Apple Silicon Macs.mrt2_basesounds better but wants a stronger Pro/Max-class machine for realtime use.
The firmware is currently set up for an ESP32-S3 dev board with:
- I2S microphone
- I2S speaker amp, such as MAX98357A
- touch input
- RGB/status LED
- WiFi on the same network as your Mac
See arduino/README.md for pin notes.
git clone --recurse-submodules https://github.com/akdeb/musical-ai-toys.git
cd musical-ai-toysIf you already cloned without submodules:
git submodule update --init --recursiveInstall Homebrew packages:
brew install cmake node opus pythonInstall Python tooling and MLX voice dependencies:
python3 -m pip install --upgrade pip
python3 -m pip install -e ".[mlx]" mlx-whisper mlx-lm huggingface_hubInstall the Magenta model resources:
mrt models init
mrt models downloadBy default, models/resources are stored under:
~/Documents/Magenta/magenta-rt-v2/
The app auto-loads mrt2_base from that folder when available. You can also pick a model from the app settings.
The app expects these Hugging Face models in your local cache:
mlx-community/whisper-base.en-mlxmlx-community/Qwen3.5-4B-MLX-4bit
Download them once:
python3 - <<'PY'
from huggingface_hub import snapshot_download
snapshot_download("mlx-community/whisper-base.en-mlx")
snapshot_download("mlx-community/Qwen3.5-4B-MLX-4bit")
PYcmake . -B build
cmake --build build --target deploy_mrt2_collider -j10
open ~/Applications/"JamBox by ELATO.app"The deploy target builds the React UI, signs the app locally, bundles voice_agent.py, and copies the app to:
~/Applications/JamBox by ELATO.app
If you use VS Code, install the PlatformIO extension. Or install the CLI:
python3 -m pip install platformioConnect the ESP32-S3 over USB, then run:
cd arduino
pio run -t upload -t monitorThe monitor runs at 115200.
On first boot, the device exposes a WiFi setup portal. Connect to the ELATO access point, enter your WiFi credentials, and put the ESP32 on the same network as the Mac.
The Mac app advertises itself over mDNS and runs the websocket server for the ESP32. Keep the app open while powering or reconnecting the device.
- Open
JamBox by ELATO.app. - Wait for the model to load.
- Click the large play button to stream music to the ESP32.
- Click the laptop icon to enable or disable local Mac audio output.
- Touch the ESP32 to interrupt playback.
- Speak a command, for example:
- "add drums to this"
- "make it lo-fi"
- "remove guitar"
- "add saxophone closer"
- The app shows your transcript, Qwen3.5 calls tools, bubbles update, and ESP32 playback resumes.
The ESP32 firmware uses LED state to show what is happening:
- Green / processing: connected, waiting, or agent is thinking.
- Yellow / listening: mic is streaming to the Mac.
- Blue / speaking: ESP32 speaker is playing generated audio.
- Large play/pause button: ESP32 playback.
- Laptop icon: Mac/laptop audio output.
- Upload icon: load an audio prompt.
- Right instrument rail: click an instrument to add it as a music bubble.
- Settings icon: model, timing, generation, and buffer controls.
musical-ai-toys/
├── arduino/ # ESP32-S3 firmware
├── core/ # C++ realtime inference engine
├── examples/
│ ├── collider/ # JamBox by ELATO macOS app
│ │ ├── ui/ # React UI
│ │ └── voice_agent.py # Whisper + Qwen3.5 tool-call loop
│ ├── common/ # Shared native/React app code
│ └── hello_mrt2/ # Minimal CLI generation example
├── magenta_rt/ # Python package
├── docs/ # Original Magenta RT docs
└── README.md
Run:
mrt models init
mrt models downloadThen reopen the app or choose the model folder from settings.
- Make sure the Mac and ESP32 are on the same WiFi network.
- Open the macOS app before powering/reconnecting the ESP32.
- Check that no firewall is blocking local network traffic.
- Watch
pio device monitorfor mDNS and websocket logs.
Make sure Qwen3.5 is cached:
python3 - <<'PY'
from huggingface_hub import snapshot_download
snapshot_download("mlx-community/Qwen3.5-4B-MLX-4bit", local_files_only=True)
print("Qwen3.5 is cached")
PYRebuild the latest app:
cmake --build build --target deploy_mrt2_collider -j10The app queues prompt updates while the model is loading, which avoids racing React startup messages with native MLX model initialization.
- Music model: Magenta RealTime 2
- Native inference: C++
magentart::corewith MLX/Metal - UI: React, TypeScript, Vite
- Speech-to-text:
mlx-community/whisper-base.en-mlx - LLM agent:
mlx-community/Qwen3.5-4B-MLX-4bit - Transport: websocket
- ESP32 audio: Opus over websocket, decoded on-device
- Firmware: Arduino + PlatformIO
This is an experimental local AI music toy platform. The local LLM can misunderstand speech or choose imperfect tool calls. Use it as a creative tool, supervise children, and avoid treating generated outputs as authoritative.
This project builds on Google Magenta's Magenta RealTime 2 and the open MLX ecosystem.
See LICENSE.