A complete Python toolkit for creating audio-reactive video art by cutting and reassembling video footage based on semantic audio-visual matching using ImageBind embeddings.
This project uses ImageBind multimodal embeddings to semantically match video clips to audio segments. It analyzes audio using onset strength detection, extracts embeddings from both audio and video, and reassembles the video based on semantic similarity.
- Onset strength analysis: Frame-accurate continuous onset strength values for precise cut points
- Source separation: Separate audio into stems using Demucs (drums, bass, vocals, other)
- ImageBind embeddings: Unified audio-visual embeddings for semantic matching
- Semantic video matching: Intelligently match video segments to audio based on content
- Flexible reuse policies: Control how video segments can be reused (none, allow, min_gap, limited, percentage)
- High-quality output: H.264 (CRF 18) and ProRes 422 outputs for editing
- Interactive visualization: Real-time threshold adjustment with playback
- Multiple pitch detection methods: CREPE, SwiftF0, Basic Pitch, or hybrid mixture-of-experts
- Intelligent pitch tracking: Handles vibrato, pitch drift, and silence detection
- Pitch smoothing: Median filtering to reduce false segmentation from natural vocal variations
- MIDI preview videos: Visual verification of pitch detection accuracy
- Configurable parameters: Silence threshold, pitch change sensitivity, segment duration filters
- Python 3.10 or 3.11 (recommended)
- FFmpeg (for video processing)
- macOS, Linux, or Windows
- GPU recommended (but not required)
# Create conda environment (recommended)
conda create -n vh python=3.11
conda activate vh
# Install dependencies
pip install -r requirements.txt
# Install FFmpeg (if not already installed)
# macOS:
brew install ffmpeg
# Linux:
sudo apt-get install ffmpegImportant:
- Always activate your conda/virtual environment before running scripts
- This installs PyTorch 2.1.0 and NumPy 1.x for compatibility
- ImageBind will be installed automatically on first use
# Run the complete pipeline
./test_full_pipeline.sh path/to/guidance_audio.wav path/to/source_video.mp4
# This will:
# 1. Separate audio into stems (Demucs)
# 2. Analyze onset strength
# 3. Segment audio based on onset points
# 4. Extract ImageBind audio embeddings
# 5. Extract ImageBind video embeddings (sliding window)Then run semantic matching and assembly:
# Create semantic matches between audio and video
./test_semantic_matching.sh allow
# Assemble final videos
./test_video_assembly.sh path/to/source_video.mp4 path/to/guidance_audio.wavOnset Strength Analysis:
python src/onset_strength_analysis.py \
--audio data/separated/htdemucs/song/other.wav \
--output data/output/onset_strength.json \
--fps 24 \
--power 0.6 \
--threshold 0.2Parameters:
--fps: Frame rate (24, 30, or 60)--power: Power-law compression (0.5-1.0, lower = more sensitive)--window-size: Smoothing window (0-5 frames)--tolerance: Noise removal (0.0-1.0)--threshold: Cut point threshold (0.0-1.0)
Audio Segmentation:
python src/audio_segmenter.py \
--audio data/separated/htdemucs/song/other.wav \
--onset-strength data/output/onset_strength.json \
--output-dir data/segments/audio \
--threshold 0.2python src/imagebind_audio_embedder.py \
--segments-metadata data/segments/audio_segments.json \
--segments-dir data/segments/audio \
--output data/segments/audio_embeddings.json \
--batch-size 8 \
--device autoExtracts 1024-dimensional ImageBind embeddings for each audio segment.
python src/imagebind_video_embedder.py \
--video path/to/video.mp4 \
--output data/segments/video_embeddings.json \
--fps 24 \
--window-size 5 \
--stride 6 \
--chunk-size 500Parameters:
--window-size: Frames per window (default 5 = ~0.2s at 24fps)--stride: Frame step for sliding window (default 6 = 0.25s at 24fps)--chunk-size: Max frames loaded in memory at once (default 500, reduce if out of memory)
python src/semantic_matcher.py \
--audio-embeddings data/segments/audio_embeddings.json \
--video-embeddings data/segments/video_embeddings.json \
--audio-segments data/segments/audio_segments.json \
--output data/segments/matches.json \
--reuse-policy allowReuse Policies:
none: Each video segment used only once (maximum variety)allow: Unlimited reuse (best semantic matches)min_gap: Minimum 5 segments between reuseslimited: Each video segment reused max 3 timespercentage: Max 30% of segments can be reuses
python src/video_assembler.py \
--video path/to/source_video.mp4 \
--audio path/to/guidance_audio.wav \
--matches data/segments/matches.json \
--output data/output/final_video.mp4Outputs:
final_video_original_audio.mp4- H.264 with original video audiofinal_video.mp4- H.264 with guidance audiofinal_video_original_audio_prores.mov- ProRes 422 (for editing)
Quality Settings:
- H.264: CRF 18 (visually lossless), slow preset, 320kbps AAC audio
- ProRes 422: 10-bit 4:2:2 color, uncompressed PCM audio
Create videos that match the pitch sequence of a guide vocal by recutting source singing footage.
# Analyze guide video to extract pitch sequence
./test_pitch_guide.sh data/input/guide_video.mp4
# This creates:
# - data/segments/guide_sequence.json (pitch data)
# - data/segments/guide_midi_preview.mp4 (verification video)Choose from four pitch detection algorithms:
CREPE (default, most accurate):
./test_pitch_guide.sh video.mp4 --pitch-method crepe- Deep learning-based pitch detection
- Most accurate for monophonic singing
- Requires TensorFlow, slower but reliable
SwiftF0 (fastest):
./test_pitch_guide.sh video.mp4 --pitch-method swift-f0- CPU-optimized, very fast (132ms for 5s audio)
- Good accuracy, no GPU required
- May add spurious low bass notes
Hybrid (best of both):
./test_pitch_guide.sh video.mp4 --pitch-method hybrid- Mixture-of-experts combining CREPE + SwiftF0
- Uses CREPE as primary, fills gaps with SwiftF0
- Filters SwiftF0 outliers (bass notes, pitch jumps)
- Recommended for best results
Basic Pitch (multipitch):
./test_pitch_guide.sh video.mp4 --pitch-method basic-pitch- Spotify's multipitch detection
- Can detect harmonies
- 3x sub-semitone resolution
Pitch Change Threshold (cents):
# More sensitive (more segments)
./test_pitch_guide.sh video.mp4 --threshold 30
# Less sensitive (smoother, fewer segments)
./test_pitch_guide.sh video.mp4 --threshold 100- Default: 50 cents
- Lower = splits on smaller pitch changes
- Higher = ignores vibrato/drift
Silence Detection:
# More permissive (catches quiet singing)
./test_pitch_guide.sh video.mp4 --silence-threshold -60
# More strict (treats quiet sounds as silence)
./test_pitch_guide.sh video.mp4 --silence-threshold -45- Default: -50 dB
- Lower (more negative) = more permissive
- Helps with quiet consonants, soft singing
Pitch Smoothing:
# Reduce vibrato/waver
./test_pitch_guide.sh video.mp4 --pitch-smoothing 5
# Aggressive smoothing
./test_pitch_guide.sh video.mp4 --pitch-smoothing 7- Default: 0 (off)
- Median filter window size: 5-7 recommended
- Smooths pitch curve before segmentation
- Higher values may miss quick note changes
Minimum Duration:
# Filter out very short notes
./test_pitch_guide.sh video.mp4 --min-duration 0.15- Default: 0.1 seconds
- Filters brief pitch fluctuations
- Useful for cleaning up noisy detections
# Best settings for wavy vocals
./test_pitch_guide.sh guide.mp4 \
--pitch-method hybrid \
--pitch-smoothing 5 \
--silence-threshold -60 \
--threshold 75 \
--min-duration 0.12
# Review MIDI preview
open data/segments/guide_midi_preview.mp4- Extract audio from video
- Detect continuous pitch using selected method
- Apply smoothing (optional) to reduce vibrato
- Segment on changes: Split when pitch changes >threshold OR silence detected
- Filter segments: Remove very short segments
- Generate MIDI preview: Create video with synthesized tones for verification
guide_sequence.json- Pitch sequence data (time, Hz, MIDI note, confidence)guide_midi_preview.mp4- Video with MIDI playback for verification
- Use hybrid mode for most vocals - best accuracy with gap filling
- Watch the MIDI preview - tones should match singing closely
- Adjust silence threshold if too many/few gaps detected
- Use pitch smoothing (5-7) for vibrato-heavy vocals
- Increase threshold if you get too many micro-segments
python src/interactive_strength_visualizer.py \
--audio data/separated/htdemucs/song/other.wav \
--strength data/output/onset_strength.json \
--output data/output/visualizer.html \
--threshold 0.2Features:
- Audio playback synchronized with onset strength curve
- Adjustable threshold slider
- Real-time segment statistics
- Click timeline to seek
gaudio provides superior quality stems:
- Upload audio to gaudiolab.com
- Download separated stems
- Place in
data/separated/gaudiolab/song/
demucs -n htdemucs data/input/song.mp3 -o data/separatedFor best results: Use the other stem for irregular, artistic video cuts.
video-hacking/
├── src/
│ ├── onset_strength_analysis.py # Audio onset analysis
│ ├── audio_segmenter.py # Cut audio into segments
│ ├── imagebind_audio_embedder.py # Extract audio embeddings
│ ├── imagebind_video_embedder.py # Extract video embeddings
│ ├── semantic_matcher.py # Match audio to video
│ ├── video_assembler.py # Assemble final videos
│ └── interactive_strength_visualizer.py # HTML visualizer
├── data/
│ ├── input/ # Source files
│ ├── separated/ # Audio stems
│ ├── segments/ # Audio/video segments + embeddings
│ └── output/ # Final videos + analysis
├── test_full_pipeline.sh # Complete pipeline (Phases 1-3)
├── test_semantic_matching.sh # Phase 4 testing
├── test_video_assembly.sh # Phase 5 testing
├── test_onset_strength.sh # Audio analysis testing
├── install_imagebind.sh # ImageBind installer
├── fix_numpy.sh # NumPy version fixer
└── requirements.txt
# Step 1: Activate environment
conda activate vh
# Step 2: Run complete pipeline (Phases 1-3)
./test_full_pipeline.sh \
data/input/guidance_audio.wav \
data/input/source_video.mp4
# Step 3: Review onset strength visualizer
open data/output/onset_strength_visualizer.html
# Step 4: Create semantic matches (try different policies)
./test_semantic_matching.sh none # No reuse (max variety)
./test_semantic_matching.sh allow # Unlimited reuse (best matches)
./test_semantic_matching.sh limited # Limited reuse
# Step 5: Assemble final videos
./test_video_assembly.sh \
data/input/source_video.mp4 \
data/input/guidance_audio.wav
# Step 6: View results
open data/output/final_video_original_audio.mp4 # Original audio
open data/output/final_video.mp4 # Guidance audiofinal_video_original_audio.mp4- Cut video with original audiofinal_video.mp4- Cut video with guidance audio
final_video_original_audio_prores.mov- 10-bit 4:2:2, PCM audio
- Use the "other" stem for irregular, artistic cuts (not drums)
- Video length: Use source video 10-20x longer than audio for variety
- Threshold tuning: Use visualizer to find optimal cut density
- Reuse policy:
- Use
nonefor maximum visual variety (requires long source video) - Use
allowfor best semantic matches (may repeat clips) - Use
limitedorpercentagefor balance
- Quality: ProRes output is perfect for further color grading/editing
- Audio Analysis: Onset strength analysis identifies musical changes
- Segmentation: Audio cut into segments at onset points
- Audio Embeddings: Each audio segment → 1024-dim ImageBind embedding
- Video Analysis: Sliding window extracts frames from entire video
- Video Embeddings: Each video window → 1024-dim ImageBind embedding
- Matching: Cosine similarity finds best video for each audio segment
- Assembly: Video segments concatenated and synced with audio
ImageBind creates a unified embedding space for multiple modalities (audio, video, text, etc.). This means:
- Audio and video embeddings are directly comparable
- Semantically similar content has similar embeddings
- No need for bridging between CLAP (audio) and CLIP (video)
If video embeddings extraction crashes with "Killed: 9", reduce chunk size:
# In test_full_pipeline.sh, add --chunk-size parameter (line ~111):
$PYTHON_CMD src/imagebind_video_embedder.py \
--video "$VIDEO_FILE" \
--output data/segments/video_embeddings.json \
--fps 24 \
--window-size 5 \
--stride 6 \
--batch-size 4 \
--chunk-size 200 \ # Reduce from default 500
--device auto
# Or run manually with smaller chunk:
python src/imagebind_video_embedder.py \
--video path/to/video.mp4 \
--output data/segments/video_embeddings.json \
--chunk-size 200Chunk size guidelines:
- Default 500: Works for most videos (8-16GB RAM)
- 200-300: For long videos or limited RAM (4-8GB)
- 100-150: For very long videos or 4GB RAM
./fix_numpy.sh
# or manually:
pip uninstall -y numpy && pip install "numpy<2.0"./install_imagebind.sh
# or manually:
pip install git+https://github.com/facebookresearch/ImageBind.gitpip uninstall -y opencv-python
pip install opencv-python==4.8.1.78- Onset analysis: CPU is fine (fast enough)
- Demucs separation: GPU recommended (10-20x faster)
- ImageBind embeddings: GPU recommended (5-10x faster)
- Video assembly: CPU only (ffmpeg)
All scripts auto-detect GPU with --device auto
- Audio embeddings: ~0.5s per segment on CPU, ~0.1s on GPU
- Video embeddings: ~2-5s per 100 windows on GPU
- Video assembly: Depends on segment count and video codec
- Full pipeline (30s audio, 10min video): ~5-10 minutes total
- torch/torchaudio (2.1.0): PyTorch for ImageBind
- numpy (<2.0): Numerical computing
- opencv-python (4.8.1.78): Video frame extraction
- ffmpeg: Video processing (external)
- librosa: Onset detection
- soundfile: Audio I/O
- demucs: Source separation
- crepe: Deep learning pitch detection (requires TensorFlow)
- tensorflow (<2.16.0): Required by CREPE
- swift-f0: Fast CPU-based pitch detection
- basic-pitch: Spotify's multipitch detection
- scipy: Signal processing (median filtering)
- imagebind: Multimodal embeddings (auto-installed)
- transformers (<4.36.0): Required by ImageBind and Basic Pitch
- Pillow: Image processing
- Onset strength analysis
- Audio segmentation
- ImageBind audio embeddings
- ImageBind video embeddings (sliding window)
- Semantic matching with reuse policies
- Video assembly with high quality output
- Real-time preview mode
- Batch processing for multiple videos
- Additional reuse strategies
- Color grading integration
- Multiple pitch detection methods (CREPE, SwiftF0, Basic Pitch)
- Hybrid mixture-of-experts (CREPE + SwiftF0)
- Pitch smoothing and silence detection
- MIDI preview video generation
- Configurable parameters (threshold, smoothing, silence)
- Source video pitch analysis
- Pitch matching between guide and source
- Final video assembly based on pitch matches
MIT License - See LICENSE file for details
- ImageBind by Meta AI Research
- Demucs by Alexandre Défossez
- librosa by Brian McFee
- CREPE by Jong Wook Kim et al.
- SwiftF0 by lars76
- Basic Pitch by Spotify Research