pip install opencv-python torch torchvision numpy tqdm pip install sk-video
Automated setup script for installing a complete local AI video creation pipeline on macOS/Windows/Linux. No cloud services required.
- Validates Python 3.10+, Git, and FFmpeg
- Checks disk space (20-40 GB recommended)
- Creates installation directory structure
- ComfyUI (visual node editor for Stable Diffusion)
- Python virtual environment with all dependencies
- ComfyUI-Manager (plugin management)
- Launch scripts for easy startup
- AnimateDiff (text/image to animation)
- Video Helper Suite (video I/O nodes)
- Stable Video Diffusion support (image to video)
- Real-ESRGAN (AI upscaling to 1080p/4K)
- RIFE (frame interpolation for smooth 60fps)
- Coqui XTTS v2 (voice synthesis + cloning)
- MusicGen/AudioCraft (AI music generation)
- Example scripts
python3 setup_ai_video.py --all# Just prerequisites and ComfyUI
python3 setup_ai_video.py --phase 1 2
# Add video tools
python3 setup_ai_video.py --phase 3
# Add upscaling and audio
python3 setup_ai_video.py --phase 4 5python3 setup_ai_video.py --all --dir ~/my-video-studioBefore running the script, ensure you have:
-
Python 3.10 or newer
python3 --version
-
Git
git --version
-
FFmpeg
- macOS:
brew install ffmpeg - Linux:
sudo apt install ffmpeg - Windows: Download from ffmpeg.org
- macOS:
-
20-40 GB free disk space
-
Optional but recommended:
- Apple Silicon Mac, or
- NVIDIA GPU with 8+ GB VRAM, or
- Decent CPU (slower but works)
cd ~/ai-video-studio/ComfyUI
./launch.sh # or launch.bat on WindowsThen open: http://127.0.0.1:8188
- Click "Manager" button in ComfyUI web interface
- Go to "Install Models" tab
- Download:
- SDXL base model (or Flux)
- AnimateDiff motion modules
- (Optional) Stable Video Diffusion model
- In ComfyUI Manager → Workflows → Load AnimateDiff example
- Set your text prompt
- Configure: 16-24 frames, 12-24 fps, 768×432 resolution
- Click "Queue Prompt" to generate
cd ~/ai-video-studio/tools/Real-ESRGAN
python inference_realesrgan.py -n RealESRGAN_x4plus -i input.mp4 -o output.mp4cd ~/ai-video-studio/tools/RIFE
python inference_video.py --exp=2 --video input.mp4 --output smooth_60fps.mp4tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
--text "Your narration text here" \
--out_path voiceover.wavcd ~/ai-video-studio/tools/audio
python generate_music_example.pyffmpeg -i video.mp4 -i voice.wav -i music.wav \
-filter_complex "[1:a]adelay=0|0[a1];[2:a]volume=0.35[a2];[a1][a2]amix=inputs=2[aout]" \
-map 0:v -map "[aout]" -c:v libx264 -c:a aac -shortest final.mp4~/ai-video-studio/
├── ComfyUI/
│ ├── .venv/ # Python virtual environment
│ ├── custom_nodes/ # Plugins
│ │ ├── ComfyUI-Manager/
│ │ ├── ComfyUI-AnimateDiff-Evolved/
│ │ └── ComfyUI-VideoHelperSuite/
│ ├── models/ # AI models (downloaded via Manager)
│ └── launch.sh # Startup script
│
└── tools/
├── Real-ESRGAN/ # Upscaling
├── RIFE/ # Frame interpolation
└── audio/ # TTS & music generation
└── generate_music_example.py
- Validates Python, Git, FFmpeg versions
- Checks available disk space
- Creates directory structure
- Runtime: < 1 minute
- Clones ComfyUI repository (~500 MB)
- Creates isolated Python virtual environment
- Installs PyTorch and dependencies (~2-5 GB)
- Installs ComfyUI-Manager
- Creates launch scripts
- Runtime: 5-15 minutes (depending on network speed)
- Installs AnimateDiff custom nodes
- Installs Video Helper Suite
- Sets up video generation capabilities
- Runtime: 2-5 minutes
- Clones Real-ESRGAN repository
- Installs upscaling dependencies
- Clones RIFE for frame interpolation
- Installs interpolation dependencies
- Runtime: 5-10 minutes
- Installs Coqui TTS (~1-2 GB)
- Installs MusicGen dependencies (~500 MB)
- Creates example scripts
- Runtime: 5-15 minutes
Total Installation Time: ~20-45 minutes (varies by network and system)
If you see missing dependencies:
cd ~/ai-video-studio/ComfyUI
.venv/bin/pip install [missing-package]- NVIDIA: Ensure CUDA drivers are installed
- AMD: Check ROCm support
- Apple Silicon: Should work out of box with MPS backend
- Reduce resolution (e.g., 512×512 instead of 768×768)
- Reduce batch size/frame count
- Close other applications
- Use CPU mode (slower but works)
Add -async 1 flag:
ffmpeg -i video.mp4 -i audio.wav -async 1 -c:v copy -c:a aac output.mp4python3 setup_ai_video.py --help# ComfyUI
ls ~/ai-video-studio/ComfyUI
# All tools
ls ~/ai-video-studio/tools
# Verify Python packages
~/ai-video-studio/ComfyUI/.venv/bin/pip listrm -rf ~/ai-video-studio- First Run: Model downloads happen via ComfyUI-Manager (5-20 GB)
- Generation Speed:
- CPU: 5-30 min per video
- GPU: 30 sec - 5 min per video
- Start Small: Test with 512×512, 16 frames, then scale up
- Batch Processing: Use ComfyUI's queue system for multiple videos
- CPU: 4+ cores
- RAM: 16 GB
- Storage: 40 GB free
- Generation time: 10-30 min/video
- GPU: NVIDIA RTX 3060+ (8 GB VRAM) or Apple M1/M2
- RAM: 32 GB
- Storage: 100 GB free (SSD)
- Generation time: 1-5 min/video
- GPU: NVIDIA RTX 4090 (24 GB VRAM)
- RAM: 64 GB
- Storage: 500 GB SSD
- Generation time: 30 sec - 2 min/video
- Read CLAUDE.md for detailed workflow tutorials
- Explore ComfyUI workflows in the Manager's workflow gallery
- Join communities:
- r/StableDiffusion
- ComfyUI Discord
- Civitai.com (model sharing)
This setup script is provided as-is. Individual tools have their own licenses:
- ComfyUI: GPL-3.0
- Real-ESRGAN: BSD-3-Clause
- RIFE: MIT
- Coqui TTS: MPL-2.0
- MusicGen: CC-BY-NC-4.0
Based on the comprehensive guide in CLAUDE.md. Tools by their respective authors and communities.