PyTorch and MediaPipe-based system for detecting gestures in videos, clustering by similarity, and training classifiers for real-time gesture recognition.
- Quick Start
- Local Workflow
- Google Cloud Storage Workflow
- Google Colab Workflow
- Components
- Output Structure
- Configuration
- Requirements
# 1. Install dependencies
pip install -r requirements.txt
# 2. Add videos to videos/ directory
# 3. Run gesture clustering
python run.py
# 4. Train classifier
python classification.py
# 5. Extract animations for web app
python classification.py --extract-animationsProcess videos to detect gestures and cluster by similarity using DTW:
python run.pyWhat it does:
- Analyzes videos at reduced FPS (default: 12 FPS)
- Detects gestures using MediaPipe pose estimation + activity detection
- Clusters similar gestures using DTW distance
- Saves segments and clustering results
Output: output_<N>v_<FPS>fps_<MIN>-<MAX>f/
N= number of videos processedFPS= analysis frame rateMIN-MAX= gesture length range in frames
Extract individual video clips organized by cluster:
python run.py --extract-clustersTrain a PyTorch classifier and export to ONNX:
python classification.pyOutputs:
output_XXv_XXfps_XX-XXf/gesture_classifier.onnx- Trained modelpublic/gesture_classifier.onnx- Copy for web apppublic.zip- Downloadable package with model
Generate gesture sequences for web visualization:
python classification.py --extract-animationsOutput: public/cluster_animations.json
Process videos and store results in Google Cloud Storage.
# Install GCS support
pip install google-cloud-storage>=2.10.0
# Authenticate
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"
# OR
gcloud auth application-default login# Process videos from GCS, save to GCS
python run_gcs.py \
--videos gs://my-bucket/videos/ \
--output gs://my-bucket/output/
# Train with GCS manifest, save model to GCS
python classification_gcs.py \
--manifest gs://my-bucket/output_50v_12fps_24-72f/clustering_manifest.json \
--output gs://my-bucket/models/gesture_classifier.onnx
# Extract animations to GCS
python classification_gcs.py \
--extract-animations \
--manifest gs://my-bucket/output_50v_12fps_24-72f/clustering_manifest.json \
--animations-output gs://my-bucket/public/cluster_animations.json# Local videos β GCS output
python run_gcs.py --videos videos/ --output gs://my-bucket/output/
# GCS videos β Local output
python run_gcs.py --videos gs://my-bucket/videos/ --output output/For cloud-based processing without local setup, use the provided Colab notebook:
- Free GPU/TPU acceleration for faster processing
- No local installation required
- Google Drive β GCS integration
- Persistent storage via GCS buckets
- Team collaboration via notebook sharing
- Gesture Clustering: 30-60 min for ~50 videos
- Classifier Training: 5-10 min
- Animation Extraction: 2-5 min
- Mount Google Drive and authenticate GCS
- Copy training videos from Drive to GCS bucket
- Run gesture detection and clustering
- Train classifier on clustered gestures
- Extract animations for web app
- Download results or access from GCS
- run.py - Gesture detection and clustering pipeline
- run_gcs.py - GCS-enabled wrapper for run.py
- classification.py - Classifier training and ONNX export
- classification_gcs.py - GCS-enabled wrapper for classification.py
- gcs_utils.py - Google Cloud Storage helper functions
- scripts/similarity_engine.py - DTW-based similarity computation
- scripts/config.py - Centralized configuration
- gesture_clustering_colab.ipynb - Complete Colab workflow
output_50v_12fps_24-72f/ # Dynamic folder name
βββ segments/ # Individual gesture videos
β βββ video_segment_000.mp4
β βββ video_segment_000_metadata.json
β βββ segments_manifest.json
βββ clusters/ # Empty (deprecated)
βββ cluster_0/ # Gestures grouped by cluster
β βββ video_0_gesture_000.mp4
β βββ ...
βββ cluster_1/
β βββ ...
βββ clustering_manifest.json # Main clustering results
βββ similarity_report.md # Similarity statistics
βββ gesture_classifier.onnx # Trained model
βββ public/ # Web app assets
βββ gesture_classifier.onnx # Model copy for deployment
βββ cluster_animations.json # Animation data
public.zip # Downloadable package
Edit parameters in run.py:
ANALYSIS_FPS = 12 # Analysis frame rate
GESTURE_MIN_FRAMES = 24 # Min gesture length (2 sec @ 12 FPS)
GESTURE_MAX_FRAMES = 72 # Max gesture length (6 sec @ 12 FPS)
USE_HDBSCAN = True # Auto-detect cluster count
N_CLUSTERS = None # Or set fixed numberpip install -r requirements.txt- torch >= 2.0.0 - PyTorch for classifier training
- mediapipe >= 0.10.0 - Pose estimation
- opencv-python >= 4.8.0 - Video processing
- dtaidistance >= 2.3.10 - DTW similarity
- hdbscan >= 0.8.33 - Density-based clustering
- google-cloud-storage >= 2.10.0 - GCS support (optional)
- Python: 3.8+
- RAM: 8GB+ recommended
- GPU: Optional, speeds up MediaPipe processing
- Disk: 2-5GB per hour of video
Output directory is automatically named based on configuration. To run multiple experiments:
- Adjust parameters in
run.py(FPS, min/max frames) - Run pipeline - new folder created automatically
- Results stored in
output_<config>/
After training, the ONNX model is automatically:
- Copied to
public/gesture_classifier.onnx - Packaged in
public.zipfor easy download - Copied to
output_XXv_XXfps_XX-XXf/public/for archiving
Use the model in your web app:
// Load ONNX model in browser
const session = await ort.InferenceSession.create('public/gesture_classifier.onnx');No gestures detected:
- Adjust
GESTURE_MIN_FRAMESandGESTURE_MAX_FRAMES - Check video quality and lighting
- Verify MediaPipe can detect poses
Too many/few clusters:
- Set
USE_HDBSCAN = Falseand specifyN_CLUSTERS - Adjust min/max gesture length to filter noise
Out of memory:
- Reduce
ANALYSIS_FPS - Process fewer videos at once
- Use Colab with GPU runtime
MIT