An intelligent system that automatically detects and segments different room scenes in videos using advanced CLIP embeddings and OpenAI vision analysis.
- Advanced Room Detection: Uses CLIP embeddings for precise scene transition detection
- AI-Powered Classification: GPT-4o vision model for accurate room type identification
- Smart Caching: Reuses previous analysis results for faster processing
- Real-time Processing: Asynchronous processing pipeline with progress monitoring
- Result Export: Support JSON format export of processing results
- CLIP-based Analysis: Uses OpenAI's CLIP model for visual understanding
- Sharpest Frame Selection: Extracts the clearest frame per second
- Multi-modal AI: Combines computer vision with language understanding
- Intelligent Caching: Automatically caches and reuses analysis results
- Thumbnail Generation: Automatic generation of representative thumbnails
- Python 3.8+
- OpenCV 4.8+
- 4GB+ RAM
- Optional: OpenAI API key for enhanced accuracy
- Supported video formats: MP4, MOV, AVI
- Clone the project
git clone https://github.com/yourusername/video-room-segmentation.git
cd video-room-segmentation
- Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# or
venv\Scripts\activate # Windows
- Install dependencies
# Basic version (uses fallback algorithms)
pip install -e .
# Full AI version (includes CLIP and OpenAI)
pip install torch torchvision
pip install open-clip-torch
pip install openai
pip install pillow
- Configure environment
cp .env.example .env
# Edit .env file, add OpenAI API key (optional)
export OPENAI_API_KEY="your-api-key-here"
- Start service
python -m uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
- Open browser Visit: http://localhost:8000
video-room-segmentation/
├── app/
│ ├── main.py # FastAPI main application
│ ├── types.py # Type definitions
│ └── routes/
│ └── api.py # API routes
├── services/
│ ├── pipeline.py # Advanced video processing pipeline
│ └── storage.py # Storage management
├── templates/
│ └── index.html # Frontend page
├── static/
│ ├── style.css # Style files
│ └── app.js # JavaScript logic
├── .cache/ # Analysis cache directory
├── .uploads/ # Upload directory
├── .runs/ # Processing results directory
├── config.yaml # Configuration file
├── pyproject.toml # Project configuration
├── .env.example # Environment variables example
└── README.md # Project documentation
-
Upload Video
- Drag video files to upload area
- Or click button to select files
- Supports videos up to 500MB
- Smart Cache: Automatically detects and loads previous analysis for identical videos
- Instant Results: Previously analyzed videos load in ~5 seconds instead of minutes
-
Automatic Processing
- Extracts sharpest frame per second
- Uses CLIP embeddings to select representative frames
- Analyzes frames with OpenAI GPT-4o (if available)
- Builds intelligent timeline of room segments
-
View Results
- View segmentation results with thumbnails
- Different rooms marked with different colors
- View segment details and confidence scores
-
Export Results
- Download results in JSON format
- Includes room types, timestamps, and confidence scores
Scenario: You want to view results from a video you processed before.
Solution: Simply re-upload the same video file - the system will automatically detect and reload your previous analysis instantly!
Step-by-step:
- Upload the same video file (same name and content)
- System calculates MD5 hash and finds matching cache
- Results appear in ~5 seconds (instead of minutes of processing)
- All thumbnails and segments are automatically restored
Verification: Check the processing info for "loaded_from_cache": true
curl -X POST "http://localhost:8000/api/upload" \
-F "file=@your_video.mp4" \
-F "reuse_cache=true"
curl "http://localhost:8000/api/status/{job_id}"
curl "http://localhost:8000/api/result/{job_id}"
# List cached analyses
curl "http://localhost:8000/api/cache/list"
# Clear cache
curl -X DELETE "http://localhost:8000/api/cache/clear"
The system automatically reuses previous analysis results for identical videos:
🔄 Automatic Cache Loading
# Upload same video file - automatically loads from cache
curl -X POST "http://localhost:8000/api/upload" \
-F "file=@your_video.mp4" \
-F "reuse_cache=true"
⚡ Cache Benefits
- Instant Results: 10-minute video analysis loads in ~5 seconds
- Cost Savings: No repeated OpenAI API calls for same video
- Consistency: Identical results for identical videos
🔍 How Cache Detection Works
- MD5 Hash: System calculates unique fingerprint for each video
- Cache Lookup: Checks
.cache/
directory for matching hash - Smart Reuse: Automatically copies thumbnails and recreates job structure
- Transparency: Results clearly indicate when loaded from cache
📁 Cache File Structure
.cache/
└── ecb856097085b77de4dd9967d19713fb.json # Video hash
├── video_info: {...} # Original video metadata
├── result: {...} # Complete analysis results
├── cached_at: "2025-09-15T23:24:58" # Cache creation time
└── algorithm_params: {...} # Processing parameters used
🔧 Manual Cache Control
# Force reprocessing (skip cache)
curl -X POST "http://localhost:8000/api/upload" \
-F "file=@your_video.mp4" \
-F "reuse_cache=false"
# View cache details
curl "http://localhost:8000/api/cache/list" | jq '.cached_analyses[]'
💡 Cache Indicators in Results
{
"processing_info": {
"loaded_from_cache": true,
"original_cache_date": "2025-09-15T23:24:58",
"processing_time": 4.2
}
}
# Video processing parameters
video_processing:
clip_analysis:
change_threshold: 0.80 # CLIP embedding similarity threshold
min_gap_sec: 1 # Minimum gap between representative frames
max_gap_sec: 5 # Maximum gap between representative frames
openai:
model: "gpt-4o" # OpenAI model for room analysis
retry_attempts: 3 # Number of retry attempts for API calls
post_processing:
min_segment_duration: 2.0 # Minimum segment duration (seconds)
# Room type definitions
room_types:
living_room: { label: "Living Room", color: "#28a745" }
bedroom: { label: "Bedroom", color: "#6f42c1" }
kitchen: { label: "Kitchen", color: "#fd7e14" }
bathroom: { label: "Bathroom", color: "#17a2b8" }
dining_room: { label: "Dining Room", color: "#ffc107" }
hallway: { label: "Hallway", color: "#6c757d" }
unknown: { label: "Unknown", color: "#6f42c1" }
# OpenAI configuration (recommended for best accuracy)
OPENAI_API_KEY=your_api_key_here
# Server configuration
HOST=0.0.0.0
PORT=8000
DEBUG=true
# Processing parameters
MAX_UPLOAD_SIZE=500MB
MAX_CONCURRENT_JOBS=3
Video Input → Extract 1fps frames → Select sharpest frame per second
All frames → CLIP embeddings → Similarity analysis → Select representative frames
Representative frames → GPT-4o analysis → Room classification + Same room detection
AI decisions → Merge same rooms → Final timeline segments
When OpenAI cannot determine a room type, it returns "unknown"
. Our system applies intelligent post-processing:
Strategy 1: High Confidence Inheritance
- If
same_room_confidence > 0.8
andsame_room = "same_room"
- → Inherit room type from previous non-unknown segment
- Used when AI is confident it's the same room but unsure of type
Strategy 2: Spatial Context Analysis
- If unknown segment is surrounded by same room type
- → Infer as that room type (likely walking through same room)
- Example:
kitchen → unknown → kitchen
becomeskitchen → kitchen → kitchen
Strategy 3: Moderate Confidence Inheritance
- If
same_room_confidence > 0.6
andsame_room = "same_room"
- → Inherit from previous segment with lower confidence threshold
Strategy 4: Duration-based Inference
- If unknown segment is very short (< 3 seconds)
- → Likely a brief transition, inherit from previous room
- Preservation: Original unknown classification is preserved in metadata
- Merging: Remaining unknowns merge into adjacent segments
- Fusion: Segments of same room type merge across unknown gaps
- Gap Filling: Small time gaps (≤2 seconds) are filled automatically
All processing is transparent and reversible:
{
"room_type": "kitchen",
"original_room_type": "unknown",
"inference_reason": "high_same_room_confidence",
"unknown_regions": [
{
"start_time": 45.2,
"end_time": 47.8,
"original_room_type": "unknown",
"evidence": ["ambiguous lighting", "transitional view"]
}
]
}
This ensures maximum usable output while maintaining complete transparency about AI uncertainty.
- No CLIP: Uses color histogram similarity
- No OpenAI: Generates realistic dummy results for testing
# Basic functionality test
python -c "from services.pipeline import VideoPipeline; print('Pipeline loaded successfully')"
# Test with sample video
curl -X POST "http://localhost:8000/api/upload" \
-F "file=@sample_video.mp4"
- With OpenAI: 10-minute video ≈ 2-3 minutes processing time
- Cached results: 10-minute video ≈ 5 seconds loading time
- Without OpenAI: 10-minute video ≈ 30 seconds processing time
- With CLIP + OpenAI: ~90% room classification accuracy
- Fallback mode: ~70% room classification accuracy
- Base memory: 1GB (with CLIP model loaded)
- Processing peak: 3GB
- Cache storage: ~50KB per video analysis
-
CLIP Model Loading Failed
pip install torch torchvision pip install open-clip-torch
-
OpenAI API Errors
- Verify API key is set:
echo $OPENAI_API_KEY
- Check API quota and limits
- System will fall back to dummy results
- Verify API key is set:
-
Out of Memory
- Reduce video resolution before upload
- Process shorter video segments
- Clear cache:
curl -X DELETE http://localhost:8000/api/cache/clear
# View application logs
tail -f .logs/app.log
# Check for specific errors
grep -i "error\|failed" .logs/app.log
FROM python:3.11-slim
WORKDIR /app
COPY . .
RUN pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
RUN pip install open-clip-torch openai pillow
RUN pip install -e .
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
OPENAI_API_KEY=your_production_key
DEBUG=false
MAX_CONCURRENT_JOBS=5
# Install development dependencies
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
# Run code formatting
black .
isort .
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI CLIP - Vision understanding
- OpenAI GPT-4o - Multimodal AI analysis
- FastAPI - Modern web framework
- OpenCV - Computer vision library
If this project helps you, please give it a ⭐️ for support!