REWIND is an advanced video memory exploration platform that transforms traditional video playback into an immersive 3D experience. By leveraging state-of-the-art AI technologies, REWIND enables users to navigate through their video memories in a spatial environment while providing intelligent scene analysis and multilingual narration capabilities through VoiceBridge™.
- Overview
- Key Features
- Architecture
- Technology Stack
- Getting Started
- Development
- API Documentation
- VoiceBridge Integration
- Project Structure
- Deployment
- Testing
- Performance Optimization
- Contributing
- Team
- License
- Acknowledgments
REWIND addresses the fundamental challenge of making video content more accessible, searchable, and emotionally connective across language barriers. The platform combines cutting-edge computer vision, natural language processing, and 3D rendering technologies to create an innovative video exploration experience that transcends traditional playback limitations.
Traditional video content faces three primary limitations:
- Linear Navigation: Videos can only be experienced sequentially, making specific moment retrieval time-consuming
- Language Barriers: Content accessibility is limited to speakers of the source language
- Lack of Context: Understanding complex scenes requires repeated viewing and manual annotation
REWIND provides a comprehensive solution through:
- Spatial video exploration using depth-based 3D reconstruction
- AI-powered scene understanding with automatic object and action recognition
- Multilingual narration that preserves the emotional connection of the original speaker's voice
- Monocular Depth Estimation: Utilizes MiDaS and DPT (Dense Prediction Transformer) models to generate accurate depth maps from single video frames
- Point Cloud Generation: Converts depth information into navigable 3D point clouds using Open3D
- Interactive Camera Controls: Provides orbital navigation, zoom, and fly-through capabilities
- Real-time Rendering: Achieves 60fps performance using Three.js WebGL optimization
- Temporal Morphing: Smooth transitions between video frames in 3D space
- Scene Segmentation: Automatic detection and classification of distinct scenes using TwelveLabs API
- Object Detection: Real-time identification of objects, people, and animals with spatial coordinates
- Action Recognition: Classification of activities and events within video sequences
- Transcript Generation: Automatic speech-to-text conversion with timestamp alignment
- Contextual Understanding: Semantic analysis of scene relationships and narrative flow
VoiceBridge™ represents a novel approach to multilingual content accessibility by combining voice cloning with real-time translation:
- Voice Cloning: One-time setup using 30-second audio samples with ElevenLabs voice synthesis
- Multilingual Support: Generate narration in 29+ languages while maintaining voice characteristics
- On-Demand Generation: Asynchronous audio synthesis triggered by user interaction
- Context-Aware Descriptions: AI-generated scene narrations using Google Gemini
- Emotional Preservation: Maintains prosody and intonation patterns across language translations
- Natural Language Queries: Search video content using conversational language
- Object-Based Navigation: Click on detected objects to jump to relevant scenes
- Temporal Filtering: Filter content by time ranges, people, or actions
- Semantic Similarity: Find related scenes based on content understanding
REWIND follows a microservices-inspired architecture with clear separation between frontend presentation, backend processing, and depth computation pipelines.
┌─────────────────────────────────────────────────────────────┐
│ Client Layer │
│ ┌────────────────────────────────────────────────────┐ │
│ │ React Frontend (Vite) │ │
│ │ - Three.js 3D Rendering │ │
│ │ - Video Upload Interface │ │
│ │ - VoiceBridge™ Controls │ │
│ └────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
│ HTTPS/WebSocket
▼
┌─────────────────────────────────────────────────────────────┐
│ API Gateway Layer │
│ ┌────────────────────────────────────────────────────┐ │
│ │ FastAPI Backend │ │
│ │ - RESTful API Endpoints │ │
│ │ - Request Validation │ │
│ │ - Authentication & Authorization │ │
│ └────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Processing Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Video │ │ AI Analysis │ │ Depth │ │
│ │ Processor │ │ Service │ │ Estimator │ │
│ │ (FFmpeg) │ │ (TwelveLabs) │ │ (MiDaS/DPT) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Gemini │ │ ElevenLabs │ │ Point Cloud │ │
│ │ Service │ │ Service │ │ Generator │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Storage Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Firebase │ │ Firestore │ │ Cloud │ │
│ │ Storage │ │ Database │ │ Storage │ │
│ │ (Videos) │ │ (Metadata) │ │ (Artifacts) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
- Video Upload: User uploads video through React frontend
- Frame Extraction: FFmpeg extracts frames at 2fps and audio tracks
- Parallel Processing:
- Depth maps generated using MiDaS/DPT
- Video analyzed by TwelveLabs for scene understanding
- Audio transcribed and aligned with timestamps
- AI Enhancement: Gemini generates natural language descriptions
- 3D Reconstruction: Point clouds created from depth maps
- User Interaction: Click on objects triggers VoiceBridge™ narration
- Voice Synthesis: ElevenLabs generates audio in user's cloned voice
| Technology | Version | Purpose |
|---|---|---|
| React | 18.2+ | UI framework and component architecture |
| Vite | 5.0+ | Build tool and development server |
| Three.js | r160+ | WebGL 3D rendering engine |
| @react-three/fiber | 8.15+ | React renderer for Three.js |
| @react-three/drei | 9.92+ | Three.js helpers and controls |
| Tailwind CSS | 3.4+ | Utility-first CSS framework |
| Lucide React | 0.300+ | Icon library |
| Firebase SDK | 10.7+ | Client-side Firebase integration |
| Technology | Version | Purpose |
|---|---|---|
| Python | 3.10+ | Backend programming language |
| FastAPI | 0.104+ | High-performance API framework |
| Uvicorn | 0.25+ | ASGI server implementation |
| Pydantic | 2.5+ | Data validation and settings management |
| Firebase Admin | 6.3+ | Server-side Firebase integration |
| FFmpeg | 6.0+ | Video and audio processing |
| Python Multipart | 0.0.6+ | Multipart form data handling |
| Technology | Version | Purpose |
|---|---|---|
| TwelveLabs API | Latest | Video understanding and scene analysis |
| Google Gemini | 1.5 Pro | Natural language generation and translation |
| ElevenLabs API | Latest | Voice cloning and text-to-speech synthesis |
| MiDaS | v3.1 | Monocular depth estimation |
| DPT | Latest | Dense prediction transformers for depth |
| PyTorch | 2.1+ | Deep learning framework |
| Open3D | 0.18+ | 3D data processing |
| OpenCV | 4.8+ | Computer vision operations |
| Technology | Purpose |
|---|---|
| Firebase Storage | Video and audio file storage with CDN |
| Firestore | NoSQL database for metadata and user data |
| Firebase Authentication | User identity and access management |
| Vercel | Frontend hosting and CDN |
| Railway/Render | Backend API hosting |
| Docker | Containerization for consistent deployment |
Before installation, ensure you have the following installed:
- Node.js 18.0 or higher (Download)
- Python 3.10 or higher (Download)
- FFmpeg 6.0 or higher (Installation Guide)
- Git (Download)
- CUDA Toolkit 11.8+ (Optional, for GPU acceleration)
git clone https://github.com/Ohimoiza1205/Rewind.git
cd Rewind# Navigate to backend directory
cd backend
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Return to root directory
cd ..# Navigate to depth-processing directory
cd depth-processing
# Install dependencies
pip install -r requirements.txt
# Download MiDaS models
python scripts/setup_midas.py
# Return to root directory
cd ..# Navigate to frontend directory
cd frontend
# Install dependencies
npm install
# Return to root directory
cd ..Create a .env file in the backend directory:
# API Keys
TWELVELABS_API_KEY=your_twelvelabs_api_key
GEMINI_API_KEY=your_gemini_api_key
ELEVENLABS_API_KEY=your_elevenlabs_api_key
# Firebase Configuration
FIREBASE_PROJECT_ID=your_project_id
FIREBASE_PRIVATE_KEY=your_private_key
FIREBASE_CLIENT_EMAIL=your_client_email
FIREBASE_STORAGE_BUCKET=your_storage_bucket
# Server Configuration
HOST=0.0.0.0
PORT=8000
DEBUG=True
CORS_ORIGINS=http://localhost:5173,http://localhost:3000
# Processing Configuration
MAX_VIDEO_SIZE_MB=500
FRAME_EXTRACTION_FPS=2
MAX_CONCURRENT_UPLOADS=5
TEMP_STORAGE_PATH=/tmp/rewindCreate a .env file in the frontend directory:
# API Configuration
VITE_API_BASE_URL=http://localhost:8000
VITE_WS_URL=ws://localhost:8000/ws
# Firebase Configuration
VITE_FIREBASE_API_KEY=your_api_key
VITE_FIREBASE_AUTH_DOMAIN=your_auth_domain
VITE_FIREBASE_PROJECT_ID=your_project_id
VITE_FIREBASE_STORAGE_BUCKET=your_storage_bucket
VITE_FIREBASE_MESSAGING_SENDER_ID=your_sender_id
VITE_FIREBASE_APP_ID=your_app_id
# Feature Flags
VITE_ENABLE_VOICE_CLONING=true
VITE_ENABLE_3D_VIEWER=true
VITE_ENABLE_ANALYTICS=false- TwelveLabs API: Sign up at twelvelabs.io
- Google Gemini: Get API key from Google AI Studio
- ElevenLabs: Register at elevenlabs.io
- Firebase: Create project at Firebase Console
cd backend
source venv/bin/activate # On Windows: venv\Scripts\activate
uvicorn main:app --reload --host 0.0.0.0 --port 8000The API will be available at http://localhost:8000. Interactive API documentation can be accessed at http://localhost:8000/docs.
cd backend
pytest tests/ -v --cov=app --cov-report=html# Format code with Black
black app/ tests/
# Sort imports with isort
isort app/ tests/
# Lint with flake8
flake8 app/ tests/
# Type checking with mypy
mypy app/cd frontend
npm run devThe application will be available at http://localhost:5173.
cd frontend
npm run buildcd frontend
npm test# Lint with ESLint
npm run lint
# Format with Prettier
npm run formatcd depth-processing
python scripts/generate_depth_maps.py --input path/to/video.mp4 --output output/cd depth-processing
python scripts/batch_process.py --input-dir test_videos/ --output-dir output/POST /api/upload
Content-Type: multipart/form-data
Parameters:
- file: Video file (max 500MB)
- user_id: User identifier
Response:
{
"video_id": "uuid-string",
"status": "processing",
"upload_url": "https://storage.url/video.mp4"
}GET /api/analysis/{video_id}
Response:
{
"video_id": "uuid-string",
"status": "completed",
"duration": 120.5,
"scenes": [
{
"scene_id": "scene-1",
"start_time": 0.0,
"end_time": 15.2,
"objects": ["person", "cake", "candles"],
"description": "Birthday celebration scene",
"confidence": 0.95
}
],
"transcript": "Full video transcript...",
"metadata": {...}
}POST /api/narration/generate
Body:
{
"scene_id": "scene-1",
"target_language": "es",
"user_id": "user-123"
}
Response:
{
"audio_url": "https://storage.url/narration.mp3",
"text": "Translated description",
"language": "es",
"duration": 5.2
}POST /api/voice-setup/clone
Content-Type: multipart/form-data
Parameters:
- audio_file: Audio sample (30 seconds minimum)
- user_id: User identifier
- voice_name: Display name for voice
Response:
{
"voice_id": "elevenlabs-voice-id",
"voice_name": "User Voice",
"status": "ready"
}For complete API documentation, visit /docs when running the development server.
VoiceBridge™ is the multilingual narration system that enables users to hear scene descriptions in their own voice across 29+ languages.
┌──────────────────────────────────────────────────────┐
│ User Interaction │
│ "Narrate this scene in Spanish" │
└──────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Scene Description (Gemini) │
│ "Here's Emma blowing out the candles on her │
│ fifth birthday cake, surrounded by family" │
└──────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Translation (Gemini) │
│ "Aquí está Emma soplando las velas de su pastel │
│ de quinto cumpleaños, rodeada de familia" │
└──────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Voice Synthesis (ElevenLabs) │
│ Generates audio in user's cloned voice │
└──────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Audio Playback │
│ User hears their voice speaking Spanish │
└──────────────────────────────────────────────────────┘
### Supported Languages
Arabic, Bengali, Chinese (Mandarin), Czech, Danish, Dutch, English, Filipino, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Malay, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, Yoruba
---
## Project Structure
rewind/ ├── backend/ # FastAPI backend application │ ├── app/ │ │ ├── api/ # API routes and endpoints │ │ ├── services/ # Business logic and external integrations │ │ ├── models/ # Data models and schemas │ │ └── utils/ # Utility functions and helpers │ ├── tests/ # Backend tests │ └── requirements.txt # Python dependencies │ ├── frontend/ # React frontend application │ ├── src/ │ │ ├── components/ # React components │ │ ├── hooks/ # Custom React hooks │ │ ├── services/ # API clients and external services │ │ └── utils/ # Frontend utilities │ ├── public/ # Static assets │ └── package.json # Node.js dependencies │ ├── depth-processing/ # Depth estimation pipeline │ ├── scripts/ # Processing scripts │ ├── src/ # Core depth estimation logic │ └── models/ # Pre-trained model weights │ ├── docs/ # Documentation ├── deploy/ # Deployment configurations └── README.md # This file
For detailed architecture documentation, see [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md).
---
## Deployment
### Production Deployment
#### Frontend (Vercel)
```bash
cd frontend
# Install Vercel CLI
npm i -g vercel
# Deploy
vercel --prod
cd backend
# Install Railway CLI
npm i -g @railway/cli
# Login and initialize
railway login
railway init
# Deploy
railway upEnsure all production environment variables are configured in your deployment platform's dashboard.
# Build and run with Docker Compose
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose downcd backend
# Run all tests
pytest
# Run with coverage
pytest --cov=app --cov-report=html
# Run specific test file
pytest tests/test_elevenlabs.py -vcd frontend
# Run unit tests
npm test
# Run tests in watch mode
npm test -- --watch
# Generate coverage report
npm test -- --coverage# Run end-to-end tests
npm run test:e2e- Async Processing: All I/O operations use async/await for non-blocking execution
- Request Batching: Multiple scene analyses batched into single API calls
- Caching: Redis caching for frequently accessed scene data and narrations
- Database Indexing: Firestore indexes on user_id, video_id, and timestamp fields
- Code Splitting: Dynamic imports for route-based code splitting
- Asset Optimization: Image compression and lazy loading
- Three.js Optimization: Level-of-detail (LOD) rendering for point clouds
- Memoization: React.memo and useMemo for expensive computations
- GPU Acceleration: CUDA support for MiDaS inference (10x speedup)
- Frame Sampling: Process every 2nd frame (2fps) to reduce computation
- Model Selection: DPT-Hybrid for accuracy vs. MiDaS-small for speed trade-off
- Batch Processing: Process multiple frames in parallel
We welcome contributions to REWIND. Please follow these guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Write or update tests
- Ensure all tests pass
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 style guide
- Use type hints for all function signatures
- Maximum line length: 88 characters (Black default)
- Docstrings required for all public functions and classes
- Follow Airbnb JavaScript Style Guide
- Use functional components with hooks
- Prefer const over let, never use var
- Use meaningful variable and function names
type(scope): subject
body
footer
Types: feat, fix, docs, style, refactor, test, chore
Example:
feat(narration): add support for Yoruba language
- Implemented Yoruba translation in Gemini service
- Added Yoruba language option to frontend selector
- Updated language constants and documentation
Closes #123
Ohinoyi Moiza - Frontend & Voice Engineering Lead
Responsible for React frontend architecture, Three.js 3D rendering, and VoiceBridge™ user interface implementation.
- GitHub: @Ohimoiza1205
- LinkedIn: Ohinoyi Moiza
Peace Enesi - 3D & Depth Processing Lead
Responsible for monocular depth estimation pipeline, point cloud generation, and 3D scene reconstruction.
- GitHub: @AhuoyizaEnesi
- LinkedIn: Peace Enesi
Joanna Chimalilo - AI & Backend Engineering Lead
Responsible for FastAPI backend architecture, AI service integration (TwelveLabs, Gemini, ElevenLabs), and VoiceBridge™ narration system.
- GitHub: @Jouujo
- LinkedIn: Joanna Chimalilo
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) 2025 REWIND Team
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
- TwelveLabs for providing advanced video understanding capabilities
- Google Gemini for natural language generation and translation
- ElevenLabs for state-of-the-art voice cloning and synthesis
- Three.js Community for the powerful 3D rendering framework
- FastAPI Team for the high-performance Python web framework
- React Team for the declarative UI framework
- Ranftl, R., et al. (2021). "Vision Transformers for Dense Prediction" - DPT Architecture
- Ranftl, R., et al. (2020). "Towards Robust Monocular Depth Estimation" - MiDaS
- Casper, J., et al. (2022). "ElevenLabs: High Quality Text to Speech"
- MiDaS - Intel Intelligent Systems Lab
- Open3D - Intel Labs and Stanford University
- FFmpeg - FFmpeg team
For questions, issues, or collaboration opportunities:
- Project Repository: github.com/Ohimoiza1205/Rewind
- Issue Tracker: github.com/Ohimoiza1205/Rewind/issues
- Email: Contact any team member via their LinkedIn profiles
- Real-time collaborative viewing
- Mobile application (iOS/Android)
- Advanced scene editing capabilities
- Integration with popular video platforms
- VR/AR support for immersive viewing
- AI-powered video summarization
- Multi-speaker voice cloning
- Enhanced privacy controls
- Live streaming support with real-time processing
- Professional video editing suite
- Team collaboration features
- Enterprise deployment options
Built with passion by the REWIND team. Transform how you experience video memories.