REWIND

REWIND is an advanced video memory exploration platform that transforms traditional video playback into an immersive 3D experience. By leveraging state-of-the-art AI technologies, REWIND enables users to navigate through their video memories in a spatial environment while providing intelligent scene analysis and multilingual narration capabilities through VoiceBridge™.

Overview

REWIND addresses the fundamental challenge of making video content more accessible, searchable, and emotionally connective across language barriers. The platform combines cutting-edge computer vision, natural language processing, and 3D rendering technologies to create an innovative video exploration experience that transcends traditional playback limitations.

Problem Statement

Traditional video content faces three primary limitations:

Linear Navigation: Videos can only be experienced sequentially, making specific moment retrieval time-consuming
Language Barriers: Content accessibility is limited to speakers of the source language
Lack of Context: Understanding complex scenes requires repeated viewing and manual annotation

Solution

REWIND provides a comprehensive solution through:

Spatial video exploration using depth-based 3D reconstruction
AI-powered scene understanding with automatic object and action recognition
Multilingual narration that preserves the emotional connection of the original speaker's voice

Key Features

3D Spatial Video Rendering

Monocular Depth Estimation: Utilizes MiDaS and DPT (Dense Prediction Transformer) models to generate accurate depth maps from single video frames
Point Cloud Generation: Converts depth information into navigable 3D point clouds using Open3D
Interactive Camera Controls: Provides orbital navigation, zoom, and fly-through capabilities
Real-time Rendering: Achieves 60fps performance using Three.js WebGL optimization
Temporal Morphing: Smooth transitions between video frames in 3D space

AI-Powered Video Analysis

Scene Segmentation: Automatic detection and classification of distinct scenes using TwelveLabs API
Object Detection: Real-time identification of objects, people, and animals with spatial coordinates
Action Recognition: Classification of activities and events within video sequences
Transcript Generation: Automatic speech-to-text conversion with timestamp alignment
Contextual Understanding: Semantic analysis of scene relationships and narrative flow

VoiceBridge™ Narration System

VoiceBridge™ represents a novel approach to multilingual content accessibility by combining voice cloning with real-time translation:

Voice Cloning: One-time setup using 30-second audio samples with ElevenLabs voice synthesis
Multilingual Support: Generate narration in 29+ languages while maintaining voice characteristics
On-Demand Generation: Asynchronous audio synthesis triggered by user interaction
Context-Aware Descriptions: AI-generated scene narrations using Google Gemini
Emotional Preservation: Maintains prosody and intonation patterns across language translations

Intelligent Search and Discovery

Natural Language Queries: Search video content using conversational language
Object-Based Navigation: Click on detected objects to jump to relevant scenes
Temporal Filtering: Filter content by time ranges, people, or actions
Semantic Similarity: Find related scenes based on content understanding

Architecture

REWIND follows a microservices-inspired architecture with clear separation between frontend presentation, backend processing, and depth computation pipelines.

System Architecture

┌─────────────────────────────────────────────────────────────┐
│                        Client Layer                          │
│  ┌────────────────────────────────────────────────────┐    │
│  │  React Frontend (Vite)                              │    │
│  │  - Three.js 3D Rendering                            │    │
│  │  - Video Upload Interface                           │    │
│  │  - VoiceBridge™ Controls                            │    │
│  └────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
                           │
                           │ HTTPS/WebSocket
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                     API Gateway Layer                        │
│  ┌────────────────────────────────────────────────────┐    │
│  │  FastAPI Backend                                    │    │
│  │  - RESTful API Endpoints                            │    │
│  │  - Request Validation                               │    │
│  │  - Authentication & Authorization                   │    │
│  └────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                    Processing Layer                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │ Video        │  │ AI Analysis  │  │ Depth        │     │
│  │ Processor    │  │ Service      │  │ Estimator    │     │
│  │ (FFmpeg)     │  │ (TwelveLabs) │  │ (MiDaS/DPT)  │     │
│  └──────────────┘  └──────────────┘  └──────────────┘     │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │ Gemini       │  │ ElevenLabs   │  │ Point Cloud  │     │
│  │ Service      │  │ Service      │  │ Generator    │     │
│  └──────────────┘  └──────────────┘  └──────────────┘     │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                      Storage Layer                           │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │ Firebase     │  │ Firestore    │  │ Cloud        │     │
│  │ Storage      │  │ Database     │  │ Storage      │     │
│  │ (Videos)     │  │ (Metadata)   │  │ (Artifacts)  │     │
│  └──────────────┘  └──────────────┘  └──────────────┘     │
└─────────────────────────────────────────────────────────────┘

Data Flow

Video Upload: User uploads video through React frontend
Frame Extraction: FFmpeg extracts frames at 2fps and audio tracks
Parallel Processing:
- Depth maps generated using MiDaS/DPT
- Video analyzed by TwelveLabs for scene understanding
- Audio transcribed and aligned with timestamps
AI Enhancement: Gemini generates natural language descriptions
3D Reconstruction: Point clouds created from depth maps
User Interaction: Click on objects triggers VoiceBridge™ narration
Voice Synthesis: ElevenLabs generates audio in user's cloned voice

Technology Stack

Frontend Technologies

Technology	Version	Purpose
React	18.2+	UI framework and component architecture
Vite	5.0+	Build tool and development server
Three.js	r160+	WebGL 3D rendering engine
@react-three/fiber	8.15+	React renderer for Three.js
@react-three/drei	9.92+	Three.js helpers and controls
Tailwind CSS	3.4+	Utility-first CSS framework
Lucide React	0.300+	Icon library
Firebase SDK	10.7+	Client-side Firebase integration

Backend Technologies

Technology	Version	Purpose
Python	3.10+	Backend programming language
FastAPI	0.104+	High-performance API framework
Uvicorn	0.25+	ASGI server implementation
Pydantic	2.5+	Data validation and settings management
Firebase Admin	6.3+	Server-side Firebase integration
FFmpeg	6.0+	Video and audio processing
Python Multipart	0.0.6+	Multipart form data handling

AI and Machine Learning

Technology	Version	Purpose
TwelveLabs API	Latest	Video understanding and scene analysis
Google Gemini	1.5 Pro	Natural language generation and translation
ElevenLabs API	Latest	Voice cloning and text-to-speech synthesis
MiDaS	v3.1	Monocular depth estimation
DPT	Latest	Dense prediction transformers for depth
PyTorch	2.1+	Deep learning framework
Open3D	0.18+	3D data processing
OpenCV	4.8+	Computer vision operations

Infrastructure

Technology	Purpose
Firebase Storage	Video and audio file storage with CDN
Firestore	NoSQL database for metadata and user data
Firebase Authentication	User identity and access management
Vercel	Frontend hosting and CDN
Railway/Render	Backend API hosting
Docker	Containerization for consistent deployment

Getting Started

Prerequisites

Before installation, ensure you have the following installed:

Node.js 18.0 or higher (Download)
Python 3.10 or higher (Download)
FFmpeg 6.0 or higher (Installation Guide)
Git (Download)
CUDA Toolkit 11.8+ (Optional, for GPU acceleration)

Installation

1. Clone the Repository

git clone https://github.com/Ohimoiza1205/Rewind.git
cd Rewind

2. Backend Setup

# Navigate to backend directory
cd backend

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Return to root directory
cd ..

3. Depth Processing Setup

# Navigate to depth-processing directory
cd depth-processing

# Install dependencies
pip install -r requirements.txt

# Download MiDaS models
python scripts/setup_midas.py

# Return to root directory
cd ..

4. Frontend Setup

# Navigate to frontend directory
cd frontend

# Install dependencies
npm install

# Return to root directory
cd ..

Configuration

Backend Configuration

Create a .env file in the backend directory:

# API Keys
TWELVELABS_API_KEY=your_twelvelabs_api_key
GEMINI_API_KEY=your_gemini_api_key
ELEVENLABS_API_KEY=your_elevenlabs_api_key

# Firebase Configuration
FIREBASE_PROJECT_ID=your_project_id
FIREBASE_PRIVATE_KEY=your_private_key
FIREBASE_CLIENT_EMAIL=your_client_email
FIREBASE_STORAGE_BUCKET=your_storage_bucket

# Server Configuration
HOST=0.0.0.0
PORT=8000
DEBUG=True
CORS_ORIGINS=http://localhost:5173,http://localhost:3000

# Processing Configuration
MAX_VIDEO_SIZE_MB=500
FRAME_EXTRACTION_FPS=2
MAX_CONCURRENT_UPLOADS=5
TEMP_STORAGE_PATH=/tmp/rewind

Frontend Configuration

Create a .env file in the frontend directory:

# API Configuration
VITE_API_BASE_URL=http://localhost:8000
VITE_WS_URL=ws://localhost:8000/ws

# Firebase Configuration
VITE_FIREBASE_API_KEY=your_api_key
VITE_FIREBASE_AUTH_DOMAIN=your_auth_domain
VITE_FIREBASE_PROJECT_ID=your_project_id
VITE_FIREBASE_STORAGE_BUCKET=your_storage_bucket
VITE_FIREBASE_MESSAGING_SENDER_ID=your_sender_id
VITE_FIREBASE_APP_ID=your_app_id

# Feature Flags
VITE_ENABLE_VOICE_CLONING=true
VITE_ENABLE_3D_VIEWER=true
VITE_ENABLE_ANALYTICS=false

Obtaining API Keys

TwelveLabs API: Sign up at twelvelabs.io
Google Gemini: Get API key from Google AI Studio
ElevenLabs: Register at elevenlabs.io
Firebase: Create project at Firebase Console

Development

Backend Development

Starting the Development Server

cd backend
source venv/bin/activate  # On Windows: venv\Scripts\activate
uvicorn main:app --reload --host 0.0.0.0 --port 8000

The API will be available at http://localhost:8000. Interactive API documentation can be accessed at http://localhost:8000/docs.

Running Tests

cd backend
pytest tests/ -v --cov=app --cov-report=html

Code Style and Linting

# Format code with Black
black app/ tests/

# Sort imports with isort
isort app/ tests/

# Lint with flake8
flake8 app/ tests/

# Type checking with mypy
mypy app/

Frontend Development

Starting the Development Server

cd frontend
npm run dev

The application will be available at http://localhost:5173.

Building for Production

cd frontend
npm run build

Running Tests

cd frontend
npm test

Linting and Formatting

# Lint with ESLint
npm run lint

# Format with Prettier
npm run format

Depth Processing

Processing a Single Video

cd depth-processing
python scripts/generate_depth_maps.py --input path/to/video.mp4 --output output/

Batch Processing

cd depth-processing
python scripts/batch_process.py --input-dir test_videos/ --output-dir output/

API Documentation

Core Endpoints

Upload Video

POST /api/upload
Content-Type: multipart/form-data

Parameters:
- file: Video file (max 500MB)
- user_id: User identifier

Response:
{
  "video_id": "uuid-string",
  "status": "processing",
  "upload_url": "https://storage.url/video.mp4"
}

Get Analysis Results

GET /api/analysis/{video_id}

Response:
{
  "video_id": "uuid-string",
  "status": "completed",
  "duration": 120.5,
  "scenes": [
    {
      "scene_id": "scene-1",
      "start_time": 0.0,
      "end_time": 15.2,
      "objects": ["person", "cake", "candles"],
      "description": "Birthday celebration scene",
      "confidence": 0.95
    }
  ],
  "transcript": "Full video transcript...",
  "metadata": {...}
}

Generate Narration

POST /api/narration/generate

Body:
{
  "scene_id": "scene-1",
  "target_language": "es",
  "user_id": "user-123"
}

Response:
{
  "audio_url": "https://storage.url/narration.mp3",
  "text": "Translated description",
  "language": "es",
  "duration": 5.2
}

Clone Voice

POST /api/voice-setup/clone

Content-Type: multipart/form-data

Parameters:
- audio_file: Audio sample (30 seconds minimum)
- user_id: User identifier
- voice_name: Display name for voice

Response:
{
  "voice_id": "elevenlabs-voice-id",
  "voice_name": "User Voice",
  "status": "ready"
}

For complete API documentation, visit /docs when running the development server.

VoiceBridge Integration

VoiceBridge™ is the multilingual narration system that enables users to hear scene descriptions in their own voice across 29+ languages.

Architecture

┌──────────────────────────────────────────────────────┐
│                  User Interaction                     │
│  "Narrate this scene in Spanish"                     │
└──────────────────┬───────────────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────────┐
│              Scene Description (Gemini)               │
│  "Here's Emma blowing out the candles on her         │
│   fifth birthday cake, surrounded by family"         │
└──────────────────┬───────────────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────────┐
│              Translation (Gemini)                     │
│  "Aquí está Emma soplando las velas de su pastel     │
│   de quinto cumpleaños, rodeada de familia"          │
└──────────────────┬───────────────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────────┐
│         Voice Synthesis (ElevenLabs)                  │
│  Generates audio in user's cloned voice              │
└──────────────────┬───────────────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────────┐
│              Audio Playback                           │
│  User hears their voice speaking Spanish             │
└──────────────────────────────────────────────────────┘

### Supported Languages

Arabic, Bengali, Chinese (Mandarin), Czech, Danish, Dutch, English, Filipino, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Malay, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, Yoruba

---

## Project Structure

rewind/ ├── backend/ # FastAPI backend application │ ├── app/ │ │ ├── api/ # API routes and endpoints │ │ ├── services/ # Business logic and external integrations │ │ ├── models/ # Data models and schemas │ │ └── utils/ # Utility functions and helpers │ ├── tests/ # Backend tests │ └── requirements.txt # Python dependencies │ ├── frontend/ # React frontend application │ ├── src/ │ │ ├── components/ # React components │ │ ├── hooks/ # Custom React hooks │ │ ├── services/ # API clients and external services │ │ └── utils/ # Frontend utilities │ ├── public/ # Static assets │ └── package.json # Node.js dependencies │ ├── depth-processing/ # Depth estimation pipeline │ ├── scripts/ # Processing scripts │ ├── src/ # Core depth estimation logic │ └── models/ # Pre-trained model weights │ ├── docs/ # Documentation ├── deploy/ # Deployment configurations └── README.md # This file


For detailed architecture documentation, see [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md).

---

## Deployment

### Production Deployment

#### Frontend (Vercel)

```bash
cd frontend

# Install Vercel CLI
npm i -g vercel

# Deploy
vercel --prod

Backend (Railway)

cd backend

# Install Railway CLI
npm i -g @railway/cli

# Login and initialize
railway login
railway init

# Deploy
railway up

Environment Variables

Ensure all production environment variables are configured in your deployment platform's dashboard.

Docker Deployment

# Build and run with Docker Compose
docker-compose up -d

# View logs
docker-compose logs -f

# Stop services
docker-compose down

Testing

Backend Tests

cd backend

# Run all tests
pytest

# Run with coverage
pytest --cov=app --cov-report=html

# Run specific test file
pytest tests/test_elevenlabs.py -v

Frontend Tests

cd frontend

# Run unit tests
npm test

# Run tests in watch mode
npm test -- --watch

# Generate coverage report
npm test -- --coverage

Integration Tests

# Run end-to-end tests
npm run test:e2e

Performance Optimization

Backend Optimization

Async Processing: All I/O operations use async/await for non-blocking execution
Request Batching: Multiple scene analyses batched into single API calls
Caching: Redis caching for frequently accessed scene data and narrations
Database Indexing: Firestore indexes on user_id, video_id, and timestamp fields

Frontend Optimization

Code Splitting: Dynamic imports for route-based code splitting
Asset Optimization: Image compression and lazy loading
Three.js Optimization: Level-of-detail (LOD) rendering for point clouds
Memoization: React.memo and useMemo for expensive computations

Depth Processing Optimization

GPU Acceleration: CUDA support for MiDaS inference (10x speedup)
Frame Sampling: Process every 2nd frame (2fps) to reduce computation
Model Selection: DPT-Hybrid for accuracy vs. MiDaS-small for speed trade-off
Batch Processing: Process multiple frames in parallel

Contributing

We welcome contributions to REWIND. Please follow these guidelines:

Development Workflow

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Write or update tests
Ensure all tests pass
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Code Style Guidelines

Python (Backend)

Follow PEP 8 style guide
Use type hints for all function signatures
Maximum line length: 88 characters (Black default)
Docstrings required for all public functions and classes

JavaScript/React (Frontend)

Follow Airbnb JavaScript Style Guide
Use functional components with hooks
Prefer const over let, never use var
Use meaningful variable and function names

Commit Message Convention

type(scope): subject

body

footer

Types: feat, fix, docs, style, refactor, test, chore

Example:

feat(narration): add support for Yoruba language

- Implemented Yoruba translation in Gemini service
- Added Yoruba language option to frontend selector
- Updated language constants and documentation

Closes #123

Team

Core Development Team

Ohinoyi Moiza - Frontend & Voice Engineering Lead
Responsible for React frontend architecture, Three.js 3D rendering, and VoiceBridge™ user interface implementation.

GitHub: @Ohimoiza1205
LinkedIn: Ohinoyi Moiza

Peace Enesi - 3D & Depth Processing Lead
Responsible for monocular depth estimation pipeline, point cloud generation, and 3D scene reconstruction.

GitHub: @AhuoyizaEnesi
LinkedIn: Peace Enesi

Joanna Chimalilo - AI & Backend Engineering Lead
Responsible for FastAPI backend architecture, AI service integration (TwelveLabs, Gemini, ElevenLabs), and VoiceBridge™ narration system.

GitHub: @Jouujo
LinkedIn: Joanna Chimalilo

License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

MIT License

Copyright (c) 2025 REWIND Team

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Acknowledgments

Technologies and Frameworks

TwelveLabs for providing advanced video understanding capabilities
Google Gemini for natural language generation and translation
ElevenLabs for state-of-the-art voice cloning and synthesis
Three.js Community for the powerful 3D rendering framework
FastAPI Team for the high-performance Python web framework
React Team for the declarative UI framework

Research Papers

Ranftl, R., et al. (2021). "Vision Transformers for Dense Prediction" - DPT Architecture
Ranftl, R., et al. (2020). "Towards Robust Monocular Depth Estimation" - MiDaS
Casper, J., et al. (2022). "ElevenLabs: High Quality Text to Speech"

Open Source Projects

MiDaS - Intel Intelligent Systems Lab
Open3D - Intel Labs and Stanford University
FFmpeg - FFmpeg team

Contact and Support

For questions, issues, or collaboration opportunities:

Project Repository: github.com/Ohimoiza1205/Rewind
Issue Tracker: github.com/Ohimoiza1205/Rewind/issues
Email: Contact any team member via their LinkedIn profiles

Roadmap

Version 1.1 (Q4 2025)

Real-time collaborative viewing
Mobile application (iOS/Android)
Advanced scene editing capabilities
Integration with popular video platforms

Version 1.2 (Q1 2026)

VR/AR support for immersive viewing
AI-powered video summarization
Multi-speaker voice cloning
Enhanced privacy controls

Version 2.0 (Q2 2026)

Live streaming support with real-time processing
Professional video editing suite
Team collaboration features
Enterprise deployment options

Built with passion by the REWIND team. Transform how you experience video memories.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
backend		backend
deploy		deploy
depth-processing		depth-processing
devpost		devpost
docs		docs
frontend		frontend
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Ohimoiza1205/Rewind

Folders and files

Latest commit

History

Repository files navigation

REWIND

Table of Contents

Overview

Problem Statement

Solution

Key Features

3D Spatial Video Rendering

AI-Powered Video Analysis

VoiceBridge™ Narration System

Intelligent Search and Discovery

Architecture

System Architecture

Data Flow

Technology Stack

Frontend Technologies

Backend Technologies

AI and Machine Learning

Infrastructure

Getting Started

Prerequisites

Installation

1. Clone the Repository

2. Backend Setup

3. Depth Processing Setup

4. Frontend Setup

Configuration

Backend Configuration

Frontend Configuration

Obtaining API Keys

Development

Backend Development

Starting the Development Server

Running Tests

Code Style and Linting

Frontend Development

Starting the Development Server

Building for Production

Running Tests

Linting and Formatting

Depth Processing

Processing a Single Video

Batch Processing

API Documentation

Core Endpoints

Upload Video

Get Analysis Results

Generate Narration

Clone Voice

VoiceBridge Integration

Architecture

Backend (Railway)

Environment Variables

Docker Deployment

Testing

Backend Tests

Frontend Tests

Integration Tests

Performance Optimization

Backend Optimization

Frontend Optimization

Depth Processing Optimization

Contributing

Development Workflow

Code Style Guidelines

Python (Backend)

JavaScript/React (Frontend)

Commit Message Convention

Team

Core Development Team

License

MIT License

Acknowledgments

Technologies and Frameworks

Research Papers

Open Source Projects

Packages