An intelligent computer vision system that recognizes Rock-Paper-Scissors hand gestures in real-time using deep learning
Features • Installation • Usage • How It Works • Training • Results
RPSVision combines state-of-the-art computer vision and deep learning to create an accurate, real-time gesture recognition system. Whether you're building an interactive game, a gesture-based interface, or exploring computer vision, RPSVision provides a solid foundation.
- MediaPipe Hands - Real-time hand landmark detection (21 keypoints)
- Custom CNN - Lightweight PyTorch model optimized for gesture classification
- ONNX Runtime - Efficient model deployment with cross-platform support
- OpenCV - Real-time video processing and visualization
- ✅ Real-Time Detection - Process webcam feed at 30+ FPS
- ✅ High Accuracy - 95.43% test accuracy on gesture classification
- ✅ Lightweight Model - Fast inference suitable for edge devices
- ✅ Easy Deployment - ONNX model for production environments
- ✅ Extensible - Simple architecture for adding new gestures
- ✅ GPU Accelerated - Automatic GPU utilization when available
- Comprehensive training pipeline with validation
- Real-time training metrics and visualization
- Clean, modular codebase
- Detailed logging and error handling
- Data augmentation support
RPSVision/
│
├── .gitignore # Ignore file
├── main.py # Application entry point
├── requirements.txt # Python dependencies
├── README.md # Project documentation
│
├── checkpoint/ # Saved models
│ ├── best_model.pth # PyTorch checkpoint
│ └── best_model.onnx # ONNX deployment model
│
├── data/ # Training dataset
│ ├── rock/ # Rock gesture images
│ ├── paper/ # Paper gesture images
│ └── scissors/ # Scissors gesture images
│
├── src/ # Source code
│ ├── constant/
│ │ └── constant.py # Configuration & constants
│ │
│ ├── model/
│ │ └── classifier.py # CNN architecture
│ │
│ ├── gestures/
│ │ └── recognizer.py # Gesture recognition engine
│ │
│ ├── data/
│ │ ├── custom_data.py # Custom dataset class
│ │ ├── dataloader.py # Data loading utilities
│ │ └── metadata.py # Dataset metadata
│ │
│ ├── training/
│ │ └── train.py # Training pipeline
│ │
│ ├── pipeline/
│ │ ├── main.py # Pipeline orchestration
│ │ └── pipeline.py # Inference pipeline
│ │
│ └── utils/
│ └── helper.py # Utility functions
- Python 3.8 or higher
- Webcam (built-in or external)
- pip package manager
git clone https://github.com/A-Ahmed-I/RPSVision.git
cd RPSVision# Windows
python -m venv venv
venv\Scripts\activate
# macOS/Linux
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txtpython -c "import torch; import cv2; import mediapipe; print('✓ All dependencies installed successfully')"Launch the application with a single command:
python main.py| Key | Action |
|---|---|
Q |
Quit the application |
ESC |
Exit fullscreen (if applicable) |
- Lighting - Ensure good, even lighting on your hand
- Distance - Keep your hand 30-50cm from the camera
- Positioning - Center your hand in the frame
- Background - Use a plain background for better detection
- Visibility - Keep all fingers visible to the camera
The application displays:
- Live webcam feed
- Hand detection bounding box
- Predicted gesture (Rock/Paper/Scissors)
- Confidence score (0-100%)
┌─────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Webcam │ ───▶ │ MediaPipe │ ───▶ │ Crop & │ ───▶ │ CNN │
│ Capture │ │ Hand Detect │ │ Preprocess │ │ Classifier │
└─────────────┘ └───────┬───────┘ └───────────────┘ └───────┬───────┘
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│ 21 Landmark │ │ Prediction │
│ Points │ │ + Confidence │
└───────────────┘ └───────────────┘
MediaPipe Hands detects hands in the video frame and extracts 21 landmark points:
# Key landmarks detected:
- Wrist (0)
- Thumb (1-4)
- Index finger (5-8)
- Middle finger (9-12)
- Ring finger (13-16)
- Pinky (17-20)The system:
- Calculates a bounding box around detected landmarks
- Adds padding for context
- Crops the region containing the hand
- Handles edge cases (frame boundaries)
Images are standardized for the CNN:
Transform Pipeline:
- Resize: 224×224 pixels
- Normalize: Divide by 255.0
- Convert: HWC → CHW format
- Batch: Add batch dimensionThe custom CNN processes the image:
Input: [1, 3, 224, 224]
↓
Conv Block 1: 16 filters
↓
Conv Block 2: 32 filters
↓
Conv Block 3: 64 filters
↓
Flatten
↓
Fully Connected: 3 classes
↓
Softmax: [Rock, Paper, Scissors]
Visual feedback includes:
- Green bounding box around hand
- Gesture label with confidence
- FPS counter (optional)
- Hand landmarks (optional)
Organize your data as follows:
data/
├── rock/
│ ├── img_001.jpg
│ ├── img_002.jpg
│ └── ...
├── paper/
│ ├── img_001.jpg
│ └── ...
└── scissors/
├── img_001.jpg
└── ...
from src.training.train import ModelTrainer
from src.model.classifier import RPSClassifier
from src.data.dataloader import create_dataloaders
import torch.optim as optim
import torch.nn as nn
# Initialize model
model = RPSClassifier(
in_channels=3,
base_filters=32,
num_classes=3
)
# Create data loaders
train_loader, val_loader, test_loader = create_dataloaders(
data_dir='data/',
batch_size=32,
val_split=0.15,
test_split=0.15
)
# Setup training
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
# Create trainer
trainer = ModelTrainer(
model=model,
optimizer=optimizer,
criterion=criterion,
train_loader=train_loader,
val_loader=val_loader,
epochs=30,
device='cuda' # or 'cpu'
)
# Train
history = trainer.train()
# Export to ONNX
trainer.export_to_onnx(output_path='checkpoint/best_model.onnx')# Hyperparameters
BATCH_SIZE = 32
LEARNING_RATE = 0.001
EPOCHS = 30
OPTIMIZER = 'Adam'
WEIGHT_DECAY = 1e-4| Metric | Value |
|---|---|
| Test Accuracy | 95.43% |
| Test F1 Score | 95.42% |
| Best Val F1 | 97.71% |
| Training Time | ~15 minutes (GPU) |
| Model Size | 2.3 MB |
| Inference Speed | ~5ms per frame |
| Epoch | Train Loss | Train Acc | Val Loss | Val Acc | Val F1 | Status |
|---|---|---|---|---|---|---|
| 1 | 1.0841 | 42.13% | 1.0338 | 61.19% | 60.40% | ⭐ Best |
| 2 | 0.8391 | 72.31% | 0.6529 | 78.08% | 78.09% | ⭐ Best |
| 5 | 0.4403 | 84.52% | 0.4277 | 85.39% | 85.40% | ⭐ Best |
| 9 | 0.2216 | 91.84% | 0.2033 | 92.69% | 92.67% | ⭐ Best |
| 12 | 0.1266 | 95.89% | 0.1317 | 95.43% | 95.41% | ⭐ Best |
| 15 | 0.0784 | 97.98% | 0.0905 | 97.72% | 97.71% | ⭐ Best |
| 18 | 0.0414 | 99.09% | 0.0720 | 97.72% | 97.71% | ⭐ Best |
| 30 | 0.0054 | 100.00% | 0.0799 | 96.80% | 96.80% | Final |
Key Observations:
- Smooth convergence with minimal overfitting
- Validation metrics closely track training metrics
- Best performance achieved at epoch 15-18
- Stable performance after epoch 20
Performance Breakdown:
- Rock: High precision, minimal confusion
- Paper: Excellent classification accuracy
- Scissors: Strong performance across all metrics
- Overall: Balanced performance across all three classes
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Rock | 96.2% | 95.8% | 96.0% | 340 |
| Paper | 94.9% | 95.1% | 95.0% | 326 |
| Scissors | 95.1% | 95.3% | 95.2% | 334 |
| Avg | 95.4% | 95.4% | 95.4% | 1000 |
Edit src/constant/constant.py to customize:
# Model settings
MODEL_PATH = 'checkpoint/best_model.onnx'
INPUT_SIZE = (224, 224)
NUM_CLASSES = 3
# Detection settings
CONFIDENCE_THRESHOLD = 0.5
HAND_DETECTION_CONFIDENCE = 0.7
HAND_TRACKING_CONFIDENCE = 0.5
# Display settings
SHOW_FPS = True
SHOW_LANDMARKS = False
BBOX_COLOR = (0, 255, 0) # Green
TEXT_COLOR = (255, 255, 255) # White# Test camera access
python -c "import cv2; cap = cv2.VideoCapture(0); print('Camera:', cap.isOpened())"
# Try different camera indices
# In main.py, change: cv2.VideoCapture(0) to cv2.VideoCapture(1)Solutions:
- Close other applications using the camera
- Reduce input resolution in the code
- Use ONNX model instead of PyTorch
- Enable GPU acceleration
Solutions:
- Improve lighting conditions
- Ensure hand is properly centered
- Retrain with more diverse data
- Adjust confidence threshold
# Reinstall dependencies
pip install --upgrade -r requirements.txt
# Verify installation
pip list | grep -E "torch|opencv|mediapipe"- Interactive Gaming - Build gesture-controlled games
- Mobile Applications - Touchless interfaces
- Robotics - Hand gesture control systems
- Education - Teaching computer vision concepts
- Accessibility - Alternative input methods
- Creative Applications - Gesture-based art tools
- Support for multiple simultaneous hands
- Additional gestures (numbers, letters)
- Mobile deployment (iOS/Android)
- Web-based demo using ONNX.js
- Gesture customization interface
- Performance benchmarking suite
- Docker containerization
We welcome contributions! Here's how you can help:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Bug fixes
- New features
- Documentation improvements
- Test coverage
- UI/UX enhancements
- Performance optimizations
This project is licensed under the MIT License - see the LICENSE file for details.
- MediaPipe Team - Hand detection framework
- PyTorch Team - Deep learning framework
- OpenCV Community - Computer vision tools
- ONNX Runtime - Model deployment solution