Real time 3D Gesture and Traffic Sign Language
Recognition using Multi-
Layer Perceptron with Mediapipe
2nd International Conference on Machine Learning and Autonomous Systems
(ICMLAS 2025)
Paper ID: ICMLAS - 1274
BVRIT HYDERABAD College of Engineering for Women
Computer Science and Engineering
Ms. D Swapna
Ms. M Shanmuga Sundari
Ms. K Sreeja
Ms. NSML Keerthi
Ms. D Sai Shriya
Agenda
• Introduction
• Abstract
• Literature work
• Proposed work
• Methodology
• Dataset
• Algorithm
• Comparative Analysis
• Results
• Conclusion
• Future work
• References
Introduction
Real-time 3D gesture and traffic sign language recognition is an essential
advancement in human-computer interaction and traffic management.
Traditional recognition methods often face challenges in accuracy,
adaptability, and real-time performance, limiting their effectiveness in
dynamic environments. This project addresses these limitations by utilizing
Multi-Layer Perceptron (MLP) with Mediapipe, a powerful framework for
tracking and interpreting hand and body gestures with high precision. By
leveraging deep learning and skeletal tracking techniques, the system can
efficiently recognize traffic signals and hand gestures, making it highly
applicable for smart traffic control, assistive technologies, and automated
surveillance systems.
Abstract
This project presents a real-time 3D gesture and traffic sign language
recognition system using Multi-Layer Perceptron (MLP) with Mediapipe to
improve gesture-based communication and traffic safety. The system captures
and processes 3D skeletal movements from real-time video input, utilizing
deep learning techniques to classify hand gestures and traffic signs with high
accuracy. By integrating MLP with Mediapipe’s pose estimation, the model
ensures low latency and adaptability in dynamic conditions. The proposed
system enhances automated traffic control, accessibility for individuals with
disabilities, and intelligent surveillance by providing an efficient, real-time
solution for recognizing and interpreting human gestures in diverse
environments.
Literature work
S.No Title Merits Demerits
1 3D Skeletal Gesture Introduces a representation for The method sometimes struggles
Recognition via gesture recognition that is with gestures that have
Discriminative Coding invariant to temporal dynamics significant intra-class variations
on Time-Warping In- by using a time-warping or complex interactions (e.g.,
variant Riemannian invariant metric, allowing the actions involving both hands)
Trajectories system to handle gestures due to noise or occlusion in the
performed at different speeds. skeleton data.
2 Traffic Sign Detection High-speed real-time Struggles with small object
and Recognition Using detection; suitable for detection and adverse conditions;
YOLO Object Detection autonomous driving and traffic computationally expensive.
Algorithm safety.
Literature work
S.No Title Merits Demerits
3 Traffic Police Gesture achieves high accuracy (98.95) The method may face
Recognition Based on by extracting discriminative overfitting and
Gesture Skeleton Extractor and interpretable gesture oversmoothing issues when
and Multichannel Dilated skeleton features, which aggregating excessive
Graph Convolution outperforms other methods. multichannel features.
Network.
4 Vision-based Traffic Police Effective in traffic control and Sensitive to illumination
Hand Signal Recognition in real-time monitoring with changes, occlusions, and
Surveillance Video. robust hand gesture viewing angles.
recognition.
Literature work
S.No Title Merits Demerits
5 An Efficient Feature The proposed method, using The paper mentions that
Fusion of Graph Attention-enhanced Adaptive Graph while the method
Convolutional Networks Convolutional Networks (AAGCN), performs well, handling
and Its Application for significantly improves the accuracy extremely noisy or
RealTime Traffic Control of gesture recognition in real-time missing data in more
Gestures Recognition applications. It is robust against challenging datasets
noisy and incomplete skeletal data. remains a limitation
6 Authorized Traffic achieves high accuracy (96.70) in The system struggles
Controller Hand Gesture recognizing traffic control hand with accuracy at the start
Recognition for gestures, even in complex scenarios or end of gestures,
Situation-Aware like overlapping pedestrians and leading to occasional
Autonomous Driving gloves. frame-wise errors.
Proposed work
Preprocessing:
• Capture webcam frames.
• Convert frames to RGB format.
• Detect hand landmarks using MediaPipe.
Gesture Recognition:
• Define gestures based on hand landmark positions (e.g., Stop, Right Turn, Move Forward).
• Track wrist movement for dynamic gestures.
Evaluation:
• Calculate accuracy, precision, and F1-score using ground truth and predictions.
• Evaluate over multiple frames for consistency.
Error Handling:
• Handle incomplete landmarks and frame capture errors.
• Use exception handling for robustness.
Methodology
Dataset
The dataset used in this project is based on the inbuilt MediaPipe Hands
dataset, which includes images and video frames for detecting 21 hand
landmarks. The main features are:
• Hand gestures are recognized based on the positions of key landmarks.
• The gestures in this project are : Stop, Right Turn, Left Turn, Move
Forward, Move Backward, Slow Down, Speed Up
• The dataset includes webcam frames processed using MediaPipe's hand
tracking model.
• Ground truth labels are manually assigned for accuracy evaluation.
This dataset supports the accurate classification of hand gestures for real-time
recognition.
Algorithm
Multi-Layer Perceptron (MLP)
• Multi-Layer Perceptron (MLP) is a feed forward artificial neural network.
• It is used for pattern recognition and gesture classification in real-time
systems.
Architecture:
• Input Layer: Accepts skeletal feature vectors extracted using Mediapipe.
• Hidden Layers: Utilize weighted connections, activation functions, and
backpropagation for learning.
• Output Layer: Predicts gesture labels based on processed input.
Key Features:
• Utilizes activation functions such as ReLU, Sigmoid, or Tanh for efficient
learning.
• Optimized using gradient descent and backpropagation.
• Designed for real-time gesture recognition with high accuracy and low latency
Comparative Analysis
• Traditional Methods: CNNs/RNNs for gesture recognition but lack real-time
efficiency (Accuracy: 80-85%, Latency: 200-300ms, FPS: 5-10).
• Proposed Approach: MLP with Mediapipe enables faster, real-time recognition
(Accuracy: 92-95%, Latency: 50-100ms, FPS: 25-30).
• Static Image Processing: Lower accuracy due to limited adaptability (Accuracy:
75-80%).
• Our System: Uses real-time skeletal tracking for better precision (Tracking
Precision: 95-98%).
• Prior Challenges: Struggles with complex gestures and occlusions (Error Rate:
15-20% in traditional methods).
• Our Solution: Holistic tracking ensures robust recognition (Error Rate Reduced
to 5-7%).
• Existing Limitations: High processing delays hinder real-time use (Inference
Time: 200-300ms).
• Our Advantage: Low-latency execution optimized for real-world applications
(Inference Time: 50-100ms).
Results
Conclusion
• Implementing real-time 3D gesture and traffic sign recognition using
MLP with Mediapipe enhances accuracy, efficiency, and adaptability.
• The system provides valuable applications in traffic management,
assistive communication, and surveillance.
• Integration of deep learning and computer vision ensures real-time
processing and precise recognition.
• Future improvements include expanding datasets, refining models,
and enhancing real-time capabilities.
Future work
• Expanded Gesture Dataset: Incorporate more complex gestures for
broader recognition
• Voice-Enabled Interaction: Integrate speech recognition for
multimodal communication.
• Real-World Deployment: Implement in traffic control, smart homes,
and assistive technologies
• Improved Adaptability: Enhance system performance under
different environmental conditions.
References
[1] X. Liu and G. Zhao, "3D Skeletal Gesture Recognition via Discriminative Coding on
Time-Warping Invariant Riemannian Trajectories," in IEEE Transactions on Multimedia,
vol. 23, pp. 1841-1854, 2021, doi: 10.1109/TMM.2020.3003783.
[2] Flores-Calero, M.; Astudillo, C.A.; Guevara, D.; Maza, J.; Lita, B.S.; Defaz, B.; Ante,
J.S.; ZabalaBlanco, D.; Armingol Moreno, J.M. Traffic Sign Detection and Recognition
Using YOLO Object Detection Algorithm: A Systematic Review. Mathematics 2024, 12,
297. https://doi.org/10.3390/ math12020297
[3] Xiong, X.; Wu, H.; Min, W.; Xu, J.; Fu, Q.; Peng, C. Traffic Police Gesture Recognition
Based on Gesture Skeleton Extractor and Multichannel Dilated Graph Convolution
Network. Electronics 2021, 10, 551. https://doi.org/ 10.3390/electronics10050551
References
[4] Sathya, R. Geetha, M.. (2013). Vision based Traffic Police Hand Signal Recognition
in Surveillance Video - A Survey. International Journal of Computer Applications. 81. 1-
10. 10.5120/14037-2192.
[5] D. -T. Pham, Q. -T. Pham, T. -L. Le and H. Vu, "An Efficient Feature Fusion of
Graph Convolutional Networks and Its Application for Real-Time Traffic Control
Gestures Recognition," in IEEE Access, vol. 9, pp. 121930-121943, 2021, doi:
10.1109/ACCESS.2021.3109255.
[6] Mishra, A.; Kim, J.; Cha, J.; Kim, D.; Kim, S. Authorized Traffic Controller Hand
Gesture Recognition for Situation-Aware Autonomous Driving. Sensors 2021, 21, 7914.
https://doi.org/10.3390/s21237914
Thank you