📚 Educational Purpose Project for Beginners
EchoID is a project that started for the purpose of education for beginners who want to create a voice recognition model but couldn't or don't want to write big boilerplate codes and don't have lots of data (so we need to do augmentation which will increase the code itself). This educational purpose project features binary classification for speaker recognition.
Warning
Educational Purpose Only: This project is designed for learning. It is not intended for production security systems.
- About the Project
- Features
- How It Works
- Installation
- Usage
- Project Structure
- Configuration
- Contributing
- Future Roadmap
- License
- Contact
EchoID is an educational deep learning project designed to help beginners learn voice speaker recognition. This project was created for those who want to build a voice recognition model without writing extensive boilerplate code or having large datasets (data augmentation is built-in to help with limited data). It leverages Convolutional Neural Networks (CNNs) trained on mel-spectrogram representations of audio signals to achieve accurate binary classification - distinguishing between a target speaker and others.
This is purely for educational purposes only
- 🧠 Deep CNN Architecture: Custom-built CNN model optimized for speaker recognition
- 🎵 Mel-Spectrogram Features: Converts raw audio to frequency-based representations
- 🔊 Advanced Augmentation: Multi-level augmentation (waveform + spectrogram) for robust learning
- 📊 Config-Driven: Fully configurable via YAML for easy experimentation
- 🎤 Real-Time Inference: GUI-based live speaker recognition
- 📈 Metrics Tracking: Comprehensive evaluation with accuracy, precision, recall, F1-score, and ROC-AUC
- 🎼 Audio Chunking: Split long recordings into uniform 3-second segments
- 📚 Dataset Loading: Efficient batch-wise audio loading with automatic train-test split (80/20)
- 🔄 Waveform Augmentation:
- Gaussian noise injection
- Pitch shifting (-3 to +3 semitones)
- Amplitude scaling (0.6x to 1.2x)
- 🎶 Mel-Spectrogram Conversion: Transform waveforms to 64x188 mel-spectrograms
- 🎨 Mel-Level Augmentation: SpecAugment and VTLP for enhanced generalization
- 🏗️ Dynamic Model Builder: Config-driven CNN architecture construction
- ⚡ Smart Callbacks:
- Early stopping to prevent overfitting
- Learning rate reduction on plateau
- 📊 Comprehensive Metrics: Track accuracy, precision, recall, F1-score, and ROC-AUC
- 💾 Version Control: Automatic model versioning and checkpointing
- 🎯 Real-Time Recognition: Live audio recording and prediction via GUI
- 🔇 Voice Activity Detection (VAD): Energy-based VAD for robust inference
- ⚙️ Configurable Thresholds: Adjust confidence levels for predictions
┌─────────────────────────────────────────────────────────────────┐
│ 1. DATA COLLECTION │
│ └─ Collect audio samples into data/speaker0 & data/speaker1 │
└──────────────────────┬──────────────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────────────┐
│ 2. AUDIO CHUNKING (AudioChunker) │
│ └─ Split long recordings into 3-second WAV chunks │
└──────────────────────┬──────────────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────────────┐
│ 3. DATASET LOADING (AudioDatasetLoader) │
│ └─ Load audio files and create train/test split (80/20) │
└──────────────────────┬──────────────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────────────┐
│ 4. WAVEFORM AUGMENTATION (AudioAugmentor) │
│ ├─ Add Gaussian noise (0.001-0.01 factor) │
│ ├─ Pitch shift (±3 semitones) │
│ └─ Amplitude scaling (0.6x-1.2x) │
└──────────────────────┬──────────────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────────────┐
│ 5. MEL-SPECTROGRAM CONVERSION (WaveformToMel) │
│ └─ Convert waveforms to 64x188 mel-spectrograms │
└──────────────────────┬──────────────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────────────┐
│ 6. MEL AUGMENTATION (MelAugmentor) │
│ ├─ SpecAugment (time & frequency masking) │
│ └─ VTLP (Vocal Tract Length Perturbation) │
└──────────────────────┬──────────────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────────────┐
│ 7. CNN TRAINING (Trainer) │
│ ├─ Build CNN from config (3 Conv2D layers: 32→64→128) │
│ ├─ Train with Early Stopping & LR Reduction │
│ └─ Track metrics: Accuracy, Precision, Recall, F1, ROC-AUC │
└──────────────────────┬──────────────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────────────┐
│ 8. MODEL EVALUATION │
│ └─ Evaluate on test set and generate performance reports │
└──────────────────────┬──────────────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────────────┐
│ 9. REAL-TIME INFERENCE (InferenceApp) │
│ └─ GUI-based live speaker recognition with VAD │
└─────────────────────────────────────────────────────────────────┘
Input Shape: (64, 188, 1) - Mel-spectrograms normalized to [0, 1]
CNN Architecture: 3 convolutional blocks with progressive filters (32 → 64 → 128)
Training: Binary cross-entropy loss with Adam optimizer
Sample Rate: 16 kHz
Chunk Duration: 3 seconds (48,000 samples @ 16kHz)
- Python 3.12
- pip package manager
- FFmpeg (Required for processing
.m4a,.mp3, or other compressed audio formats)
# Clone the repository
git clone https://github.com/Muhd-Uwais/EchoID.git
cd EchoID
# Install required packages
pip install -r requirements.txtImportant
Windows Users: If you are using .m4a or other compressed audio files, you must install FFmpeg and add it to your system PATH. Without it, librosa will raise an error when trying to load your audio files.
Note for Linux/Mac users: If you encounter errors installing sounddevice, you may need system-level dependencies:
# For Debian/Ubuntu
sudo apt-get install libasound2-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg
# For MacOS
brew install portaudio ffmpegkeras==3.11.3 # Deep learning framework
tensorflow==2.20.0 # Backend for Keras
librosa==0.11.0 # Audio processing
numpy==2.3.3 # Numerical computing
scikit_learn==1.7.2 # Machine learning utilities
sounddevice==0.5.3 # Audio I/O
soundfile==0.13.1 # Audio file I/O
vad==1.0.2 # Voice activity detection
ruamel.base==0.18.16 # YAML configuration
Collect your raw voice recordings (long files, e.g., 1 minute or more).
Important
Input Format Requirements:
- File Type: Currently, the system only supports
.m4afiles for the initial chunking process. (Support for .mp3/.wav coming in v1.1) - Naming Convention: You MUST rename your source files sequentially as
Voice (1).m4a,Voice (2).m4a, etc.
Organize them in the data/ directory:
data/
├── speaker0/ # Non-target speaker samples (negative class)
│ ├── Voice (1).m4a
│ ├── Voice (2).m4a
│ └── ...
└── speaker1/ # Target speaker samples (positive class)
├── Voice (1).m4a
├── Voice (2).m4a
└── ...
Note: If you have long recordings, proceed to step 2 for chunking.
Run the chunker to split your long .m4a recordings into uniform 3-second .wav segments suitable for training.
Command Syntax:
python chunker.py [speaker0_file_count] [speaker1_file_count]
Example:
If you have Voice (1).m4a to Voice (21).m4a in speaker0 (21 files) and Voice (1).m4a to Voice (23).m4a in speaker1 (23 files):
# Usage: python chunker.py [speaker0 file count] [speaker1 file count]
python chunker.py 21 23Post-Chunking Step:
The script generates a chunks/ folder inside each speaker directory containing the processed audio. You must organize these files before training:
- Move Files: Transfer all generated
.wavfiles fromdata/speaker0/chunks/directly intodata/speaker0/. Do the same forspeaker1. - Clean Up: Delete the original
.m4afiles and the now-emptychunks/subdirectories.- Note: The training pipeline requires only the
.wavfiles to be present in the root speaker folders to function correctly.
- Note: The training pipeline requires only the
Run the complete training pipeline using main.py:
python main.pyWhat happens during training:
- Load Dataset: Audio files are loaded and split into train/test sets (80/20)
- Augment Training Data: Waveform augmentation is applied (noise, pitch, amplitude)
- Convert to Mel-Spectrograms: Audio is transformed into mel-spectrogram features
- Apply Mel Augmentation: Additional augmentation on spectrograms
- Train CNN Model: Model trains with early stopping and learning rate reduction
- Evaluate Performance: Test set evaluation with comprehensive metrics
Training Output:
x_train_mel_aug shape: (40, 32, 64, 188, 1)
y_train_mel_aug shape: (40, 32)
Epoch 1/20
100/100 [==============================] - 45s 450ms/step
...
Accuracy: 0.95 | Precision: 0.94 | Recall: 0.96
Trained models are saved in: models/cnn_model_v1/model_v1.keras
Launch the GUI application for live speaker recognition:
python inference.pyGUI Features:
- 🎤 Record Button: Capture 3-second audio clips
- 📊 Real-Time Prediction: Instant speaker identification
- 🎯 Confidence Score: Display prediction probability
- 🔇 VAD Integration: Filter out silence and noise
Example Output:
✅ Target Speaker Detected (92.3%)
❌ Other Speaker Detected (15.7%)
EchoID/
├── config.yaml # Global configuration file
├── chunker.py # Audio chunking script
├── main.py # Main training script
├── inference.py # Real-time inference launcher
├── requirements.txt # Python dependencies
│
├── data/ # Audio dataset directory
│ ├── speaker0/ # Non-target speaker samples
│ └── speaker1/ # Target speaker samples
│
├── models/ # Trained model checkpoints
│ └── cnn_model_v1/
│ └── model_v1.keras
│
├── notebooks/ # Jupyter notebooks for experiments
│ ├── audio_preprocessing_experimental.ipynb
│ ├── model_training_experimental.ipynb
│ └── structure_experimental.ipynb
│
└── src/ # Source code modules
├── data/ # Data processing modules
│ ├── audio_chunker.py # Split long recordings
│ ├── dataset_loader.py # Load and batch audio
│ ├── audio_augmentor.py # Waveform augmentation
│ └── mel_processor.py # Mel-spectrogram conversion & augmentation
│
├── models/ # Model architecture & training
│ ├── model_builder.py # Dynamic CNN builder
│ ├── trainer.py # Training pipeline
│ ├── callbacks.py # Keras callbacks
│ └── evaluation.py # Model evaluation metrics
│
├── inference/ # Real-time inference
│ └── listener.py # GUI inference application
│
└── utils/ # Utility functions
├── config_utils.py # Config file parsing
└── metrics_utils.py # Custom metrics
The entire project is configurable via config.yaml. Key parameters:
model:
input_shape: [64, 188, 1]
filters: [32, 64, 128]
dense_units: [512]
dropout_rates: [0.25, 0.25, 0.3, 0.5]
max_pool_sizes: [2, 2, 2]
kernel_sizes: [3, 3, 3]training:
batch_size: 32
epochs: 20
validation_split: 0.2
early_stopping:
enable: true
patience: 7
monitor: val_loss
reduce_lr:
enable: true
factor: 0.5
patience: 4
min_lr: 1e-6inference:
duration: 3 # Chunk duration (seconds)
sample_rate: 16000 # Audio sample rate (Hz)
threshold: 0.7 # Confidence thresholdContributions are welcome and encouraged! Whether you're fixing bugs, improving documentation, or proposing new features, your input is valuable.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
We appreciate contributions in any of these areas:
- 🐛 Bug fixes and error handling improvements
- 📝 Documentation enhancements
- ✨ New feature implementations
- 🧪 Test coverage improvements
- 🎨 Code quality and refactoring
Have ideas but unsure how to implement them? Open an issue to discuss! We're here to help.
- Multi-Class Classification: Extend from binary to multi-speaker recognition (3+ speakers)
- Advanced Augmentation Methods:
- Room impulse response (RIR) simulation
- Background noise mixing
- Speed perturbation
- Codec simulation
- Transfer Learning: Leverage pre-trained models (VGGish, ResNet, wav2vec 2.0)
- Better Model Architecture:
- Attention mechanisms
- Residual connections
- Deeper networks with batch normalization
- Data Pipeline:
- TensorFlow Dataset API integration
- Data caching for faster training
- Cross-Validation: K-fold validation for robust performance estimates
- Confusion Matrix Visualization: Detailed error analysis
- Per-Speaker Performance: Individual speaker accuracy metrics
- Threshold Tuning: ROC curve analysis for optimal thresholds
- Comprehensive Tutorial Series: Step-by-step video tutorials
- Sample Datasets: Public dataset links and benchmarks
- Model Zoo: Pre-trained models for different use cases
- Paper/Blog Post: Detailed technical write-up of methodology
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) 2024 Muhd Uwais
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files...
Muhd Uwais - Project Author
For feedback or questions, please reach out via the link below:
- 📝 Contact Form: Send Me a Message
- Found a bug? Open an Issue
- Have a question? Start a Discussion
- Want to collaborate? Reach out via the Contact Form
- Suggestions? We'd love to hear them!
If EchoID helped you with your speaker recognition tasks or you found it useful for learning, please consider giving it a ⭐ on GitHub! Your support motivates continued development and helps others discover the project.
Built with ❤️ and 🧠 by an aspiring AI Developer
#DPMG
Discipline • Peace • Myself • Growth
Happy Coding! 🚀