🎙️ EchoID - Deep Voice Speaker Recognition System

📚 Educational Purpose Project for Beginners
EchoID is a project that started for the purpose of education for beginners who want to create a voice recognition model but couldn't or don't want to write big boilerplate codes and don't have lots of data (so we need to do augmentation which will increase the code itself). This educational purpose project features binary classification for speaker recognition.

Warning

Educational Purpose Only: This project is designed for learning. It is not intended for production security systems.

🎥 A comprehensive walkthrough showing how to use EchoID will be available shortly.

📑 Table of Contents

🎯 About the Project

EchoID is an educational deep learning project designed to help beginners learn voice speaker recognition. This project was created for those who want to build a voice recognition model without writing extensive boilerplate code or having large datasets (data augmentation is built-in to help with limited data). It leverages Convolutional Neural Networks (CNNs) trained on mel-spectrogram representations of audio signals to achieve accurate binary classification - distinguishing between a target speaker and others.

This is purely for educational purposes only

Key Highlights:

🧠 Deep CNN Architecture: Custom-built CNN model optimized for speaker recognition
🎵 Mel-Spectrogram Features: Converts raw audio to frequency-based representations
🔊 Advanced Augmentation: Multi-level augmentation (waveform + spectrogram) for robust learning
📊 Config-Driven: Fully configurable via YAML for easy experimentation
🎤 Real-Time Inference: GUI-based live speaker recognition
📈 Metrics Tracking: Comprehensive evaluation with accuracy, precision, recall, F1-score, and ROC-AUC

✨ Features

Data Processing Pipeline

🎼 Audio Chunking: Split long recordings into uniform 3-second segments
📚 Dataset Loading: Efficient batch-wise audio loading with automatic train-test split (80/20)
🔄 Waveform Augmentation:
- Gaussian noise injection
- Pitch shifting (-3 to +3 semitones)
- Amplitude scaling (0.6x to 1.2x)
🎶 Mel-Spectrogram Conversion: Transform waveforms to 64x188 mel-spectrograms
🎨 Mel-Level Augmentation: SpecAugment and VTLP for enhanced generalization

Model Training

🏗️ Dynamic Model Builder: Config-driven CNN architecture construction
⚡ Smart Callbacks:
- Early stopping to prevent overfitting
- Learning rate reduction on plateau
📊 Comprehensive Metrics: Track accuracy, precision, recall, F1-score, and ROC-AUC
💾 Version Control: Automatic model versioning and checkpointing

Inference

🎯 Real-Time Recognition: Live audio recording and prediction via GUI
🔇 Voice Activity Detection (VAD): Energy-based VAD for robust inference
⚙️ Configurable Thresholds: Adjust confidence levels for predictions

🔍 How It Works

Complete Workflow

┌─────────────────────────────────────────────────────────────────┐
│  1. DATA COLLECTION                                             │
│     └─ Collect audio samples into data/speaker0 & data/speaker1 │
└──────────────────────┬──────────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────────┐
│  2. AUDIO CHUNKING (AudioChunker)                               │
│     └─ Split long recordings into 3-second WAV chunks           │
└──────────────────────┬──────────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────────┐
│  3. DATASET LOADING (AudioDatasetLoader)                        │
│     └─ Load audio files and create train/test split (80/20)     │
└──────────────────────┬──────────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────────┐
│  4. WAVEFORM AUGMENTATION (AudioAugmentor)                      │
│     ├─ Add Gaussian noise (0.001-0.01 factor)                   │
│     ├─ Pitch shift (±3 semitones)                               │
│     └─ Amplitude scaling (0.6x-1.2x)                            │
└──────────────────────┬──────────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────────┐
│  5. MEL-SPECTROGRAM CONVERSION (WaveformToMel)                  │
│     └─ Convert waveforms to 64x188 mel-spectrograms             │
└──────────────────────┬──────────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────────┐
│  6. MEL AUGMENTATION (MelAugmentor)                             │
│     ├─ SpecAugment (time & frequency masking)                   │
│     └─ VTLP (Vocal Tract Length Perturbation)                   │
└──────────────────────┬──────────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────────┐
│  7. CNN TRAINING (Trainer)                                      │
│     ├─ Build CNN from config (3 Conv2D layers: 32→64→128)       │
│     ├─ Train with Early Stopping & LR Reduction                 │
│     └─ Track metrics: Accuracy, Precision, Recall, F1, ROC-AUC  │
└──────────────────────┬──────────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────────┐
│  8. MODEL EVALUATION                                            │
│     └─ Evaluate on test set and generate performance reports    │
└──────────────────────┬──────────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────────┐
│  9. REAL-TIME INFERENCE (InferenceApp)                          │
│     └─ GUI-based live speaker recognition with VAD              │
└─────────────────────────────────────────────────────────────────┘

Technical Details

Input Shape: (64, 188, 1) - Mel-spectrograms normalized to [0, 1]
CNN Architecture: 3 convolutional blocks with progressive filters (32 → 64 → 128)
Training: Binary cross-entropy loss with Adam optimizer
Sample Rate: 16 kHz
Chunk Duration: 3 seconds (48,000 samples @ 16kHz)

📦 Installation

Prerequisites

Python 3.12
pip package manager
FFmpeg (Required for processing .m4a, .mp3, or other compressed audio formats)

Install Dependencies

# Clone the repository
git clone https://github.com/Muhd-Uwais/EchoID.git
cd EchoID

# Install required packages
pip install -r requirements.txt

System-Level Dependencies (Important)

Important

Windows Users: If you are using .m4a or other compressed audio files, you must install FFmpeg and add it to your system PATH. Without it, librosa will raise an error when trying to load your audio files.

Note for Linux/Mac users: If you encounter errors installing sounddevice, you may need system-level dependencies:

# For Debian/Ubuntu
sudo apt-get install libasound2-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg

# For MacOS
brew install portaudio ffmpeg

Dependencies Overview

keras==3.11.3          # Deep learning framework
tensorflow==2.20.0     # Backend for Keras
librosa==0.11.0        # Audio processing
numpy==2.3.3           # Numerical computing
scikit_learn==1.7.2    # Machine learning utilities
sounddevice==0.5.3     # Audio I/O
soundfile==0.13.1      # Audio file I/O
vad==1.0.2             # Voice activity detection
ruamel.base==0.18.16   # YAML configuration

🚀 Usage

1. Data Preparation

Collect your raw voice recordings (long files, e.g., 1 minute or more).

Important

Input Format Requirements:

File Type: Currently, the system only supports .m4a files for the initial chunking process. (Support for .mp3/.wav coming in v1.1)
Naming Convention: You MUST rename your source files sequentially as Voice (1).m4a, Voice (2).m4a, etc.

Organize them in the data/ directory:

data/
├── speaker0/          # Non-target speaker samples (negative class)
│   ├── Voice (1).m4a
│   ├── Voice (2).m4a
│   └── ...
└── speaker1/          # Target speaker samples (positive class)
    ├── Voice (1).m4a
    ├── Voice (2).m4a
    └── ...

Note: If you have long recordings, proceed to step 2 for chunking.

2. Audio Chunking

Run the chunker to split your long .m4a recordings into uniform 3-second .wav segments suitable for training.

Command Syntax: python chunker.py [speaker0_file_count] [speaker1_file_count]

Example: If you have Voice (1).m4a to Voice (21).m4a in speaker0 (21 files) and Voice (1).m4a to Voice (23).m4a in speaker1 (23 files):

# Usage: python chunker.py [speaker0 file count] [speaker1 file count]
python chunker.py 21 23

Post-Chunking Step: The script generates a chunks/ folder inside each speaker directory containing the processed audio. You must organize these files before training:

Move Files: Transfer all generated .wav files from data/speaker0/chunks/ directly into data/speaker0/. Do the same for speaker1.
Clean Up: Delete the original .m4a files and the now-empty chunks/ subdirectories.
- Note: The training pipeline requires only the .wav files to be present in the root speaker folders to function correctly.

3. Training the Model

Run the complete training pipeline using main.py:

python main.py

What happens during training:

Load Dataset: Audio files are loaded and split into train/test sets (80/20)
Augment Training Data: Waveform augmentation is applied (noise, pitch, amplitude)
Convert to Mel-Spectrograms: Audio is transformed into mel-spectrogram features
Apply Mel Augmentation: Additional augmentation on spectrograms
Train CNN Model: Model trains with early stopping and learning rate reduction
Evaluate Performance: Test set evaluation with comprehensive metrics

Training Output:

x_train_mel_aug shape: (40, 32, 64, 188, 1)
y_train_mel_aug shape: (40, 32)

Epoch 1/20
100/100 [==============================] - 45s 450ms/step
...
Accuracy: 0.95 | Precision: 0.94 | Recall: 0.96

Trained models are saved in: models/cnn_model_v1/model_v1.keras

4. Real-Time Inference

Launch the GUI application for live speaker recognition:

python inference.py

GUI Features:

🎤 Record Button: Capture 3-second audio clips
📊 Real-Time Prediction: Instant speaker identification
🎯 Confidence Score: Display prediction probability
🔇 VAD Integration: Filter out silence and noise

Example Output:

✅ Target Speaker Detected (92.3%)
❌ Other Speaker Detected (15.7%)

📂 Project Structure

EchoID/
├── config.yaml                    # Global configuration file
├── chunker.py                     # Audio chunking script
├── main.py                        # Main training script
├── inference.py                   # Real-time inference launcher
├── requirements.txt               # Python dependencies
│
├── data/                          # Audio dataset directory
│   ├── speaker0/                  # Non-target speaker samples
│   └── speaker1/                  # Target speaker samples
│
├── models/                        # Trained model checkpoints
│   └── cnn_model_v1/             
│       └── model_v1.keras
│
├── notebooks/                     # Jupyter notebooks for experiments
│   ├── audio_preprocessing_experimental.ipynb
│   ├── model_training_experimental.ipynb
│   └── structure_experimental.ipynb
│
└── src/                           # Source code modules
    ├── data/                      # Data processing modules
    │   ├── audio_chunker.py       # Split long recordings
    │   ├── dataset_loader.py      # Load and batch audio
    │   ├── audio_augmentor.py     # Waveform augmentation
    │   └── mel_processor.py       # Mel-spectrogram conversion & augmentation
    │
    ├── models/                    # Model architecture & training
    │   ├── model_builder.py       # Dynamic CNN builder
    │   ├── trainer.py             # Training pipeline
    │   ├── callbacks.py           # Keras callbacks
    │   └── evaluation.py          # Model evaluation metrics
    │
    ├── inference/                 # Real-time inference
    │   └── listener.py            # GUI inference application
    │
    └── utils/                     # Utility functions
        ├── config_utils.py        # Config file parsing
        └── metrics_utils.py       # Custom metrics

⚙️ Configuration

The entire project is configurable via config.yaml. Key parameters:

Model Architecture

model:
  input_shape: [64, 188, 1]
  filters: [32, 64, 128]
  dense_units: [512]
  dropout_rates: [0.25, 0.25, 0.3, 0.5]
  max_pool_sizes: [2, 2, 2]
  kernel_sizes: [3, 3, 3]

Training Parameters

training:
  batch_size: 32
  epochs: 20
  validation_split: 0.2
  
  early_stopping:
    enable: true
    patience: 7
    monitor: val_loss
  
  reduce_lr:
    enable: true
    factor: 0.5
    patience: 4
    min_lr: 1e-6

Inference Settings

inference:
  duration: 3              # Chunk duration (seconds)
  sample_rate: 16000       # Audio sample rate (Hz)
  threshold: 0.7           # Confidence threshold

🤝 Contributing

Contributions are welcome and encouraged! Whether you're fixing bugs, improving documentation, or proposing new features, your input is valuable.

How to Contribute

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Contribution Areas

We appreciate contributions in any of these areas:

🐛 Bug fixes and error handling improvements
📝 Documentation enhancements
✨ New feature implementations
🧪 Test coverage improvements
🎨 Code quality and refactoring

Have ideas but unsure how to implement them? Open an issue to discuss! We're here to help.

🚧 Future Roadmap

Planned Enhancements

🎯 Core Features

Multi-Class Classification: Extend from binary to multi-speaker recognition (3+ speakers)
Advanced Augmentation Methods:
- Room impulse response (RIR) simulation
- Background noise mixing
- Speed perturbation
- Codec simulation
Transfer Learning: Leverage pre-trained models (VGGish, ResNet, wav2vec 2.0)

🏗️ Architecture Improvements

📊 Model Evaluation

Cross-Validation: K-fold validation for robust performance estimates
Confusion Matrix Visualization: Detailed error analysis
Per-Speaker Performance: Individual speaker accuracy metrics
Threshold Tuning: ROC curve analysis for optimal thresholds

Community Goals

Comprehensive Tutorial Series: Step-by-step video tutorials
Sample Datasets: Public dataset links and benchmarks
Model Zoo: Pre-trained models for different use cases
Paper/Blog Post: Detailed technical write-up of methodology

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Copyright (c) 2024 Muhd Uwais

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files...

📬 Contact

Muhd Uwais - Project Author

For feedback or questions, please reach out via the link below:

📝 Contact Form: Send Me a Message

Feedback & Support

Found a bug? Open an Issue
Have a question? Start a Discussion
Want to collaborate? Reach out via the Contact Form
Suggestions? We'd love to hear them!

🏆 Star this Project!

If EchoID helped you with your speaker recognition tasks or you found it useful for learning, please consider giving it a ⭐ on GitHub! Your support motivates continued development and helps others discover the project.

Built with ❤️ and 🧠 by an aspiring AI Developer

#DPMG
_{Discipline • Peace • Myself • Growth}

^{Happy Coding! 🚀}

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
config		config
data		data
models		models
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chunker.py		chunker.py
config.yaml		config.yaml
inference.py		inference.py
main.py		main.py
requirements.txt		requirements.txt

License

Muhd-Uwais/EchoID

Folders and files

Latest commit

History

Repository files navigation