Skip to content

Deep voice speaker recognition system built with Keras CNNs. Educational project featuring audio augmentation, mel-spectrogram processing, and real-time inference. Binary classification for beginners.

License

Notifications You must be signed in to change notification settings

Muhd-Uwais/EchoID

Repository files navigation

🎙️ EchoID - Deep Voice Speaker Recognition System

Python Keras License

📚 Educational Purpose Project for Beginners
EchoID is a project that started for the purpose of education for beginners who want to create a voice recognition model but couldn't or don't want to write big boilerplate codes and don't have lots of data (so we need to do augmentation which will increase the code itself). This educational purpose project features binary classification for speaker recognition.

Warning

Educational Purpose Only: This project is designed for learning. It is not intended for production security systems.


Demo Video Placeholder
🎥 A comprehensive walkthrough showing how to use EchoID will be available shortly.


📑 Table of Contents


🎯 About the Project

EchoID is an educational deep learning project designed to help beginners learn voice speaker recognition. This project was created for those who want to build a voice recognition model without writing extensive boilerplate code or having large datasets (data augmentation is built-in to help with limited data). It leverages Convolutional Neural Networks (CNNs) trained on mel-spectrogram representations of audio signals to achieve accurate binary classification - distinguishing between a target speaker and others.

This is purely for educational purposes only

Key Highlights:

  • 🧠 Deep CNN Architecture: Custom-built CNN model optimized for speaker recognition
  • 🎵 Mel-Spectrogram Features: Converts raw audio to frequency-based representations
  • 🔊 Advanced Augmentation: Multi-level augmentation (waveform + spectrogram) for robust learning
  • 📊 Config-Driven: Fully configurable via YAML for easy experimentation
  • 🎤 Real-Time Inference: GUI-based live speaker recognition
  • 📈 Metrics Tracking: Comprehensive evaluation with accuracy, precision, recall, F1-score, and ROC-AUC

✨ Features

Data Processing Pipeline

  • 🎼 Audio Chunking: Split long recordings into uniform 3-second segments
  • 📚 Dataset Loading: Efficient batch-wise audio loading with automatic train-test split (80/20)
  • 🔄 Waveform Augmentation:
    • Gaussian noise injection
    • Pitch shifting (-3 to +3 semitones)
    • Amplitude scaling (0.6x to 1.2x)
  • 🎶 Mel-Spectrogram Conversion: Transform waveforms to 64x188 mel-spectrograms
  • 🎨 Mel-Level Augmentation: SpecAugment and VTLP for enhanced generalization

Model Training

  • 🏗️ Dynamic Model Builder: Config-driven CNN architecture construction
  • ⚡ Smart Callbacks:
    • Early stopping to prevent overfitting
    • Learning rate reduction on plateau
  • 📊 Comprehensive Metrics: Track accuracy, precision, recall, F1-score, and ROC-AUC
  • 💾 Version Control: Automatic model versioning and checkpointing

Inference

  • 🎯 Real-Time Recognition: Live audio recording and prediction via GUI
  • 🔇 Voice Activity Detection (VAD): Energy-based VAD for robust inference
  • ⚙️ Configurable Thresholds: Adjust confidence levels for predictions

🔍 How It Works

Complete Workflow

┌─────────────────────────────────────────────────────────────────┐
│  1. DATA COLLECTION                                             │
│     └─ Collect audio samples into data/speaker0 & data/speaker1 │
└──────────────────────┬──────────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────────┐
│  2. AUDIO CHUNKING (AudioChunker)                               │
│     └─ Split long recordings into 3-second WAV chunks           │
└──────────────────────┬──────────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────────┐
│  3. DATASET LOADING (AudioDatasetLoader)                        │
│     └─ Load audio files and create train/test split (80/20)     │
└──────────────────────┬──────────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────────┐
│  4. WAVEFORM AUGMENTATION (AudioAugmentor)                      │
│     ├─ Add Gaussian noise (0.001-0.01 factor)                   │
│     ├─ Pitch shift (±3 semitones)                               │
│     └─ Amplitude scaling (0.6x-1.2x)                            │
└──────────────────────┬──────────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────────┐
│  5. MEL-SPECTROGRAM CONVERSION (WaveformToMel)                  │
│     └─ Convert waveforms to 64x188 mel-spectrograms             │
└──────────────────────┬──────────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────────┐
│  6. MEL AUGMENTATION (MelAugmentor)                             │
│     ├─ SpecAugment (time & frequency masking)                   │
│     └─ VTLP (Vocal Tract Length Perturbation)                   │
└──────────────────────┬──────────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────────┐
│  7. CNN TRAINING (Trainer)                                      │
│     ├─ Build CNN from config (3 Conv2D layers: 32→64→128)       │
│     ├─ Train with Early Stopping & LR Reduction                 │
│     └─ Track metrics: Accuracy, Precision, Recall, F1, ROC-AUC  │
└──────────────────────┬──────────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────────┐
│  8. MODEL EVALUATION                                            │
│     └─ Evaluate on test set and generate performance reports    │
└──────────────────────┬──────────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────────┐
│  9. REAL-TIME INFERENCE (InferenceApp)                          │
│     └─ GUI-based live speaker recognition with VAD              │
└─────────────────────────────────────────────────────────────────┘

Technical Details

Input Shape: (64, 188, 1) - Mel-spectrograms normalized to [0, 1]
CNN Architecture: 3 convolutional blocks with progressive filters (32 → 64 → 128)
Training: Binary cross-entropy loss with Adam optimizer
Sample Rate: 16 kHz
Chunk Duration: 3 seconds (48,000 samples @ 16kHz)


📦 Installation

Prerequisites

  • Python 3.12
  • pip package manager
  • FFmpeg (Required for processing .m4a, .mp3, or other compressed audio formats)

Install Dependencies

# Clone the repository
git clone https://github.com/Muhd-Uwais/EchoID.git
cd EchoID

# Install required packages
pip install -r requirements.txt

System-Level Dependencies (Important)

Important

Windows Users: If you are using .m4a or other compressed audio files, you must install FFmpeg and add it to your system PATH. Without it, librosa will raise an error when trying to load your audio files.

Note for Linux/Mac users: If you encounter errors installing sounddevice, you may need system-level dependencies:

# For Debian/Ubuntu
sudo apt-get install libasound2-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg

# For MacOS
brew install portaudio ffmpeg

Dependencies Overview

keras==3.11.3          # Deep learning framework
tensorflow==2.20.0     # Backend for Keras
librosa==0.11.0        # Audio processing
numpy==2.3.3           # Numerical computing
scikit_learn==1.7.2    # Machine learning utilities
sounddevice==0.5.3     # Audio I/O
soundfile==0.13.1      # Audio file I/O
vad==1.0.2             # Voice activity detection
ruamel.base==0.18.16   # YAML configuration

🚀 Usage

1. Data Preparation

Collect your raw voice recordings (long files, e.g., 1 minute or more).

Important

Input Format Requirements:

  1. File Type: Currently, the system only supports .m4a files for the initial chunking process. (Support for .mp3/.wav coming in v1.1)
  2. Naming Convention: You MUST rename your source files sequentially as Voice (1).m4a, Voice (2).m4a, etc.

Organize them in the data/ directory:

data/
├── speaker0/          # Non-target speaker samples (negative class)
│   ├── Voice (1).m4a
│   ├── Voice (2).m4a
│   └── ...
└── speaker1/          # Target speaker samples (positive class)
    ├── Voice (1).m4a
    ├── Voice (2).m4a
    └── ...

Note: If you have long recordings, proceed to step 2 for chunking.


2. Audio Chunking

Run the chunker to split your long .m4a recordings into uniform 3-second .wav segments suitable for training.

Command Syntax: python chunker.py [speaker0_file_count] [speaker1_file_count]

Example: If you have Voice (1).m4a to Voice (21).m4a in speaker0 (21 files) and Voice (1).m4a to Voice (23).m4a in speaker1 (23 files):

# Usage: python chunker.py [speaker0 file count] [speaker1 file count]
python chunker.py 21 23

Post-Chunking Step: The script generates a chunks/ folder inside each speaker directory containing the processed audio. You must organize these files before training:

  1. Move Files: Transfer all generated .wav files from data/speaker0/chunks/ directly into data/speaker0/. Do the same for speaker1.
  2. Clean Up: Delete the original .m4a files and the now-empty chunks/ subdirectories.
    • Note: The training pipeline requires only the .wav files to be present in the root speaker folders to function correctly.

3. Training the Model

Run the complete training pipeline using main.py:

python main.py

What happens during training:

  1. Load Dataset: Audio files are loaded and split into train/test sets (80/20)
  2. Augment Training Data: Waveform augmentation is applied (noise, pitch, amplitude)
  3. Convert to Mel-Spectrograms: Audio is transformed into mel-spectrogram features
  4. Apply Mel Augmentation: Additional augmentation on spectrograms
  5. Train CNN Model: Model trains with early stopping and learning rate reduction
  6. Evaluate Performance: Test set evaluation with comprehensive metrics

Training Output:

x_train_mel_aug shape: (40, 32, 64, 188, 1)
y_train_mel_aug shape: (40, 32)

Epoch 1/20
100/100 [==============================] - 45s 450ms/step
...
Accuracy: 0.95 | Precision: 0.94 | Recall: 0.96

Trained models are saved in: models/cnn_model_v1/model_v1.keras


4. Real-Time Inference

Launch the GUI application for live speaker recognition:

python inference.py

GUI Features:

  • 🎤 Record Button: Capture 3-second audio clips
  • 📊 Real-Time Prediction: Instant speaker identification
  • 🎯 Confidence Score: Display prediction probability
  • 🔇 VAD Integration: Filter out silence and noise

Example Output:

✅ Target Speaker Detected (92.3%)
❌ Other Speaker Detected (15.7%)

📂 Project Structure

EchoID/
├── config.yaml                    # Global configuration file
├── chunker.py                     # Audio chunking script
├── main.py                        # Main training script
├── inference.py                   # Real-time inference launcher
├── requirements.txt               # Python dependencies
│
├── data/                          # Audio dataset directory
│   ├── speaker0/                  # Non-target speaker samples
│   └── speaker1/                  # Target speaker samples
│
├── models/                        # Trained model checkpoints
│   └── cnn_model_v1/             
│       └── model_v1.keras
│
├── notebooks/                     # Jupyter notebooks for experiments
│   ├── audio_preprocessing_experimental.ipynb
│   ├── model_training_experimental.ipynb
│   └── structure_experimental.ipynb
│
└── src/                           # Source code modules
    ├── data/                      # Data processing modules
    │   ├── audio_chunker.py       # Split long recordings
    │   ├── dataset_loader.py      # Load and batch audio
    │   ├── audio_augmentor.py     # Waveform augmentation
    │   └── mel_processor.py       # Mel-spectrogram conversion & augmentation
    │
    ├── models/                    # Model architecture & training
    │   ├── model_builder.py       # Dynamic CNN builder
    │   ├── trainer.py             # Training pipeline
    │   ├── callbacks.py           # Keras callbacks
    │   └── evaluation.py          # Model evaluation metrics
    │
    ├── inference/                 # Real-time inference
    │   └── listener.py            # GUI inference application
    │
    └── utils/                     # Utility functions
        ├── config_utils.py        # Config file parsing
        └── metrics_utils.py       # Custom metrics

⚙️ Configuration

The entire project is configurable via config.yaml. Key parameters:

Model Architecture

model:
  input_shape: [64, 188, 1]
  filters: [32, 64, 128]
  dense_units: [512]
  dropout_rates: [0.25, 0.25, 0.3, 0.5]
  max_pool_sizes: [2, 2, 2]
  kernel_sizes: [3, 3, 3]

Training Parameters

training:
  batch_size: 32
  epochs: 20
  validation_split: 0.2
  
  early_stopping:
    enable: true
    patience: 7
    monitor: val_loss
  
  reduce_lr:
    enable: true
    factor: 0.5
    patience: 4
    min_lr: 1e-6

Inference Settings

inference:
  duration: 3              # Chunk duration (seconds)
  sample_rate: 16000       # Audio sample rate (Hz)
  threshold: 0.7           # Confidence threshold

🤝 Contributing

Contributions are welcome and encouraged! Whether you're fixing bugs, improving documentation, or proposing new features, your input is valuable.

How to Contribute

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Contribution Areas

We appreciate contributions in any of these areas:

  • 🐛 Bug fixes and error handling improvements
  • 📝 Documentation enhancements
  • ✨ New feature implementations
  • 🧪 Test coverage improvements
  • 🎨 Code quality and refactoring

Have ideas but unsure how to implement them? Open an issue to discuss! We're here to help.


🚧 Future Roadmap

Planned Enhancements

🎯 Core Features

  • Multi-Class Classification: Extend from binary to multi-speaker recognition (3+ speakers)
  • Advanced Augmentation Methods:
    • Room impulse response (RIR) simulation
    • Background noise mixing
    • Speed perturbation
    • Codec simulation
  • Transfer Learning: Leverage pre-trained models (VGGish, ResNet, wav2vec 2.0)

🏗️ Architecture Improvements

  • Better Model Architecture:
    • Attention mechanisms
    • Residual connections
    • Deeper networks with batch normalization
  • Data Pipeline:
    • TensorFlow Dataset API integration
    • Data caching for faster training

📊 Model Evaluation

  • Cross-Validation: K-fold validation for robust performance estimates
  • Confusion Matrix Visualization: Detailed error analysis
  • Per-Speaker Performance: Individual speaker accuracy metrics
  • Threshold Tuning: ROC curve analysis for optimal thresholds

Community Goals

  • Comprehensive Tutorial Series: Step-by-step video tutorials
  • Sample Datasets: Public dataset links and benchmarks
  • Model Zoo: Pre-trained models for different use cases
  • Paper/Blog Post: Detailed technical write-up of methodology

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Copyright (c) 2024 Muhd Uwais

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files...

📬 Contact

Muhd Uwais - Project Author

For feedback or questions, please reach out via the link below:

Feedback & Support

  • Found a bug? Open an Issue
  • Have a question? Start a Discussion
  • Want to collaborate? Reach out via the Contact Form
  • Suggestions? We'd love to hear them!

🏆 Star this Project!

If EchoID helped you with your speaker recognition tasks or you found it useful for learning, please consider giving it a ⭐ on GitHub! Your support motivates continued development and helps others discover the project.


Built with ❤️ and 🧠 by an aspiring AI Developer

#DPMG
Discipline • Peace • Myself • Growth

Happy Coding! 🚀

About

Deep voice speaker recognition system built with Keras CNNs. Educational project featuring audio augmentation, mel-spectrogram processing, and real-time inference. Binary classification for beginners.

Topics

Resources

License

Stars

Watchers

Forks