This repository contains an implementation of the paper Deep Speech 2: End-to-End Speech Recognition and newly proposed parallel minGRU architecture from Were RNNs All We Needed? using PyTorch 🔥 and Lightning AI ⚡.
- Gated Recurrent Neural Networks
- Deep Speech 2: End-to-End Speech Recognition
- Were RNNs All We Needed?
- KenLM
- Boosting Sequence Generation Performance with Beam Search Language Model Decoding
Deep Speech 2 was a state-of-the-art ASR model designed to transcribe speech into text with end-to-end training using deep learning techniques in 2015.
On the other hand, Were RNNs All We Needed? introduces a new RNN-based architecture with a parallelized version of the minGRU (Minimum Gated Recurrent Unit), aiming to enhance the efficiency of RNNs by reducing the dependency on sequential data processing. This architecture enables faster training and inference, making it potentially more suitable for ASR tasks and other real-time applications.
-
Clone the repository:
git clone --recursive https://github.com/LuluW8071/Deep-Speech-2.git cd deep-speech-2
-
Install Pytorch and required dependencies:
pip install -r requirements.txt
Ensure you have
PyTorch
andLightning AI
installed.
Important
Before training make sure you have placed comet ml api key and project name in the environment variable file .env
.
To train the Deep Speech 2 model, use the following command for default training configs:
python3 train.py
Customize the pytorch training parameters by passing arguments in train.py
to suit your needs:
Refer to the provided table to change hyperparameters and train configurations.
Args | Description | Default Value |
---|---|---|
-g, --gpus |
Number of GPUs per node | 1 |
-g, --num_workers |
Number of CPU workers | 8 |
-db, --dist_backend |
Distributed backend to use for training | ddp_find_unused_parameters_true |
--epochs |
Number of total epochs to run | 50 |
--batch_size |
Size of the batch | 32 |
-lr, --learning_rate |
Learning rate | 2e-4 (0.0002) |
--checkpoint_path |
Checkpoint path to resume training from | None |
--precision |
Precision of the training | 16-mixed |
python3 train.py
-g 4 # Number of GPUs per node for parallel gpu training
-w 8 # Number of CPU workers for parallel data loading
--epochs 10 # Number of total epochs to run
--batch_size 64 # Size of the batch
-lr 2e-5 # Learning rate
--precision 16-mixed # Precision of the training
--checkpoint_path path_to_checkpoint.ckpt # Checkpoint path to resume training from
The model was trained on LibriSpeech train set (100 + 360 + 500 hours) and validated on the LibriSpeech test set ( ~ 10.5 hours).
@misc{amodei2015deepspeech2endtoend,
title={Deep Speech 2: End-to-End Speech Recognition in English and Mandarin},
author={Dario Amodei and Rishita Anubhai and Eric Battenberg and Carl Case and others,
year={2015},
url={https://arxiv.org/abs/1512.02595},
}
@inproceedings{Feng2024WereRA,
title = {Were RNNs All We Needed?},
author = {Leo Feng and Frederick Tung and Mohamed Osama Ahmed and Yoshua Bengio and Hossein Hajimirsadegh},
year = {2024},
url = {https://api.semanticscholar.org/CorpusID:273025630}
}