Speaker-Invariant Speech Recognition through Fine-Tuning on Individual-Specific Data with Voice Conversion
Keywords: Automatic Speech Recognition, Speaker-Invariance, Self-Supervised Speech Representation
Abstract: In this paper, we propose a speaker-invariant speech recognition method that fine-tunes a pre-trained model (Obtained by a self-supervised learning method) on a selected subset of data containing speech from a specific individual. This fine-tuning changes the network's behavior, allowing it to focus on information that is important for tasks such as ASR and phoneme recognition while reducing sensitivity to speaker-specific vocal characteristics. In the test time, we recommend employing voice conversion techniques to transform the voices of diverse individuals to match that of the individual used for training.
6 Replies
Loading