An individual's voice can vary dramatically depending on word choice, affect, and other factors. Such intrinsic within-talker variability causes considerable difficulties when distinguishing talkers by their voices, both for humans and machines. For machines, phonetic content variability substantially degrades performance when utterances are short (e.g., < 10 sec). Humans, on the contrary, are less influenced by content variability, and they perform better than machines in such conditions. Hence, understanding which and how acoustic features are related to human responses might provide insights to improve machine performance. Yet, little is known about human and machine voice discrimination ability under various kinds of intrinsic within-talker variabilities.
This dissertation presents studies of voice discrimination abilities of humans and machines under text, affect, and speaking-style variabilities. The main focus is in developing a feature set, based on a psychoacoustic model of voice quality, that can be used to improve machine performance and to find acoustic correlates with human responses. In order to systematically investigate the effects of within- and between-talker variability, a database was developed at UCLA. More than a hundred females and a hundred males were recorded with various speech styles, including sustained vowels, read sentences, affective speech, and pet-directed speech.
Preliminary experiments indicated that the voice quality feature set (VQual1) was promising for predicting human responses, and for improving automatic speaker verification (ASV) performance which degraded significantly under text, affect and/or speaking-style variabilities. VQual1 was modified to another set (VQual2) to better differentiate talkers, leading to further improvements in short-utterance text-independent ASV tasks.
Voice discrimination abilities of humans and machines for very short utterances (~ 2 sec) under high text and style variability were analyzed using read sentences and pet-directed speech. Humans were more accurate than machines for read sentence pairs, but the performance difference became small for style-mismatched pairs and for perceptually marked talkers. Humans' and machines' decision spaces were weakly correlated, indicating a weak or non-linear relationship between talker representations by humans and machines. However, for different-talker pairs, the VQual2-based system responses were highly correlated with human responses. Results also suggested that machines could supplement human decisions for perceptually marked talkers. Additionally, VQual2 was effective in perceived affect recognition, suggesting another application where voice quality features can contribute to predict human decisions.