default search action
21st Interspeech 2020: Shanghai, China
- Helen Meng, Bo Xu, Thomas Fang Zheng:
21st Annual Conference of the International Speech Communication Association, Interspeech 2020, Virtual Event, Shanghai, China, October 25-29, 2020. ISCA 2020
Keynote 1
- Janet B. Pierrehumbert:
The cognitive status of simple and complex models.
ASR Neural Network Architectures I
- Jinyu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, Shujie Liu:
On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition. 1-5 - Zhifu Gao, Shiliang Zhang, Ming Lei, Ian McLoughlin:
SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition. 6-10 - Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian Metze, Yatharth Saraf:
Contextual RNN-T for Open Domain ASR. 11-15 - Jing Pan, Joshua Shapiro, Jeremy Wohlwend, Kyu Jeong Han, Tao Lei, Tao Ma:
ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition. 16-20 - Deepak Kadetotad, Jian Meng, Visar Berisha, Chaitali Chakrabarti, Jae-sun Seo:
Compressing LSTM Networks with Hierarchical Coarse-Grain Sparsity. 21-25 - Timo Lohrenz, Tim Fingscheidt:
BLSTM-Driven Stream Fusion for Automatic Speech Recognition: Novel Methods and a Multi-Size Window Fusion Example. 26-30 - Ngoc-Quan Pham, Thanh-Le Ha, Tuan-Nam Nguyen, Thai-Son Nguyen, Elizabeth Salesky, Sebastian Stüker, Jan Niehues, Alex Waibel:
Relative Positional Encoding for Speech Recognition and Direct Translation. 31-35 - Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Tianyan Zhou, Takuya Yoshioka:
Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of any Number of Speakers. 36-40 - Takashi Fukuda, Samuel Thomas:
Implicit Transfer of Privileged Acoustic Information in a Generalized Knowledge Distillation Framework. 41-45 - Jinhwan Park, Wonyong Sung:
Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition. 46-50
Multi-Channel Speech Enhancement
- Guanjun Li, Shan Liang, Shuai Nie, Wenju Liu, Zhanlei Yang, Longshuai Xiao:
Deep Neural Network-Based Generalized Sidelobe Canceller for Robust Multi-Channel Speech Recognition. 51-55 - Yong Xu, Meng Yu, Shi-Xiong Zhang, Lianwu Chen, Chao Weng, Jianming Liu, Dong Yu:
Neural Spatio-Temporal Beamformer for Target Speech Separation. 56-60 - Li Li, Kazuhito Koishida, Shoji Makino:
Online Directional Speech Enhancement Using Geometrically Constrained Independent Vector Analysis. 61-65 - Meng Yu, Xuan Ji, Bo Wu, Dan Su, Dong Yu:
End-to-End Multi-Look Keyword Spotting. 66-70 - Weilong Huang, Jinwei Feng:
Differential Beamforming for Uniform Circular Array with Directional Microphones. 71-75 - Jun Qi, Hu Hu, Yannan Wang, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee:
Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement. 76-80 - Jian Wu, Zhuo Chen, Jinyu Li, Takuya Yoshioka, Zhili Tan, Ed Lin, Yi Luo, Lei Xie:
An End-to-End Architecture of Online Multi-Channel Speech Separation. 81-85 - Yu Nakagome, Masahito Togami, Tetsuji Ogawa, Tetsunori Kobayashi:
Mentoring-Reverse Mentoring for Unsupervised Multi-Channel Speech Source Separation. 86-90 - Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Hiroshi Sawada, Shoko Araki:
Computationally Efficient and Versatile Framework for Joint Optimization of Blind Speech Separation and Dereverberation. 91-95 - Yanhui Tu, Jun Du, Lei Sun, Feng Ma, Jia Pan, Chin-Hui Lee:
A Space-and-Speaker-Aware Iterative Mask Estimation Approach to Multi-Channel Speech Recognition in the CHiME-6 Challenge. 96-100
Speech Processing in the Brain
- Youssef Hmamouche, Laurent Prévot, Magalie Ochs, Thierry Chaminade:
Identifying Causal Relationships Between Behavior and Local Brain Activity During Natural Conversation. 101-105 - Di Zhou, Gaoyan Zhang, Jianwu Dang, Shuang Wu, Zhuo Zhang:
Neural Entrainment to Natural Speech Envelope Based on Subject Aligned EEG Signals. 106-110 - Chongyuan Lian, Tianqi Wang, Mingxiao Gu, Manwa L. Ng, Feiqi Zhu, Lan Wang, Nan Yan:
Does Lexical Retrieval Deteriorate in Patients with Mild Cognitive Impairment? Analysis of Brain Functional Network Will Tell. 111-115 - Zhen Fu, Jing Chen:
Congruent Audiovisual Speech Enhances Cortical Envelope Tracking During Auditory Selective Attention. 116-120 - Lei Wang, Ed X. Wu, Fei Chen:
Contribution of RMS-Level-Based Speech Segments to Target Speech Decoding Under Noisy Conditions. 121-124 - Bin Zhao, Jianwu Dang, Gaoyan Zhang, Masashi Unoki:
Cortical Oscillatory Hierarchy for Natural Sentence Processing. 125-129 - Louis ten Bosch, Kimberley Mulder, Lou Boves:
Comparing EEG Analyses with Different Epoch Alignments in an Auditory Lexical Decision Experiment. 130-134 - Tanya Talkar, Sophia Yuditskaya, James R. Williamson, Adam C. Lammert, Hrishikesh Rao, Daniel J. Hannon, Anne T. O'Brien, Gloria Vergara-Diaz, Richard DeLaura, Douglas E. Sturim, Gregory A. Ciccarelli, Ross Zafonte, Jeff Palmer, Paolo Bonato, Thomas F. Quatieri:
Detection of Subclinical Mild Traumatic Brain Injury (mTBI) Through Speech and Gait. 135-139
Speech Signal Representation
- Joel Shor, Aren Jansen, Ronnie Maor, Oran Lang, Omry Tuval, Félix de Chaumont Quitry, Marco Tagliasacchi, Ira Shavitt, Dotan Emanuel, Yinnon Haviv:
Towards Learning a Universal Non-Semantic Representation of Speech. 140-144 - Rajeev Rajan, Aiswarya Vinod Kumar, Ben P. Babu:
Poetic Meter Classification Using i-Vector-MTF Fusion. 145-149 - Wang Dai, Jinsong Zhang, Yingming Gao, Wei Wei, Dengfeng Ke, Binghuai Lin, Yanlu Xie:
Formant Tracking Using Dilated Convolutional Networks Through Dense Connection with Gating Mechanism. 150-154 - Na Hu, Berit Janssen, Judith Hanssen, Carlos Gussenhoven, Aoju Chen:
Automatic Analysis of Speech Prosody in Dutch. 155-159 - Adrien Gresse, Mathias Quillot, Richard Dufour, Jean-François Bonastre:
Learning Voice Representation Using Knowledge Distillation for Automatic Voice Casting. 160-164 - B. Yegnanarayana, Joseph M. Anand, Vishala Pannala:
Enhancing Formant Information in Spectrographic Display of Speech. 165-169 - Michael Gump, Wei-Ning Hsu, James R. Glass:
Unsupervised Methods for Evaluating Speech Representations. 170-174 - Dung N. Tran, Uros Batricevic, Kazuhito Koishida:
Robust Pitch Regression with Voiced/Unvoiced Classification in Nonstationary Noise Environments. 175-179 - Amrith Setlur, Barnabás Póczos, Alan W. Black:
Nonlinear ISA with Auxiliary Variables for Learning Speech Representations. 180-184 - Hirotoshi Takeuchi, Kunio Kashino, Yasunori Ohishi, Hiroshi Saruwatari:
Harmonic Lowering for Accelerating Harmonic Convolution for Audio Signals. 185-189
Speech Synthesis: Neural Waveform Generation I
- Yang Ai, Zhen-Hua Ling:
Knowledge-and-Data-Driven Amplitude Spectrum Prediction for Hierarchical Neural Vocoders. 190-194 - Qiao Tian, Zewang Zhang, Heng Lu, Ling-Hui Chen, Shan Liu:
FeatherWave: An Efficient High-Fidelity Neural Vocoder with Multi-Band Linear Prediction. 195-199 - Jinhyeok Yang, Junmo Lee, Young-Ik Kim, Hoon-Young Cho, Injung Kim:
VocGAN: A High-Fidelity Real-Time Vocoder with a Hierarchically-Nested Adversarial Network. 200-204 - Hiroki Kanagawa, Yusuke Ijima:
Lightweight LPCNet-Based Neural Vocoder with Tensor Decomposition. 205-209 - Po-Chun Hsu, Hung-yi Lee:
WG-WaveNet: Real-Time High-Fidelity Speech Synthesis Without GPU. 210-214 - Brooke Stephenson, Laurent Besacier, Laurent Girin, Thomas Hueber:
What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS. 215-219 - Vadim Popov, Stanislav Kamenev, Mikhail A. Kudinov, Sergey Repyevsky, Tasnima Sadekova, Vitalii Bushaev, Vladimir Kryzhanovskiy, Denis Parkhomenko:
Fast and Lightweight On-Device TTS with Tacotron2 and LPCNet. 220-224 - Wei Song, Guanghui Xu, Zhengchen Zhang, Chao Zhang, Xiaodong He, Bowen Zhou:
Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed. 225-229 - Sébastien Le Maguer, Naomi Harte:
Can Auditory Nerve Models Tell us What's Different About WaveNet Vocoded Speech? 230-234 - Dipjyoti Paul, Yannis Pantazis, Yannis Stylianou:
Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions. 235-239 - Zhijun Liu, Kuan Chen, Kai Yu:
Neural Homomorphic Vocoder. 240-244
Automatic Speech Recognition for Non-Native Children’s Speech
- Roberto Gretter, Marco Matassoni, Daniele Falavigna, Keelan Evanini, Chee Wee Leong:
Overview of the Interspeech TLT2020 Shared Task on ASR for Non-Native Children's Speech. 245-249 - Tien-Hong Lo, Fu-An Chao, Shi-Yan Weng, Berlin Chen:
The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR Challenge. 250-254 - Kate M. Knill, Linlin Wang, Yu Wang, Xixin Wu, Mark J. F. Gales:
Non-Native Children's Automatic Speech Recognition: The INTERSPEECH 2020 Shared Task ALTA Systems. 255-259 - Hemant Kumar Kathania, Mittul Singh, Tamás Grósz, Mikko Kurimo:
Data Augmentation Using Prosody and False Starts to Recognize Non-Native Children's Speech. 260-264 - Mostafa Ali Shahin, Renée Lu, Julien Epps, Beena Ahmed:
UNSW System Description for the Shared Task on Automatic Speech Recognition for Non-Native Children's Speech. 265-268
Speaker Diarization
- Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, Kenji Nagamatsu:
End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors. 269-273 - Ivan Medennikov, Maxim Korenevsky, Tatiana Prisyach, Yuri Y. Khokhlov, Mariya Korenevskaya, Ivan Sorokin, Tatiana Timofeeva, Anton Mitrofanov, Andrei Andrusenko, Ivan Podluzhny, Aleksandr Laptev, Aleksei Romanenko:
Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario. 274-278 - Hagai Aronowitz, Weizhong Zhu, Masayuki Suzuki, Gakuto Kurata, Ron Hoory:
New Advances in Speaker Diarization. 279-283 - Qingjian Lin, Yu Hou, Ming Li:
Self-Attentive Similarity Measurement Strategies in Speaker Diarization. 284-288 - Jixuan Wang, Xiong Xiao, Jian Wu, Ranjani Ramamurthy, Frank Rudzicz, Michael Brudno:
Speaker Attribution with Voice Profiles by Graph-Based Semi-Supervised Learning. 289-293 - Prachi Singh, Sriram Ganapathy:
Deep Self-Supervised Hierarchical Clustering for Speaker Diarization. 294-298 - Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman:
Spot the Conversation: Speaker Diarisation in the Wild. 299-303
Noise Robust and Distant Speech Recognition
- Wangyou Zhang, Yanmin Qian:
Learning Contextual Language Embeddings for Monaural Multi-Talker Speech Recognition. 304-308 - Zhihao Du, Jiqing Han, Xueliang Zhang:
Double Adversarial Network Based Monaural Speech Enhancement for Robust Speech Recognition. 309-313 - Antoine Bruguier, Ananya Misra, Arun Narayanan, Rohit Prabhavalkar:
Anti-Aliasing Regularization in Stacking Layers. 314-318 - Andrei Andrusenko, Aleksandr Laptev, Ivan Medennikov:
Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription. 319-323 - Wangyou Zhang, Aswin Shanmugam Subramanian, Xuankai Chang, Shinji Watanabe, Yanmin Qian:
End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming. 324-328 - Xinchi Qiu, Titouan Parcollet, Mirco Ravanelli, Nicholas D. Lane, Mohamed Morchid:
Quaternion Neural Networks for Multi-Channel Distant Speech Recognition. 329-333 - Hangting Chen, Pengyuan Zhang, Qian Shi, Zuozhen Liu:
Improved Guided Source Separation Integrated with a Strong Back-End for the CHiME-6 Dinner Party Scenario. 334-338 - Dongmei Wang, Zhuo Chen, Takuya Yoshioka:
Neural Speech Separation Using Spatially Distributed Microphones. 339-343 - Shota Horiguchi, Yusuke Fujita, Kenji Nagamatsu:
Utterance-Wise Meeting Transcription System Using Asynchronous Distributed Microphones. 344-348 - Jack Deadman, Jon Barker:
Simulating Realistically-Spatialised Simultaneous Speech Using Video-Driven Speaker Detection and the CHiME-5 Dataset. 349-353
Speech in Multimodality
- Catarina Botelho, Lorenz Diener, Dennis Küster, Kevin Scheck, Shahin Amiriparian, Björn W. Schuller, Tanja Schultz, Alberto Abad, Isabel Trancoso:
Toward Silent Paralinguistics: Speech-to-EMG - Retrieving Articulatory Muscle Activity from Speech. 354-358 - Jiaxuan Zhang, Sarah Ita Levitan, Julia Hirschberg:
Multimodal Deception Detection Using Automatically Extracted Acoustic, Visual, and Lexical Features. 359-363 - Zexu Pan, Zhaojie Luo, Jichen Yang, Haizhou Li:
Multi-Modal Attention for Speech Emotion Recognition. 364-368 - Guang Shen, Riwei Lai, Rui Chen, Yu Zhang, Kejia Zhang, Qilong Han, Hongtao Song:
WISE: Word-Level Interaction-Based Multimodal Fusion for Speech Emotion Recognition. 369-373 - Ming Chen, Xudong Zhao:
A Multi-Scale Fusion Framework for Bimodal Speech Emotion Recognition. 374-378 - Pengfei Liu, Kun Li, Helen Meng:
Group Gated Fusion on Attention-Based Bidirectional Alignment for Multimodal Emotion Recognition. 379-383 - Aparna Khare, Srinivas Parthasarathy, Shiva Sundaram:
Multi-Modal Embeddings Using Multi-Task Learning for Emotion Recognition. 384-388 - Jeng-Lin Li, Chi-Chun Lee:
Using Speaker-Aligned Graph Memory Block in Multimodally Attentive Emotion Recognition Network. 389-393 - Zheng Lian, Jianhua Tao, Bin Liu, Jian Huang, Zhanlei Yang, Rongjun Li:
Context-Dependent Domain Adversarial Neural Network for Multimodal Emotion Recognition. 394-398
Speech, Language, and Multimodal Resources
- Bo Yang, Xianlong Tan, Zhengmao Chen, Bing Wang, Min Ruan, Dan Li, Zhongping Yang, Xiping Wu, Yi Lin:
ATCSpeech: A Multilingual Pilot-Controller Speech Corpus from Real Air Traffic Control Environment. 399-403 - Alexander Gutkin, Isin Demirsahin, Oddur Kjartansson, Clara Rivera, Kólá Túbosún:
Developing an Open-Source Corpus of Yoruba Speech. 404-408 - Jung-Woo Ha, Kihyun Nam, Jingu Kang, Sang-Woo Lee, Sohee Yang, Hyunhoon Jung, Hyeji Kim, Eunmi Kim, Soojin Kim, Hyun Ah Kim, Kyoungtae Doh, Chan Kyu Lee, Nako Sung, Sunghun Kim:
ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers. 409-413 - Yanhong Wang, Huan Luan, Jiahong Yuan, Bin Wang, Hui Lin:
LAIX Corpus of Chinese Learner English: Towards a Benchmark for L2 English ASR. 414-418 - Vikram Ramanarayanan:
Design and Development of a Human-Machine Dialog Corpus for the Automated Assessment of Conversational English Proficiency. 419-423 - Si Ioi Ng, Cymie Wing-Yee Ng, Jiarui Wang, Tan Lee, Kathy Yuet-Sheung Lee, Michael Chi-Fai Tong:
CUCHILD: A Large-Scale Cantonese Corpus of Child Speech for Phonology and Articulation Assessment. 424-428 - Katri Leino, Juho Leinonen, Mittul Singh, Sami Virpioja, Mikko Kurimo:
FinChat: Corpus and Evaluation Setup for Finnish Chat Conversations on Everyday Topics. 429-433 - Maarten Van Segbroeck, Ahmed Zaid, Ksenia Kutsenko, Cirenia Huerta, Tinh Nguyen, Xuewen Luo, Björn Hoffmeister, Jan Trmal, Maurizio Omologo, Roland Maas:
DiPCo - Dinner Party Corpus. 434-436 - Bo Wang, Yue Wu, Niall Taylor, Terry J. Lyons, Maria Liakata, Alejo J. Nevado-Holgado, Kate E. A. Saunders:
Learning to Detect Bipolar Disorder and Borderline Personality Disorder with Language and Speech in Non-Clinical Interviews. 437-441 - Andreas Kirkedal, Marija Stepanovic, Barbara Plank:
FT Speech: Danish Parliament Speech Corpus. 442-446
Language Recognition
- Raphaël Duroselle, Denis Jouvet, Irina Illina:
Metric Learning Loss Functions to Reduce Domain Mismatch in the x-Vector Space for Language Recognition. 447-451 - Zheng Li, Miao Zhao, Jing Li, Yiming Zhi, Lin Li, Qingyang Hong:
The XMUSPEECH System for the AP19-OLR Challenge. 452-456 - Zheng Li, Miao Zhao, Jing Li, Lin Li, Qingyang Hong:
On the Usage of Multi-Feature Integration for Speaker Verification and Language Identification. 457-461 - Shammur A. Chowdhury, Ahmed Ali, Suwon Shon, James R. Glass:
What Does an End-to-End Dialect Identification Model Learn About Non-Dialectal Information? 462-466 - Matias Lindgren, Tommi Jauhiainen, Mikko Kurimo:
Releasing a Toolkit and Comparing the Performance of Language Embeddings Across Various Spoken Language Identification Datasets. 467-471 - Aitor Arronte Alvarez, Elsayed Sabry Abdelaal Issa:
Learning Intonation Pattern Embeddings for Arabic Dialect Identification. 472-476 - Badr M. Abdullah, Tania Avgustinova, Bernd Möbius, Dietrich Klakow:
Cross-Domain Adaptation of Spoken Language Identification for Related Languages: The Curious Case of Slavic Languages. 477-481
Speech Processing and Analysis
- Noé Tits, Kevin El Haddad, Thierry Dutoit:
ICE-Talk: An Interface for a Controllable Expressive Talking Machine. 482-483 - Mathieu Hu, Laurent Pierron, Emmanuel Vincent, Denis Jouvet:
Kaldi-Web: An Installation-Free, On-Device Speech Recognition System. 484-485 - Amelia C. Kelly, Eleni Karamichali, Armin Saeb, Karel Veselý, Nicholas Parslow, Agape Deng, Arnaud Letondor, Robert O'Regan, Qiru Zhou:
Soapbox Labs Verification Platform for Child Speech. 486-487 - Amelia C. Kelly, Eleni Karamichali, Armin Saeb, Karel Veselý, Nicholas Parslow, Gloria Montoya Gomez, Agape Deng, Arnaud Letondor, Niall Mullally, Adrian Hempel, Robert O'Regan, Qiru Zhou:
SoapBox Labs Fluency Assessment Platform for Child Speech. 488-489 - Baybars Külebi, Alp Öktem, Alex Peiró Lilja, Santiago Pascual, Mireia Farrús:
CATOTRON - A Neural Text-to-Speech System in Catalan. 490-491 - Vikram Ramanarayanan, Oliver Roesler, Michael Neumann, David Pautler, Doug Habberstad, Andrew Cornish, Hardik Kothare, Vignesh Murali, Jackson Liscombe, Dirk Schnelle-Walka, Patrick L. Lange, David Suendermann-Oeft:
Toward Remote Patient Monitoring of Speech, Video, Cognitive and Respiratory Biomarkers Using Multimodal Dialog Technology. 492-493 - Baihan Lin, Xinxin Zhang:
VoiceID on the Fly: A Speaker Recognition System that Learns from Scratch. 494-495
Speech Emotion Recognition I
- Zhao Ren, Jing Han, Nicholas Cummins, Björn W. Schuller:
Enhancing Transferability of Black-Box Adversarial Attacks via Lifelong Learning for Speech Emotion Recognition Models. 496-500 - Han Feng, Sei Ueno, Tatsuya Kawahara:
End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model. 501-505 - Bo-Hao Su, Chun-Min Chang, Yun-Shao Lin, Chi-Chun Lee:
Improving Speech Emotion Recognition Using Graph Attentive Bi-Directional Gated Recurrent Unit Network. 506-510 - Adria Mallol-Ragolta, Nicholas Cummins, Björn W. Schuller:
An Investigation of Cross-Cultural Semi-Supervised Learning for Continuous Affect Recognition. 511-515 - Kusha Sridhar, Carlos Busso:
Ensemble of Students Taught by Probabilistic Teachers to Improve Speech Emotion Recognition. 516-520 - Siddique Latif, Muhammad Asim, Rajib Rana, Sara Khalifa, Raja Jurdak, Björn W. Schuller:
Augmenting Generative Adversarial Networks for Speech Emotion Recognition. 521-525 - Vipula Dissanayake, Haimo Zhang, Mark Billinghurst, Suranga Nanayakkara:
Speech Emotion Recognition 'in the Wild' Using an Autoencoder. 526-530 - Shuiyang Mao, Pak-Chung Ching, Tan Lee:
Emotion Profile Refinery for Speech Emotion Classification. 531-535 - Sung-Lin Yeh, Yun-Shao Lin, Chi-Chun Lee:
Speech Representation Learning for Emotion Recognition Using End-to-End ASR with Factorized Adaptation. 536-540
ASR Neural Network Architectures and Training I
- Kshitiz Kumar, Emilian Stoimenov, Hosam Khalil, Jian Wu:
Fast and Slow Acoustic Model. 541-545 - Takafumi Moriya, Tsubasa Ochiai, Shigeki Karita, Hiroshi Sato, Tomohiro Tanaka, Takanori Ashihara, Ryo Masumura, Yusuke Shinohara, Marc Delcroix:
Self-Distillation for Improving CTC-Transformer-Based ASR Systems. 546-550 - Zoltán Tüske, George Saon, Kartik Audhkhasi, Brian Kingsbury:
Single Headed Attention Based Sequence-to-Sequence Model for State-of-the-Art Results on Switchboard. 551-555 - Zhehuai Chen, Andrew Rosenberg, Yu Zhang, Gary Wang, Bhuvana Ramabhadran, Pedro J. Moreno:
Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection. 556-560 - Yiwen Shao, Yiming Wang, Daniel Povey, Sanjeev Khudanpur:
PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR. 561-565 - Keyu An, Hongyu Xiang, Zhijian Ou:
CAT: A CTC-CRF Based ASR Toolkit Bridging the Hybrid and the End-to-End Approaches Towards Data Efficiency and Low Latency. 566-570 - Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara:
CTC-Synchronous Training for Monotonic Attention Model. 571-575 - Brady Houston, Katrin Kirchhoff:
Continual Learning for Multi-Dialect Acoustic Models. 576-580 - Xingcheng Song, Zhiyong Wu, Yiheng Huang, Dan Su, Helen Meng:
SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition. 581-585
Evaluation of Speech Technology Systems and Methods for Resource Construction and Annotation
- Adriana Stan:
RECOApy: Data Recording, Pre-Processing and Phonetic Transcription for End-to-End Speech-Based Applications. 586-590 - Yuan Shangguan, Kate Knister, Yanzhang He, Ian McGraw, Françoise Beaufays:
Analyzing the Quality and Stability of a Streaming End-to-End On-Device Speech Recognizer. 591-595 - Zhe Liu, Fuchun Peng:
Statistical Testing on ASR Performance via Blockwise Bootstrap. 596-600 - Anil Ramakrishna, Shrikanth Narayanan:
Sentence Level Estimation of Psycholinguistic Norms Using Joint Multidimensional Annotations. 601-605 - Kai Fan, Bo Li, Jiayi Wang, Shiliang Zhang, Boxing Chen, Niyu Ge, Zhijie Yan:
Neural Zero-Inflated Quality Estimation Model for Automatic Speech Recognition System. 606-610 - Alejandro Woodward, Clara Bonnín, Issey Masuda, David Varas, Elisenda Bou-Balust, Juan Carlos Riveiro:
Confidence Measures in Encoder-Decoder Models for Speech Recognition. 611-615 - Ahmed Ali, Steve Renals:
Word Error Rate Estimation Without ASR Output: e-WER2. 616-620 - Bogdan Ludusan, Petra Wagner:
An Evaluation of Manual and Semi-Automatic Laughter Annotation. 621-625 - Joshua L. Martin, Kevin Tang:
Understanding Racial Disparities in Automatic Speech Recognition: The Case of Habitual "be". 626-630
Phonetics and Phonology
- Georgia Zellou, Rebecca Scarborough, Renee Kemp:
Secondary Phonetic Cues in the Production of the Nasal Short-a System in California English. 631-635 - Louis-Marie Lorin, Lorenzo Maselli, Léo Varnet, Maria Giavazzi:
Acoustic Properties of Strident Fricatives at the Edges: Implications for Consonant Discrimination. 636-640 - Mingqiong Luo:
Processes and Consequences of Co-Articulation in Mandarin V1N.(C2)V2 Context: Phonology and Phonetics. 641-645 - Yang Yue, Fang Hu:
Voicing Distinction of Obstruents in the Hangzhou Wu Chinese Dialect. 646-650 - Lei Wang:
The Phonology and Phonetics of Kaifeng Mandarin Vowels. 651-655 - Margaret Zellers, Barbara Schuppler:
Microprosodic Variability in Plosives in German and Austrian German. 656-660 - Jing Huang, Feng-fan Hsieh, Yueh-Chin Chang:
Er-Suffixation in Southwestern Mandarin: An EMA and Ultrasound Study. 661-665 - Yinghao Li, Jinghua Zhang:
Electroglottographic-Phonetic Study on Korean Phonation Induced by Tripartite Plosives in Yanbian Korean. 666-670 - Nicholas Wilkins, Max Cordes Galbraith, Ifeoma Nwogu:
Modeling Global Body Configurations in American Sign Language. 671-675
Topics in ASR I
- Hang Li, Siyuan Chen, Julien Epps:
Augmenting Turn-Taking Prediction with Wearable Eye Activity During Conversation. 676-680 - Weiyi Lu, Yi Xu, Peng Yang, Belinda Zeng:
CAM: Uninteresting Speech Detector. 681-685 - Diamantino Caseiro, Pat Rondon, Quoc-Nam Le The, Petar S. Aleksic:
Mixed Case Contextual ASR Using Capitalization Masks. 686-690 - Huanru Henry Mao, Shuyang Li, Julian J. McAuley, Garrison W. Cottrell:
Speech Recognition and Multi-Speaker Diarization of Long Conversations. 691-695 - Mengzhe Geng, Xurong Xie, Shansong Liu, Jianwei Yu, Shoukang Hu, Xunying Liu, Helen Meng:
Investigation of Data Augmentation Techniques for Disordered Speech Recognition. 696-700 - Wenqi Wei, Jianzong Wang, Jiteng Ma, Ning Cheng, Jing Xiao:
A Real-Time Robot-Based Auxiliary System for Risk Evaluation of COVID-19 Infection. 701-705 - David S. Barbera, Mark A. Huckvale, Victoria Fleming, Emily Upton, Henry Coley-Fisher, Ian Shaw, William H. Latham, Alexander P. Leff, Jenny Crinion:
An Utterance Verification System for Word Naming Therapy in Aphasia. 706-710 - Shansong Liu, Xurong Xie, Jianwei Yu, Shoukang Hu, Mengzhe Geng, Rongfeng Su, Shi-Xiong Zhang, Xunying Liu, Helen Meng:
Exploiting Cross-Domain Visual Feature Generation for Disordered Speech Recognition. 711-715 - Binghuai Lin, Liyuan Wang:
Joint Prediction of Punctuation and Disfluency in Speech Transcripts. 716-720 - Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Ye Bai, Cunhang Fan:
Focal Loss for Punctuation Prediction. 721-725
Large-Scale Evaluation of Short-Duration Speaker Verification
- Zhuxin Chen, Yue Lin:
Improving X-Vector and PLDA for Text-Dependent Speaker Verification. 726-730 - Hossein Zeinali, Kong Aik Lee, Jahangir Alam, Lukás Burget:
SdSV Challenge 2020: Large-Scale Evaluation of Short-Duration Speaker Verification. 731-735 - Tao Jiang, Miao Zhao, Lin Li, Qingyang Hong:
The XMUSPEECH System for Short-Duration Speaker Verification Challenge 2020. 736-740 - Sung Hwan Mun, Woo Hyun Kang, Min Hyun Han, Nam Soo Kim:
Robust Text-Dependent Speaker Verification via Character-Level Information Preservation for the SdSV Challenge 2020. 741-745 - Tanel Alumäe, Jörgen Valk:
The TalTech Systems for the Short-Duration Speaker Verification Challenge 2020. 746-750 - Peng Shen, Xugang Lu, Hisashi Kawai:
Investigation of NICT Submission for Short-Duration Speaker Verification Challenge 2020. 751-755 - Jenthe Thienpondt, Brecht Desplanques, Kris Demuynck:
Cross-Lingual Speaker Verification with Domain-Balanced Hard Prototype Mining and Language-Dependent Score Normalization. 756-760 - Alicia Lozano-Diez, Anna Silnova, Bhargav Pulugundla, Johan Rohdin, Karel Veselý, Lukás Burget, Oldrich Plchot, Ondrej Glembek, Ondrej Novotný, Pavel Matejka:
BUT Text-Dependent Speaker Verification System for SdSV Challenge 2020. 761-765 - Vijay Ravi, Ruchao Fan, Amber Afshan, Huanhua Lu, Abeer Alwan:
Exploring the Use of an Unsupervised Autoregressive Model as a Shared Encoder for Text-Dependent Speaker Verification. 766-770
Voice Conversion and Adaptation I
- Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai:
Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning. 771-775 - Shaojin Ding, Guanlong Zhao, Ricardo Gutierrez-Osuna:
Improving the Speaker Identity of Non-Parallel Many-to-Many Voice Conversion with Adversarial Speaker Recognition. 776-780 - Yanping Li, Dongxiang Xu, Yan Zhang, Yang Wang, Binbin Chen:
Non-Parallel Many-to-Many Voice Conversion with PSR-StarGAN. 781-785 - Adam Polyak, Lior Wolf, Yaniv Taigman:
TTS Skins: Speaker Conversion via ASR. 786-790 - Zining Zhang, Bingsheng He, Zhenjie Zhang:
GAZEV: GAN-Based Zero-Shot Voice Conversion Over Non-Parallel Speech Corpus. 791-795 - Tao Wang, Jianhua Tao, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Rongxiu Zhong:
Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation. 796-800 - Adam Polyak, Lior Wolf, Yossi Adi, Yaniv Taigman:
Unsupervised Cross-Domain Singing Voice Conversion. 801-805 - Tatsuma Ishihara, Daisuke Saito:
Attention-Based Speaker Embeddings for One-Shot Voice Conversion. 806-810 - Jian Cong, Shan Yang, Lei Xie, Guoqiao Yu, Guanglu Wan:
Data Efficient Voice Cloning from Noisy Samples with Domain Adversarial Training. 811-815
Acoustic Event Detection
- Sixin Hong, Yuexian Zou, Wenwu Wang:
Gated Multi-Head Attention Pooling for Weakly Labelled Audio Tagging. 816-820 - Helin Wang, Yuexian Zou, Dading Chong, Wenwu Wang:
Environmental Sound Classification with Parallel Temporal-Spectral Attention. 821-825 - Luyu Wang, Kazuya Kawakami, Aäron van den Oord:
Contrastive Predictive Coding of Audio with an Adversary. 826-830 - Arjun Pankajakshan, Helen L. Bear, Vinod Subramanian, Emmanouil Benetos:
Memory Controlled Sequential Self Attention for Sound Recognition. 831-835 - Donghyeon Kim, Jaihyun Park, David K. Han, Hanseok Ko:
Dual Stage Learning Based Dynamic Time-Frequency Mask Generation for Audio Event Classification. 836-840 - Xu Zheng, Yan Song, Jie Yan, Li-Rong Dai, Ian McLoughlin, Lin Liu:
An Effective Perturbation Based Semi-Supervised Learning Method for Sound Event Detection. 841-845 - Chieh-Chi Kao, Bowen Shi, Ming Sun, Chao Wang:
A Joint Framework for Audio Tagging and Weakly Supervised Acoustic Event Detection Using DenseNet with Global Average Pooling. 846-850 - Chun-Chieh Chang, Chieh-Chi Kao, Ming Sun, Chao Wang:
Intra-Utterance Similarity Preserving Knowledge Distillation for Audio Tagging. 851-855 - In Young Park, Hong Kook Kim:
Two-Stage Polyphonic Sound Event Detection Based on Faster R-CNN-LSTM with Multi-Token Connectionist Temporal Classification. 856-860 - Amit Jindal, Narayanan Elavathur Ranganatha, Aniket Didolkar, Arijit Ghosh Chowdhury, Di Jin, Ramit Sawhney, Rajiv Ratn Shah:
SpeechMix - Augmenting Deep Sound Recognition Using Hidden Space Interpolations. 861-865
Spoken Language Understanding I
- Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann:
End-to-End Neural Transformer Based Spoken Language Understanding. 866-870 - Chen Liu, Su Zhu, Zijian Zhao, Ruisheng Cao, Lu Chen, Kai Yu:
Jointly Encoding Word Confusion Network and Dialogue Context with BERT for Spoken Language Understanding. 871-875 - Milind Rao, Anirudh Raju, Pranav Dheram, Bach Bui, Ariya Rastrow:
Speech to Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces. 876-880 - Pavel Denisov, Ngoc Thang Vu:
Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning. 881-885 - Srikanth Raj Chetupalli, Sriram Ganapathy:
Context Dependent RNNLM for Automatic Transcription of Conversations. 886-890 - Yusheng Tian, Philip John Gorinski:
Improving End-to-End Speech-to-Intent Classification with Reptile. 891-895 - Won-Ik Cho, Donghyun Kwak, Ji Won Yoon, Nam Soo Kim:
Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation. 896-900 - Weitong Ruan, Yaroslav Nechaev, Luoxin Chen, Chengwei Su, Imre Kiss:
Towards an ASR Error Robust Spoken Language Understanding System. 901-905 - Hong-Kwang Jeff Kuo, Zoltán Tüske, Samuel Thomas, Yinghui Huang, Kartik Audhkhasi, Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory, Luis A. Lastras:
End-to-End Spoken Language Understanding Without Full Transcripts. 906-910 - Karthik Gopalakrishnan, Behnam Hedayatnia, Longshaokan Wang, Yang Liu, Dilek Hakkani-Tür:
Are Neural Open-Domain Dialog Systems Robust to Speech Recognition Errors in the Dialog History? An Empirical Study. 911-915
DNN Architectures for Speaker Recognition
- Shaojin Ding, Tianlong Chen, Xinyu Gong, Weiwei Zha, Zhangyang Wang:
AutoSpeech: Neural Architecture Search for Speaker Recognition. 916-920 - Ya-Qi Yu, Wu-Jun Li:
Densely Connected Time Delay Neural Network for Speaker Verification. 921-925 - Siqi Zheng, Yun Lei, Hongbin Suo:
Phonetically-Aware Coupled Network For Short Duration Text-Independent Speaker Verification. 926-930 - Myunghun Jung, Youngmoon Jung, Jahyun Goo, Hoirin Kim:
Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification Using CTC-Based Soft VAD and Global Query Attention. 931-935 - Yanfeng Wu, Chenkai Guo, Hongcan Gao, Xiaolei Hou, Jing Xu:
Vector-Based Attentive Pooling for Text-Independent Speaker Verification. 936-940 - Pooyan Safari, Miquel India, Javier Hernando:
Self-Attention Encoding and Pooling for Speaker Recognition. 941-945 - Ruiteng Zhang, Jianguo Wei, Wenhuan Lu, Longbiao Wang, Meng Liu, Lin Zhang, Jiayu Jin, Junhai Xu:
ARET: Aggregated Residual Extended Time-Delay Neural Networks for Speaker Verification. 946-950 - Hanyi Zhang, Longbiao Wang, Yunchun Zhang, Meng Liu, Kong Aik Lee, Jianguo Wei:
Adversarial Separation Network for Speaker Recognition. 951-955 - Jingyu Li, Tan Lee:
Text-Independent Speaker Verification with Dual Attention Network. 956-960 - Xiaoyang Qu, Jianzong Wang, Jing Xiao:
Evolutionary Algorithm Enhanced Neural Architecture Search for Text-Independent Speaker Verification. 961-965
ASR Model Training and Strategies
- Chao Weng, Chengzhu Yu, Jia Cui, Chunlei Zhang, Dong Yu:
Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition. 966-970 - Chengyi Wang, Yu Wu, Yujiao Du, Jinyu Li, Shujie Liu, Liang Lu, Shuo Ren, Guoli Ye, Sheng Zhao, Ming Zhou:
Semantic Mask for Transformer Based End-to-End Speech Recognition. 971-975 - Frank Zhang, Yongqiang Wang, Xiaohui Zhang, Chunxi Liu, Yatharth Saraf, Geoffrey Zweig:
Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces. 976-980 - Dimitrios Dimitriadis, Ken'ichi Kumatani, Robert Gmyr, Yashesh Gaur, Sefik Emre Eskimez:
A Federated Approach in Training Acoustic Models. 981-985 - Imran A. Sheikh, Emmanuel Vincent, Irina Illina:
On Semi-Supervised LF-MMI Training of Acoustic Models with Limited Data. 986-990 - Yixin Gao, Noah D. Stein, Chieh-Chi Kao, Yunliang Cai, Ming Sun, Tao Zhang, Shiv Naga Prasad Vitaladevuni:
On Front-End Gain Invariant Modeling for Wake Word Spotting. 991-995 - Fenglin Ding, Wu Guo, Bin Gu, Zhen-Hua Ling, Jun Du:
Unsupervised Regularization-Based Adaptive Training for Speech Recognition. 996-1000 - Erfan Loweimi, Peter Bell, Steve Renals:
On the Robustness and Training Dynamics of Raw Waveform Models. 1001-1005 - Qiantong Xu, Tatiana Likhomanenko, Jacob Kahn, Awni Y. Hannun, Gabriel Synnaeve, Ronan Collobert:
Iterative Pseudo-Labeling for Speech Recognition. 1006-1010
Speech Annotation and Speech Assessment
- Naoko Kawamura, Tatsuya Kitamura, Kenta Hamada:
Smart Tube: A Biofeedback System for Vocal Training and Therapy Through Tube Phonation. 1011-1012 - Seong Choi, Seunghoon Jeong, Jeewoo Yoon, Migyeong Yang, Minsam Ko, Eunil Park, Jinyoung Han, Munyoung Lee, Seonghee Lee:
VCTUBE : A Library for Automatic Speech Data Annotation. 1013-1014 - Yanlu Xie, Xiaoli Feng, Boxue Li, Jinsong Zhang, Yujia Jin:
A Mandarin L2 Learning APP with Mispronunciation Detection and Feedback. 1015-1016 - Tejas Udayakumar, Kinnera Saranu, Mayuresh Sanjay Oak, Ajit Ashok Saunshikhar, Sandip Shriram Bapat:
Rapid Enhancement of NLP Systems by Acquisition of Data in Correlated Domains. 1017-1018 - Ke Shi, Kye Min Tan, Richeng Duan, Siti Umairah Md. Salleh, Nur Farah Ain Suhaimi, Rajan Vellu, Ngoc Thuy Huong Helen Thai, Nancy F. Chen:
Computer-Assisted Language Learning System: Automatic Speech Evaluation for Children Learning Malay and Tamil. 1019-1020 - Takaaki Saeki, Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari:
Real-Time, Full-Band, Online DNN-Based Voice Conversion System Using a Single CPU. 1021-1022 - Xiaoli Feng, Yanlu Xie, Yayue Deng, Boxue Li:
A Dynamic 3D Pronunciation Teaching Model Based on Pronunciation Attributes and Anatomy. 1023-1024 - Naoki Kimura, Zixiong Su, Takaaki Saeki:
End-to-End Deep Learning Speech Recognition Model for Silent Speech Challenge. 1025-1026
Cross/Multi-Lingual and Code-Switched Speech Recognition
- Jialu Li, Mark Hasegawa-Johnson:
Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous? 1027-1031 - Martha Yifiru Tachbelie, Solomon Teferra Abate, Tanja Schultz:
Development of Multilingual ASR Using GlobalPhone for Less-Resourced Languages: The Case of Ethiopian Languages. 1032-1036 - Wenxin Hou, Yue Dong, Bairong Zhuang, Longfei Yang, Jiatong Shi, Takahiro Shinozaki:
Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning. 1037-1041 - Xinyuan Zhou, Emre Yilmaz, Yanhua Long, Yijie Li, Haizhou Li:
Multi-Encoder-Decoder Transformer for Code-Switching Speech Recognition. 1042-1046 - Solomon Teferra Abate, Martha Yifiru Tachbelie, Tanja Schultz:
Multilingual Acoustic and Language Modeling for Ethio-Semitic Languages. 1047-1051 - Yushi Hu, Shane Settle, Karen Livescu:
Multilingual Jointly Trained Acoustic and Written Word Embeddings. 1052-1056 - Chia-Yu Li, Ngoc Thang Vu:
Improving Code-Switching Language Modeling with Artificially Generated Texts Using Cycle-Consistent Adversarial Networks. 1057-1061 - Xinhui Hu, Qi Zhang, Lei Yang, Binbin Gu, Xinkang Xu:
Data Augmentation for Code-Switch Language Modeling by Fusing Multiple Text Generation Methods. 1062-1066 - Xinxing Li, Edward Lin:
A 43 Language Multilingual Punctuation Prediction Neural Network Model. 1067-1071 - Jisung Wang, Jihwan Kim, Sangki Kim, Yeha Lee:
Exploring Lexicon-Free Modeling Units for End-to-End Korean and Korean-English Code-Switching Speech Recognition. 1072-1075
Anti-Spoofing and Liveness Detection
- Patrick von Platen, Fei Tao, Gökhan Tür:
Multi-Task Siamese Neural Network for Improving Replay Attack Detection. 1076-1080 - Kosuke Akimoto, Seng Pei Liew, Sakiko Mishima, Ryo Mizushima, Kong Aik Lee:
POCO: A Voice Spoofing and Liveness Detection Corpus Based on Pop Noise. 1081-1085 - Hongji Wang, Heinrich Dinkel, Shuai Wang, Yanmin Qian, Kai Yu:
Dual-Adversarial Domain Adaptation for Generalized Replay Attack Detection. 1086-1090 - Hye-jin Shim, Hee-Soo Heo, Jee-weon Jung, Ha-Jin Yu:
Self-Supervised Pre-Training with Acoustic Configurations for Replay Spoofing Detection. 1091-1095 - Abhijith Girish, Adharsh Sabu, Akshay Prasannan Latha, Rajeev Rajan:
Competency Evaluation in Voice Mimicking Using Acoustic Cues. 1096-1100 - Zhenzong Wu, Rohan Kumar Das, Jichen Yang, Haizhou Li:
Light Convolutional Neural Network with Feature Genuinization for Detection of Synthetic Speech Attacks. 1101-1105 - Hemlata Tak, Jose Patino, Andreas Nautsch, Nicholas W. D. Evans, Massimiliano Todisco:
Spoofing Attack Detection Using the Non-Linear Fusion of Sub-Band Classifiers. 1106-1110 - Prasanth Parasu, Julien Epps, Kaavya Sriskandaraja, Gajan Suthokumar:
Investigating Light-ResNet Architecture for Spoofing Detection Under Mismatched Conditions. 1111-1115 - Zhenchun Lei, Yingen Yang, Changhong Liu, Jihua Ye:
Siamese Convolutional Neural Network Using Gaussian Probability Feature for Spoofing Speech Detection. 1116-1120
Noise Reduction and Intelligibility
- Hendrik Schröter, Tobias Rosenkranz, Alberto N. Escalante-B., Pascal Zobel, Andreas Maier:
Lightweight Online Noise Reduction on Embedded Devices Using Hierarchical Recurrent Neural Networks. 1121-1125 - Marco Tagliasacchi, Yunpeng Li, Karolis Misiunas, Dominik Roblek:
SEANet: A Multi-Modal Speech Enhancement Network. 1126-1130 - Shang-Yi Chuang, Yu Tsao, Chen-Chou Lo, Hsin-Min Wang:
Lite Audio-Visual Speech Enhancement. 1131-1135 - Christian Bergler, Manuel Schmitt, Andreas Maier, Simeon Smeele, Volker Barth, Elmar Nöth:
ORCA-CLEAN: A Deep Denoising Toolkit for Killer Whale Communication. 1136-1140 - Hao Zhang, DeLiang Wang:
A Deep Learning Approach to Active Noise Control. 1141-1145 - Tuan Dinh, Alexander Kain, Kris Tjaden:
Improving Speech Intelligibility Through Speaker Dependent and Independent Spectral Style Conversion. 1146-1150 - Mathias Bach Pedersen, Morten Kolbæk, Asger Heidemann Andersen, Søren Holdt Jensen, Jesper Jensen:
End-to-End Speech Intelligibility Prediction Using Time-Domain Fully Convolutional Neural Networks. 1151-1155 - Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani, Toshio Irino:
Predicting Intelligibility of Enhanced Speech Using Posteriors Derived from DNN-Based ASR System. 1156-1160 - Ali Abavisani, Mark Hasegawa-Johnson:
Automatic Estimation of Intelligibility Measure for Consonants in Speech. 1161-1165 - Viet Anh Trinh, Michael I. Mandel:
Large Scale Evaluation of Importance Maps in Automatic Speech Recognition. 1166-1170
Acoustic Scene Classification
- Jixiang Li, Chuming Liang, Bo Zhang, Zhao Wang, Fei Xiang, Xiangxiang Chu:
Neural Architecture Search on Acoustic Scene Classification. 1171-1175 - Jee-weon Jung, Hye-jin Shim, Ju-ho Kim, Seung-bin Kim, Ha-Jin Yu:
Acoustic Scene Classification Using Audio Tagging. 1176-1180 - Liwen Zhang, Jiqing Han, Ziqiang Shi:
ATReSN-Net: Capturing Attentive Temporal Relations in Semantic Neighborhood for Acoustic Scene Classification. 1181-1185 - Jivitesh Sharma, Ole-Christoffer Granmo, Morten Goodwin:
Environment Sound Classification Using Multiple Feature Channels and Attention Based Deep Convolutional Neural Network. 1186-1190 - Weimin Wang, Weiran Wang, Ming Sun, Chao Wang:
Acoustic Scene Analysis with Multi-Head Attention Networks. 1191-1195 - Hu Hu, Sabato Marco Siniscalchi, Yannan Wang, Chin-Hui Lee:
Relational Teacher Student Learning with Neural Label Embedding for Device Adaptation in Acoustic Scene Classification. 1196-1200 - Hu Hu, Sabato Marco Siniscalchi, Yannan Wang, Xue Bai, Jun Du, Chin-Hui Lee:
An Acoustic Segment Model Based Segment Unit Selection Approach to Acoustic Scene Classification with Partial Utterances. 1201-1205 - Dhanunjaya Varma Devalraju, H. Muralikrishna, Padmanabhan Rajan, Dileep Aroor Dinesh:
Attention-Driven Projections for Soundscape Classification. 1206-1210 - Panagiotis Tzirakis, Alexander Shiarella, Robert M. Ewers, Björn W. Schuller:
Computer Audition for Continuous Rainforest Occupancy Monitoring: The Case of Bornean Gibbons' Call Detection. 1211-1215 - Zuzanna Kwiatkowska, Beniamin Kalinowski, Michal Kosmider, Krzysztof Rykaczewski:
Deep Learning Based Open Set Acoustic Scene Classification. 1216-1220
Singing Voice Computing and Processing in Music
- Orazio Angelini, Alexis Moinet, Kayoko Yanagisawa, Thomas Drugman:
Singing Synthesis: With a Little Help from my Attention. 1221-1225 - Yusong Wu, Shengchen Li, Chengzhu Yu, Heng Lu, Chao Weng, Liqiang Zhang, Dong Yu:
Peking Opera Synthesis via Duration Informed Attention Network. 1226-1230 - Liqiang Zhang, Chengzhu Yu, Heng Lu, Chao Weng, Chunlei Zhang, Yusong Wu, Xiang Xie, Zijin Li, Dong Yu:
DurIAN-SC: Duration Informed Attention Network Based Singing Voice Conversion System. 1231-1235 - Yuanbo Hou, Frank K. Soong, Jian Luan, Shengchen Li:
Transfer Learning for Improving Singing-Voice Detection in Polyphonic Instrumental Music. 1236-1240 - Haohe Liu, Lei Xie, Jian Wu, Geng Yang:
Channel-Wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music. 1241-1245
Acoustic Model Adaptation for ASR
- Samik Sadhu, Hynek Hermansky:
Continual Learning in Automatic Speech Recognition. 1246-1250 - Genshun Wan, Jia Pan, Qingran Wang, Jianqing Gao, Zhongfu Ye:
Speaker Adaptive Training for Speech Recognition Based on Attention-Over-Attention Mechanism. 1251-1255 - Yan Huang, Jinyu Li, Lei He, Wenning Wei, William Gale, Yifan Gong:
Rapid RNN-T Adaptation Using Personalized Speech Synthesis and Neural Language Generator. 1256-1260 - Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq R. Joty, Eng Siong Chng, Bin Ma:
Speech Transformer with Speaker Aware Persistent Memory. 1261-1265 - Fenglin Ding, Wu Guo, Bin Gu, Zhen-Hua Ling, Jun Du:
Adaptive Speaker Normalization for CTC-Based Speech Recognition. 1266-1270 - Akhil Mathur, Nadia Berthouze, Nicholas D. Lane:
Unsupervised Domain Adaptation Under Label Space Mismatch for Speech Classification. 1271-1275 - Genta Indra Winata, Samuel Cahyawijaya, Zihan Liu, Zhaojiang Lin, Andrea Madotto, Peng Xu, Pascale Fung:
Learning Fast Adaptation on Cross-Accented Speech Recognition. 1276-1280 - Kartik Khandelwal, Preethi Jyothi, Abhijeet Awasthi, Sunita Sarawagi:
Black-Box Adaptation of ASR for Accented Speech. 1281-1285 - M. A. Tugtekin Turan, Emmanuel Vincent, Denis Jouvet:
Achieving Multi-Accent ASR via Unsupervised Acoustic Model Adaptation. 1286-1290 - Ryu Takeda, Kazunori Komatani:
Frame-Wise Online Unsupervised Adaptation of DNN-HMM Acoustic Model from Perspective of Robust Adaptive Filtering. 1291-1295
Singing and Multimodal Synthesis
- Jie Wu, Jian Luan:
Adversarially Trained Multi-Singer Sequence-to-Sequence Singing Synthesizer. 1296-1300 - JinHong Lu, Hiroshi Shimodaira:
Prediction of Head Motion from Speech Waveforms with a Canonical-Correlation-Constrained Autoencoder. 1301-1305 - Peiling Lu, Jie Wu, Jian Luan, Xu Tan, Li Zhou:
XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System. 1306-1310 - Ravindra Yadav, Ashish Sardana, Vinay P. Namboodiri, Rajesh M. Hegde:
Stochastic Talking Face Generation Using Latent Distribution Matching. 1311-1315 - Da-Yi Wu, Yi-Hsuan Yang:
Speech-to-Singing Conversion Based on Boundary Equilibrium GAN. 1316-1320 - Shunsuke Goto, Kotaro Onishi, Yuki Saito, Kentaro Tachibana, Koichiro Mori:
Face2Speech: Towards Multi-Speaker Text-to-Speech Synthesis Using an Embedding Vector Predicted from a Face Image. 1321-1325 - Wentao Wang, Yan Wang, Jianqing Sun, Qingsong Liu, Jiaen Liang, Teng Li:
Speech Driven Talking Head Generation via Attentional Landmarks Based Representation. 1326-1330
Intelligibility-Enhancing Speech Modification
- Marc René Schädler:
Optimization and Evaluation of an Intelligibility-Improving Signal Processing Approach (IISPA) for the Hurricane Challenge 2.0 with FADE. 1331-1335 - Haoyu Li, Szu-Wei Fu, Yu Tsao, Junichi Yamagishi:
iMetricGAN: Intelligibility Enhancement for Speech-in-Noise Using Generative Adversarial Network-Based Metric Learning. 1336-1340 - Jan Rennies, Henning F. Schepker, Cassia Valentini-Botinhao, Martin Cooke:
Intelligibility-Enhancing Speech Modifications - The Hurricane Challenge 2.0. 1341-1345 - Olympia Simantiraki, Martin Cooke:
Exploring Listeners' Speech Rate Preferences. 1346-1350 - Felicitas Bederna, Henning F. Schepker, Christian Rollwage, Simon Doclo, Arne Pusch, Jörg Bitzer, Jan Rennies:
Adaptive Compressive Onset-Enhancement for Improved Speech Intelligibility in Noise and Reverberation. 1351-1355 - Carol Chermaz, Simon King:
A Sound Engineering Approach to Near End Listening Enhancement. 1356-1360 - Dipjyoti Paul, P. V. Muhammed Shifas, Yannis Pantazis, Yannis Stylianou:
Enhancing Speech Intelligibility in Text-To-Speech Synthesis Using Speaking Style Conversion. 1361-1365
Human Speech Production I
- Takayuki Arai:
Two Different Mechanisms of Movable Mandible for Vocal-Tract Model with Flexible Tongue. 1366-1370 - Qiang Fang:
Improving the Performance of Acoustic-to-Articulatory Inversion by Removing the Training Loss of Noncritical Portions of Articulatory Channels Dynamically. 1371-1375 - Aravind Illa, Prasanta Kumar Ghosh:
Speaker Conditioned Acoustic-to-Articulatory Inversion Using x-Vectors. 1376-1380 - Zirui Liu, Yi Xu, Feng-fan Hsieh:
Coarticulation as Synchronised Sequential Target Approximation: An EMA Study. 1381-1385 - Jônatas Santos, Jugurta Montalvão, Israel Santos:
Improved Model for Vocal Folds with a Polyp with Potential Application. 1386-1390 - Lin Zhang, Kiyoshi Honda, Jianguo Wei, Seiji Adachi:
Regional Resonance of the Lower Vocal Tract and its Contribution to Speaker Characteristics. 1391-1395 - Renuka Mannem, Navaneetha Gaddam, Prasanta Kumar Ghosh:
Air-Tissue Boundary Segmentation in Real Time Magnetic Resonance Imaging Video Using 3-D Convolutional Neural Network. 1396-1400 - Tilak Purohit, Prasanta Kumar Ghosh:
An Investigation of the Virtual Lip Trajectories During the Production of Bilabial Stops and Nasal at Different Speaking Rates. 1401-1405
Targeted Source Separation
- Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li:
SpEx+: A Complete Time Domain Speaker Extraction Network. 1406-1410 - Tingle Li, Qingjian Lin, Yuanyuan Bao, Ming Li:
Atss-Net: Target Speaker Separation via Attention-Based Neural Network. 1411-1415 - Leyuan Qu, Cornelius Weber, Stefan Wermter:
Multimodal Target Speech Separation with Voice and Face References. 1416-1420 - Zining Zhang, Bingsheng He, Zhenjie Zhang:
X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network. 1421-1425 - Chenda Li, Yanmin Qian:
Listen, Watch and Understand at the Cocktail Party: Audio-Visual-Contextual Speech Separation. 1426-1430 - Yunzhe Hao, Jiaming Xu, Jing Shi, Peng Zhang, Lei Qin, Bo Xu:
A Unified Framework for Low-Latency Speaker Extraction in Cocktail Party Environments. 1431-1435 - Jianshu Zhao, Shengzhou Gao, Takahiro Shinozaki:
Time-Domain Target-Speaker Speech Separation with Waveform-Based Speaker Embedding. 1436-1440 - Tsubasa Ochiai, Marc Delcroix, Yuma Koizumi, Hiroaki Ito, Keisuke Kinoshita, Shoko Araki:
Listen to What You Want: Neural Network-Based Universal Sound Selector. 1441-1445 - Masahiro Yasuda, Yasunori Ohishi, Yuma Koizumi, Noboru Harada:
Crossmodal Sound Retrieval Based on Specific Target Co-Occurrence Denoted with Weak Labels. 1446-1450 - Jiahao Xu, Kun Hu, Chang Xu, Tran Duc Chung, Zhiyong Wang:
Speaker-Aware Monaural Speech Separation. 1451-1455
Keynote 2
- Barbara G. Shinn-Cunningham:
Brain networks enabling speech perception in everyday settings.
Speech Translation and Multilingual/Multimodal Learning
- Liming Wang, Mark Hasegawa-Johnson:
A DNN-HMM-DNN Hybrid Model for Discovering Word-Like Units from Spoken Captions and Image Regions. 1456-1460 - Maha Elbayad, Laurent Besacier, Jakob Verbeek:
Efficient Wait-k Models for Simultaneous Machine Translation. 1461-1465 - Ha Nguyen, Fethi Bougares, Natalia A. Tomashenko, Yannick Estève, Laurent Besacier:
Investigating Self-Supervised Pre-Training for End-to-End Speech Translation. 1466-1470 - Marco Gaido, Mattia Antonino Di Gangi, Matteo Negri, Mauro Cettolo, Marco Turchi:
Contextualized Translation of Automatically Segmented Speech. 1471-1475 - Juan Miguel Pino, Qiantong Xu, Xutai Ma, Mohammad Javad Dousti, Yun Tang:
Self-Training for End-to-End Speech Translation. 1476-1480 - Marcello Federico, Yogesh Virkar, Robert Enyedi, Roberto Barra-Chicote:
Evaluating and Optimizing Prosodic Alignment for Automatic Dubbing. 1481-1485 - Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, James R. Glass:
Pair Expansion for Learning Multilingual Semantic Embeddings Using Disjoint Visually-Grounded Speech Audio Datasets. 1486-1490 - Anne Wu, Changhan Wang, Juan Miguel Pino, Jiatao Gu:
Self-Supervised Representations Improve End-to-End Speech Translation. 1491-1495
Speaker Recognition I
- Jee-weon Jung, Seung-bin Kim, Hye-jin Shim, Ju-ho Kim, Ha-Jin Yu:
Improved RawNet with Feature Map Scaling for Text-Independent Speaker Verification Using Raw Waveforms. 1496-1500 - Youngmoon Jung, Seong Min Kye, Yeunju Choi, Myunghun Jung, Hoirin Kim:
Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances. 1501-1505 - Bin Gu, Wu Guo, Fenglin Ding, Zhen-Hua Ling, Jun Du:
An Adaptive X-Vector Model for Text-Independent Speaker Verification. 1506-1510 - Santi Prieto, Alfonso Ortega Giménez, Iván López-Espejo, Eduardo Lleida:
Shouted Speech Compensation for Speaker Verification Robust to Vocal Effort Conditions. 1511-1515 - Aaron Nicolson, Kuldip K. Paliwal:
Sum-Product Networks for Robust Automatic Speaker Identification. 1516-1520 - Seung-bin Kim, Jee-weon Jung, Hye-jin Shim, Ju-ho Kim, Ha-Jin Yu:
Segment Aggregation for Short Utterances Speaker Verification Using Raw Waveforms. 1521-1525 - Shai Rozenberg, Hagai Aronowitz, Ron Hoory:
Siamese X-Vector Reconstruction for Domain Adapted Speaker Recognition. 1526-1529 - Yanpei Shi, Qiang Huang, Thomas Hain:
Speaker Re-Identification with Speaker Dependent Speech Enhancement. 1530-1534 - Galina Lavrentyeva, Marina Volkova, Anastasia Avdeeva, Sergey Novoselov, Artem Gorlanov, Tseren Andzhukaev, Artem Ivanov, Alexander Kozlov:
Blind Speech Signal Quality Estimation for Speaker Verification Systems. 1535-1539 - Xu Li, Na Li, Jinghua Zhong, Xixin Wu, Xunying Liu, Dan Su, Dong Yu, Helen Meng:
Investigating Robustness of Adversarial Samples Detection for Automatic Speaker Verification. 1540-1544
Spoken Language Understanding II
- Vaishali Pal, Fabien Guillot, Manish Shrivastava, Jean-Michel Renders, Laurent Besacier:
Modeling ASR Ambiguity for Neural Dialogue State Tracking. 1545-1549 - Haoyu Wang, Shuyan Dong, Yue Liu, James Logan, Ashish Kumar Agrawal, Yang Liu:
ASR Error Correction with Augmented Transformer for Entity Retrieval. 1550-1554 - Xueli Jia, Jianzong Wang, Zhiyong Zhang, Ning Cheng, Jing Xiao:
Large-Scale Transfer Learning for Low-Resource Spoken Language Understanding. 1555-1559 - Judith Gaspers, Quynh Ngoc Thi Do, Fabian Triefenbach:
Data Balancing for Boosting Performance of Low-Frequency Classes in Spoken Language Understanding. 1560-1564 - Yu Wang, Yilin Shen, Hongxia Jin:
An Interactive Adversarial Reward Learning-Based Spoken Language Understanding System. 1565-1569 - Jin Cao, Jun Wang, Wael Hamza, Kelly Vanee, Shang-Wen Li:
Style Attuned Pre-Training and Parameter Efficient Fine-Tuning for Spoken Language Understanding. 1570-1574 - Shota Orihashi, Mana Ihori, Tomohiro Tanaka, Ryo Masumura:
Unsupervised Domain Adaptation for Dialogue Sequence Labeling Based on Hierarchical Adversarial Training. 1575-1579 - Leda Sari, Mark Hasegawa-Johnson:
Deep F-Measure Maximization for End-to-End Speech Understanding. 1580-1584 - Taesun Whang, Dongyub Lee, Chanhee Lee, Kisu Yang, Dongsuk Oh, Heuiseok Lim:
An Effective Domain Adaptive Post-Training Method for BERT in Response Selection. 1585-1589 - Antoine Caubrière, Yannick Estève, Antoine Laurent, Emmanuel Morin:
Confidence Measure for Speech-to-Concept End-to-End Spoken Language Understanding. 1590-1594
Human Speech Processing
- Grant L. McGuire, Molly Babel:
Attention to Indexical Information Improves Voice Recall. 1595-1599 - Anaïs Tran Ngoc, Julien Meyer, Fanny Meunier:
Categorization of Whistled Consonants by French Speakers. 1600-1604 - Anaïs Tran Ngoc, Julien Meyer, Fanny Meunier:
Whistled Vowel Identification by French Listeners. 1605-1609 - Maria del Mar Cordero, Fanny Meunier, Nicolas Grimault, Stéphane Pota, Elsa Spinelli:
F0 Slope and Mean: Cues to Speech Segmentation in French. 1610-1614 - Amandine Michelas, Sophie Dufour:
Does French Listeners' Ability to Use Accentual Information at the Word Level Depend on the Ear of Presentation? 1615-1619 - Wen Liu:
A Perceptual Study of the Five Level Tones in Hmu (Xinzhai Variety). 1620-1623 - Zhen Zeng, Karen Mattock, Liquan Liu, Varghese Peter, Alba Tuninetti, Feng-Ming Tsao:
Mandarin and English Adults' Cue-Weighting of Lexical Stress. 1624-1628 - Yan Feng, Gang Peng, William Shi-Yuan Wang:
Age-Related Differences of Tone Perception in Mandarin-Speaking Seniors. 1629-1633 - Georgia Zellou, Michelle Cohn:
Social and Functional Pressures in Vocal Alignment: Differences for Human and Voice-AI Interlocutors. 1634-1638 - Hassan Salami Kavaki, Michael I. Mandel:
Identifying Important Time-Frequency Locations in Continuous Speech Utterances. 1639-1643
Feature Extraction and Distant ASR
- Erfan Loweimi, Peter Bell, Steve Renals:
Raw Sign and Magnitude Spectra for Multi-Head Acoustic Modelling. 1644-1648 - Purvi Agrawal, Sriram Ganapathy:
Robust Raw Waveform Speech Recognition Using Relevance Weighted Representations. 1649-1653 - Dino Oglic, Zoran Cvetkovic, Peter Bell, Steve Renals:
A Deep 2D Convolutional Network for Waveform-Based Speech Recognition. 1654-1658 - Ludwig Kürzinger, Nicolas Lindae, Palle Klewitz, Gerhard Rigoll:
Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions. 1659-1663 - Pegah Ghahramani, Hossein Hadian, Daniel Povey, Hynek Hermansky, Sanjeev Khudanpur:
An Alternative to MFCCs for ASR. 1664-1667 - Anirban Dutta, Ashishkumar Prabhakar Gudmalwar, Ch. V. Rama Rao:
Phase Based Spectro-Temporal Features for Building a Robust ASR System. 1668-1672 - Neethu M. Joy, Dino Oglic, Zoran Cvetkovic, Peter Bell, Steve Renals:
Deep Scattering Power Spectrum Features for Robust Speech Recognition. 1673-1677 - Titouan Parcollet, Xinchi Qiu, Nicholas D. Lane:
FusionRNN: Shared Neural Parameters for Multi-Channel Distant Speech Recognition. 1678-1682 - Kshitiz Kumar, Bo Ren, Yifan Gong, Jian Wu:
Bandpass Noise Generation and Augmentation for Unified ASR. 1683-1687 - Anurenjan Purushothaman, Anirudh Sreeram, Rohit Kumar, Sriram Ganapathy:
Deep Learning Based Dereverberation of Temporal Envelopes for Robust Speech Recognition. 1688-1692
Voice Privacy Challenge
- Natalia A. Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas W. D. Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, Massimiliano Todisco:
Introducing the VoicePrivacy Initiative. 1693-1697 - Andreas Nautsch, Jose Patino, Natalia A. Tomashenko, Junichi Yamagishi, Paul-Gauthier Noé, Jean-François Bonastre, Massimiliano Todisco, Nicholas W. D. Evans:
The Privacy ZEBRA: Zero Evidence Biometric Recognition Assessment. 1698-1702 - Candy Olivia Mawalim, Kasorn Galajit, Jessada Karnjana, Masashi Unoki:
X-Vector Singular Value Modification and Statistical-Based Decomposition with Ensemble Regression Modeling for Speaker Anonymization System. 1703-1707 - Mohamed Maouche, Brij Mohan Lal Srivastava, Nathalie Vauquier, Aurélien Bellet, Marc Tommasi, Emmanuel Vincent:
A Comparative Study of Speech Anonymization Metrics. 1708-1712 - Brij Mohan Lal Srivastava, Natalia A. Tomashenko, Xin Wang, Emmanuel Vincent, Junichi Yamagishi, Mohamed Maouche, Aurélien Bellet, Marc Tommasi:
Design Choices for X-Vector Based Speaker Anonymization. 1713-1717 - Paul-Gauthier Noé, Jean-François Bonastre, Driss Matrouf, Natalia A. Tomashenko, Andreas Nautsch, Nicholas W. D. Evans:
Speech Pseudonymisation Assessment Using Voice Similarity Matrices. 1718-1722
Speech Synthesis: Text Processing, Data and Evaluation
- Kyubyong Park, Seanie Lee:
g2pM: A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset. 1723-1727 - Haiteng Zhang, Huashan Pan, Xiulin Li:
A Mask-Based Model for Mandarin Chinese Polyphone Disambiguation. 1728-1732 - Michelle Cohn, Georgia Zellou:
Perception of Concatenative vs. Neural Text-To-Speech (TTS): Differences in Intelligibility in Noise and Language Attitudes. 1733-1737 - Jason Taylor, Korin Richmond:
Enhancing Sequence-to-Sequence Text-to-Speech with Morphology. 1738-1742 - Yeunju Choi, Youngmoon Jung, Hoirin Kim:
Deep MOS Predictor for Synthetic Speech Using Cluster-Based Modeling. 1743-1747 - Gabriel Mittag, Sebastian Möller:
Deep Learning Based Assessment of Synthetic Speech Naturalness. 1748-1752 - Jiawen Zhang, Yuanyuan Zhao, Jiaqi Zhu, Jinba Xiao:
Distant Supervision for Polyphone Disambiguation in Mandarin Chinese. 1753-1757 - Pilar Oplustil Gallegos, Jennifer Williams, Joanna Rownicka, Simon King:
An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets. 1758-1762 - Anurag Das, Guanlong Zhao, John Levis, Evgeny Chukharev-Hudilainen, Ricardo Gutierrez-Osuna:
Understanding the Effect of Voice Quality and Accent on Talker Similarity. 1763-1767
Search for Speech Recognition
- Wei Zhou, Ralf Schlüter, Hermann Ney:
Robust Beam Search for Encoder-Decoder Attention Based Speech Recognition Without Length Bias. 1768-1772 - Xi Chen, Songyang Zhang, Dandan Song, Peng Ouyang, Shouyi Yin:
Transformer with Bidirectional Decoder for Speech Recognition. 1773-1777 - Weiran Wang, Guangsen Wang, Aadyot Bhatnagar, Yingbo Zhou, Caiming Xiong, Richard Socher:
An Investigation of Phone-Based Subword Units for End-to-End Speech Recognition. 1778-1782 - Jeremy Heng Meng Wong, Yashesh Gaur, Rui Zhao, Liang Lu, Eric Sun, Jinyu Li, Yifan Gong:
Combination of End-to-End and Hybrid Models for Speech Recognition. 1783-1787 - Jihwan Kim, Jisung Wang, Sangki Kim, Yeha Lee:
Evolved Speech-Transformer: Applying Neural Architecture Search to End-to-End Automatic Speech Recognition. 1788-1792 - Abhinav Garg, Ashutosh Gupta, Dhananjaya Gowda, Shatrughan Singh, Chanwoo Kim:
Hierarchical Multi-Stage Word-to-Grapheme Named Entity Corrector for Automatic Speech Recognition. 1793-1797 - Eugen Beck, Ralf Schlüter, Hermann Ney:
LVCSR with Transformer Language Models. 1798-1802 - Yi-Chen Chen, Jui-Yang Hsu, Cheng-Kuang Lee, Hung-yi Lee:
DARTS-ASR: Differentiable Architecture Search for Multilingual Speech Recognition and Adaptation. 1803-1807
Computational Paralinguistics I
- Lukas Stappen, Georgios Rizos, Madina Hasan, Thomas Hain, Björn W. Schuller:
Uncertainty-Aware Machine Support for Paper Reviewing on the Interspeech 2019 Submission Corpus. 1808-1812 - Michelle Cohn, Melina Sarian, Kristin Predeck, Georgia Zellou:
Individual Variation in Language Attitudes Toward Voice-AI: The Role of Listeners' Autistic-Like Traits. 1813-1817 - Michelle Cohn, Eran Raveh, Kristin Predeck, Iona Gessinger, Bernd Möbius, Georgia Zellou:
Differences in Gradient Emotion Perception: Human vs. Alexa Voices. 1818-1822 - Luz Martinez-Lucas, Mohammed Abdelwahab, Carlos Busso:
The MSP-Conversation Corpus. 1823-1827 - Fuxiang Tao, Anna Esposito, Alessandro Vinciarelli:
Spotting the Traces of Depression in Read Speech: An Approach Based on Computational Paralinguistics and Social Signal Processing. 1828-1832 - Yelin Kim, Joshua Levy, Yang Liu:
Speech Sentiment and Customer Satisfaction Estimation in Socialbot Conversations. 1833-1837 - Haley Lepp, Gina-Anne Levow:
Pardon the Interruption: An Analysis of Gender and Turn-Taking in U.S. Supreme Court Oral Arguments. 1838-1842 - Jana Neitsch, Oliver Niebuhr:
Are Germans Better Haters Than Danes? Language-Specific Implicit Prosodies of Types of Hate Speech and How They Relate to Perceived Severity and Societal Rules. 1843-1847 - Fuling Chen, Roberto Togneri, Murray Maybery, Diana Tan:
An Objective Voice Gender Scoring System and Identification of the Salient Acoustic Measures. 1848-1852 - Sadari Jayawardena, Julien Epps, Zhaocheng Huang:
How Ordinal Are Your Data? 1853-1857
Acoustic Phonetics and Prosody
- Vincent Hughes, Frantz Clermont, Philip Harrison:
Correlating Cepstra with Formant Frequencies: Implications for Phonetically-Informed Forensic Voice Comparison. 1858-1862 - Jana Neitsch, Plínio A. Barbosa, Oliver Niebuhr:
Prosody and Breathing: A Comparison Between Rhetorical and Information-Seeking Questions in German and Brazilian Portuguese. 1863-1867 - Rebecca Defina, Catalina Torres, Hywel Stoakes:
Scaling Processes of Clause Chains in Pitjantjatjara. 1868-1872 - Ai Mizoguchi, Ayako Hashimoto, Sanae Matsui, Setsuko Imatomi, Ryunosuke Kobayashi, Mafuyu Kitahara:
Neutralization of Voicing Distinction of Stops in Tohoku Dialects of Japanese: Field Work and Acoustic Measurements. 1873-1877 - Lou Lee, Denis Jouvet, Katarina Bartkova, Yvon Keromnes, Mathilde Dargnat:
Correlation Between Prosody and Pragmatics: Case Study of Discourse Markers in French and English. 1878-1882 - Dina El Zarka, Anneliese Kelterer, Barbara Schuppler:
An Analysis of Prosodic Prominence Cues to Information Structure in Egyptian Arabic. 1883-1887 - Benazir Mumtaz, Tina Bögel, Miriam Butt:
Lexical Stress in Urdu. 1888-1892 - Rachid Riad, Hadrien Titeux, Laurie Lemoine, Justine Montillot, Jennifer Hamet Bagnou, Xuan-Nga Cao, Emmanuel Dupoux, Anne-Catherine Bachoud-Lévi:
Vocal Markers from Sustained Phonation in Huntington's Disease. 1893-1897 - Laure Dentel, Julien Meyer:
How Rhythm and Timbre Encode Mooré Language in Bendré Drummed Speech. 1898-1902
Keynote 3
- Lin-Shan Lee:
Doing Something we Never could with Spoken Language Technologies-from early days to the era of deep learning.
Tonal Aspects of Acoustic Phonetics and Prosody
- Wendy Lalhminghlui, Priyankoo Sarmah:
Interaction of Tone and Voicing in Mizo. 1903-1907 - Yaru Wu, Martine Adda-Decker, Lori Lamel:
Mandarin Lexical Tones: A Corpus-Based Study of Word Length, Syllable Position and Prosodic Position on Duration. 1908-1912 - Yingming Gao, Xinyu Zhang, Yi Xu, Jinsong Zhang, Peter Birkholz:
An Investigation of the Target Approximation Model for Tone Modeling and Recognition in Continuous Mandarin Speech. 1913-1917 - Wei Lai, Aini Li:
Integrating the Application and Realization of Mandarin 3rd Tone Sandhi in the Resolution of Sentence Ambiguity. 1918-1922 - Zhenrui Zhang, Fang Hu:
Neutral Tone in Changde Mandarin. 1923-1927 - Ping Cui, Jianjing Kuang:
Pitch Declination and Final Lowering in Northeastern Mandarin. 1928-1932 - Phil Rose:
Variation in Spectral Slope and Interharmonic Noise in Cantonese Tones. 1933-1937 - Ping Tang, Shanpeng Li:
The Acoustic Realization of Mandarin Tones in Fast Speech. 1938-1941
Speech Classification
- Anastassia Loukina, Keelan Evanini, Matthew Mulholland, Ian Blood, Klaus Zechner:
Do Face Masks Introduce Bias in Speech Technologies? The Case of Automated Scoring of Speaking Proficiency. 1942-1946 - Mohamed Mhiri, Samuel Myer, Vikrant Singh Tomar:
A Low Latency ASR-Free End to End Spoken Language Understanding System. 1947-1951 - Joe Wang, Rajath Kumar, Mike Rodehorst, Brian Kulis, Shiv Naga Prasad Vitaladevuni:
An Audio-Based Wakeword-Independent Verification System. 1952-1956 - Tyler Vuong, Yangyang Xia, Richard M. Stern:
Learnable Spectro-Temporal Receptive Fields for Robust Voice Type Discrimination. 1957-1961 - Shuo-Yiin Chang, Bo Li, David Rybach, Yanzhang He, Wei Li, Tara N. Sainath, Trevor Strohman:
Low Latency Speech Recognition Using End-to-End Prefetching. 1962-1966 - Jingsong Wang, Tom Ko, Zhen Xu, Xiawei Guo, Souxiang Liu, Wei-Wei Tu, Lei Xie:
AutoSpeech 2020: The Second Automated Machine Learning Challenge for Speech Classification. 1967-1971 - Rajath Kumar, Mike Rodehorst, Joe Wang, Jiacheng Gu, Brian Kulis:
Building a Robust Word-Level Wakeword Verification Network. 1972-1976 - Yuma Koizumi, Ryo Masumura, Kyosuke Nishida, Masahiro Yasuda, Shoichiro Saito:
A Transformer-Based Audio Captioning Model with Keyword Estimation. 1977-1981 - Tong Mo, Yakun Yu, Mohammad Salameh, Di Niu, Shangling Jui:
Neural Architecture Search for Keyword Spotting. 1982-1986 - Ximin Li, Xiaodong Wei, Xiaowei Qin:
Small-Footprint Keyword Spotting with Multi-Scale Temporal Convolution. 1987-1991
Speech Synthesis Paradigms and Methods I
- Xin Wang, Junichi Yamagishi:
Using Cyclic Noise as the Source Signal for Neural Source-Filter-Based Speech Waveform Model. 1992-1996 - Jen-Yu Liu, Yu-Hua Chen, Yin-Cheng Yeh, Yi-Hsuan Yang:
Unconditional Audio Generation with Generative Adversarial Networks and Cycle Regularization. 1997-2001 - Toru Nakashika:
Complex-Valued Variational Autoencoder: A Novel Deep Generative Model for Direct Representation of Complex Spectra. 2002-2006 - Seungwoo Choi, Seungju Han, Dongyoung Kim, Sungjoo Ha:
Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding. 2007-2011 - Hyeong Rae Ihm, Joun Yeop Lee, Byoung Jin Choi, Sung Jun Cheon, Nam Soo Kim:
Reformer-TTS: Neural Speech Synthesis with Reformer Network. 2012-2016 - Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo:
CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-Spectrogram Conversion. 2017-2021 - Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Aimilios Chalamandaris, Georgia Maniati, Panos Kakoulidis, Spyros Raptis, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis:
High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency. 2022-2026 - Chengzhu Yu, Heng Lu, Na Hu, Meng Yu, Chao Weng, Kun Xu, Peng Liu, Deyi Tuo, Shiyin Kang, Guangzhi Lei, Dan Su, Dong Yu:
DurIAN: Duration Informed Attention Network for Speech Synthesis. 2027-2031 - Kentaro Mitsui, Tomoki Koriyama, Hiroshi Saruwatari:
Multi-Speaker Text-to-Speech Synthesis Using Deep Gaussian Processes. 2032-2036 - Mano Ranjith Kumar M., Sudhanshu Srivastava, Anusha Prakash, Hema A. Murthy:
A Hybrid HMM-Waveglow Based Text-to-Speech Synthesizer Using Histogram Equalization for Low Resource Indian Languages. 2037-2041
The INTERSPEECH 2020 Computational Paralinguistics ChallengE (ComParE)
- Björn W. Schuller, Anton Batliner, Christian Bergler, Eva-Maria Messner, Antonia F. de C. Hamilton, Shahin Amiriparian, Alice Baird, Georgios Rizos, Maximilian Schmitt, Lukas Stappen, Harald Baumeister, Alexis Deighton MacIntyre, Simone Hantke:
The INTERSPEECH 2020 Computational Paralinguistics Challenge: Elderly Emotion, Breathing & Masks. 2042-2046 - Tomoya Koike, Kun Qian, Björn W. Schuller, Yoshiharu Yamamoto:
Learning Higher Representations from Pre-Trained Deep Models with Data Augmentation for the COMPARE 2020 Challenge Mask Task. 2047-2051 - Steffen Illium, Robert Müller, Andreas Sedlmeier, Claudia Linnhoff-Popien:
Surgical Mask Detection with Convolutional Neural Networks and Data Augmentations on Spectrograms. 2052-2056 - Philipp Klumpp, Tomás Arias-Vergara, Juan Camilo Vásquez-Correa, Paula Andrea Pérez-Toro, Florian Hönig, Elmar Nöth, Juan Rafael Orozco-Arroyave:
Surgical Mask Detection with Deep Recurrent Phonetic Models. 2057-2061 - Claude Montacié, Marie-José Caraty:
Phonetic, Frame Clustering and Intelligibility Analyses for the INTERSPEECH 2020 ComParE Challenge. 2062-2066 - Mariana Julião, Alberto Abad, Helena Moniz:
Exploring Text and Audio Embeddings for Multi-Dimension Elderly Emotion Recognition. 2067-2071 - Maxim Markitantov, Denis Dresvyanskiy, Danila Mamontov, Heysem Kaya, Wolfgang Minker, Alexey Karpov:
Ensembling End-to-End Deep Models for Computational Paralinguistics Tasks: ComParE 2020 Mask and Breathing Sub-Challenges. 2072-2076 - John Mendonça, Francisco Teixeira, Isabel Trancoso, Alberto Abad:
Analyzing Breath Signals for the Interspeech 2020 ComParE Challenge. 2077-2081 - Alexis Deighton MacIntyre, Georgios Rizos, Anton Batliner, Alice Baird, Shahin Amiriparian, Antonia F. de C. Hamilton, Björn W. Schuller:
Deep Attentive End-to-End Continuous Breath Sensing from Speech. 2082-2086 - Jeno Szep, Salim Hariri:
Paralinguistic Classification of Mask Wearing by Image Classifiers and Fusion. 2087-2091 - Ziqing Yang, Zifan An, Zehao Fan, Chengye Jing, Houwei Cao:
Exploration of Acoustic and Lexical Cues for the INTERSPEECH 2020 Computational Paralinguistic Challenge. 2092-2096 - Gizem Sogancioglu, Oxana Verkholyak, Heysem Kaya, Dmitrii Fedotov, Tobias Cadèe, Albert Ali Salah, Alexey Karpov:
Is Everything Fine, Grandma? Acoustic and Linguistic Modeling for Robust Elderly Speech Emotion Recognition. 2097-2101 - Nicolae-Catalin Ristea, Radu Tudor Ionescu:
Are you Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. 2102-2106
Streaming ASR
- Kshitiz Kumar, Chaojun Liu, Yifan Gong, Jian Wu:
1-D Row-Convolution LSTM: Fast Streaming ASR at Accuracy Parity with LC-BLSTM. 2107-2111 - Chengyi Wang, Yu Wu, Liang Lu, Shujie Liu, Jinyu Li, Guoli Ye, Ming Zhou:
Low Latency End-to-End Streaming Speech Recognition with a Scout Network. 2112-2116 - Gakuto Kurata, George Saon:
Knowledge Distillation from Offline to Streaming RNN Transducer for End-to-End Speech Recognition. 2117-2121 - Wei Li, James Qin, Chung-Cheng Chiu, Ruoming Pang, Yanzhang He:
Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition. 2122-2126 - Pau Baquero-Arnal, Javier Jorge, Adrià Giménez, Joan Albert Silvestre-Cerdà, Javier Iranzo-Sánchez, Albert Sanchís, Jorge Civera, Alfons Juan:
Improved Hybrid Streaming ASR with Transformer Language Models. 2127-2131 - Chunyang Wu, Yongqiang Wang, Yangyang Shi, Ching-Feng Yeh, Frank Zhang:
Streaming Transformer-Based Acoustic Models Using Self-Attention with Augmented Memory. 2132-2136 - Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara:
Enhancing Monotonic Multihead Attention for Streaming ASR. 2137-2141 - Shiliang Zhang, Zhifu Gao, Haoneng Luo, Ming Lei, Jie Gao, Zhijie Yan, Lei Xie:
Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition. 2142-2146 - Thai-Son Nguyen, Ngoc-Quan Pham, Sebastian Stüker, Alex Waibel:
High Performance Sequence-to-Sequence Model for Streaming Speech Recognition. 2147-2151 - Vikas Joshi, Rui Zhao, Rupesh R. Mehta, Kshitiz Kumar, Jinyu Li:
Transfer Learning Approaches for Streaming End-to-End Speech Recognition System. 2152-2156
Alzheimer’s Dementia Recognition Through Spontaneous Speech
- Matej Martinc, Senja Pollak:
Tackling the ADReSS Challenge: A Multimodal Approach to the Automated Recognition of Alzheimer's Dementia. 2157-2161 - Jiahong Yuan, Yuchen Bian, Xingyu Cai, Jiaji Huang, Zheng Ye, Kenneth Church:
Disfluencies and Fine-Tuning Pre-Trained Language Models for Detection of Alzheimer's Disease. 2162-2166 - Aparna Balagopalan, Benjamin Eyre, Frank Rudzicz, Jekaterina Novikova:
To BERT or not to BERT: Comparing Speech and Language-Based Approaches for Alzheimer's Disease Detection. 2167-2171 - Saturnino Luz, Fasih Haider, Sofia de la Fuente, Davida Fromm, Brian MacWhinney:
Alzheimer's Dementia Recognition Through Spontaneous Speech: The ADReSS Challenge. 2172-2176 - Raghavendra Pappagari, Jaejin Cho, Laureano Moro-Velázquez, Najim Dehak:
Using State of the Art Speaker Recognition and Natural Language Processing Technologies to Detect Alzheimer's Disease and Assess its Severity. 2177-2181 - Nicholas Cummins, Yilin Pan, Zhao Ren, Julian Fritsch, Venkata Srikanth Nallanthighal, Heidi Christensen, Daniel Blackburn, Björn W. Schuller, Mathew Magimai-Doss, Helmer Strik, Aki Härmä:
A Comparison of Acoustic and Linguistics Methodologies for Alzheimer's Dementia Recognition. 2182-2186 - Morteza Rohanian, Julian Hough, Matthew Purver:
Multi-Modal Fusion with Gating Using Audio, Lexical and Disfluency Features for Alzheimer's Dementia Recognition from Spontaneous Speech. 2187-2191 - Thomas Searle, Zina M. Ibrahim, Richard J. B. Dobson:
Comparing Natural Language Processing Techniques for Alzheimer's Dementia Prediction in Spontaneous Speech. 2192-2196 - Erik Edwards, Charles Dognin, Bajibabu Bollepalli, Maneesh Kumar Singh:
Multiscale System for Alzheimer's Dementia Recognition Through Spontaneous Speech. 2197-2201 - Anna Pompili, Thomas Rolland, Alberto Abad:
The INESC-ID Multi-Modal System for the ADReSS 2020 Challenge. 2202-2206 - Shahla Farzana, Natalie Parde:
Exploring MMSE Score Prediction Using Verbal and Non-Verbal Cues. 2207-2211 - Utkarsh Sarawgi, Wazeer Zulfikar, Nouran Soliman, Pattie Maes:
Multimodal Inductive Transfer Learning for Detection of Alzheimer's Dementia and its Severity. 2212-2216 - Junghyun Koo, Jie Hwan Lee, Jaewoo Pyo, Yujin Jo, Kyogu Lee:
Exploiting Multi-Modal Features from Pre-Trained Networks for Alzheimer's Dementia Recognition. 2217-2221 - Muhammad Shehram Shah Syed, Zafi Sherhan Syed, Margaret Lech, Elena Pirogova:
Automated Screening for Alzheimer's Dementia Through Spontaneous Speech. 2222-2226
Speaker Recognition Challenges and Applications
- Kong Aik Lee, Koji Okabe, Hitoshi Yamamoto, Qiongqiong Wang, Ling Guo, Takafumi Koshinaka, Jiacen Zhang, Keisuke Ishikawa, Koichi Shinoda:
NEC-TT Speaker Verification System for SRE'19 CTS Challenge. 2227-2231 - Ruyun Li, Tianyu Liang, Dandan Song, Yi Liu, Yangcheng Wu, Can Xu, Peng Ouyang, Xianwei Zhang, Xianhong Chen, Weiqiang Zhang, Shouyi Yin, Liang He:
THUEE System for NIST SRE19 CTS Challenge. 2232-2236 - Grigory Antipov, Nicolas Gengembre, Olivier Le Blouch, Gaël Le Lan:
Automatic Quality Assessment for Audio-Visual Verification Systems. The LOVe Submission to NIST SRE Challenge 2019. 2237-2241 - Ruijie Tao, Rohan Kumar Das, Haizhou Li:
Audio-Visual Speaker Recognition with a Cross-Modal Discriminative Network. 2242-2246 - Suwon Shon, James R. Glass:
Multimodal Association for Speaker Verification. 2247-2251 - Zhengyang Chen, Shuai Wang, Yanmin Qian:
Multi-Modality Matters: A Performance Leap on VoxCeleb. 2252-2256 - Zhenyu Wang, Wei Xia, John H. L. Hansen:
Cross-Domain Adaptation with Discrepancy Minimization for Text-Independent Forensic Speaker Verification. 2257-2261 - Mufan Sang, Wei Xia, John H. L. Hansen:
Open-Set Short Utterance Forensic Speaker Verification Using Teacher-Student Network with Explicit Inductive Bias. 2262-2266 - Anurag Chowdhury, Austin Cozzo, Arun Ross:
JukeBox: A Multilingual Singer Recognition Dataset. 2267-2271 - Ruirui Li, Jyun-Yu Jiang, Xian Wu, Chu-Cheng Hsieh, Andreas Stolcke:
Speaker Identification for Household Scenarios with Self-Attention and Adversarial Training. 2272-2276
Applications of ASR
- Oleg Rybakov, Natasha Kononenko, Niranjan Subrahmanya, Mirkó Visontai, Stella Laurenzo:
Streaming Keyword Spotting on Mobile Devices. 2277-2281 - Hongyi Liu, Apurva Abhyankar, Yuriy Mishchenko, Thibaud Sénéchal, Gengshen Fu, Brian Kulis, Noah D. Stein, Anish Shah, Shiv Naga Prasad Vitaladevuni:
Metadata-Aware End-to-End Keyword Spotting. 2282-2286 - Yehao Kong, Jiliang Zhang:
Adversarial Audio: A New Information Hiding Method. 2287-2291 - Xinsheng Wang, Tingting Qiao, Jihua Zhu, Alan Hanjalic, Odette Scharenborg:
S2IGAN: Speech-to-Image Generation via Adversarial Learning. 2292-2296 - Juan Zuluaga-Gomez, Petr Motlícek, Qingran Zhan, Karel Veselý, Rudolf A. Braun:
Automatic Speech Recognition Benchmark for Air-Traffic Communications. 2297-2301 - Prithvi R. R. Gudepu, Gowtham P. Vadisetti, Abhishek Niranjan, Kinnera Saranu, Raghava Sarma, M. Ali Basha Shaik, Periyasamy Paramasivam:
Whisper Augmented End-to-End/Hybrid Speech Recognition System - CycleGAN Approach. 2302-2306 - Ramit Sawhney, Arshiya Aggarwal, Piyush Khanna, Puneet Mathur, Taru Jain, Rajiv Ratn Shah:
Risk Forecasting from Earnings Calls Acoustics and Network Correlations. 2307-2311 - Huili Chen, Bita Darvish Rouhani, Farinaz Koushanfar:
SpecMark: A Spectral Watermarking Framework for IP Protection of Speech Recognition Systems. 2312-2316 - Justin van der Hout, Zoltán D'Haese, Mark Hasegawa-Johnson, Odette Scharenborg:
Evaluating Automatically Generated Phoneme Captions for Images. 2317-2321
Speech Emotion Recognition II
- Wei-Cheng Lin, Carlos Busso:
An Efficient Temporal Modeling Approach for Speech Emotion Recognition by Mapping Varied Duration Sentences into Fixed Number of Chunks. 2322-2326 - Siddique Latif, Rajib Rana, Sara Khalifa, Raja Jurdak, Björn W. Schuller:
Deep Architecture Enhancing Robustness to Noise, Adversarial Attacks, and Cross-Corpus Setting for Speech Emotion Recognition. 2327-2331 - Takuya Fujioka, Takeshi Homma, Kenji Nagamatsu:
Meta-Learning for Speech Emotion Recognition Considering Ambiguity of Emotion Labels. 2332-2336 - Jiaxing Liu, Zhilei Liu, Longbiao Wang, Yuan Gao, Lili Guo, Jianwu Dang:
Temporal Attention Convolutional Network for Speech Emotion Recognition with Latent Representation. 2337-2341 - Zhi Zhu, Yoshinao Sato:
Reconciliation of Multiple Corpora for Speech Emotion Recognition by Multiple Classifiers with an Adversarial Corpus Discriminator. 2342-2346 - Zheng Lian, Jianhua Tao, Bin Liu, Jian Huang, Zhanlei Yang, Rongjun Li:
Conversational Emotion Recognition Using Self-Attention Mechanisms and Graph Neural Networks. 2347-2351 - Shuiyang Mao, P. C. Ching, Tan Lee:
EigenEmo: Spectral Utterance Representation Using Dynamic Mode Decomposition for Speech Emotion Classification. 2352-2356 - Shuiyang Mao, P. C. Ching, C.-C. Jay Kuo, Tan Lee:
Advancing Multiple Instance Learning with Attention Modeling for Categorical Speech Emotion Recognition. 2357-2361
Bi- and Multilinguality
- Rubén Pérez Ramón, María Luisa García Lecumberri, Martin Cooke:
The Effect of Language Proficiency on the Perception of Segmental Foreign Accent. 2362-2366 - Yi Liu, Jinghong Ning:
The Effect of Language Dominance on the Selective Attention of Segments and Tones in Urdu-Cantonese Speakers. 2367-2371 - Mengrou Li, Ying Chen, Jie Cui:
The Effect of Input on the Production of English Tense and Lax Vowels by Chinese Learners: Evidence from an Elementary School in China. 2372-2376 - Laura Spinu, Jiwon Hwang, Nadya Pincus, Mariana Vasilita:
Exploring the Use of an Artificial Accent of English to Assess Phonetic Learning in Monolingual and Bilingual Speakers. 2377-2381 - Shammur A. Chowdhury, Younes Samih, Mohamed Eldesouki, Ahmed Ali:
Effects of Dialectal Code-Switching on Speech Modules: A Study Using Egyptian Arabic Broadcast Speech. 2382-2386 - Khia A. Johnson, Molly Babel, Robert A. Fuhrman:
Bilingual Acoustic Voice Variation is Similarly Structured Across Languages. 2387-2391 - Haobo Zhang, Haihua Xu, Van Tung Pham, Hao Huang, Eng Siong Chng:
Monolingual Data Selection Analysis for English-Mandarin Hybrid Code-Switching Speech Recognition. 2392-2396 - Dan Du, Xianjin Zhu, Zhu Li, Jinsong Zhang:
Perception and Production of Mandarin Initial Stops by Native Urdu Speakers. 2397-2401 - Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman:
Now You're Speaking My Language: Visual Language Identification. 2402-2406 - Nari Rhee, Jianjing Kuang:
The Different Enhancement Roles of Covarying Cues in Thai and Mandarin Tones. 2407-2411
Single-Channel Speech Enhancement I
- Hao Shi, Longbiao Wang, Sheng Li, Chenchen Ding, Meng Ge, Nan Li, Jianwu Dang, Hiroshi Seki:
Singing Voice Extraction with Attention-Based Spectrograms Fusion. 2412-2416 - Yen-Ju Lu, Chien-Feng Liao, Xugang Lu, Jeih-weih Hung, Yu Tsao:
Incorporating Broad Phonetic Information for Speech Enhancement. 2417-2421 - Andong Li, Chengshi Zheng, Cunhang Fan, Renhua Peng, Xiaodong Li:
A Recursive Network with Dynamic Attention for Monaural Speech Enhancement. 2422-2426 - Hongjiang Yu, Wei-Ping Zhu, Yuhong Yang:
Constrained Ratio Mask for Speech Enhancement Using DNN. 2427-2431 - Chi-Chang Lee, Yu-Chen Lin, Hsuan-Tien Lin, Hsin-Min Wang, Yu Tsao:
SERIL: Noise Adaptive Speech Enhancement Using Regularization-Based Incremental Learning. 2432-2436 - Yoshiaki Bando, Kouhei Sekiguchi, Kazuyoshi Yoshii:
Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder. 2437-2441 - Ahmet Emin Bulut, Kazuhito Koishida:
Low-Latency Single Channel Speech Dereverberation Using U-Net Convolutional Neural Networks. 2442-2446 - Dung N. Tran, Kazuhito Koishida:
Single-Channel Speech Enhancement by Subspace Affinity Minimization. 2447-2451 - Haoyu Li, Junichi Yamagishi:
Noise Tokens: Learning Neural Noise Templates for Environment-Aware Speech Enhancement. 2452-2456 - Feng Deng, Tao Jiang, Xiaorui Wang, Chen Zhang, Yan Li:
NAAGN: Noise-Aware Attention-Gated Network for Speech Enhancement. 2457-2461
Deep Noise Suppression Challenge
- Xiaofei Li, Radu Horaud:
Online Monaural Speech Enhancement Using Delayed Subband LSTM. 2462-2466 - Maximilian Strake, Bruno Defraene, Kristoff Fluyt, Wouter Tirry, Tim Fingscheidt:
INTERSPEECH 2020 Deep Noise Suppression Challenge: A Fully Convolutional Recurrent Network (FCRN) for Joint Dereverberation and Denoising. 2467-2471 - Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, Lei Xie:
DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement. 2472-2476 - Nils L. Westhausen, Bernd T. Meyer:
Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression. 2477-2481 - Jean-Marc Valin, Umut Isik, Neerad Phansalkar, Ritwik Giri, Karim Helwani, Arvindh Krishnaswamy:
A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech. 2482-2486 - Umut Isik, Ritwik Giri, Neerad Phansalkar, Jean-Marc Valin, Karim Helwani, Arvindh Krishnaswamy:
PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss. 2487-2491 - Chandan K. A. Reddy, Vishak Gopal, Ross Cutler, Ebrahim Beyrami, Roger Cheng, Harishchandra Dubey, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, Puneet Rana, Sriram Srinivasan, Johannes Gehrke:
The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results. 2492-2496
Voice and Hearing Disorders
- Sara Akbarzadeh, Sungmin Lee, Chin-Tuan Tan:
The Implication of Sound Level on Spatial Selective Auditory Attention for Cochlear Implant Users: Behavioral and Electrophysiological Measurement. 2497-2501 - Yangyang Wan, Huali Zhou, Qinglin Meng, Nengheng Zheng:
Enhancing the Interaural Time Difference of Bilateral Cochlear Implants with the Temporal Limits Encoder. 2502-2506 - Toshio Irino, Soichi Higashiyama, Hanako Yoshigi:
Speech Clarity Improvement by Vocal Self-Training Using a Hearing Impairment Simulator and its Correlation with an Auditory Modulation Index. 2507-2511 - Zhuohuang Zhang, Donald S. Williamson, Yi Shen:
Investigation of Phase Distortion on Perceived Speech Quality for Hearing-Impaired Listeners. 2512-2516 - Zhuo Zhang, Gaoyan Zhang, Jianwu Dang, Shuang Wu, Di Zhou, Longbiao Wang:
EEG-Based Short-Time Auditory Attention Detection Using Multi-Task Deep Learning. 2517-2521 - Sondes Abderrazek, Corinne Fredouille, Alain Ghio, Muriel Lalain, Christine Meunier, Virginie Woisard:
Towards Interpreting Deep Learning Models to Understand Loss of Speech Intelligibility in Speech Disorders - Step 1: CNN Model-Based Phone Classification. 2522-2526 - Bahman Mirheidari, Daniel Blackburn, Ronan O'Malley, Annalena Venneri, Traci Walker, Markus Reuber, Heidi Christensen:
Improving Cognitive Impairment Classification by Generative Neural Network-Based Feature Augmentation. 2527-2531 - Meredith Moore, Piyush Papreja, Michael Saxon, Visar Berisha, Sethuraman Panchanathan:
UncommonVoice: A Crowdsourced Dataset of Dysphonic Speech. 2532-2536 - Purva Barche, Krishna Gurugubelli, Anil Kumar Vuppala:
Towards Automatic Assessment of Voice Disorders: A Clinical Approach. 2537-2541 - Abhishek Shivkumar, Jack Weston, Raphael Lenain, Emil Fristed:
BlaBla: Linguistic Feature Extraction for Clinical Analysis in Multiple Languages. 2542-2546
Spoken Term Detection
- Menglong Xu, Xiao-Lei Zhang:
Depthwise Separable Convolutional ResNet with Squeeze-and-Excitation Blocks for Small-Footprint Keyword Spotting. 2547-2551 - Théodore Bluche, Thibault Gisselbrecht:
Predicting Detection Filters for Small Footprint Open-Vocabulary Keyword Spotting. 2552-2556 - Emre Yilmaz, Özgür Bora Gevrek, Jibin Wu, Yuxiang Chen, Xuanbo Meng, Haizhou Li:
Deep Convolutional Spiking Neural Networks for Keyword Spotting. 2557-2561 - Haiwei Wu, Yan Jia, Yuanfei Nie, Ming Li:
Domain Aware Training for Far-Field Small-Footprint Keyword Spotting. 2562-2566 - Kun Zhang, Zhiyong Wu, Daode Yuan, Jian Luan, Jia Jia, Helen Meng, Binheng Song:
Re-Weighted Interval Loss for Handling Data Imbalance Problem of End-to-End Keyword Spotting. 2567-2571 - Peng Zhang, Xueliang Zhang:
Deep Template Matching for Small-Footprint and Configurable Keyword Spotting. 2572-2576 - Chen Yang, Xue Wen, Liming Song:
Multi-Scale Convolution for Robust Keyword Spotting. 2577-2581 - Yangbin Chen, Tom Ko, Lifeng Shang, Xiao Chen, Xin Jiang, Qing Li:
An Investigation of Few-Shot Learning in Spoken Term Classification. 2582-2586 - Zeyu Zhao, Weiqiang Zhang:
End-to-End Keyword Search Based on Attention and Energy Scorer for Low Resource Languages. 2587-2591 - Takuya Higuchi, Mohammad Ghasemzadeh, Kisun You, Chandra Dhir:
Stacked 1D Convolutional Networks for End-to-End Small Footprint Voice Trigger Detection. 2592-2596
The Fearless Steps Challenge Phase-02
- Jens Heitkaemper, Joerg Schmalenstroeer, Reinhold Haeb-Umbach:
Statistical and Neural Network Based Speech Activity Detection in Non-Stationary Acoustic Environments. 2597-2601 - Xueshuai Zhang, Wenchao Wang, Pengyuan Zhang:
Speaker Diarization System Based on DPCA Algorithm for Fearless Steps Challenge Phase-2. 2602-2606 - Qingjian Lin, Tingle Li, Ming Li:
The DKU Speech Activity Detection and Speaker Identification Systems for Fearless Steps Challenge Phase-02. 2607-2611 - Arseniy Gorin, Daniil Kulko, Steven Grima, Alex Glasman:
"This is Houston. Say again, please". The Behavox System for the Apollo-11 Fearless Steps Challenge (Phase II). 2612-2616 - Aditya Joglekar, John H. L. Hansen, Meena Chandra Shekhar, Abhijeet Sangwan:
FEARLESS STEPS Challenge (FS-2): Supervised Learning with Massive Naturalistic Apollo Data. 2617-2621
Monaural Source Separation
- Yi Luo, Nima Mesgarani:
Separating Varying Numbers of Sources with Auxiliary Autoencoding Loss. 2622-2626 - Jingjing Chen, Qirong Mao, Dong Liu:
On Synthesis for Supervised Monaural Speech Separation in Time Domain. 2627-2631 - Jun Wang:
Learning Better Speech Representations by Worsening Interference. 2632-2636 - Manuel Pariente, Samuele Cornell, Joris Cosentino, Sunit Sivasankaran, Efthymios Tzinis, Jens Heitkaemper, Michel Olvera, Fabian-Robert Stöter, Mathieu Hu, Juan M. Martín-Doñas, David Ditter, Ariel Frank, Antoine Deleforge, Emmanuel Vincent:
Asteroid: The PyTorch-Based Audio Source Separation Toolkit for Researchers. 2637-2641 - Jingjing Chen, Qirong Mao, Dong Liu:
Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation. 2642-2646 - Chengyun Deng, Yi Zhang, Shiqian Ma, Yongtao Sha, Hui Song, Xiangang Li:
Conv-TasSAN: Separative Adversarial Network Based on Conv-TasNet. 2647-2651 - Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach:
Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation. 2652-2656 - Vivek Sivaraman Narayanaswamy, Jayaraman J. Thiagarajan, Rushil Anirudh, Andreas Spanias:
Unsupervised Audio Source Separation Using Generative Priors. 2657-2661
Single-Channel Speech Enhancement II
- Yuanhang Qiu, Ruili Wang:
Adversarial Latent Representation Learning for Speech Enhancement. 2662-2666 - Yang Xiang, Liming Shi, Jesper Lisby Højvang, Morten Højfeldt Rasmussen, Mads Græsbøll Christensen:
An NMF-HMM Speech Enhancement Method Based on Kullback-Leibler Divergence. 2667-2671 - Lu Zhang, Mingjiang Wang:
Multi-Scale TCN: Exploring Better Temporal DNN Model for Causal Speech Enhancement. 2672-2676 - Quan Wang, Ignacio López-Moreno, Mert Saglam, Kevin W. Wilson, Alan Chiao, Renjie Liu, Yanzhang He, Wei Li, Jason Pelecanos, Marily Nika, Alexander Gruenstein:
VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition. 2677-2681 - Ziqiang Shi, Rujie Liu, Jiqing Han:
Speech Separation Based on Multi-Stage Elaborated Dual-Path Deep BiLSTM with Auxiliary Identity Loss. 2682-2686 - Xiang Hao, Shixue Wen, Xiangdong Su, Yun Liu, Guanglai Gao, Xiaofei Li:
Sub-Band Knowledge Distillation Framework for Speech Enhancement. 2687-2691 - Sujan Kumar Roy, Aaron Nicolson, Kuldip K. Paliwal:
A Deep Learning-Based Kalman Filter for Speech Enhancement. 2692-2696 - Hongjiang Yu, Wei-Ping Zhu, Benoît Champagne:
Subband Kalman Filtering with DNN Estimated Parameters for Speech Enhancement. 2697-2701 - Xiaoqi Li, Yaxing Li, Yuanjie Dong, Shan Xu, Zhihui Zhang, Dan Wang, Shengwu Xiong:
Bidirectional LSTM Network with Ordered Neurons for Speech Enhancement. 2702-2706 - Jing Shi, Jiaming Xu, Yusuke Fujita, Shinji Watanabe, Bo Xu:
Speaker-Conditional Chain Model for Speech Separation and Extraction. 2707-2711
Topics in ASR II
- Leanne Nortje, Herman Kamper:
Unsupervised vs. Transfer Learning for Multimodal One-Shot Matching of Speech and Images. 2712-2716 - Yoonhyung Lee, Seunghyun Yoon, Kyomin Jung:
Multimodal Speech Emotion Recognition Using Cross Attention with Aligned Audio and Text. 2717-2721 - Tamás Gábor Csapó:
Speaker Dependent Articulatory-to-Acoustic Mapping Using Real-Time MRI of the Vocal Tract. 2722-2726 - Tamás Gábor Csapó, Csaba Zainkó, László Tóth, Gábor Gosztolya, Alexandra Markó:
Ultrasound-Based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis. 2727-2731 - Siyuan Feng, Odette Scharenborg:
Unsupervised Subword Modeling Using Autoregressive Pretraining and Cross-Lingual Phone-Aware Modeling. 2732-2736 - Kohei Matsuura, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara:
Generative Adversarial Training Data Adaptation for Very Low-Resource Automatic Speech Recognition. 2737-2741 - Kazuki Tsunematsu, Johanes Effendi, Sakriani Sakti, Satoshi Nakamura:
Neural Speech Completion. 2742-2746 - Benjamin Milde, Chris Biemann:
Improving Unsupervised Sparsespeech Acoustic Models with Categorical Reparameterization. 2747-2751 - Katerina Papadimitriou, Gerasimos Potamianos:
Multimodal Sign Language Recognition via Temporal Deformable Convolutional Sequence Learning. 2752-2756 - Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Collobert:
MLS: A Large-Scale Multilingual Dataset for Speech Research. 2757-2761
Neural Signals for Spoken Communication
- Ivan Halim Parmonangan, Hiroki Tanaka, Sakriani Sakti, Satoshi Nakamura:
Combining Audio and Brain Activity for Predicting Speech Quality. 2762-2766 - Rini A. Sharon, Hema A. Murthy:
The "Sound of Silence" in EEG - Cognitive Voice Activity Detection. 2767-2771 - Siqi Cai, Enze Su, Yonghao Song, Longhan Xie, Haizhou Li:
Low Latency Auditory Attention Detection with Common Spatial Pattern Analysis of EEG Signals. 2772-2776 - Miguel Angrick, Christian Herff, Garett D. Johnson, Jerry J. Shih, Dean J. Krusienski, Tanja Schultz:
Speech Spectrogram Estimation from Intracranial Brain Activity Using a Quantization Approach. 2777-2781 - Debadatta Dash, Paul Ferrari, Angel W. Hernandez-Mulero, Daragh Heitzman, Sara G. Austin, Jun Wang:
Neural Speech Decoding for Amyotrophic Lateral Sclerosis. 2782-2786
Training Strategies for ASR
- Yang Chen, Weiran Wang, Chao Wang:
Semi-Supervised ASR by End-to-End Self-Training. 2787-2791 - Hitesh Tulsiani, Ashtosh Sapru, Harish Arsikere, Surabhi Punjabi, Sri Garimella:
Improved Training Strategies for End-to-End Speech Recognition in Digital Voice Assistants. 2792-2796 - Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka:
Serialized Output Training for End-to-End Overlapped Speech Recognition. 2797-2801 - Felix Weninger, Franco Mana, Roberto Gemello, Jesús Andrés-Ferrer, Puming Zhan:
Semi-Supervised Learning with Data Augmentation for End-to-End ASR. 2802-2806 - Jinxi Guo, Gautam Tiwari, Jasha Droppo, Maarten Van Segbroeck, Che-Wei Huang, Andreas Stolcke, Roland Maas:
Efficient Minimum Word Error Rate Training of RNN-Transducer for End-to-End Speech Recognition. 2807-2811 - Albert Zeyer, André Merboldt, Ralf Schlüter, Hermann Ney:
A New Training Pipeline for an Improved Neural Transducer. 2812-2816 - Daniel S. Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, Quoc V. Le:
Improved Noisy Student Training for Automatic Speech Recognition. 2817-2821 - Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi:
Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition. 2822-2826 - Dhananjaya Gowda, Ankur Kumar, Kwangyoun Kim, Hejung Yang, Abhinav Garg, Sachin Singh, Jiyeon Kim, Mehul Kumar, Sichen Jin, Shatrughan Singh, Chanwoo Kim:
Utterance Invariant Training for Hybrid Two-Pass End-to-End Speech Recognition. 2827-2831 - Gary Wang, Andrew Rosenberg, Zhehuai Chen, Yu Zhang, Bhuvana Ramabhadran, Pedro J. Moreno:
SCADA: Stochastic, Consistent and Adversarial Data Augmentation to Improve ASR. 2832-2836
Speech Transmission & Coding
- Sneha Das, Tom Bäckström, Guillaume Fuchs:
Fundamental Frequency Model for Postfiltering at Low Bitrates in a Transform-Domain Speech and Audio Codec. 2837-2841 - Arthur Van Den Broucke, Deepak Baby, Sarah Verhulst:
Hearing-Impaired Bio-Inspired Cochlear Models for Real-Time Auditory Applications. 2842-2846 - Jan Skoglund, Jean-Marc Valin:
Improving Opus Low Bit Rate Quality with Neural Speech Synthesis. 2847-2851 - Pranay Manocha, Adam Finkelstein, Richard Zhang, Nicholas J. Bryan, Gautham J. Mysore, Zeyu Jin:
A Differentiable Perceptual Audio Metric Learned from Just Noticeable Differences. 2852-2856 - Piotr Masztalski, Mateusz Matuszewski, Karol Piaskowski, Michal Romaniuk:
StoRIR: Stochastic Room Impulse Response Generation for Audio Data Augmentation. 2857-2861 - Babak Naderi, Ross Cutler:
An Open Source Implementation of ITU-T Recommendation P.808 with Validation. 2862-2866 - Gabriel Mittag, Ross Cutler, Yasaman Hosseinkashi, Michael Revow, Sriram Srinivasan, Naglakshmi Chande, Robert Aichner:
DNN No-Reference PSTN Speech Quality Prediction. 2867-2871 - Sebastian Möller, Tobias Hübschen, Thilo Michael, Gabriel Mittag, Gerhard Schmidt:
Non-Intrusive Diagnostic Monitoring of Fullband Speech Quality. 2872-2876
Bioacoustics and Articulation
- Abdolreza Sabzi Shahrebabaki, Negar Olfati, Sabato Marco Siniscalchi, Giampiero Salvi, Torbjørn Svendsen:
Transfer Learning of Articulatory Information Through Phone Information. 2877-2881 - Abdolreza Sabzi Shahrebabaki, Sabato Marco Siniscalchi, Giampiero Salvi, Torbjørn Svendsen:
Sequence-to-Sequence Articulatory Inversion Through Time Convolution of Sub-Band Frequency Signals. 2882-2886 - Bernardo B. Gatto, Eulanda Miranda dos Santos, Juan Gabriel Colonna, Naoya Sogi, Lincon Sales de Souza, Kazuhiro Fukui:
Discriminative Singular Spectrum Analysis for Bioacoustic Classification. 2887-2891 - Renuka Mannem, Hima Jyothi R., Aravind Illa, Prasanta Kumar Ghosh:
Speech Rate Task-Specific Representation Learning from Acoustic-Articulatory Data. 2892-2896 - Abner Hernandez, Eun Jung Yeo, Sunhee Kim, Minhwa Chung:
Dysarthria Detection and Severity Assessment Using Rhythm-Based Metrics. 2897-2901 - Yi Ma, Xinzi Xu, Yongfu Li:
LungRN+NL: An Improved Adventitious Lung Sound Classification Using Non-Local Block ResNet Neural Network with Mixup Data Augmentation. 2902-2906 - Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh:
Attention and Encoder-Decoder Based Models for Transforming Articulatory Movements at Different Speaking Rates. 2907-2911 - Zijiang Yang, Shuo Liu, Meishu Song, Emilia Parada-Cabaleiro, Björn W. Schuller:
Adventitious Respiratory Classification Using Attentive Residual Neural Networks. 2912-2916 - Raphael Lenain, Jack Weston, Abhishek Shivkumar, Emil Fristed:
Surfboard: Audio Feature Extraction for Modern Machine Learning. 2917-2921 - Abinay Reddy Naini, Malla Satyapriya, Prasanta Kumar Ghosh:
Whisper Activity Detection Using CNN-LSTM Based Attention Pooling Network Trained for a Speaker Identification Task. 2922-2926
Speech Synthesis: Multilingual and Cross-Lingual Approaches
- Shengkui Zhao, Trung Hieu Nguyen, Hao Wang, Bin Ma:
Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion. 2927-2931 - Zhaoyu Liu, Brian Mak:
Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment. 2932-2936 - Ruibo Fu, Jianhua Tao, Zhengqi Wen, Jiangyan Yi, Chunyu Qiang, Tao Wang:
Dynamic Soft Windowing and Language Dependent Style Token for Code-Switching End-to-End Speech Synthesis. 2937-2941 - Marlene Staib, Tian Huey Teh, Alexandra Torresquintero, Devang S. Ram Mohan, Lorenzo Foglianti, Raphael Lenain, Jiameng Gao:
Phonological Features for 0-Shot Multilingual Speech Synthesis. 2942-2946 - Detai Xin, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama, Hiroshi Saruwatari:
Cross-Lingual Text-To-Speech Synthesis via Domain Adaptation and Perceptual Similarity Regression in Speaker Space. 2947-2951 - Ruolan Liu, Xue Wen, Chunhui Lu, Xiao Chen:
Tone Learning in Low-Resource Bilingual TTS. 2952-2956 - Shubham Bansal, Arijit Mukherjee, Sandeepkumar Satpal, Rupesh Kumar Mehta:
On Improving Code Mixed Speech Synthesis with Mixlingual Grapheme-to-Phoneme Model. 2957-2961 - Anusha Prakash, Hema A. Murthy:
Generic Indic Text-to-Speech Synthesisers with Rapid Adaptation in an End-to-End Framework. 2962-2966 - Marcel de Korte, Jaebok Kim, Esther Klabbers:
Efficient Neural Speech Synthesis for Low-Resource Languages Through Multilingual Modeling. 2967-2971 - Tomás Nekvinda, Ondrej Dusek:
One Model, Many Languages: Meta-Learning for Multilingual Text-to-Speech. 2972-2976
Learning Techniques for Speaker Recognition I
- Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee-Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, Icksang Han:
In Defence of Metric Learning for Speaker Recognition. 2977-2981 - Seong Min Kye, Youngmoon Jung, Haebeom Lee, Sung Ju Hwang, Hoirin Kim:
Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs. 2982-2986 - Kai Li, Masato Akagi, Yibo Wu, Jianwu Dang:
Segment-Level Effects of Gender, Nationality and Emotion Information on Text-Independent Speaker Verification. 2987-2991 - Yanpei Shi, Qiang Huang, Thomas Hain:
Weakly Supervised Training of Hierarchical Attention Networks for Speaker Identification. 2992-2996 - Ana Montalvo, José R. Calvo, Jean-François Bonastre:
Multi-Task Learning for Voice Related Recognition Tasks. 2997-3001 - Umair Khan, Javier Hernando:
Unsupervised Training of Siamese Networks for Speaker Verification. 3002-3006 - Ying Liu, Yan Song, Yiheng Jiang, Ian McLoughlin, Lin Liu, Li-Rong Dai:
An Effective Speaker Recognition Method Based on Joint Identification and Verification Supervisions. 3007-3011 - Naijun Zheng, Xixin Wu, Jinghua Zhong, Xunying Liu, Helen Meng:
Speaker-Aware Linear Discriminant Analysis in Speaker Verification. 3012-3016 - Zhengyang Chen, Shuai Wang, Yanmin Qian:
Adversarial Domain Adaptation for Speaker Verification Using Partially Shared Network. 3017-3021
Pronunciation
- Binghuai Lin, Liyuan Wang, Xiaoli Feng, Jinsong Zhang:
Automatic Scoring at Multi-Granularity for L2 Pronunciation. 3022-3026 - Tien-Hong Lo, Shi-Yan Weng, Hsiu-Jui Chang, Berlin Chen:
An Effective End-to-End Modeling Approach for Mispronunciation Detection. 3027-3031 - Bi-Cheng Yan, Meng-Che Wu, Hsiao-Tsung Hung, Berlin Chen:
An End-to-End Mispronunciation Detection System for L2 English Speech Leveraging Novel Anti-Phone Modeling. 3032-3036 - Richeng Duan, Nancy F. Chen:
Unsupervised Feature Adaptation Using Adversarial Multi-Task Training for Automatic Evaluation of Children's Speech. 3037-3041 - Longfei Yang, Kaiqi Fu, Jinsong Zhang, Takahiro Shinozaki:
Pronunciation Erroneous Tendency Detection with Language Adversarial Represent Learning. 3042-3046 - Sitong Cheng, Zhixin Liu, Lantian Li, Zhiyuan Tang, Dong Wang, Thomas Fang Zheng:
ASR-Free Pronunciation Assessment. 3047-3051 - Konstantinos Kyriakopoulos, Kate M. Knill, Mark J. F. Gales:
Automatic Detection of Accent and Lexical Pronunciation Errors in Spontaneous Non-Native English Speech. 3052-3056 - Jiatong Shi, Nan Huo, Qin Jin:
Context-Aware Goodness of Pronunciation for Computer-Assisted Pronunciation Training. 3057-3061 - Wei Chu, Yang Liu, Jianwei Zhou:
Recognize Mispronunciations to Improve Non-Native Acoustic Modeling Through a Phone Decoder Built from One Edit Distance Finite State Automaton. 3062-3066
Diarization
- Pablo Gimeno, Victoria Mingote, Alfonso Ortega Giménez, Antonio Miguel, Eduardo Lleida:
Partial AUC Optimisation Using Recurrent Neural Networks for Music Detection with Limited Training Data. 3067-3071 - Marvin Lavechin, Ruben Bousbib, Hervé Bredin, Emmanuel Dupoux, Alejandrina Cristià:
An Open-Source Voice Type Classifier for Child-Centered Daylong Recordings. 3072-3076 - Chao Peng, Xihong Wu, Tianshu Qu:
Competing Speaker Count Estimation on the Fusion of the Spectral and Spatial Embedding Space. 3077-3081 - Shoufeng Lin, Xinyuan Qian:
Audio-Visual Multi-Speaker Tracking Based on the GLMB Framework. 3082-3086 - Shuo Liu, Andreas Triantafyllopoulos, Zhao Ren, Björn W. Schuller:
Towards Speech Robustness for Acoustic Scene Classification. 3087-3091 - Junzhe Zhu, Mark Hasegawa-Johnson, Leda Sari:
Identify Speakers in Cocktail Parties with End-to-End Attention. 3092-3096 - Thilo von Neumann, Christoph Böddeker, Lukas Drude, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach:
Multi-Talker ASR for an Unknown Number of Sources: Joint Training of Source Counting, Separation and ASR. 3097-3101 - Shreya G. Upadhyay, Bo-Hao Su, Chi-Chun Lee:
Attentive Convolutional Recurrent Neural Network Using Phoneme-Level Acoustic Representation for Rare Sound Event Detection. 3102-3106 - Samuele Cornell, Maurizio Omologo, Stefano Squartini, Emmanuel Vincent:
Detecting and Counting Overlapping Speakers in Distant Speech Scenarios. 3107-3111 - Niko Moritz, Gordon Wichern, Takaaki Hori, Jonathan Le Roux:
All-in-One Transformer: Unifying Speech Recognition, Audio Tagging, and Event Detection. 3112-3116
Computational Paralinguistics II
- Lorenz Diener, Shahin Amiriparian, Catarina Botelho, Kevin Scheck, Dennis Küster, Isabel Trancoso, Björn W. Schuller, Tanja Schultz:
Towards Silent Paralinguistics: Deriving Speaking Mode and Speaker ID from Electromyographic Signals. 3117-3121 - Shun-Chang Zhong, Bo-Hao Su, Wei Huang, Yi-Ching Liu, Chi-Chun Lee:
Predicting Collaborative Task Performance Using Graph Interlocutor Acoustic Network in Small Group Interaction. 3122-3126 - Gábor Gosztolya:
Very Short-Term Conflict Intensity Estimation Using Fisher Vectors. 3127-3131 - Hiroki Mori, Yuki Kikuchi:
Gaming Corpus for Studying Social Screams. 3132-3135 - Amber Afshan, Jody Kreiman, Abeer Alwan:
Speaker Discrimination in Humans and Machines: Effects of Speaking Style Variability. 3136-3140 - Kamini Sabu, Preeti Rao:
Automatic Prediction of Confidence Level from Children's Oral Reading Recordings. 3141-3145 - Wei Xue, Viviana Mendoza Ramos, Wieke Harmsen, Catia Cucchiarini, R. W. N. M. van Hout, Helmer Strik:
Towards a Comprehensive Assessment of Speech Intelligibility for Pathological Speech. 3146-3150 - Yi Lin, Hongwei Ding:
Effects of Communication Channels and Actor's Gender on Emotion Identification by Native Mandarin Speakers. 3151-3155 - Ivo Anjos, Maxine Eskénazi, Nuno Marques, Margarida Grilo, Isabel Guimarães, João Magalhães, Sofia Cavaco:
Detection of Voicing and Place of Articulation of Fricatives with Deep Learning in a Virtual Speech and Language Therapy Tutor. 3156-3160
Speech Synthesis Paradigms and Methods II
- Haitong Zhang, Yue Lin:
Unsupervised Learning for Sequence-to-Sequence Text-to-Speech for Low-Resource Languages. 3161-3165 - Kasperi Palkama, Lauri Juvela, Alexander Ilin:
Conditional Spoken Digit Generation with StyleGAN. 3166-3170 - Jingzhou Yang, Lei He:
Towards Universal Text-to-Speech. 3171-3175 - Kouichi Katsurada, Korin Richmond:
Speaker-Independent Mel-Cepstrum Estimation from Articulator Movements Using D-Vector Input. 3176-3180 - Xiangyu Liang, Zhiyong Wu, Runnan Li, Yanqing Liu, Sheng Zhao, Helen Meng:
Enhancing Monotonicity for Robust Autoregressive Transformer TTS. 3181-3185 - Devang S. Ram Mohan, Raphael Lenain, Lorenzo Foglianti, Tian Huey Teh, Marlene Staib, Alexandra Torresquintero, Jiameng Gao:
Incremental Text to Speech for Neural Sequence-to-Sequence Models Using Reinforcement Learning. 3186-3190 - Tao Tu, Yuan-Jui Chen, Alexander H. Liu, Hung-yi Lee:
Semi-Supervised Learning for Multi-Speaker Text-to-Speech Synthesis Using Discrete Speech Representation. 3191-3195 - Pramit Saha, Sidney S. Fels:
Learning Joint Articulatory-Acoustic Representations with Normalizing Flows. 3196-3200 - Yuki Yamashita, Tomoki Koriyama, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, Ryo Masumura, Hiroshi Saruwatari:
Investigating Effective Additional Contextual Factors in DNN-Based Spontaneous Speech Synthesis. 3201-3205 - Jacob J. Webber, Olivier Perrotin, Simon King:
Hider-Finder-Combiner: An Adversarial Architecture for General Speech Signal Modification. 3206-3210
Speaker Embedding
- Wei-Wei Lin, Man-Wai Mak:
Wav2Spk: A Simple DNN Architecture for Learning Speaker Embeddings from Waveforms. 3211-3215 - Minh Pham, Zeqian Li, Jacob Whitehill:
How Does Label Noise Affect the Quality of Speaker Embeddings? 3216-3220 - Xuechen Liu, Md. Sahidullah, Tomi Kinnunen:
A Comparative Re-Assessment of Feature Extractors for Deep Speaker Embeddings. 3221-3225 - Wei Xia, John H. L. Hansen:
Speaker Representation Learning Using Global Context Guided Channel and Time-Frequency Transformations. 3226-3230 - Yoohwan Kwon, Soo-Whan Chung, Hong-Goo Kang:
Intra-Class Variation Reduction of Speaker Representation in Disentanglement Framework. 3231-3235 - Munir Georges, Jonathan Huang, Tobias Bocklet:
Compact Speaker Embedding: lrx-Vector. 3236-3240 - Florian L. Kreyssig, Philip C. Woodland:
Cosine-Distance Virtual Adversarial Training for Semi-Supervised Speaker-Discriminative Acoustic Embeddings. 3241-3245 - Junyi Peng, Rongzhi Gu, Yuexian Zou:
Deep Speaker Embedding with Long Short Term Centroid Learning for Text-Independent Speaker Verification. 3246-3250 - Lantian Li, Dong Wang, Thomas Fang Zheng:
Neural Discriminant Analysis for Deep Speaker Embedding. 3251-3255 - Jaejin Cho, Piotr Zelasko, Jesús Villalba, Shinji Watanabe, Najim Dehak:
Learning Speaker Embedding from Text-to-Speech. 3256-3260
Single-Channel Speech Enhancement III
- Yan Zhao, DeLiang Wang:
Noisy-Reverberant Speech Enhancement Using DenseUNet with Time-Frequency Attention. 3261-3265 - Zhuohuang Zhang, Chengyun Deng, Yi Shen, Donald S. Williamson, Yongtao Sha, Yi Zhang, Hui Song, Xiangang Li:
On Loss Functions and Recurrency Training for GAN-Based Speech Enhancement Systems. 3266-3270 - Zhihao Du, Ming Lei, Jiqing Han, Shiliang Zhang:
Self-Supervised Adversarial Multi-Task Learning for Vocoder-Based Monaural Speech Enhancement. 3271-3275 - Mikolaj Kegler, Pierre Beckmann, Milos Cernak:
Deep Speech Inpainting of Time-Frequency Masks. 3276-3280 - Nikhil Shankar, Gautam Shreedhar Bhat, Issa M. S. Panahi:
Real-Time Single-Channel Deep Neural Network-Based Speech Enhancement on Edge Devices. 3281-3285 - Ju Lin, Sufeng Niu, Adriaan J. de Lind van Wijngaarden, Jerome L. McClendon, Melissa C. Smith, Kuang-Ching Wang:
Improved Speech Enhancement Using a Time-Domain GAN with Mask Learning. 3286-3290 - Alexandre Défossez, Gabriel Synnaeve, Yossi Adi:
Real Time Speech Enhancement in the Waveform Domain. 3291-3295 - Michal Romaniuk, Piotr Masztalski, Karol Piaskowski, Mateusz Matuszewski:
Efficient Low-Latency Speech Enhancement with Mobile Audio Streaming Networks. 3296-3300
Multi-Channel Audio and Emotion Recognition
- Yuya Chiba, Takashi Nose, Akinori Ito:
Multi-Stream Attention-Based BLSTM with Feature Segmentation for Speech Emotion Recognition. 3301-3305 - Guanjun Li, Shan Liang, Shuai Nie, Wenju Liu, Zhanlei Yang, Longshuai Xiao:
Microphone Array Post-Filter for Target Speech Enhancement Without a Prior Information of Point Interferers. 3306-3310 - Atsuo Hiroe:
Similarity-and-Independence-Aware Beamformer: Method for Target Source Extraction Using Magnitude Spectrogram as Reference. 3311-3315 - Oleg Golokolenko, Gerald Schuller:
The Method of Random Directions Optimization for Stereo Audio Source Separation. 3316-3320 - Cunhang Fan, Jianhua Tao, Bin Liu, Jiangyan Yi, Zhengqi Wen:
Gated Recurrent Fusion of Spatial and Spectral Features for Multi-Channel Speech Separation with Deep Embedding Representations. 3321-3325 - Robin Scheibler:
Generalized Minimal Distortion Principle for Blind Source Separation. 3326-3330 - Ying Zhong, Ying Hu, Hao Huang, Wushour Silamu:
A Lightweight Model Based on Separable Convolution for Speech Emotion Recognition. 3331-3335 - Ruichu Cai, Kaibin Guo, Boyan Xu, Xiaoyan Yang, Zhenjie Zhang:
Meta Multi-Task Learning for Speech Emotion Recognition. 3336-3340 - François Grondin, Jean-Samuel Lauzon, Jonathan Vincent, François Michaud:
GEV Beamforming Supported by DOA-Based Masks Generated on Pairs of Microphones. 3341-3345
Computational Resource Constrained Speech Recognition
- Christin Jose, Yuriy Mishchenko, Thibaud Sénéchal, Anish Shah, Alex Escott, Shiv Naga Prasad Vitaladevuni:
Accurate Detection of Wake Word Start and End Using a CNN. 3346-3350 - Saurabh Adya, Vineet Garg, Siddharth Sigtia, Pramod Simha, Chandra Dhir:
Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering. 3351-3355 - Somshubra Majumdar, Boris Ginsburg:
MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. 3356-3360 - Abhinav Mehrotra, Lukasz Dudziak, Jinsu Yeo, Young-Yoon Lee, Ravichander Vipperla, Mohamed S. Abdelfattah, Sourav Bhattacharya, Samin Ishtiaq, Alberto Gil C. P. Ramos, SangJeong Lee, Daehyun Kim, Nicholas D. Lane:
Iterative Compression of End-to-End ASR Model Using AutoML. 3361-3365 - Hieu Duy Nguyen, Anastasios Alexandridis, Athanasios Mouchtaris:
Quantization Aware Training with Absolute-Cosine Regularization for Automatic Speech Recognition. 3366-3370 - Abhinav Garg, Gowtham P. Vadisetti, Dhananjaya Gowda, Sichen Jin, Aditya Jayasimha, Youngho Han, Jiyeon Kim, Junmo Park, Kwangyoun Kim, Sooyeon Kim, Young-Yoon Lee, Kyungbo Min, Chanwoo Kim:
Streaming On-Device End-to-End ASR System for Privacy-Sensitive Voice-Typing. 3371-3375 - Vineel Pratap, Qiantong Xu, Jacob Kahn, Gilad Avidov, Tatiana Likhomanenko, Awni Y. Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert:
Scaling Up Online Speech Recognition Using ConvNets. 3376-3380 - Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen, Shuai Zhang:
Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition. 3381-3385 - Grant P. Strimel, Ariya Rastrow, Gautam Tiwari, Adrien Piérard, Jon Webb:
Rescore in a Flash: Compact, Cache Efficient Hashing Data Structures for n-Gram Language Models. 3386-3390
Speech Synthesis: Prosody and Emotion
- Ravi Shankar, Hsi-Wei Hsieh, Nicolas Charon, Archana Venkataraman:
Multi-Speaker Emotion Conversion via Latent Variable Regularization and a Chained Encoder-Decoder-Predictor Network. 3391-3395 - Ravi Shankar, Jacob Sager, Archana Venkataraman:
Non-Parallel Emotion Conversion Using a Deep-Generative Hybrid Network and an Adversarial Pair Discriminator. 3396-3400 - Noé Tits, Kevin El Haddad, Thierry Dutoit:
Laughter Synthesis: Combining Seq2seq Modeling with Transfer Learning. 3401-3405 - Yuexin Cao, Zhengchen Liu, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao:
Nonparallel Emotional Speech Conversion Using VAE-GAN. 3406-3410 - Alexander Sorin, Slava Shechtman, Ron Hoory:
Principal Style Components: Expressive Style Control and Cross-Speaker Transfer in Neural TTS. 3411-3415 - Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li:
Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion. 3416-3420 - Kento Matsumoto, Sunao Hara, Masanobu Abe:
Controlling the Strength of Emotions in Speech-Like Emotional Sound Generated by WaveNet. 3421-3425 - Guangyan Zhang, Ying Qin, Tan Lee:
Learning Syllable-Level Discrete Prosodic Representation for Expressive Speech Generation. 3426-3430 - Takuya Kishida, Shin Tsukamoto, Toru Nakashika:
Simultaneous Conversion of Speaker Identity and Emotion Based on Multiple-Domain Adaptive RBM. 3431-3435 - Fengyu Yang, Shan Yang, Qinghua Wu, Yujun Wang, Lei Xie:
Exploiting Deep Sentential Context for Expressive End-to-End Speech Synthesis. 3436-3440 - Yukiya Hono, Kazuna Tsuboi, Kei Sawada, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda:
Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis. 3441-3445 - Sefik Emre Eskimez, Dimitrios Dimitriadis, Robert Gmyr, Kenichi Kumanati:
GAN-Based Data Generation for Speech Emotion Recognition. 3446-3450 - Hira Dhamyal, Shahan Ali Memon, Bhiksha Raj, Rita Singh:
The Phonetic Bases of Vocal Expressed Emotion: Natural versus Acted. 3451-3455
The Interspeech 2020 Far Field Speaker Verification Challenge
- Xiaoyi Qin, Ming Li, Hui Bu, Wei Rao, Rohan Kumar Das, Shrikanth Narayanan, Haizhou Li:
The INTERSPEECH 2020 Far-Field Speaker Verification Challenge. 3456-3460 - Peng Zhang, Peng Hu, Xueliang Zhang:
Deep Embedding Learning for Text-Dependent Speaker Verification. 3461-3465 - Aleksei Gusev, Vladimir Volokhov, Alisa Vinogradova, Tseren Andzhukaev, Andrey Shulipa, Sergey Novoselov, Timur Pekhovsky, Alexander Kozlov:
STC-Innovation Speaker Recognition Systems for Far-Field Speaker Verification Challenge 2020. 3466-3470 - Li Zhang, Jian Wu, Lei Xie:
NPU Speaker Verification System for INTERSPEECH 2020 Far-Field Speaker Verification Challenge. 3471-3475 - Ying Tong, Wei Xue, Shanluo Huang, Lu Fan, Chao Zhang, Guohong Ding, Xiaodong He:
The JD AI Speaker Verification System for the FFSVC 2020 Challenge. 3476-3480
Multimodal Speech Processing
- Soo-Whan Chung, Soyeon Choe, Joon Son Chung, Hong-Goo Kang:
FaceFilter: Audio-Visual Speech Separation Using Still Images. 3481-3485 - Soo-Whan Chung, Hong-Goo Kang, Joon Son Chung:
Seeing Voices and Hearing Voices: Learning Discriminative Embeddings Using Cross-Modal Self-Supervision. 3486-3490 - Michael Wand, Jürgen Schmidhuber:
Fusion Architectures for Word-Based Audiovisual Speech Recognition. 3491-3495 - Jianwei Yu, Bo Wu, Rongzhi Gu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu, Meng Yu, Dan Su, Dong Yu, Xunying Liu, Helen Meng:
Audio-Visual Multi-Channel Recognition of Overlapped Speech. 3496-3500 - Wubo Li, Dongwei Jiang, Wei Zou, Xiangang Li:
TMT: A Transformer-Based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-Aware Dialog. 3501-3505 - George Sterpu, Christian Saam, Naomi Harte:
Should we Hard-Code the Recurrence Concept or Learn it Instead ? Exploring the Transformer Architecture for Audio-Visual Speech Recognition. 3506-3509 - Alexandros Koumparoulis, Gerasimos Potamianos, Samuel Thomas, Edmilson da Silva Morais:
Resource-Adaptive Deep Learning for Visual Speech Recognition. 3510-3514 - Masood S. Mortazavi:
Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks. 3515-3519 - Hong Liu, Zhan Chen, Bing Yang:
Lip Graph Assisted Audio-Visual Speech Recognition Using Bidirectional Synchronous Fusion. 3520-3524 - Vighnesh Reddy Konda, Mayur Warialani, Rakesh Prasanth Achari, Varad Bhatnagar, Jayaprakash Akula, Preethi Jyothi, Ganesh Ramakrishnan, Gholamreza Haffari, Pankaj Singh:
Caption Alignment for Low Resource Audio-Visual Data. 3525-3529
Keynote 4
- Shehzad Mevawalla:
Successes, Challenges and Opportunities for Speech Technology in Conversational Agents.
Speech Synthesis: Neural Waveform Generation II
- Daniel Michelsanti, Olga Slizovskaia, Gloria Haro, Emilia Gómez, Zheng-Hua Tan, Jesper Jensen:
Vocoder-Based Speech Synthesis from Silent Videos. 3530-3534 - Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda:
Quasi-Periodic Parallel WaveGAN Vocoder: A Non-Autoregressive Pitch-Dependent Dilated Convolution Model for Parametric Speech Generation. 3535-3539 - Yi-Chiao Wu, Patrick Lumban Tobing, Kazuki Yasuhara, Noriyuki Matsunaga, Yamato Ohtani, Tomoki Toda:
A Cyclical Post-Filtering Approach to Mismatch Refinement of Neural Vocoder for Text-to-Speech Systems. 3540-3544 - Hyun-Wook Yoon, Sang-Hoon Lee, Hyeong-Rae Noh, Seong-Whan Lee:
Audio Dequantization for High Fidelity Audio Generation in Flow-Based Neural Vocoder. 3545-3549 - Manish Sharma, Tom Kenter, Rob Clark:
StrawNet: Self-Training WaveNet for TTS in Low-Data Regimes. 3550-3554 - Yang Cui, Xi Wang, Lei He, Frank K. Soong:
An Efficient Subband Linear Prediction for LPCNet-Based Neural Synthesis. 3555-3559 - Yang Ai, Xin Wang, Junichi Yamagishi, Zhen-Hua Ling:
Reverberation Modeling for Source-Filter-Based Neural Vocoder. 3560-3564 - Ravichander Vipperla, Sangjun Park, Kihyun Choo, Samin Ishtiaq, Kyoungbo Min, Sourav Bhattacharya, Abhinav Mehrotra, Alberto Gil C. P. Ramos, Nicholas D. Lane:
Bunched LPCNet: Vocoder for Low-Cost Neural Text-To-Speech Systems. 3565-3569 - Eunwoo Song, Min-Jae Hwang, Ryuichi Yamamoto, Jin-Seob Kim, Ohsung Kwon, Jae-Min Kim:
Neural Text-to-Speech with a Modeling-by-Generation Excitation Vocoder. 3570-3574 - Jan Vainer, Ondrej Dusek:
SpeedySpeech: Efficient Neural Speech Synthesis. 3575-3579
ASR Neural Network Architectures and Training II
- Zi-qiang Zhang, Yan Song, Jian-Shu Zhang, Ian McLoughlin, Li-Rong Dai:
Semi-Supervised End-to-End ASR via Teacher-Student Learning with Conditional Posterior Distribution. 3580-3584 - Ashtosh Sapru, Sri Garimella:
Leveraging Unlabeled Speech for Sequence Discriminative Training of Acoustic Models. 3585-3589 - Jinyu Li, Rui Zhao, Zhong Meng, Yanqing Liu, Wenning Wei, Sarangarajan Parthasarathy, Vadim Mazalov, Zhenghao Wang, Lei He, Sheng Zhao, Yifan Gong:
Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability. 3590-3594 - Xuankai Chang, Aswin Shanmugam Subramanian, Pengcheng Guo, Shinji Watanabe, Yuya Fujita, Motoi Omachi:
End-to-End ASR with Adaptive Span Self-Attention. 3595-3599 - Egor Lakomkin, Jahn Heymann, Ilya Sklyar, Simon Wiesler:
Subword Regularization: An Analysis of Scalability and Generalization for End-to-End Automatic Speech Recognition. 3600-3604 - Wilfried Michel, Ralf Schlüter, Hermann Ney:
Early Stage LM Integration Using Local and Global Log-Linear Combination. 3605-3609 - Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, Yonghui Wu:
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context. 3610-3614 - Tara N. Sainath, Ruoming Pang, David Rybach, Basi García, Trevor Strohman:
Emitting Word Timings with End-to-End Models. 3615-3619 - Danni Liu, Gerasimos Spanakis, Jan Niehues:
Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection. 3620-3624
Neural Networks for Language Modeling
- Ke Li, Daniel Povey, Sanjeev Khudanpur:
Neural Language Modeling with Implicit Cache Pointers. 3625-3629 - Abhilash Jain, Aku Rouhe, Stig-Arne Grönroos, Mikko Kurimo:
Finnish ASR with Deep Transformer Models. 3630-3634 - Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara:
Distilling the Knowledge of BERT for Sequence-to-Sequence ASR. 3635-3639 - Jen-Tzung Chien, Yu-Min Huang:
Stochastic Convolutional Recurrent Networks for Language Modeling. 3640-3644 - Jingjing Huo, Yingbo Gao, Weiyue Wang, Ralf Schlüter, Hermann Ney:
Investigation of Large-Margin Softmax in Neural Language Modeling. 3645-3649 - Da-Rong Liu, Chunxi Liu, Frank Zhang, Gabriel Synnaeve, Yatharth Saraf, Geoffrey Zweig:
Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model. 3650-3654 - Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, Tetsunori Kobayashi:
Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict. 3655-3659 - Yuya Fujita, Shinji Watanabe, Motoi Omachi, Xuankai Chang:
Insertion-Based Modeling for End-to-End Automatic Speech Recognition. 3660-3664
Phonetic Event Detection and Segmentation
- Yefei Chen, Heinrich Dinkel, Mengyue Wu, Kai Yu:
Voice Activity Detection in the Wild via Weakly Supervised Sound Event Detection. 3665-3669 - Joohyung Lee, Youngmoon Jung, Hoirin Kim:
Dual Attention in Time and Frequency Domain for Voice Activity Detection. 3670-3674 - Tianjiao Xu, Hui Zhang, Xueliang Zhang:
Polishing the Classical Likelihood Ratio Test by Supervised Learning for Voice Activity Detection. 3675-3679 - Avinash Kumar, S. Shahnawazuddin, Waquar Ahmad:
A Noise Robust Technique for Detecting Vowels in Speech Signals. 3680-3684 - Marvin Lavechin, Marie-Philippe Gill, Ruben Bousbib, Hervé Bredin, Leibny Paola García-Perera:
End-to-End Domain-Adversarial Voice Activity Detection. 3685-3689 - Ayush Agarwal, Jagabandhu Mishra, S. R. Mahadeva Prasanna:
VOP Detection in Variable Speech Rate Condition. 3690-3694 - Zhenpeng Zheng, Jianzong Wang, Ning Cheng, Jian Luo, Jing Xiao:
MLNET: An Adaptive Multiple Receptive-Field Attention Neural Network for Voice Activity Detection. 3695-3699 - Felix Kreuk, Joseph Keshet, Yossi Adi:
Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation. 3700-3704 - Piotr Zelasko, Laureano Moro-Velázquez, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak:
That Sounds Familiar: An Analysis of Phonetic Representations Transfer Across Languages. 3705-3709 - S. Limonard, Catia Cucchiarini, R. W. N. M. van Hout, Helmer Strik:
Analyzing Read Aloud Speech by Primary School Pupils: Insights for Research and Development. 3710-3714
Human Speech Production II
- Heikki Rasilo, Yannick Jadoul:
Discovering Articulatory Speech Targets from Synthesized Random Babble. 3715-3719 - Tamás Gábor Csapó:
Speaker Dependent Acoustic-to-Articulatory Inversion Using Real-Time MRI of the Vocal Tract. 3720-3724 - Narjes Bozorg, Michael T. Johnson:
Acoustic-to-Articulatory Inversion with Deep Autoregressive Articulatory-WaveNet. 3725-3729 - Ioannis K. Douros, Ajinkya Kulkarni, Chrysanthi Dourou, Yu Xie, Jacques Felblinger, Karyna Isaieva, Pierre-André Vuissoz, Yves Laprie:
Using Silence MR Image to Synthesise Dynamic MRI Vocal Tract Data of CV. 3730-3734 - Tamás Gábor Csapó, Kele Xu:
Quantification of Transducer Misalignment in Ultrasound Tongue Imaging. 3735-3739 - Maud Parrot, Juliette Millet, Ewan Dunbar:
Independent and Automatic Evaluation of Speaker-Independent Acoustic-to-Articulatory Reconstruction. 3740-3744 - Lorenz Diener, Mehrdad Roustay Vishkasougheh, Tanja Schultz:
CSL-EMG_Array: An Open Access Corpus for EMG-to-Speech Conversion. 3745-3749 - Joshua Penney, Felicity Cox, Anita Szakay:
Links Between Production and Perception of Glottalisation in Individual Australian English Speaker/Listeners. 3750-3754
New Trends in Self-Supervised Speech Processing
- Shamane Siriwardhana, Andrew Reis, Rivindu Weerasekera, Suranga Nanayakkara:
Jointly Fine-Tuning "BERT-Like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition. 3755-3759 - Yu-An Chung, Hao Tang, James R. Glass:
Vector-Quantized Autoregressive Predictive Coding. 3760-3764 - Xingchen Song, Guangsen Wang, Yiheng Huang, Zhiyong Wu, Dan Su, Helen Meng:
Speech-XLNet: Unsupervised Acoustic Model Pretraining for Self-Attention Networks. 3765-3769 - Kritika Singh, Vimal Manohar, Alex Xiao, Sergey Edunov, Ross B. Girshick, Vitaliy Liptchinsky, Christian Fuegen, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed:
Large Scale Weakly and Semi-Supervised Learning for Low-Resource Video ASR. 3770-3774 - Ken'ichi Kumatani, Dimitrios Dimitriadis, Yashesh Gaur, Robert Gmyr, Sefik Emre Eskimez, Jinyu Li, Michael Zeng:
Sequence-Level Self-Learning with Multiple Hypotheses. 3775-3779 - Haibin Wu, Andy T. Liu, Hung-yi Lee:
Defense for Black-Box Attacks on Anti-Spoofing Models by Self-Supervised Learning. 3780-3784 - Shu-Wen Yang, Andy T. Liu, Hung-yi Lee:
Understanding Self-Attention of Self-Supervised Audio Transformers. 3785-3789 - Sameer Khurana, Antoine Laurent, Wei-Ning Hsu, Jan Chorowski, Adrian Lancucki, Ricard Marxer, James R. Glass:
A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning. 3790-3794 - Ayimunishagu Abulimiti, Jochen Weiner, Tanja Schultz:
Automatic Speech Recognition for ILSE-Interviews: Longitudinal Conversational Speech Recordings Covering Aging and Cognitive Decline. 3795-3799
Learning Techniques for Speaker Recognition II
- Dao Zhou, Longbiao Wang, Kong Aik Lee, Yibo Wu, Meng Liu, Jianwu Dang, Jianguo Wei:
Dynamic Margin Softmax Loss for Speaker Verification. 3800-3804 - Magdalena Rybicka, Konrad Kowalczyk:
On Parameter Adaptation in Softmax-Based Cross-Entropy Loss for Improved Convergence Speed and Accuracy in DNN-Based Speaker Recognition. 3805-3809 - Victoria Mingote, Antonio Miguel, Alfonso Ortega Giménez, Eduardo Lleida:
Training Speaker Enrollment Models by Network Optimization. 3810-3814 - Seyyed Saeed Sarfjoo, Srikanth R. Madikeri, Petr Motlícek, Sébastien Marcel:
Supervised Domain Adaptation for Text-Independent Speaker Verification Using Limited Data. 3815-3819 - Yuheng Wei, Junzhao Du, Hui Liu:
Angular Margin Centroid Loss for Text-Independent Speaker Recognition. 3820-3824 - Jiawen Kang, Ruiqi Liu, Lantian Li, Yunqi Cai, Dong Wang, Thomas Fang Zheng:
Domain-Invariant Speaker Vector Projection by Model-Agnostic Meta-Learning. 3825-3829 - Brecht Desplanques, Jenthe Thienpondt, Kris Demuynck:
ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. 3830-3834 - Wenda Chen, Jonathan Huang, Tobias Bocklet:
Length- and Noise-Aware Training Techniques for Short-Utterance Speaker Recognition. 3835-3839
Spoken Language Evaluatiosn
- Yiting Lu, Mark J. F. Gales, Yu Wang:
Spoken Language 'Grammatical Error Correction'. 3840-3844 - Sara Papi, Edmondo Trentin, Roberto Gretter, Marco Matassoni, Daniele Falavigna:
Mixtures of Deep Neural Experts for Automated Speech Scoring. 3845-3849 - Xinhao Wang, Klaus Zechner, Christopher Hamill:
Targeted Content Feedback in Spoken Language Learning and Assessment. 3850-3854 - Vyas Raina, Mark J. F. Gales, Kate M. Knill:
Universal Adversarial Attacks on Spoken Language Assessment Systems. 3855-3859 - Xixin Wu, Kate M. Knill, Mark J. F. Gales, Andrey Malinin:
Ensemble Approaches for Uncertainty in Spoken Language Assessment. 3860-3864 - Zhenchao Lin, Ryo Takashima, Daisuke Saito, Nobuaki Minematsu, Noriko Nakanishi:
Shadowability Annotation with Fine Granularity on L2 Utterances and its Improvement with Native Listeners' Script-Shadowing. 3865-3869 - Yu Bai, Ferdy Hubers, Catia Cucchiarini, Helmer Strik:
ASR-Based Evaluation and Feedback for Individualized Reading Practice. 3870-3874 - Dominika Woszczyk, Stavros Petridis, David E. Millard:
Domain Adversarial Neural Networks for Dysarthric Speech Recognition. 3875-3879 - Shunsuke Hidaka, Yogaku Lee, Kohei Wakamiya, Takashi Nakagawa, Tokihiko Kaburagi:
Automatic Estimation of Pathological Voice Quality Based on Recurrent Neural Network Using Amplitude and Phase Spectrogram. 3880-3884
Spoken Dialogue System
- Jen-Tzung Chien, Po-Chien Hsu:
Stochastic Curiosity Exploration for Dialogue Systems. 3885-3889 - Myeongho Jeong, Seungtaek Choi, Hojae Han, Kyungho Kim, Seung-won Hwang:
Conditional Response Augmentation for Dialogue Using Knowledge Distillation. 3890-3894 - Hongyin Luo, Shang-Wen Li, James R. Glass:
Prototypical Q Networks for Automatic Conversational Diagnosis and Few-Shot New Disease Adaption. 3895-3899 - Teakgyu Hong, Oh-Woog Kwon, Young-Kil Kim:
End-to-End Task-Oriented Dialog System Through Template Slot Value Generation. 3900-3904 - Zhenhao He, Jiachun Wang, Jian Chen:
Task-Oriented Dialog Generation with Enhanced Entity Representation. 3905-3909 - Viet-Trung Dang, Tianyu Zhao, Sei Ueno, Hirofumi Inaguma, Tatsuya Kawahara:
End-to-End Speech-to-Dialog-Act Recognition. 3910-3914 - Yao Qian, Yu Shi, Michael Zeng:
Discriminative Transfer Learning for Optimizing ASR and Semantic Labeling in Task-Oriented Spoken Dialog. 3915-3919 - Xinnuo Xu, Yizhe Zhang, Lars Liden, Sungjin Lee:
Datasets and Benchmarks for Task-Oriented Log Dialogue Ranking Task. 3920-3924
Dereverberation and Echo Cancellation
- Ziteng Wang, Yueyue Na, Zhang Liu, Yun Li, Biao Tian, Qiang Fu:
A Semi-Blind Source Separation Approach for Speech Dereverberation. 3925-3929 - Joon-Young Yang, Joon-Hyuk Chang:
Virtual Acoustic Channel Expansion Based on Neural Networks for Weighted Prediction Error-Based Speech Dereverberation. 3930-3934 - Vinay Kothapally, Wei Xia, Shahram Ghorbani, John H. L. Hansen, Wei Xue, Jing Huang:
SkipConvNet: Skip Convolutional Neural Network for Speech Dereverberation Using Optimally Smoothed Spectral Mapping. 3935-3939 - Chenggang Zhang, Xueliang Zhang:
A Robust and Cascaded Acoustic Echo Cancellation Based on Deep Learning. 3940-3944 - Yi Zhang, Chengyun Deng, Shiqian Ma, Yongtao Sha, Hui Song, Xiangang Li:
Generative Adversarial Network Based Acoustic Echo Cancellation. 3945-3949 - Lukas Pfeifenberger, Franz Pernkopf:
Nonlinear Residual Echo Suppression Using a Recurrent Neural Network. 3950-3954 - Yi Gao, Ian Liu, J. Zheng, Cheng Luo, Bin Li:
Independent Echo Path Modeling for Stereophonic Acoustic Echo Cancellation. 3955-3958 - Hongsheng Chen, Teng Xiang, Kai Chen, Jing Lu:
Nonlinear Residual Echo Suppression Based on Multi-Stream Conv-TasNet. 3959-3963 - Wenzhi Fan, Jing Lu:
Improving Partition-Block-Based Acoustic Echo Canceler in Under-Modeling Scenarios. 3964-3968 - Jung-Hee Kim, Joon-Hyuk Chang:
Attention Wave-U-Net for Acoustic Echo Cancellation. 3969-3973
Speech Synthesis: Toward End-to-End Synthesis
- Zexin Cai, Chuxiong Zhang, Ming Li:
From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint. 3974-3978 - Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Junichi Yamagishi:
Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS? 3979-3983 - Tao Wang, Xuefei Liu, Jianhua Tao, Jiangyan Yi, Ruibo Fu, Zhengqi Wen:
Non-Autoregressive End-to-End TTS with Coarse-to-Fine Decoding. 3984-3988 - Tao Wang, Jianhua Tao, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Chunyu Qiang:
Bi-Level Speaker Supervision for One-Shot Speech Synthesis. 3989-3993 - Alex Peiró Lilja, Mireia Farrús:
Naturalness Enhancement with Linguistic Information in End-to-End TTS Using Unsupervised Parallel Encoding. 3994-3998 - Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou:
MoBoAligner: A Neural Alignment Model for Non-Autoregressive TTS with Monotonic Boundary Search. 3999-4003 - Dan Lim, Won Jang, Gyeonghwan O, Heayoung Park, Bongwan Kim, Jaesam Yoon:
JDI-T: Jointly Trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment. 4004-4008 - Masashi Aso, Shinnosuke Takamichi, Hiroshi Saruwatari:
End-to-End Text-to-Speech Synthesis with Unaligned Multiple Language Units Based on Attention. 4009-4013 - Qingyun Dou, Joshua Efiong, Mark J. F. Gales:
Attention Forcing for Speech Synthesis. 4014-4018 - Jason Fong, Jason Taylor, Simon King:
Testing the Limits of Representation Mixing for Pronunciation Correction in End-to-End Speech Synthesis. 4019-4023 - Mingjian Chen, Xu Tan, Yi Ren, Jin Xu, Hao Sun, Sheng Zhao, Tao Qin:
MultiSpeech: Multi-Speaker Text to Speech with Transformer. 4024-4028
Speech Enhancement, Bandwidth Extension and Hearing Aids
- Pavlos Papadopoulos, Shrikanth Narayanan:
Exploiting Conic Affinity Measures to Design Speech Enhancement Systems Operating in Unseen Noise Conditions. 4029-4033 - Yunyun Ji, Longting Xu, Wei-Ping Zhu:
Adversarial Dictionary Learning for Monaural Speech Enhancement. 4034-4038 - Shogo Seki, Moe Takada, Tomoki Toda:
Semi-Supervised Self-Produced Speech Enhancement and Suppression Based on Joint Source Modeling of Air- and Body-Conducted Signals Using Variational Autoencoder. 4039-4043 - Ran Weisman, Vladimir Tourbabin, Paul Calamia, Boaz Rafaely:
Spatial Covariance Matrix Estimation for Reverberant Speech with Application to Speech Enhancement. 4044-4048 - Minh Tri Ho, Jinyoung Lee, Bong-Ki Lee, Dong Hoon Yi, Hong-Goo Kang:
A Cross-Channel Attention-Based Wave-U-Net for Multi-Channel Speech Enhancement. 4049-4053 - Igor Fedorov, Marko Stamenovic, Carl Jensen, Li-Chia Yang, Ari Mandell, Yiming Gan, Matthew Mattina, Paul N. Whatmough:
TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids. 4054-4058 - Shu Hikosaka, Shogo Seki, Tomoki Hayashi, Kazuhiro Kobayashi, Kazuya Takeda, Hideki Banno, Tomoki Toda:
Intelligibility Enhancement Based on Speech Waveform Modification Using Hearing Impairment. 4059-4063 - Nana Hou, Chenglin Xu, Van Tung Pham, Joey Tianyi Zhou, Eng Siong Chng, Haizhou Li:
Speaker and Phoneme-Aware Speech Bandwidth Extension with Residual Dual-Path Network. 4064-4068 - Nana Hou, Chenglin Xu, Joey Tianyi Zhou, Eng Siong Chng, Haizhou Li:
Multi-Task Learning for End-to-End Noise-Robust Bandwidth Extension. 4069-4073 - Shichao Hu, Bin Zhang, Beici Liang, Ethan Zhao, Simon Lui:
Phase-Aware Music Super-Resolution Using Generative Adversarial Networks. 4074-4078
Speech Emotion Recognition III
- Jian Huang, Jianhua Tao, Bin Liu, Zheng Lian:
Learning Utterance-Level Representations with Label Smoothing for Speech Emotion Recognition. 4079-4083 - Md Asif Jalal, Rosanna Milner, Thomas Hain, Roger K. Moore:
Removing Bias with Residual Mixture of Multi-View Attention for Speech Emotion Recognition. 4084-4088 - Weiquan Fan, Xiangmin Xu, Xiaofen Xing, Dongyan Huang:
Adaptive Domain-Aware Representation Learning for Speech Emotion Recognition. 4089-4093 - Huan Zhou, Kai Liu:
Speech Emotion Recognition with Discriminative Feature Learning. 4094-4097 - Hengshun Zhou, Jun Du, Yanhui Tu, Chin-Hui Lee:
Using Speech Enhancement Preprocessing for Speech Emotion Recognition in Realistic Noisy Conditions. 4098-4102 - Yongwei Li, Jianhua Tao, Bin Liu, Donna Erickson, Masato Akagi:
Comparison of Glottal Source Parameter Values in Emotional Vowels. 4103-4107 - Huang-Cheng Chou, Chi-Chun Lee:
Learning to Recognize Per-Rater's Emotion Perception Using Co-Rater Training Strategy with Soft and Hard Labels. 4108-4112 - Md. Asif Jalal, Rosanna Milner, Thomas Hain:
Empirical Interpretation of Speech Emotion Perception with Attention Based Model for Speech Emotion Recognition. 4113-4117
Accoustic Phonetics of L1-L2 and Other Interactions
- Iona Gessinger, Bernd Möbius, Bistra Andreeva, Eran Raveh, Ingmar Steiner:
Phonetic Accommodation of L2 German Speakers to the Virtual Language Learning Tutor Mirabella. 4118-4122 - Yuling Gu, Nancy F. Chen:
Characterization of Singaporean Children's English: Comparisons to American and British Counterparts Using Archetypal Analysis. 4123-4127 - Svetlana Kaminskaïa:
Rhythmic Convergence in Canadian French Varieties? 4128-4132 - Sreeja Manghat, Sreeram Manghat, Tanja Schultz:
Malayalam-English Code-Switched: Grapheme to Phoneme System. 4133-4137 - Mathilde Hutin, Adèle Jatteau, Ioana Vasilescu, Lori Lamel, Martine Adda-Decker:
Ongoing Phonologization of Word-Final Voicing Alternations in Two Romance Languages: Romanian and French. 4138-4142 - Maxwell Hope, Jason Lilley:
Cues for Perception of Gender in Synthetic Voices and the Role of Identity. 4143-4147 - Alla Menshikova, Daniil Kocharov, Tatiana Kachkovskaia:
Phonetic Entrainment in Cooperative Dialogues: A Case of Russian. 4148-4152 - Chengwei Xu, Wentao Gu:
Prosodic Characteristics of Genuine and Mock (Im)polite Mandarin Utterances. 4153-4157 - Yanping Li, Catherine T. Best, Michael D. Tyler, Denis Burnham:
Tone Variations in Regionally Accented Mandarin. 4158-4162 - Yike Yang, Si Chen, Xi Chen:
F0 Patterns in Mandarin Statements of Mandarin and Cantonese Speakers. 4163-4167
Conversational Systems
- Yung-Sung Chuang, Chi-Liang Liu, Hung-yi Lee, Lin-Shan Lee:
SpeechBERT: An Audio-and-Text Jointly Learned Language Model for End-to-End Spoken Question Answering. 4168-4172 - Chia-Chih Kuo, Shang-Bao Luo, Kuan-Yu Chen:
An Audio-Enriched BERT-Based Framework for Spoken Multiple-Choice Question Answering. 4173-4177 - Binxuan Huang, Han Wang, Tong Wang, Yue Liu, Yang Liu:
Entity Linking for Short Text Using Structured Knowledge Graph via Multi-Grained Text Matching. 4178-4182 - Mingxin Zhang, Tomohiro Tanaka, Wenxin Hou, Shengzhou Gao, Takahiro Shinozaki:
Sound-Image Grounding Based Focusing Mechanism for Efficient Automatic Spoken Language Acquisition. 4183-4187 - Kenta Yamamoto, Koji Inoue, Tatsuya Kawahara:
Semi-Supervised Learning for Character Expression of Spoken Dialogue Systems. 4188-4192 - Xiaohan Shi, Sixia Li, Jianwu Dang:
Dimensional Emotion Prediction Based on Interactive Context in Conversation. 4193-4197 - Asma Atamna, Chloé Clavel:
HRI-RNN: A User-Robot Dynamics-Oriented RNN for Engagement Decrease Detection. 4198-4202 - Simone Fuscone, Benoît Favre, Laurent Prévot:
Neural Representations of Dialogical History for Improving Upcoming Turn Acoustic Parameters Prediction. 4203-4207 - Shengli Hu:
Detecting Domain-Specific Credibility and Expertise in Text and Speech. 4208-4212
The Attacker’s Perpective on Automatic Speaker Verification
- Rohan Kumar Das, Xiaohai Tian, Tomi Kinnunen, Haizhou Li:
The Attacker's Perspective on Automatic Speaker Verification: An Overview. 4213-4217 - Alexey Sholokhov, Tomi Kinnunen, Ville Vestman, Kong Aik Lee:
Extrapolating False Alarm Rates in Automatic Speaker Verification. 4218-4222 - Ziyue Jiang, Hongcheng Zhu, Li Peng, Wenbing Ding, Yanzhen Ren:
Self-Supervised Spoofing Audio Detection Scheme. 4223-4227 - Qing Wang, Pengcheng Guo, Lei Xie:
Inaudible Adversarial Perturbations for Targeted Attack in Speaker Recognition. 4228-4232 - Jesús Villalba, Yuekai Zhang, Najim Dehak:
x-Vectors Meet Adversarial Attacks: Benchmarking Adversarial Robustness in Speaker Verification. 4233-4237 - Yuekai Zhang, Ziyan Jiang, Jesús Villalba, Najim Dehak:
Black-Box Attacks on Spoofing Countermeasures Using Transferability of Adversarial Examples. 4238-4242
Summarization, Semantic Analysis and Classification
- Krishna D. N, Ankita Patil:
Multimodal Emotion Recognition Using Cross-Modal Attention and 1D Convolutional Neural Networks. 4243-4247 - Potsawee Manakul, Mark J. F. Gales, Linlin Wang:
Abstractive Spoken Document Summarization Using Hierarchical Model with Multi-Stage Attention Diversity Optimization. 4248-4252 - Yichi Zhang, Yinpei Dai, Zhijian Ou, Huixin Wang, Junlan Feng:
Improved Learning of Word Embeddings with Word Definitions and Semantic Injection. 4253-4257 - Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, Sanjeev Khudanpur:
Wake Word Detection with Alignment-Free Lattice-Free MMI. 4258-4262 - Thai Binh Nguyen, Quang Minh Nguyen, Thi Thu Hien Nguyen, Quoc Truong Do, Luong Chi Mai:
Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models. 4263-4267 - Hemant Yadav, Sreyan Ghosh, Yi Yu, Rajiv Ratn Shah:
End-to-End Named Entity Recognition from English Speech. 4268-4272 - Joseph P. McKenna, Samridhi Choudhary, Michael Saxon, Grant P. Strimel, Athanasios Mouchtaris:
Semantic Complexity in End-to-End Spoken Language Understanding. 4273-4277 - Trang Tran, Morgan Tinkler, Gary Yeung, Abeer Alwan, Mari Ostendorf:
Analysis of Disfluency in Children's Speech. 4278-4282 - Ashish R. Mittal, Samarth Bharadwaj, Shreya Khare, Saneem A. Chemmengath, Karthik Sankaranarayanan, Brian Kingsbury:
Representation Based Meta-Learning for Few-Shot Spoken Intent Recognition. 4283-4287 - Rishika Agarwal, Xiaochuan Niu, Pranay Dighe, Srikanth Vishnubhotla, Sameer Badaskar, Devang Naik:
Complementary Language Model and Parallel Bi-LRNN for False Trigger Mitigation. 4288-4292
Speaker Recognition II
- Tianchi Liu, Rohan Kumar Das, Maulik C. Madhavi, Shengmei Shen, Haizhou Li:
Speaker-Utterance Dual Attention for Speaker and Utterance Verification. 4293-4297 - Lu Yi, Man-Wai Mak:
Adversarial Separation and Adaptation Network for Far-Field Speaker Verification. 4298-4302 - Hyewon Han, Soo-Whan Chung, Hong-Goo Kang:
MIRNet: Learning Multiple Identities Representations in Overlapped Speech. 4303-4307 - Weiwei Lin, Man-Wai Mak, Jen-Tzung Chien:
Strategies for End-to-End Text-Independent Speaker Verification. 4308-4312 - Rosa González Hautamäki, Tomi Kinnunen:
Why Did the x-Vector System Miss a Target Speaker? Impact of Acoustic Mismatch Upon Target Score on VoxCeleb Data. 4313-4317 - Amber Afshan, Jinxi Guo, Soo Jin Park, Vijay Ravi, Alan McCree, Abeer Alwan:
Variable Frame Rate-Based Data Augmentation to Handle Speaking-Style Variability for Automatic Speaker Verification. 4318-4322 - Mathieu Seurin, Florian Strub, Philippe Preux, Olivier Pietquin:
A Machine of Few Words: Interactive Speaker Recognition with Reinforcement Learning. 4323-4327 - Filip Granqvist, Matt Seigel, Rogier C. van Dalen, Áine Cahill, Stephen Shum, Matthias Paulik:
Improving On-Device Speaker Verification Using Federated Learning with Privacy. 4328-4332 - Shreyas Ramoji, Prashant Krishnan V, Sriram Ganapathy:
Neural PLDA Modeling for End-to-End Speaker Verification. 4333-4337
General Topics in Speech Recognition
- Kuba Lopatka, Tobias Bocklet:
State Sequence Pooling Training of Acoustic Models for Keyword Spotting. 4338-4342 - Andrew Hard, Kurt Partridge, Cameron Nguyen, Niranjan Subrahmanya, Aishanee Shah, Pai Zhu, Ignacio López-Moreno, Rajiv Mathews:
Training Keyword Spotting Models on Non-IID Data with Federated Learning. 4343-4347 - Rongqing Huang, Ossama Abdel-Hamid, Xinwei Li, Gunnar Evermann:
Class LM and Word Mapping for Contextual Biasing in End-to-End ASR. 4348-4351 - Lasse Borgholt, Jakob D. Havtorn, Zeljko Agic, Anders Søgaard, Lars Maaløe, Christian Igel:
Do End-to-End Speech Recognition Models Care About Context? 4352-4356 - Ankur Kumar, Sachin Singh, Dhananjaya Gowda, Abhinav Garg, Shatrughan Singh, Chanwoo Kim:
Utterance Confidence Measure for End-to-End Speech Recognition with Applications to Distributed Speech Recognition Scenarios. 4357-4361 - Huaxin Wu, Genshun Wan, Jia Pan:
Speaker Code Based Speaker Adaptive Training Using Model Agnostic Meta-Learning. 4362-4366 - Han Zhu, Jiangjiang Zhao, Yuling Ren, Li Wang, Pengyuan Zhang:
Domain Adaptation Using Class Similarity for Robust Speech Recognition. 4367-4371 - Sashi Novitasari, Andros Tjandra, Tomoya Yanagita, Sakriani Sakti, Satoshi Nakamura:
Incremental Machine Speech Chain Towards Enabling Listening While Speaking in Real-Time. 4372-4376 - Tina Raissi, Eugen Beck, Ralf Schlüter, Hermann Ney:
Context-Dependent Acoustic Modeling Without Explicit Phone Clustering. 4377-4381 - S. Shahnawazuddin, Nagaraj Adiga, Kunal Kumar, Aayushi Poddar, Waquar Ahmad:
Voice Conversion Based Data Augmentation to Improve Children's Speech Recognition in Limited Data Scenario. 4382-4386
Speech Synthesis: Prosody Modeling
- Sri Karlapati, Alexis Moinet, Arnaud Joly, Viacheslav Klimkov, Daniel Sáez-Trigueros, Thomas Drugman:
CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech. 4387-4391 - Binghuai Lin, Liyuan Wang, Xiaoli Feng, Jinsong Zhang:
Joint Detection of Sentence Stress and Phrase Boundary for Prosody. 4392-4396 - Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet:
Transfer Learning of the Expressivity Using FLOW Metric Learning in Multispeaker Text-to-Speech Synthesis. 4397-4401 - Jae-Sung Bae, Hanbin Bae, Young-Sun Joo, Junmo Lee, Gyeong-Hoon Lee, Hoon-Young Cho:
Speaking Speed Control of End-to-End Speech Synthesis Using Sentence-Level Conditioning. 4402-4406 - Shubhi Tyagi, Marco Nicolis, Jonas Rohnke, Thomas Drugman, Jaime Lorenzo-Trueba:
Dynamic Prosody Generation for Speech Synthesis Using Linguistics-Driven Acoustic Embedding Selection. 4407-4411 - Tom Kenter, Manish Sharma, Rob Clark:
Improving the Prosody of RNN-Based English Text-To-Speech Synthesis by Incorporating a BERT Model. 4412-4416 - Yi Zhao, Haoyu Li, Cheng-I Lai, Jennifer Williams, Erica Cooper, Junichi Yamagishi:
Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction. 4417-4421 - Zhen Zeng, Jianzong Wang, Ning Cheng, Jing Xiao:
Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit. 4422-4426 - Yuma Shirahata, Daisuke Saito, Nobuaki Minematsu:
Discriminative Method to Extract Coarse Prosodic Structure and its Application for Statistical Phrase/Accent Command Estimation. 4427-4431 - Tuomo Raitio, Ramya Rasipuram, Dan Castellani:
Controllable Neural Text-to-Speech Synthesis Using Intuitive Prosodic Features. 4432-4436 - Max Morrison, Zeyu Jin, Justin Salamon, Nicholas J. Bryan, Gautham J. Mysore:
Controllable Neural Prosody Synthesis. 4437-4441 - Matt Whitehill, Shuang Ma, Daniel McDuff, Yale Song:
Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency. 4442-4446 - Yang Gao, Weiyi Zheng, Zhaojun Yang, Thilo Köhler, Christian Fuegen, Qing He:
Interactive Text-to-Speech System via Joint Style Analysis. 4447-4451
Language Learning
- Kevin Hirschi, Okim Kang, Catia Cucchiarini, John H. L. Hansen, Keelan Evanini, Helmer Strik:
Mobile-Assisted Prosody Training for Limited English Proficiency: Learner Background and Speech Learning Pattern. 4452-4456 - Daniel R. van Niekerk, Anqi Xu, Branislav Gerazov, Paul Konstantin Krug, Peter Birkholz, Yi Xu:
Finding Intelligible Consonant-Vowel Sounds Using High-Quality Articulatory Synthesis. 4457-4461 - Venkat Krishnamohan, Akshara Soman, Anshul Gupta, Sriram Ganapathy:
Audiovisual Correspondence Learning in Humans and Machines. 4462-4466 - Yizhou Lan:
Perception of English Fricatives and Affricates by Advanced Chinese Learners of English. 4467-4470 - Kimiko Tsukada, Joo-Yeon Kim, Jeong-Im Han:
Perception of Japanese Consonant Length by Native Speakers of Korean Differing in Japanese Learning Experience. 4471-4475 - Si Ioi Ng, Tan Lee:
Automatic Detection of Phonological Errors in Child Speech Using Siamese Recurrent Autoencoder. 4476-4480 - Hongwei Ding, Binghuai Lin, Liyuan Wang, Hui Wang, Ruomei Fang:
A Comparison of English Rhythm Produced by Native American Speakers and Mandarin ESL Primary School Learners. 4481-4485 - Chao Zhou, Silke Hamann:
Cross-Linguistic Interaction Between Phonological Categorization and Orthography Predicts Prosodic Effects in the Acquisition of Portuguese Liquids by L1-Mandarin Learners. 4486-4490 - Wenqian Li, Jung-Yueh Tu:
Cross-Linguistic Perception of Utterances with Willingness and Reluctance in Mandarin by Korean L2 Learners. 4491-4495
Speech Enhancement
- Rui Cheng, Changchun Bao:
Speech Enhancement Based on Beamforming and Post-Filtering by Combining Phase Information. 4496-4500 - Yu-Xuan Wang, Jun Du, Li Chai, Chin-Hui Lee, Jia Pan:
A Noise-Aware Memory-Attention Network Architecture for Regression-Based Speech Enhancement. 4501-4505 - Jiaqi Su, Zeyu Jin, Adam Finkelstein:
HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks. 4506-4510 - Ashutosh Pandey, DeLiang Wang:
Learning Complex Spectral Mapping for Speech Enhancement with Improved Cross-Corpus Generalization. 4511-4515 - Julius Richter, Guillaume Carbajal, Timo Gerkmann:
Speech Enhancement with Stochastic Temporal Convolutional Networks. 4516-4520 - Mandar Gogate, Kia Dashtipour, Amir Hussain:
Visual Speech In Real Noisy Environments (VISION): A Novel Benchmark Dataset and Deep Learning-Based Baseline System. 4521-4525 - Aswin Sivaraman, Minje Kim:
Sparse Mixture of Local Experts for Efficient Speech Enhancement. 4526-4530 - Vinith Kishore, Nitya Tiwari, Periyasamy Paramasivam:
Improved Speech Enhancement Using TCN with Multiple Encoder-Decoder Layers. 4531-4535 - Cunhang Fan, Jianhua Tao, Bin Liu, Jiangyan Yi, Zhengqi Wen:
Joint Training for Simultaneous Speech Denoising and Dereverberation with Deep Embedding Representations. 4536-4540 - Mathieu Fontaine, Kouhei Sekiguchi, Aditya Arie Nugraha, Kazuyoshi Yoshii:
Unsupervised Robust Speech Enhancement Based on Alpha-Stable Fast Multichannel Nonnegative Matrix Factorization. 4541-4545
Speech in Health II
- Merlin Albes, Zhao Ren, Björn W. Schuller, Nicholas Cummins:
Squeeze for Sneeze: Compact Neural Networks for Cold and Flu Recognition. 4546-4550 - Nadee Seneviratne, James R. Williamson, Adam C. Lammert, Thomas F. Quatieri, Carol Y. Espy-Wilson:
Extended Study on the Use of Vocal Tract Variables to Quantify Neuromotor Coordination in Depression. 4551-4555 - Danai Xezonaki, Georgios Paraskevopoulos, Alexandros Potamianos, Shrikanth Narayanan:
Affective Conditioning on Hierarchical Attention Networks Applied to Depression Detection from Transcribed Clinical Interviews. 4556-4560 - Zhaocheng Huang, Julien Epps, Dale Joachim, Brian Stasak, James R. Williamson, Thomas F. Quatieri:
Domain Adaptation for Enhancing Speech-Based Depression Detection in Natural Environmental Conditions Using Dilated CNNs. 4561-4565 - Gábor Gosztolya, Anita Bagi, Szilvia Szalóki, István Szendi, Ildikó Hoffmann:
Making a Distinction Between Schizophrenia and Bipolar Disorder Based on Temporal Parameters in Spontaneous Speech. 4566-4570 - Mark A. Huckvale, András Beke, Mirei Ikushima:
Prediction of Sleepiness Ratings from Voice by Man and Machine. 4571-4575 - Kristin J. Teplansky, Alan Wisler, Beiming Cao, Wendy Liang, Chad W. Whited, Ted Mau, Jun Wang:
Tongue and Lip Motion Patterns in Alaryngeal Speech. 4576-4580 - Zhengjun Yue, Heidi Christensen, Jon Barker:
Autoencoder Bottleneck Features with Multi-Task Optimisation for Improved Continuous Dysarthric Speech Recognition. 4581-4585 - Jhansi Mallela, Aravind Illa, Yamini Belur, Atchayaram Nalini, Ravi Yadav, Pradeep Reddy, Dipanjan Gope, Prasanta Kumar Ghosh:
Raw Speech Waveform Based Classification of Patients with ALS, Parkinson's Disease and Healthy Controls Using CNN-BLSTM. 4586-4590 - Anna Pompili, Rubén Solera-Ureña, Alberto Abad, Rita Cardoso, Isabel Guimarães, Margherita Fabbri, Isabel P. Martins, Joaquim J. Ferreira:
Assessment of Parkinson's Disease Medication State Through Automatic Speech Analysis. 4591-4595
Speech and Audio Quality Assessment
- Chao Zhang, Junjie Cheng, Yanmei Gu, Huacan Wang, Jun Ma, Shaojun Wang, Jing Xiao:
Improving Replay Detection System with Channel Consistency DenseNeXt for the ASVspoof 2019 Challenge. 4596-4600 - Przemyslaw Falkowski-Gilski, Grzegorz Debita, Marcin Habrych, Bogdan Miedzinski, Przemyslaw Jedlikowski, Bartosz Polnik, Jan Wandzio, Xin Wang:
Subjective Quality Evaluation of Speech Signals Transmitted via BPL-PLC Wired System. 4601-4605 - Waito Chiu, Yan Xu, Andrew Abel, Chun Lin, Zhengzheng Tu:
Investigating the Visual Lombard Effect with Gabor Based Features. 4606-4610 - Qiang Huang, Thomas Hain:
Exploration of Audio Quality Assessment and Anomaly Localisation Using Attention Models. 4611-4615 - Alessandro Ragano, Emmanouil Benetos, Andrew Hines:
Development of a Speech Quality Database Under Uncontrolled Conditions. 4616-4620 - Robin Algayres, Mohamed Salah Zaïem, Benoît Sagot, Emmanuel Dupoux:
Evaluating the Reliability of Acoustic Speech Embeddings. 4621-4625 - Hao Li, DeLiang Wang, Xueliang Zhang, Guanglai Gao:
Frame-Level Signal-to-Noise Ratio Estimation Using Deep Learning. 4626-4630 - Xuan Dong, Donald S. Williamson:
A Pyramid Recurrent Network for Predicting Crowdsourced Speech-Quality Ratings of Real-World Signals. 4631-4635 - Avamarie Brueggeman, John H. L. Hansen:
Effect of Spectral Complexity Reduction and Number of Instruments on Musical Enjoyment with Cochlear Implants. 4636-4640 - Michal Kosmider:
Spectrum Correction: Acoustic Scene Classification with Mismatched Recording Devices. 4641-4645
Privacy and Security in Speech Communication
- Matthew O'Connor, W. Bastiaan Kleijn:
Distributed Summation Privacy for Speech Enhancement. 4646-4650 - Anna Leschanowsky, Sneha Das, Tom Bäckström, Pablo Pérez Zarazaga:
Perception of Privacy Measured in the Crowd - Paired Comparison on the Effect of Background Noises. 4651-4655 - Felix Kreuk, Yossi Adi, Bhiksha Raj, Rita Singh, Joseph Keshet:
Hide and Speak: Towards Deep Neural Networks for Speech Steganography. 4656-4660 - Sina Däubener, Lea Schönherr, Asja Fischer, Dorothea Kolossa:
Detecting Adversarial Examples for Speech Recognition via Uncertainty Quantification. 4661-4665 - David Ifeoluwa Adelani, Ali Davody, Thomas Kleinbauer, Dietrich Klakow:
Privacy Guarantees for De-Identifying Text Transformations. 4666-4670 - Tejas Jayashankar, Jonathan Le Roux, Pierre Moulin:
Detecting Audio Attacks on ASR Systems with Dropout Uncertainty. 4671-4675
Voice Conversion and Adaptation II
- Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu Kameoka, Tomoki Toda:
Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining. 4676-4680 - Hitoshi Suda, Gaku Kotani, Daisuke Saito:
Nonparallel Training of Exemplar-Based Voice Conversion System Using INCA-Based Alignment Technique. 4681-4685 - Chen-Yu Chen, Wei-Zhong Zheng, Syu-Siang Wang, Yu Tsao, Pei-Chun Li, Ying-Hui Lai:
Enhancing Intelligibility of Dysarthric Speech Using Gated Convolutional-Based Voice Conversion System. 4686-4690 - Da-Yi Wu, Yen-Hao Chen, Hung-yi Lee:
VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net Architecture. 4691-4695 - Seung Won Park, Doo-young Kim, Myun-chul Joe:
Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion Without Parallel Data. 4696-4700 - Ruibo Fu, Jianhua Tao, Zhengqi Wen, Jiangyan Yi, Tao Wang, Chunyu Qiang:
Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis. 4701-4705 - Zheng Lian, Zhengqi Wen, Xinyong Zhou, Songbai Pu, Shengkai Zhang, Jianhua Tao:
ARVC: An Auto-Regressive Voice Conversion System Without Parallel Training Data. 4706-4710 - Shahan Nercessian:
Improved Zero-Shot Voice Conversion Using Explicit Conditioning Signals. 4711-4715 - Minchuan Chen, Weijian Hou, Jun Ma, Shaojun Wang, Jing Xiao:
Non-Parallel Voice Conversion with Fewer Labeled Data by Conditional Generative Adversarial Networks. 4716-4720 - Songxiang Liu, Yuewen Cao, Shiyin Kang, Na Hu, Xunying Liu, Dan Su, Dong Yu, Helen Meng:
Transferring Source Style in Non-Parallel Voice Conversion. 4721-4725 - Ehab A. AlBadawy, Siwei Lyu:
Voice Conversion Using Speech-to-Speech Neuro-Style Transfer. 4726-4730
Multilingual and Code-Switched ASR
- Changhan Wang, Juan Miguel Pino, Jiatao Gu:
Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation. 4731-4735 - Samuel Thomas, Kartik Audhkhasi, Brian Kingsbury:
Transliteration Based Data Augmentation for Training Multilingual ASR Acoustic Models in Low Resource Settings. 4736-4740 - Yun Zhu, Parisa Haghani, Anshuman Tripathi, Bhuvana Ramabhadran, Brian Farris, Hainan Xu, Han Lu, Hasim Sak, Isabel Leal, Neeraj Gaur, Pedro J. Moreno, Qian Zhang:
Multilingual Speech Recognition with Self-Attention Structured Parameterization. 4741-4745 - Srikanth R. Madikeri, Banriskhem K. Khonglah, Sibo Tong, Petr Motlícek, Hervé Bourlard, Daniel Povey:
Lattice-Free Maximum Mutual Information Training of Multilingual Speech Recognition Systems. 4746-4750 - Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Y. Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert:
Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters. 4751-4755 - Hardik B. Sailor, Thomas Hain:
Multilingual Speech Recognition Using Language-Specific Phoneme Recognition as Auxiliary Task for Indian Languages. 4756-4760 - Khyathi Raghavi Chandu, Alan W. Black:
Style Variation as a Vantage Point for Code-Switching. 4761-4765 - Yizhou Lu, Mingkun Huang, Hao Li, Jiaqi Guo, Yanmin Qian:
Bi-Encoder Transformer Network for Mandarin-English Code-Switching Speech Recognition Using Mixture of Experts. 4766-4770 - Yash Sharma, Basil Abraham, Karan Taneja, Preethi Jyothi:
Improving Low Resource Code-Switched ASR Using Augmented Code-Switched TTS. 4771-4775 - Zimeng Qiu, Yiyuan Li, Xinjian Li, Florian Metze, William M. Campbell:
Towards Context-Aware End-to-End Code-Switching Speech Recognition. 4776-4780
Speech and Voice Disorders
- Tuan Dinh, Alexander Kain, Robin Samlan, Beiming Cao, Jun Wang:
Increasing the Intelligibility and Naturalness of Alaryngeal Speech Using Voice Conversion and Synthetic Fundamental Frequency. 4781-4785 - Han Tong, Hamid R. Sharifzadeh, Ian McLoughlin:
Automatic Assessment of Dysarthric Severity Level Using Audio-Video Cross-Modal Approach in Deep Learning. 4786-4790 - Yuqin Lin, Longbiao Wang, Sheng Li, Jianwu Dang, Chenchen Ding:
Staged Knowledge Distillation for End-to-End Dysarthric Speech Recognition and Speech Attribute Transcription. 4791-4795 - Yuki Takashima, Ryoichi Takashima, Tetsuya Takiguchi, Yasuo Ariki:
Dysarthric Speech Recognition Based on Deep Metric Learning. 4796-4800 - Divya Degala, M. V. Achuth Rao, Rahul Krishnamurthy, Pebbili Gopikishore, Veeramani Priyadharshini, Prakash T. K., Prasanta Kumar Ghosh:
Automatic Glottis Detection and Segmentation in Stroboscopic Videos Using Convolutional Networks. 4801-4805 - Yilin Pan, Bahman Mirheidari, Zehai Tu, Ronan O'Malley, Traci Walker, Annalena Venneri, Markus Reuber, Daniel Blackburn, Heidi Christensen:
Acoustic Feature Extraction with Interpretable Deep Neural Network for Neurodegenerative Related Disorder Classification. 4806-4810 - Neeraj Kumar Sharma, Prashant Krishnan V, Rohit Kumar, Shreyas Ramoji, Srikanth Raj Chetupalli, Nirmala R., Prasanta Kumar Ghosh, Sriram Ganapathy:
Coswara - A Database of Breathing, Cough, and Voice Sounds for COVID-19 Diagnosis. 4811-4815 - Hannah P. Rowe, Sarah E. Gutz, Marc F. Maffei, Jordan R. Green:
Acoustic-Based Articulatory Phenotypes of Amyotrophic Lateral Sclerosis and Parkinson's Disease: Towards an Interpretable, Hypothesis-Driven Framework of Motor Control. 4816-4820 - Lubna Alhinti, Stuart P. Cunningham, Heidi Christensen:
Recognising Emotions in Dysarthric Speech Using Typical Speech Data. 4821-4825 - Bence Mark Halpern, Rob van Son, Michiel W. M. van den Brekel, Odette Scharenborg:
Detecting and Analysing Spontaneous Oral Cancer Speech in the Wild. 4826-4830
The Zero Resource Speech Challenge 2020
- Ewan Dunbar, Julien Karadayi, Mathieu Bernard, Xuan-Nga Cao, Robin Algayres, Lucas Ondel, Laurent Besacier, Sakriani Sakti, Emmanuel Dupoux:
The Zero Resource Speech Challenge 2020: Discovering Discrete Subword and Word Units. 4831-4835 - Benjamin van Niekerk, Leanne Nortje, Herman Kamper:
Vector-Quantized Neural Networks for Acoustic Unit Discovery in the ZeroSpeech 2020 Challenge. 4836-4840 - Karthik Pandia D. S, Anusha Prakash, Mano Ranjith Kumar M., Hema A. Murthy:
Exploration of End-to-End Synthesisers for Zero Resource Speech Challenge 2020. 4841-4845 - Batuhan Gündogdu, Bolaji Yusuf, Mansur Yesilbursa, Murat Saraclar:
Vector Quantized Temporally-Aware Correspondence Sparse Autoencoders for Zero-Resource Acoustic Unit Discovery. 4846-4850 - Andros Tjandra, Sakriani Sakti, Satoshi Nakamura:
Transformer VQ-VAE for Unsupervised Unit Discovery and Speech Synthesis: ZeroSpeech 2020 Challenge. 4851-4855 - Takashi Morita, Hiroki Koda:
Exploring TTS Without T Using Biologically/Psychologically Motivated Neural Network Modules (ZeroSpeech 2020). 4856-4860 - Patrick Lumban Tobing, Tomoki Hayashi, Yi-Chiao Wu, Kazuhiro Kobayashi, Tomoki Toda:
Cyclic Spectral Modeling for Unsupervised Unit Discovery into Voice Conversion with Excitation and Waveform Modeling. 4861-4865 - Mingjie Chen, Thomas Hain:
Unsupervised Acoustic Unit Representation Learning for Voice Conversion Using WaveNet Auto-Encoders. 4866-4870 - Okko Räsänen, María Andrea Cruz Blandón:
Unsupervised Discovery of Recurring Speech Patterns Using Probabilistic Adaptive Metrics. 4871-4875 - Saurabhchand Bhati, Jesús Villalba, Piotr Zelasko, Najim Dehak:
Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery. 4876-4880 - Juliette Millet, Ewan Dunbar:
Perceptimatic: A Human Speech Perception Benchmark for Unsupervised Subword Modelling. 4881-4885 - Jonathan Clayton, Scott Wellington, Cassia Valentini-Botinhao, Oliver Watts:
Decoding Imagined, Heard, and Spoken Speech: Classification and Regression of EEG Using a 14-Channel Dry-Contact Mobile Headset. 4886-4890 - Gurunath Reddy M., K. Sreenivasa Rao, Partha Pratim Das:
Glottal Closure Instants Detection from EGG Signal by Classification Approach. 4891-4895 - Hua Li, Fei Chen:
Classify Imaginary Mandarin Tones with Cortical EEG Signals. 4896-4900
LM Adaptation, Lexical Units and Punctuation
- Johanes Effendi, Andros Tjandra, Sakriani Sakti, Satoshi Nakamura:
Augmenting Images for ASR and TTS Through Single-Loop and Dual-Loop Multimodal Chain Framework. 4901-4905 - Lukasz Augustyniak, Piotr Szymanski, Mikolaj Morzy, Piotr Zelasko, Adrian Szymczak, Jan Mizgajski, Yishay Carmiel, Najim Dehak:
Punctuation Prediction in Spontaneous Conversations: Can We Mitigate ASR Errors with Retrofitted Word Embeddings? 4906-4910 - Monica Sunkara, Srikanth Ronanki, Dhanush Bekal, Sravan Bodapati, Katrin Kirchhoff:
Multimodal Semi-Supervised Learning Framework for Punctuation Prediction in Conversational Speech. 4911-4915 - Ruizhe Huang, Ke Li, Ashish Arora, Daniel Povey, Sanjeev Khudanpur:
Efficient MDI Adaptation for n-Gram Language Models. 4916-4920 - Cal Peyser, Sepand Mavandadi, Tara N. Sainath, James Apfel, Ruoming Pang, Shankar Kumar:
Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus. 4921-4925 - Atsunori Ogawa, Naohiro Tawara, Marc Delcroix:
Language Model Data Augmentation Based on Text Domain Transfer. 4926-4930 - Krzysztof Wolk:
Contemporary Polish Language Model (Version 2) Using Big Data and Sub-Word Approach. 4931-4935 - Prabhat Pandey, Volker Leutnant, Simon Wiesler, Jahn Heymann, Daniel Willett:
Improving Speech Recognition of Compound-Rich Languages. 4936-4940 - Simone Wills, Pieter Uys, Charl Johannes van Heerden, Etienne Barnard:
Language Modeling for Speech Analytics in Under-Resourced Languages. 4941-4945
Speech in Health I
- Jing Han, Kun Qian, Meishu Song, Zijiang Yang, Zhao Ren, Shuo Liu, Juan Liu, Huaiyuan Zheng, Wei Ji, Tomoya Koike, Xiao Li, Zixing Zhang, Yoshiharu Yamamoto, Björn W. Schuller:
An Early Study on Intelligent Analysis of Speech Under COVID-19: Severity, Sleep Quality, Fatigue, and Anxiety. 4946-4950 - Alice Baird, Nicholas Cummins, Sebastian Schnieder, Jarek Krajewski, Björn W. Schuller:
An Evaluation of the Effect of Anxiety on Speech - Computational Prediction of Anxiety from Sustained Vowels. 4951-4955 - Ziping Zhao, Qifei Li, Nicholas Cummins, Bin Liu, Haishuai Wang, Jianhua Tao, Björn W. Schuller:
Hybrid Network Feature Extraction for Depression Assessment from Speech. 4956-4960 - Yilin Pan, Bahman Mirheidari, Markus Reuber, Annalena Venneri, Daniel Blackburn, Heidi Christensen:
Improving Detection of Alzheimer's Disease Using Automatic Speech Recognition to Identify High-Quality Segments for More Robust Feature Extraction. 4961-4965 - Amrit Romana, John Bandon, Noelle Carlozzi, Angela Roberts, Emily Mower Provost:
Classification of Manifest Huntington Disease Using Vowel Distortion Measures. 4966-4970 - Sudarsana Reddy Kadiri, Rashmi Kethireddy, Paavo Alku:
Parkinson's Disease Detection from Speech Using Single Frequency Filtering Cepstral Coefficients. 4971-4975 - Sebastião Quintas, Julie Mauclair, Virginie Woisard, Julien Pinquier:
Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer. 4976-4980 - Ajish K. Abraham, M. Pushpavathi, N. Sreedevi, A. Navya, Vikram C. Mathad, S. R. Mahadeva Prasanna:
Spectral Moment and Duration of Burst of Plosives in Speech of Children with Hearing Impairment and Typically Developing Children - A Comparative Study. 4981-4985 - Matthew Perez, Zakaria Aldeneh, Emily Mower Provost:
Aphasic Speech Recognition Using a Mixture of Speech Intelligibility Experts. 4986-4990 - Ina Kodrasi, Michaela Pernon, Marina Laganaro, Hervé Bourlard:
Automatic Discrimination of Apraxia of Speech and Dysarthria Using a Minimalistic Set of Handcrafted Features. 4991-4995
ASR Neural Network Architectures II — Transformers
- Yangyang Shi, Yongqiang Wang, Chunyang Wu, Christian Fuegen, Frank Zhang, Duc Le, Ching-Feng Yeh, Michael L. Seltzer:
Weak-Attention Suppression for Transformer Based Speech Recognition. 4996-5000 - Wenyong Huang, Wenchao Hu, Yu Ting Yeung, Xiao Chen:
Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition. 5001-5005 - Song Li, Lin Li, Qingyang Hong, Lingling Liu:
Improving Transformer-Based Speech Recognition with Unsupervised Pre-Training and Multi-Task Semantic Knowledge Learning. 5006-5010 - Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux:
Transformer-Based Long-Context End-to-End Speech Recognition. 5011-5015 - Xinyuan Zhou, Grandee Lee, Emre Yilmaz, Yanhua Long, Jiaen Liang, Haizhou Li:
Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-Based LVCSR. 5016-5020 - Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq R. Joty, Eng Siong Chng, Bin Ma:
Universal Speech Transformer. 5021-5025 - Zhengkun Tian, Jiangyan Yi, Jianhua Tao, Ye Bai, Shuai Zhang, Zhengqi Wen:
Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition. 5026-5030 - Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq R. Joty, Eng Siong Chng, Bin Ma:
Cross Attention with Monotonic Alignment for Speech Transformer. 5031-5035 - Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang:
Conformer: Convolution-augmented Transformer for Speech Recognition. 5036-5040 - Liang Lu, Changliang Liu, Jinyu Li, Yifan Gong:
Exploring Transformers for Large-Scale Speech Recognition. 5041-5045
Spatial Audio
- Masahito Togami, Robin Scheibler:
Sparseness-Aware DOA Estimation with Majorization Minimization. 5046-5050 - Xiaoli Zhong, Hao Song, Xuejie Liu:
Spatial Resolution of Early Reflection for Speech and White Noise. 5051-5055 - Aditya Raikar, Karan Nathwani, Ashish Panda, Sunil Kumar Kopparapu:
Effect of Microphone Position Measurement Error on RIR and its Impact on Speech Intelligibility and Quality. 5056-5060 - Shuwen Deng, Wolfgang Mack, Emanuël A. P. Habets:
Online Blind Reverberation Time Estimation Using CRNNs. 5061-5065 - Wolfgang Mack, Shuwen Deng, Emanuël A. P. Habets:
Single-Channel Blind Direct-to-Reverberation Ratio Estimation Using Masking. 5066-5070 - Hanan Beit-On, Vladimir Tourbabin, Boaz Rafaely:
The Importance of Time-Frequency Averaging for Binaural Speaker Localization in Reverberant Environments. 5071-5075 - Yonggang Hu, Prasanga N. Samarasinghe, Thushara D. Abhayapala:
Acoustic Signal Enhancement Using Relative Harmonic Coefficients: Spherical Harmonics Domain Approach. 5076-5080 - B. H. V. S. Narayana Murthy, J. V. Satyanarayana, Nivedita Chennupati, B. Yegnanarayana:
Instantaneous Time Delay Estimation of Broadband Signals. 5081-5085 - Hao Wang, Kai Chen, Jing Lu:
U-Net Based Direct-Path Dominance Test for Robust Direction-of-Arrival Estimation. 5086-5090 - Wei Xue, Ying Tong, Chao Zhang, Guohong Ding, Xiaodong He, Bowen Zhou:
Sound Event Localization and Detection Based on Multiple DOA Beamforming and Multi-Task Learning. 5091-5095
manage site settings
To protect your privacy, all features that rely on external API calls from your browser are turned off by default. You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q.