Skip to content

liutaocode/TTS-arxiv-daily

Repository files navigation

Updated on 2025.12.25

Usage instructions: here

This page is modified from here

Table of Contents
  1. TTS

TTS

Publish Date Title Authors PDF Code
2025-12-23 TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation Ji-Hoon Kim et.al. 2512.20296 null
2025-12-23 Fun-Audio-Chat Technical Report Qian Chen et.al. 2512.20156 null
2025-12-22 JoyVoice: Long-Context Conditioning for Anthropomorphic Multi-Speaker Conversational Synthesis Fan Yu et.al. 2512.19090 null
2025-12-21 Smark: A Watermark for Text-to-Speech Diffusion Models via Discrete Wavelet Transform Yichuan Zhang et.al. 2512.18791 null
2025-12-21 Task Vector in TTS: Toward Emotionally Expressive Dialectal Speech Synthesis Pengchao Feng et.al. 2512.18699 link
2025-12-19 Training Text-to-Speech Model with Purely Synthetic Data: Feasibility, Sensitivity, and Generalization Capability Tingxiao Zhou et.al. 2512.17356 null
2025-12-19 Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track June Young Yi et.al. 2512.17293 null
2025-12-18 Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs Sara Papi et.al. 2512.16378 null
2025-12-16 Adapting Speech Language Model to Singing Voice Synthesis Yiwen Zhao et.al. 2512.14657 null
2025-12-16 Robust Training of Singing Voice Synthesis Using Prior and Posterior Uncertainty Yiwen Zhao et.al. 2512.14653 null
2025-12-16 GLM-TTS Technical Report Jiayan Cui et.al. 2512.14291 null
2025-12-18 A stylometric analysis of speaker attribution from speech transcripts Cristina Aggazzotti et.al. 2512.13667 null
2025-12-15 Reproducing and Dissecting Denoising Language Models for Speech Recognition Dorian Koch et.al. 2512.13576 null
2025-12-18 DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec Tao Li et.al. 2512.13251 null
2025-12-11 CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences Yiyang Wang et.al. 2512.10918 null
2025-12-10 DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance Kang Yin et.al. 2512.09504 null
2025-12-09 LG Uplus System with Multi-Speaker IDs and Discriminator-based Sub-Judges for the WildSpoof Challenge Jinyoung Park et.al. 2512.09000 null
2025-12-08 Beyond Unified Models: A Service-Oriented Approach to Low Latency, Context Aware Phonemization for Real Time TTS Mahta Fetrat et.al. 2512.08006 null
2025-12-08 MultiAPI Spoof: A Multi-API Dataset and Local-Attention Network for Speech Anti-spoofing Detection Xueping Zhang et.al. 2512.07352 null
2025-12-06 Sanvaad: A Multimodal Accessibility Framework for ISL Recognition and Voice-Based Interaction Kush Revankar et.al. 2512.06485 null
2025-12-05 SEA-SafeguardBench: Evaluating AI Safety in SEA Languages and Cultures Panuthep Tasawong et.al. 2512.05501 null
2025-12-05 Simulating Life Paths with Digital Twins: AI-Generated Future Selves Influence Decision-Making and Expand Human Choice Rachel Poonsiriwong et.al. 2512.05397 null
2025-12-04 HiPPO: Exploring A Novel Hierarchical Pronunciation Assessment Approach for Spoken Languages Bi-Cheng Yan et.al. 2512.04964 link
2025-12-04 TripleC Learning and Lightweight Speech Enhancement for Multi-Condition Target Speech Extraction Ziling Huang et.al. 2512.04945 null
2025-12-04 YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-free Melody Guidance Junjie Zheng et.al. 2512.04779 null
2025-12-04 Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild Yigui Feng et.al. 2512.04728 null
2025-12-04 M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis Xiaopeng Wang et.al. 2512.04720 null
2025-12-04 Large Speech Model Enabled Semantic Communication Yun Tian et.al. 2512.04711 null
2025-12-04 Limit cycles for speech Adamantios I. Gafos et.al. 2512.04642 null
2025-12-04 RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS Cong Wang et.al. 2512.04552 null
2025-12-04 Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention Cong Wang et.al. 2512.04551 null
2025-12-03 Head, posture, and full-body gestures in interactive communication Ľuboš Hládek et.al. 2512.03636 null
2025-12-03 A Convolutional Framework for Mapping Imagined Auditory MEG into Listened Brain Responses Maryam Maghsoudi et.al. 2512.03458 null
2025-12-02 Comparing Unsupervised and Supervised Semantic Speech Tokens: A Case Study of Child ASR Mohan Shi et.al. 2512.03301 null
2025-12-02 How to DP-fy Your Data: A Practical Guide to Generating Synthetic Data With Differential Privacy Natalia Ponomareva et.al. 2512.03238 null
2025-12-02 MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation Youxin Pang et.al. 2512.03034 null
2025-12-02 Perceptual evaluation of Acoustic Level of Detail in Virtual Acoustic Environments Stefan Fichna et.al. 2512.02891 null
2025-12-02 BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion Sai Koneru et.al. 2512.02817 null
2025-12-02 Reasoning-Aware Multimodal Fusion for Hateful Video Detection Shuonan Yang et.al. 2512.02743 null
2025-12-02 Hear What Matters! Text-conditioned Selective Video-to-Audio Generation Junwon Lee et.al. 2512.02650 null
2025-12-02 Spoken Conversational Agents with Large Language Models Chao-Han Huck Yang et.al. 2512.02593 null
2025-12-02 Co-speech Gesture Video Generation via Motion-Based Graph Retrieval Yafei Song et.al. 2512.02576 null
2025-12-02 Generative Multi-modal Feedback for Singing Voice Synthesis Evaluation Xueyan Li et.al. 2512.02523 null
2025-12-02 VibOmni: Towards Scalable Bone-conduction Speech Enhancement on Earables Lixing He et.al. 2512.02515 null
2025-12-01 Swivuriso: The South African Next Voices Multilingual Speech Dataset Vukosi Marivatee et.al. 2512.02201 null
2025-12-01 Cross-Lingual Interleaving for Speech Language Models Adel Moumen et.al. 2512.01865 null
2025-12-01 MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark Yuezhang Peng et.al. 2512.01603 link
2025-12-01 MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages Yexing Du et.al. 2512.01512 null
2025-12-01 Model-Based Clustering of Functional Data Via Random Projection Ensembles Matteo Mori et.al. 2512.01450 null
2025-12-01 EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans Yingjie Zhou et.al. 2512.01340 null
2025-12-01 fMRI2GES: Co-speech Gesture Reconstruction from fMRI Signal with Dual Brain Decoding Alignment Chunzheng Zhu et.al. 2512.01189 null
2025-11-30 Supporting Productivity Skill Development in College Students through Social Robot Coaching: A Proof-of-Concept Himanshi Lalwani et.al. 2512.01105 null
2025-11-30 Arabic TTS with FastPitch: Reproducible Baselines, Adversarial Training, and Oversmoothing Analysis Lars Nippert et.al. 2512.00937 null
2025-11-29 STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition Siyu Wang et.al. 2512.00451 null
2025-11-28 OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion Sai Koneru et.al. 2512.00234 null
2025-11-28 CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation Fengyi Fang et.al. 2511.22863 null
2025-11-27 Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration Kanchon Gharami et.al. 2511.22769 null
2025-11-27 PURE Codec: Progressive Unfolding of Residual Entropy for Speech Codec Learning Jiatong Shi et.al. 2511.22687 null
2025-11-27 Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking Katia Vendrame et.al. 2511.22503 null
2025-11-27 Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition Maheswar Bora et.al. 2511.22443 null
2025-11-27 GLA-Grad++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis Teysir Baoueb et.al. 2511.22293 null
2025-11-27 VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task Yuyue Wang et.al. 2511.22229 null
2025-11-27 Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation Joel Alberto Santos et.al. 2511.22025 null
2025-11-26 Advancing Marine Bioacoustics with Deep Generative Models: A Hybrid Augmentation Strategy for Southern Resident Killer Whale Detection Bruno Padovese et.al. 2511.21872 null
2025-11-26 Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation Lina Conti et.al. 2511.21517 null
2025-11-26 TSGM: Regular and Irregular Time-series Generation using Score-based Generative Models Haksoo Lim et.al. 2511.21335 null
2025-11-26 Acoustic neural networks: Identifying design principles and exploring physical feasibility Ivan Kalthoff et.al. 2511.21313 null
2025-11-26 Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale Yicheng Zhong et.al. 2511.21270 null
2025-11-26 CartoonSing: Unifying Human and Nonhuman Timbres in Singing Generation Jionghao Han et.al. 2511.21045 null
2025-11-26 RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data Zhisheng Zheng et.al. 2511.20974 null
2025-11-26 Towards Audio Token Compression in Large Audio Language Models Saurabhchand Bhati et.al. 2511.20973 null
2025-11-26 SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications Jionghao Han et.al. 2511.20972 null
2025-11-25 Continual Audio Deepfake Detection via Universal Adversarial Perturbation Wangjie Li et.al. 2511.19974 null
2025-11-25 Towards Edge General Intelligence: Knowledge Distillation for Mobile Agentic AI Yuxuan Wu et.al. 2511.19947 null
2025-11-25 It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models Xiangyu Zhao et.al. 2511.19877 null
2025-11-24 Evaluating Objective Speech Quality Metrics for Neural Audio Codecs Luca A. Lanzendörfer et.al. 2511.19734 null
2025-11-24 A Layered Protocol Architecture for the Internet of Agents Charles Fleming et.al. 2511.19699 null
2025-11-24 Dynamic Multi-Species Bird Soundscape Generation with Acoustic Patterning and 3D Spatialization Ellie L. Zhang et.al. 2511.19275 null
2025-11-25 PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation Huadai Liu et.al. 2511.18833 null
2025-11-24 Context-Aware Whisper for Arabic ASR Under Linguistic Varieties Bashar Talafha et.al. 2511.18774 null
2025-11-24 AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation Omar Garib et.al. 2511.18718 null
2025-11-23 The Locally Deployable Virtual Doctor: LLM Based Human Interface for Automated Anamnesis and Database Conversion Jan Benedikt Ruhland et.al. 2511.18632 null
2025-11-23 InstructAudio: Unified speech and music generation with natural language instruction Chunyu Qiang et.al. 2511.18487 null
2025-11-23 A Multimodal Conversational Agent for Tabular Data Analysis Mohammad Nour Al Awad et.al. 2511.18405 null
2025-11-23 Gradient Masters at BLP-2025 Task 1: Advancing Low-Resource NLP for Bengali using Ensemble-Based Adversarial Training for Hate Speech Detection Syed Mohaiminul Hoque et.al. 2511.18324 null
2025-11-23 MultiDiffNet: A Multi-Objective Diffusion Framework for Generalizable Brain Decoding Mengchun Zhang et.al. 2511.18294 null
2025-11-22 A superpersuasive autonomous policy debating system Allen Roush et.al. 2511.17854 null
2025-11-21 Enhancing Quranic Learning: A Multimodal Deep Learning Approach for Arabic Phoneme Recognition Ayhan Kucukmanisa et.al. 2511.17477 null
2025-11-21 AI in Music and Sound: Pedagogical Reflections, Post-Structuralist Approaches and Creative Outcomes in Seminar Practice Guilherme Coelho et.al. 2511.17425 null
2025-11-21 Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM Chiori Hori et.al. 2511.17335 null
2025-11-21 Investigating self-supervised representations for audio-visual deepfake detection Dragos-Alexandru Boldisor et.al. 2511.17181 null
2025-11-20 Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation Wei-Cheng Tseng et.al. 2511.16757 null
2025-11-20 Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs Wei-Cheng Tseng et.al. 2511.16639 null
2025-11-21 WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue Zachary Ellis et.al. 2511.16544 null
2025-11-20 SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise Rui Sang et.al. 2511.16114 null
2025-11-19 Step-Audio-R1 Technical Report Fei Tian et.al. 2511.15848 null
2025-11-19 A Generalized Weighted Overlap-Add (WOLA) Filter Bank for Improved Subband System Identification Mohit Sharma et.al. 2511.15766 null
2025-11-19 PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback Sirui Chen et.al. 2511.15253 null
2025-11-19 Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding Mingyue Huo et.al. 2511.15145 null
2025-11-19 Aligning Generative Music AI with Human Preferences: Methods and Challenges Dorien Herremans et.al. 2511.15038 null
2025-11-18 Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion Zanxu Wang et.al. 2511.14969 null
2025-11-18 PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants Mingkun Yu et.al. 2511.14852 null
2025-11-18 Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech Nam-Gyu Kim et.al. 2511.14824 null
2025-11-18 Ground Truth Generation for Multilingual Historical NLP using LLMs Clovis Gladstone et.al. 2511.14688 null
2025-11-18 TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation Wei Liu et.al. 2511.14410 null
2025-11-18 AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR Gabrial Zencha Ashungafac et.al. 2511.14255 null
2025-11-18 Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning Rui Liu et.al. 2511.14249 null
2025-11-18 StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model Yifan Yang et.al. 2511.14223 null
2025-11-18 FxSearcher: gradient-free text-driven audio transformation Hojoon Ki et.al. 2511.14138 null
2025-11-17 Human-centric Maintenance Process Through Integration of AI, Speech, and AR Parul Khanna et.al. 2511.13918 null
2025-11-17 Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video Filippo Cenacchi. Longbing Cao et.al. 2511.13802 null
2025-11-17 PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement Xiaobin Rong et.al. 2511.13300 null
2025-11-17 Computational Measurement of Political Positions: A Review of Text-Based Ideal Point Estimation Algorithms Patrick Parschan et.al. 2511.13238 null
2025-11-17 FoleyBench: A Benchmark For Video-to-Audio Models Satvik Dixit et.al. 2511.13219 null
2025-11-17 Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis Zaara Zabeen Arpa et.al. 2511.13159 link
2025-11-17 A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning Liuyi Jin et.al. 2511.13078 null
2025-11-17 CalibrateMix: Guided-Mixup Calibration of Image Semi-Supervised Models Mehrab Mustafy Rahman et.al. 2511.12964 null
2025-11-16 Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data Sina Rashidi et.al. 2511.12690 null
2025-11-16 Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans Hongbin Huang et.al. 2511.12662 null
2025-11-16 Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data Yunxin Li et.al. 2511.12609 null
2025-11-16 DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions Xiaoyu Lin et.al. 2511.12452 null
2025-11-14 Proactive Hearing Assistants that Isolate Egocentric Conversations Guilin Hu et.al. 2511.11473 link
2025-11-14 Language-Aided State Estimation Yuki Miyoshi et.al. 2511.11285 null
2025-11-14 Analysing Personal Attacks in U.S. Presidential Debates Ruban Goyal et.al. 2511.11108 null
2025-11-14 CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation Crystal Min Hui Poon et.al. 2511.11104 null
2025-11-14 CAT-Net: A Cross-Attention Tone Network for Cross-Subject EEG-EMG Fusion Tone Decoding Yifan Zhuang et.al. 2511.10935 null
2025-11-14 Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio Guangke Chen et.al. 2511.10913 null
2025-11-13 Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces Farhan Sheth et.al. 2511.10793 null
2025-11-13 Towards Attribution of Generators and Emotional Manipulation in Cross-Lingual Synthetic Speech using Geometric Learning Girish et.al. 2511.10790 null
2025-11-13 Music Flamingo: Scaling Music Understanding in Audio Language Models Sreyan Ghosh et.al. 2511.10289 null
2025-11-13 VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction Yuhao Wang et.al. 2511.10232 null
2025-11-13 Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard Yudong Yang et.al. 2511.10222 null
2025-11-13 Towards Leveraging Sequential Structure in Animal Vocalizations Eklavya Sarkar et.al. 2511.10190 link
2025-11-13 FabasedVC: Enhancing Voice Conversion with Text Modality Fusion and Phoneme-Level SSL Features Wenyu Wang et.al. 2511.10112 null
2025-11-13 Mitigating Error Accumulation in Co-Speech Motion Generation via Global Rotation Diffusion and Multi-Level Constraints Xiangyue Zhang et.al. 2511.10076 null
2025-11-13 Time-Layer Adaptive Alignment for Speaker Similarity in Flow-Matching Based Zero-Shot TTS Haoyu Li et.al. 2511.09995 null
2025-11-13 MINDS: A Cross-cultural Dialogue Corpus for Social Norm Classification and Adherence Detection Pritish Sahu et.al. 2511.09918 null
2025-11-12 Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages Omnilingual ASR team et.al. 2511.09690 null
2025-11-12 End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering Jiliang Hu et.al. 2511.09282 null
2025-11-10 Generating Novel and Realistic Speakers for Voice Conversion Meiying Melissa Chen et.al. 2511.07135 null
2025-11-10 E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis Zhisheng Zhang et.al. 2511.07099 link
2025-11-09 IDMap: A Pseudo-Speaker Generator Framework Based on Speaker Identity Index to Vector Mapping Zeyan Liu et.al. 2511.06246 null
2025-11-07 Synthesizing speech with selected perceptual voice qualities - A case study with creaky voice Frederik Rautenberg et.al. 2511.05143 null
2025-11-05 Step-Audio-EditX Technical Report Chao Yan et.al. 2511.03601 null
2025-11-05 PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech Michel Wong et.al. 2511.03080 null
2025-11-04 Augmenting Open-Vocabulary Dysarthric Speech Assessment with Human Perceptual Supervision Kaimeng Jia et.al. 2511.02270 null
2025-11-03 Toward Objective and Interpretable Prosody Evaluation in Text-to-Speech: A Linguistically Motivated Approach Cedric Chan et.al. 2511.02104 null
2025-10-31 Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication Deok-Seon Kim et.al. 2510.27247 null
2025-10-27 SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution Dharma Teja Donepudi et.al. 2510.25178 null
2025-10-28 Levée d'ambiguïtés par grammaires locales Eric G. C. Laporte et.al. 2510.24530 null
2025-10-28 Bayesian Speech synthesizers Can Learn from Multiple Teachers Ziyang Zhang et.al. 2510.24372 null
2025-10-28 emg2speech: synthesizing speech from electromyography using self-supervised speech models Harshavardhana T. Gowda et.al. 2510.23969 null
2025-10-28 SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity Hanke Xie et.al. 2510.23541 null
2025-10-26 UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models Wenming Tu et.al. 2510.22588 null
2025-10-24 StylePitcher: Generating Style-Following and Expressive Pitch Curves for Versatile Singing Tasks Jingyue Huang et.al. 2510.21685 null
2025-10-23 Vox-Evaluator: Enhancing Stability and Fidelity for Zero-shot TTS with A Multi-Level Evaluator Hualei Wang et.al. 2510.20210 null
2025-10-23 SpeechAgent: An End-to-End Mobile Infrastructure for Speech Impairment Assistance Haowei Lou et.al. 2510.20113 null
2025-10-22 Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent Yangshijie Zhang et.al. 2510.19641 null
2025-10-22 Which Evaluation for Which Model? A Taxonomy for Speech Model Assessment Maureen de Seyssel et.al. 2510.19509 null
2025-10-22 EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection Tong Zhang et.al. 2510.19414 null
2025-10-21 StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction Qianheng Xu et.al. 2510.18938 null
2025-10-21 KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers Mohd Ruhul Ameen et.al. 2510.18355 null
2025-10-21 ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation Haowei Lou et.al. 2510.18308 null
2025-10-19 U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation Xusheng Yang et.al. 2510.16718 null
2025-10-18 Edge-Based Speech Transcription and Synthesis for Kinyarwanda and Swahili Languages Pacome Simon Mbonimpa et.al. 2510.16497 null
2025-10-22 VoiceMorph: How AI Voice Morphing Reveals the Boundaries of Auditory Self-Recognition Kye Shimizu et.al. 2510.16192 null
2025-10-16 RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF Qing Yang et.al. 2510.14628 null
2025-10-15 InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue Wenwen Tong et.al. 2510.13747 null
2025-10-15 Closing the Gap Between Text and Speech Understanding in LLMs Santiago Cuervo et.al. 2510.13632 null
2025-10-15 Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models Yizhou Peng et.al. 2510.13293 null
2025-10-14 Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs Xinlu He et.al. 2510.12995 null
2025-10-14 Beating Harmful Stereotypes Through Facts: RAG-based Counter-speech Generation Greta Damo et.al. 2510.12316 null
2025-10-15 DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation Yakun Song et.al. 2510.12210 null
2025-10-13 BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis Jingyuan Xing et.al. 2510.11646 null
2025-10-13 Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker Cheng Gong et.al. 2510.11124 null
2025-10-14 ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis Mohammad Javad Ranjbar Kalahroodi et.al. 2510.10774 null
2025-10-14 MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations Wenxiang Guo et.al. 2510.10396 null
2025-10-10 Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models Donghang Wu et.al. 2510.09592 null
2025-10-10 O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion Huu Tuong Tu et.al. 2510.09061 null
2025-10-10 DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment Zongcai Du et.al. 2510.09016 null
2025-10-09 DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching Hanke Xie et.al. 2510.08373 null
2025-10-09 IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation Wei Wang et.al. 2510.07979 null
2025-10-08 Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis Zhu Li et.al. 2510.07096 null
2025-10-08 Towards Responsible Evaluation for Text-to-Speech Yifan Yang et.al. 2510.06927 null
2025-10-08 XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection Phuong Tuan Dat et.al. 2510.06706 null
2025-10-07 ECTSpeech: Enhancing Efficient Speech Synthesis via Easy Consistency Tuning Tao Zhu et.al. 2510.05984 null
2025-10-07 Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech Rikuto Kotoge et.al. 2510.05799 null
2025-10-07 Investigation of perception inconsistency in speaker embedding for asynchronous voice anonymization Rui Wang et.al. 2510.05718 null
2025-10-07 Sparse deepfake detection promotes better disentanglement Antoine Teissier et.al. 2510.05696 null
2025-10-07 Teaching Machines to Speak Using Articulatory Control Akshay Anand et.al. 2510.05619 null
2025-10-06 Paper2Video: Automatic Video Generation from Scientific Papers Zeyu Zhu et.al. 2510.05096 null
2025-10-06 Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba Baher Mohammad et.al. 2510.04738 null
2025-10-06 UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models Wenhao Guan et.al. 2510.04593 link
2025-10-05 GDiffuSE: Diffusion-based speech enhancement with noise model guidance Efrayim Yanir et.al. 2510.04157 null
2025-10-05 A Multilingual Framework for Dysarthria: Detection, Severity Classification, Speech-to-Text, and Clean Speech Generation Ananya Raghu et.al. 2510.03986 null
2025-10-07 Synthetic Audio Forensics Evaluation (SAFE) Challenge Kirill Trapeznikov et.al. 2510.03387 null
2025-10-03 Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech Hieu-Nghia Huynh-Nguyen et.al. 2510.02848 null
2025-10-02 Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement Jianing Yang et.al. 2510.01722 link
2025-10-01 From Scores to Preferences: Redefining MOS Benchmarking for Speech Quality Reward Modeling Yifei Cao et.al. 2510.00743 null
2025-10-02 MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance Xingjian Zhao et.al. 2510.00499 null
2025-09-30 BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs Yue Wang et.al. 2509.26514 null
2025-09-30 HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis Ziyu Zhang et.al. 2509.25842 null
2025-09-30 LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning Kang Yang et.al. 2509.25670 null
2025-09-29 Emotion-Aligned Generation in Diffusion Text to Speech Models via Preference-Guided Optimization Jiacheng Shi et.al. 2509.25416 null
2025-09-29 MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech Chengyao Wang et.al. 2509.25131 null
2025-09-30 VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning Xin Cheng et.al. 2509.24773 null
2025-09-29 VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning Yixuan Zhou et.al. 2509.24650 null
2025-09-29 Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis Tianrui Wang et.al. 2509.24629 null
2025-09-29 ISSE: An Instruction-Guided Speech Style Editing Dataset And Benchmark Yun Chen et.al. 2509.24570 null
2025-09-29 UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities Xuenan Xu et.al. 2509.24391 null
2025-09-28 Generalizable Speech Deepfake Detection via Information Bottleneck Enhanced Adversarial Alignment Pu Huang et.al. 2509.23618 null
2025-09-27 BFA: Real-time Multilingual Text-to-speech Forced Alignment Abdul Rehman et.al. 2509.23147 null
2025-09-26 ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection Mohamed Maged et.al. 2509.22808 null
2025-09-26 Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis Zhikang Niu et.al. 2509.22167 null
2025-09-26 Speaker Anonymisation for Speech-based Suicide Risk Detection Ziyun Cui et.al. 2509.22148 null
2025-09-26 Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling Junjie Cao et.al. 2509.22062 null
2025-09-26 Align2Speak: Improving TTS for Low Resource Languages via ASR-Guided Online Preference Optimization Shehzeen Hussain et.al. 2509.21718 null
2025-09-25 UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice Sitong Cheng et.al. 2509.21144 link
2025-09-27 i-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents Anupam Purwar et.al. 2509.20971 null
2025-09-26 SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS Tan Dat Nguyen et.al. 2509.20802 null
2025-09-24 Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens Ismail Rasim Ulgen et.al. 2509.20485 null
2025-09-24 OLaPh: Optimal Language Phonemizer Johannes Wirth et.al. 2509.20086 null
2025-09-25 Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration Yifan Yang et.al. 2509.19928 null
2025-09-24 CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance Junchuan Zhao et.al. 2509.19883 null
2025-09-24 Eliminating stability hallucinations in llm-based tts models via attention guidance ShiMing Wang et.al. 2509.19852 null
2025-09-24 Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation Yang Cui et.al. 2509.19812 null
2025-09-24 PART: Progressive Alignment Representation Training for Multilingual Speech-To-Text with LLMs Pei Zhang et.al. 2509.19745 null
2025-09-24 Selective Classifier-free Guidance for Zero-shot Text-to-speech John Zheng et.al. 2509.19668 null
2025-09-23 Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation Roy Fejgin et.al. 2509.19592 null
2025-09-23 HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS Sihang Nie et.al. 2509.19001 null
2025-09-23 Direct Preference Optimization for Speech Autoregressive Diffusion Models Zhijun Liu et.al. 2509.18928 null
2025-09-23 Group Relative Policy Optimization for Text-to-Speech with Large Language Models Chang Liu et.al. 2509.18798 null
2025-09-23 Explore the Reinforcement Learning for the LLM based ASR and TTS system Changfeng Gao et.al. 2509.18569 null
2025-09-23 No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS Seungyoun Shin et.al. 2509.18531 null
2025-09-22 Discrete-time diffusion-like models for speech synthesis Xiaozhou Tan et.al. 2509.18470 null
2025-09-22 TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation Yutong Liu et.al. 2509.18060 null
2025-09-22 Nord-Parl-TTS: Finnish and Swedish TTS Dataset from Parliament Speech Zirui Li et.al. 2509.17988 null
2025-09-22 Qwen3-Omni Technical Report Jin Xu et.al. 2509.17765 null
2025-09-22 Audiobook-CC: Controllable Long-context Speech Generation for Multicast Audiobook Min Liu et.al. 2509.17516 null
2025-09-21 Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing Wataru Nakata et.al. 2509.17052 null
2025-09-21 Bridging the gap between training and inference in LM-based TTS models Ruonan Zhang et.al. 2509.17021 null
2025-09-21 MBCodec:Thorough disentangle for high-fidelity audio compression Ruonan Zhang et.al. 2509.17006 null
2025-09-19 Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation Qi Wang et.al. 2509.16010 null
2025-09-19 VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency Nikita Torgashov et.al. 2509.15969 null
2025-09-19 Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS Ziqi Dai et.al. 2509.15845 null
2025-09-19 LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control Junki Ohmura et.al. 2509.15626 null
2025-09-19 Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech Xinlei Niu et.al. 2509.15492 null
2025-09-18 A Novel Semantic Compression Approach for Ultra-low Bandwidth Voice Communication Ryan Collette et.al. 2509.15462 null
2025-09-18 Frustratingly Easy Data Augmentation for Low-Resource ASR Katsumi Ibaraki et.al. 2509.15373 null
2025-09-18 Real-Time Streaming Mel Vocoding with Generative Flow Matching Simon Welker et.al. 2509.15085 null
2025-09-20 SynParaSpeech: Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding Bingsong Bai et.al. 2509.14946 link
2025-09-18 MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis Keyu An et.al. 2509.14784 null
2025-09-18 DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis Ye-Xin Lu et.al. 2509.14684 null
2025-09-18 Stochastic Clock Attention for Aligning Continuous and Ordered Sequences Hyungjoon Soh et.al. 2509.14678 null
2025-09-18 SpeechMLC: Speech Multi-label Classification Miseul Kim et.al. 2509.14677 null
2025-09-18 Mitigating Intra-Speaker Variability in Diarization with Style-Controllable Speech Augmentation Miseul Kim et.al. 2509.14632 null
2025-09-18 Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis Qingyu Liu et.al. 2509.14579 null
2025-09-17 CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset Brian Yan et.al. 2509.14161 null
2025-09-18 Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems Yi-Cheng Lin et.al. 2509.13989 null
2025-09-16 MSR-Codec: A Low-Bitrate Multi-Stream Residual Codec for High-Fidelity Speech Generation with Information Disentanglement Jingyu Li et.al. 2509.13068 null
2025-09-16 A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis Javeria Amir et.al. 2509.12831 null
2025-09-15 Preservation of Language Understanding Capabilities in Speech-aware Large Language Models Marek Kubis et.al. 2509.12171 null
2025-09-14 FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs Md Mubtasim Ahasan et.al. 2509.11425 null
2025-09-14 Length-Aware Rotary Position Embedding for Text-Speech Alignment Hyeongju Kim et.al. 2509.11084 null
2025-09-12 WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers Akshat Pandey et.al. 2509.10452 null
2025-09-12 Towards Data Drift Monitoring for Speech Deepfake Detection in the context of MLOps Xin Wang et.al. 2509.10086 null
2025-09-11 DiTReducio: A Training-Free Acceleration for DiT-Based TTS via Progressive Calibration Yanru Huo et.al. 2509.09748 null
2025-09-09 VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions Jun Zhan et.al. 2509.09716 null
2025-09-12 DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech Ngoc-Son Nguyen et.al. 2509.09631 null
2025-09-11 HISPASpoof: A New Dataset For Spanish Speech Forensics Maria Risques et.al. 2509.09155 null
2025-09-10 Deploying AI for Signal Processing education: Selected challenges and intriguing opportunities Jarvis Haupt et.al. 2509.08950 null
2025-09-10 Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling Neil Zeghidour et.al. 2509.08753 null
2025-09-10 Accelerating Diffusion Transformer-Based Text-to-Speech with Transformer Layer Caching Siratish Sakpiboonchit et.al. 2509.08696 null
2025-09-10 Joint Learning using Mixture-of-Expert-Based Representation for Enhanced Speech Generation and Robust Emotion Recognition Jing-Tong Tzeng et.al. 2509.08470 null
2025-09-09 Progressive Facial Granularity Aggregation with Bilateral Attribute-based Enhancement for Face-to-Speech Synthesis Yejin Jeon et.al. 2509.07376 null
2025-09-09 When Fine-Tuning is Not Enough: Lessons from HSAD on Hybrid and Adversarial Audio Spoof Detection Bin Hu et.al. 2509.07323 null
2025-09-08 Controllable Singing Voice Synthesis using Phoneme-Level Energy Sequence Yerin Ryu et.al. 2509.07038 null
2025-09-08 ParCzech4Speech: A New Speech Corpus Derived from Czech Parliamentary Data Vladislav Stankov et.al. 2509.06675 null
2025-09-09 Speaker Privacy and Security in the Big Data Era: Protection and Defense against Deepfake Liping Chen et.al. 2509.06361 null
2025-09-07 UniVerse-1: Unified Audio-Video Generation via Stitching of Experts Duomin Wang et.al. 2509.06155 null
2025-09-07 Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis Zhenqi Jia et.al. 2509.06074 null
2025-09-06 LatinX: Aligning a Multilingual TTS Model with Direct Preference Optimization Luis Felipe Chary et.al. 2509.05863 null
2025-09-05 Cloning a Conversational Voice AI Agent from Call,Recording Datasets for Telesales Krittanon Kaewtawee et.al. 2509.04871 null
2025-09-04 Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding Rui-Chen Zheng et.al. 2509.04685 null
2025-09-04 DarkStream: real-time speech anonymization with low latency Waris Quamer et.al. 2509.04667 null
2025-09-04 AUDETER: A Large-scale Dataset for Deepfake Audio Detection in Open Worlds Qizhou Wang et.al. 2509.04345 null
2025-09-04 Open-Source Full-Duplex Conversational Datasets for Natural and Interactive Speech Synthesis Zhitong Zhou et.al. 2509.04093 null
2025-09-04 LibriQuote: A Speech Dataset of Fictional Character Utterances for Expressive Zero-Shot Speech Synthesis Gaspard Michel et.al. 2509.04072 null
2025-09-03 Multi-level SSL Feature Gating for Audio Deepfake Detection Hoan My Tran et.al. 2509.03409 null
2025-09-03 LatPhon: Lightweight Multilingual G2P for Romance Languages and English Luis Felipe Chary et.al. 2509.03300 null
2025-09-03 Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised Embeddings Dyah A. M. G. Wisnu et.al. 2509.03292 null
2025-09-03 AIVA: An AI-based Virtual Companion for Emotion-aware Interaction Chenxi Li et.al. 2509.03212 null
2025-09-04 FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot Kun Xie et.al. 2509.02020 null
2025-09-01 MixedG2P-T5: G2P-free Speech Synthesis for Mixed-script texts using Speech Self-Supervised Learning and Language Model Joonyong Park et.al. 2509.01391 null
2025-09-01 The AudioMOS Challenge 2025 Wen-Chin Huang et.al. 2509.01336 null
2025-09-01 An AI-Based Shopping Assistant System to Support the Visually Impaired Larissa R. de S. Shibata et.al. 2509.01246 null
2025-09-01 SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation Chenyang Le et.al. 2509.01200 null
2025-08-31 MPO: Multidimensional Preference Optimization for Language Model-based Text-to-Speech Kangxiang Xia et.al. 2509.00685 null
2025-08-29 Towards Improved Speech Recognition through Optimized Synthetic Data Generation Yanis Perrin et.al. 2508.21631 null
2025-08-28 MoTAS: MoE-Guided Feature Selection from TTS-Augmented Speech for Enhanced Multimodal Alzheimer's Early Screening Yongqi Shao et.al. 2508.20513 null
2025-08-26 Interpolating Speaker Identities in Embedding Space for Data Expansion Tianchi Liu et.al. 2508.19210 null
2025-08-26 CLEAR: Continuous Latent Autoregressive Modeling for High-quality and Low-latency Speech Synthesis Chun Yat Wu et.al. 2508.19098 null
2025-08-25 Unseen Speaker and Language Adaptation for Lightweight Text-To-Speech with Adapters Alessio Falai et.al. 2508.18006 null
2025-08-27 Vocoder-Projected Feature Discriminator Takuhiro Kaneko et.al. 2508.17874 null
2025-09-02 Zero-shot Context Biasing with Trie-based Decoding using Synthetic Multi-Pronunciation Changsong Liu et.al. 2508.17796 null
2025-08-25 ClearMask: Noise-Free and Naturalness-Preserving Protection Against Voice Deepfake Attacks Yuanda Wang et.al. 2508.17660 null
2025-08-26 EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems Jingwen Liu et.al. 2508.17623 null
2025-08-24 Improving French Synthetic Speech Quality via SSML Prosody Control Nassima Ould Ouali et.al. 2508.17494 null
2025-08-23 RephraseTTS: Dynamic Length Text based Speech Insertion with Speaker Style Transfer Neeraj Matiyali et.al. 2508.17031 null
2025-08-23 WildSpoof Challenge Evaluation Plan Yihan Wu et.al. 2508.16858 null
2025-08-22 TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling Yuancheng Wang et.al. 2508.16790 link
2025-08-22 Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation Weiting Tan et.al. 2508.16188 null
2025-08-21 QvTAD: Differential Relative Attribute Learning for Voice Timbre Attribute Detection Zhiyu Wu et.al. 2508.15931 null
2025-08-21 Any-to-any Speaker Attribute Perturbation for Asynchronous Voice Anonymization Liping Chen et.al. 2508.15565 null
2025-08-24 Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets Chenlin Liu et.al. 2508.15442 null
2025-08-21 UniCoM: A Universal Code-Switching Speech Generator Sangmin Lee et.al. 2508.15244 link
2025-08-20 Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization Rui Wang et.al. 2508.14947 null
2025-08-20 Long-Context Speech Synthesis with Context-Aware Memory Zhipeng Li et.al. 2508.14713 null
2025-08-20 Improving Resource-Efficient Speech Enhancement via Neural Differentiable DSP Vocoder Refinement Heitor R. Guimarães et.al. 2508.14709 null
2025-08-20 DiffIER: Optimizing Diffusion Models with Iterative Error Reduction Ao Chen et.al. 2508.13628 null
2025-08-19 Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM Dariia Puhach et.al. 2508.13603 null
2025-08-18 A Surveillance Based Interactive Robot Kshitij Kavimandan et.al. 2508.13319 null
2025-08-18 Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis Zhu Li et.al. 2508.13028 null
2025-08-18 Real-Time Sign Language Gestures to Speech Transcription using Deep Learning Brandone Fonya et.al. 2508.12713 null
2025-08-19 FNH-TTS: A Fast, Natural, and Human-Like Speech Synthesis System with advanced prosodic modeling based on Mixture of Experts Qingliang Meng et.al. 2508.12001 null
2025-08-16 SimInterview: Transforming Business Education through Large Language Model-Based Simulated Multilingual Interview Training System Truong Thanh Hung Nguyen et.al. 2508.11873 null
2025-08-15 MoE-TTS: Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts Heyang Xue et.al. 2508.11326 null
2025-08-15 EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens Joonyong Park et.al. 2508.11273 null
2025-08-14 Fake Speech Wild: Detecting Deepfake Speech on Social Media Platform Yuankun Xie et.al. 2508.10559 link
2025-08-14 Facilitating Personalized TTS for Dysarthric Speakers Using Knowledge Anchoring and Curriculum Learning Yejin Jeon et.al. 2508.10412 null
2025-08-14 Towards Frame-level Quality Predictions of Synthetic Speech Michael Kuhlmann et.al. 2508.10374 null
2025-08-08 LLMCARE: Alzheimer's Detection via Transformer Models Enhanced by LLM-Generated Synthetic Data Ali Zolnour et.al. 2508.10027 null
2025-08-15 Training-Free Multimodal Large Language Model Orchestration Tianyu Xie et.al. 2508.10016 null
2025-08-13 Analysis of Domain Shift across ASR Architectures via TTS-Enabled Separation of Target Domain and Acoustic Conditions Tina Raissi et.al. 2508.09868 null
2025-08-13 UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech Shuhei Kato et.al. 2508.09767 null
2025-08-13 $\text{M}^3\text{PDB}$ : A Multimodal, Multi-Label, Multilingual Prompt Database for Speech Generation Boyu Zhu et.al. 2508.09702 null
2025-08-12 Fake-Mamba: Real-Time Speech Deepfake Detection Using Bidirectional Mamba as Self-Attention's Alternative Xi Xuan et.al. 2508.09294 null
2025-08-13 DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models Yuanyuan Wang et.al. 2508.08961 null
2025-08-12 QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems Chien-Chun Wang et.al. 2508.08957 null
2025-08-15 MultiAiTutor: Child-Friendly Educational Multilingual Speech Generation Tutor with LLMs Xiaoxue Gao et.al. 2508.08715 null
2025-08-12 Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization Chaoqun Cui et.al. 2508.08550 null
2025-08-11 Is GAN Necessary for Mel-Spectrogram-based Neural Vocoder? Hui-Peng Du et.al. 2508.07711 null
2025-08-10 Think Before You Talk: Enhancing Meaningful Dialogue Generation in Full-Duplex Speech Language Models with Planning-Inspired Text Guidance Wenqian Cui et.al. 2508.07375 link
2025-08-10 KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features Ivan Kukanov et.al. 2508.07337 null
2025-08-12 XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation Tianlun Zuo et.al. 2508.07302 null
2025-08-09 Maestro-EVC: Controllable Emotional Voice Conversion Guided by References and Explicit Prosody Jinsung Yoon et.al. 2508.06890 null
2025-08-09 Text to Speech System for Meitei Mayek Script Gangular Singh Irengbam et.al. 2508.06870 null
2025-08-08 ScamAgents: How AI Agents Can Simulate Human-Level Scam Calls Sanket Badhe et.al. 2508.06457 null
2025-08-08 Improved Dysarthric Speech to Text Conversion via TTS Personalization Péter Mihajlik et.al. 2508.06391 null
2025-08-08 Large Language Model Data Generation for Enhanced Intent Recognition in German Speech Theresa Pekarek Rosin et.al. 2508.06277 null
2025-08-08 Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech Synthesis Wenjie Tian et.al. 2508.06262 null
2025-08-07 A Scalable Pipeline for Enabling Non-Verbal Speech Generation and Understanding Runchuan Ye et.al. 2508.05385 null
2025-08-15 Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS M Anuprabha et.al. 2508.05102 null
2025-08-06 Root Cause Analysis Training for Healthcare Professionals With AI-Powered Virtual Simulation: A Proof-of-Concept Yuqi Hu et.al. 2508.04904 null
2025-08-05 Toward Low-Latency End-to-End Voice Agents for Telecommunications Using Streaming ASR, Quantized LLMs, and Real-Time TTS Vignesh Ethiraj et.al. 2508.04721 null
2025-08-07 UniTalker: Conversational Speech-Visual Synthesis Yifan Hu et.al. 2508.04585 null
2025-08-06 NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations Huan Liao et.al. 2508.04195 null
2025-08-06 Multilingual Source Tracing of Speech Deepfakes: A First Benchmark Xi Xuan et.al. 2508.04143 null
2025-08-06 Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech Jingyuan Xing et.al. 2508.04141 null
2025-08-06 EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering Tianxin Xie et.al. 2508.03543 null
2025-08-05 MiSTR: Multi-Modal iEEG-to-Speech Synthesis with Transformer-Based Prosody Prediction and Neural Phase Reconstruction Mohammed Salah Al-Radhi et.al. 2508.03166 link
2025-08-05 Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback Jingyi Chen et.al. 2508.03123 null
2025-08-14 Marco-Voice Technical Report Fengping Tian et.al. 2508.02038 null
2025-08-03 Enhancing Spectrogram Realism in Singing Voice Synthesis via Explicit Bandwidth Extension Prior to Vocoder Runxuan Yang et.al. 2508.01796 null
2025-08-03 Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe Tiantian Feng et.al. 2508.01691 null
2025-08-01 Advancing Speech Quality Assessment Through Scientific Challenges and Open-source Activities Wen-Chin Huang et.al. 2508.00317 null
2025-08-01 Next Tokens Denoising for Speech Synthesis Yanqing Liu et.al. 2507.22746 null
2025-07-30 Adaptive Duration Model for Text Speech Alignment Junjie Cao et.al. 2507.22612 null
2025-07-29 SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Incorporating Cutting-Edge Generation Methods Wen Huang et.al. 2507.21463 null
2025-07-23 WaveVerify: A Novel Audio Watermarking Framework for Media Authentication and Combatting Deepfakes Aditya Pujari et.al. 2507.21150 null
2025-07-22 TTS-1 Technical Report Oleg Atamanenko et.al. 2507.21138 null
2025-07-29 JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1 Xinhan Di et.al. 2507.20987 null
2025-07-28 AV-Deepfake1M++: A Large-Scale Audio-Visual Deepfake Benchmark with Real-World Perturbations Zhixi Cai et.al. 2507.20579 null
2025-07-27 Do Not Mimic My Voice: Speaker Identity Unlearning for Zero-Shot Text-to-Speech Taesoo Kim et.al. 2507.20140 null
2025-07-26 Defining ethically sourced code generation Zhuolin Xu et.al. 2507.19743 null
2025-07-25 GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness Hongjie Chen et.al. 2507.18119 null
2025-07-24 Synthetic Data Generation for Phrase Break Prediction with Large Language Model Hoyeon Lee et.al. 2507.18044 null
2025-07-23 AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer Danny D. Leybzon et.al. 2507.17718 null
2025-07-23 Synthetic Voice Data for Automatic Speech Recognition in African Languages Brian DeRenzi et.al. 2507.17578 null
2025-07-23 BoSS: Beyond-Semantic Speech Qing Wang et.al. 2507.17563 null
2025-07-27 Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice Shanbo Cheng et.al. 2507.17527 null
2025-07-22 SplitMeanFlow: Interval Splitting Consistency in Few-Step Generative Modeling Yi Guo et.al. 2507.16884 null
2025-07-22 Technical report: Impact of Duration Prediction on Speaker-specific TTS for Indian Languages Isha Pandey et.al. 2507.16875 null
2025-07-15 Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems Nima Yazdani et.al. 2507.16835 null
2025-07-21 A2TTS: TTS for Low Resource Indian Languages Ayush Singh Bhadoriya et.al. 2507.15272 null
2025-07-21 EchoVoices: Preserving Generational Voices and Memories for Seniors and Children Haiying Xu et.al. 2507.15221 null
2025-07-22 Hear Your Code Fail, Voice-Assisted Debugging for Python Sayed Mahbub Hasan Amiri et.al. 2507.15007 null
2025-07-20 DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis Yinghao Aaron Li et.al. 2507.14988 null
2025-07-20 FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing Shoutao Guo et.al. 2507.14815 null
2025-07-17 A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models Kirill Borodin et.al. 2507.13563 null
2025-07-17 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech Maksim Borisov et.al. 2507.13155 null
2025-07-17 Intelligent Virtual Sonographer (IVS): Enhancing Physician-Robot-Patient Communication Tianyu Song et.al. 2507.13052 null
2025-07-17 Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes Zhou Feng et.al. 2507.12932 null
2025-07-16 Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations Yichen Han et.al. 2507.12197 null
2025-07-16 EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis Haoxun Li et.al. 2507.12015 null
2025-07-15 Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection Ivan Viakhirev et.al. 2507.11777 null
2025-07-25 P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge Marvin Sach et.al. 2507.11306 null
2025-07-20 Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition Mengzhe Geng et.al. 2507.10827 null
2025-07-14 An Empirical Evaluation of AI-Powered Non-Player Characters' Perceived Realism and Performance in Virtual Reality Environments Mikko Korkiakoski et.al. 2507.10469 null
2025-07-14 DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis Wenjie Tian et.al. 2507.10109 null
2025-07-12 ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching Han Zhu et.al. 2507.09318 null
2025-07-12 Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning Dominika Woszczyk et.al. 2507.09310 null
2025-07-12 ClaritySpeech: Dementia Obfuscation in Speech Dominika Woszczyk et.al. 2507.09282 link
2025-07-19 Mixture of LoRA Experts with Multi-Modal and Multi-Granularity LLM Generative Error Correction for Accented Speech Recognition Bingshen Mu et.al. 2507.09116 null
2025-07-11 SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment Shivam Mehta et.al. 2507.09070 null
2025-07-11 Exploiting Leaderboards for Large-Scale Distribution of Malicious Models Anshuman Suri et.al. 2507.08983 null
2025-07-06 A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting Niranjan Mallikarjun Sindhur et.al. 2507.08832 null
2025-07-11 Unlocking Speech Instruction Data Potential with Query Rewriting Yonghua Hei et.al. 2507.08603 null
2025-07-11 MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling Jingjing Tang et.al. 2507.08530 null
2025-07-11 Active Learning for Text-to-Speech Synthesis with Informative Sample Collection Kentaro Seki et.al. 2507.08319 null
2025-07-05 RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning Atli Sigurgeirsson et.al. 2507.08012 null
2025-07-10 SecureSpeech: Prompt-based Speaker and Content Protection Belinda Soh Hui Hui et.al. 2507.07799 null
2025-07-09 STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation Wenxiang Guo et.al. 2507.06670 null
2025-07-09 Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents Zackary Rackauckas et.al. 2507.06483 null
2025-07-08 Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis Xintong Hu et.al. 2507.06116 null
2025-07-08 Differentiable Reward Optimization for LLM based TTS system Changfeng Gao et.al. 2507.05911 null
2025-07-08 OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model Chen Wang et.al. 2507.05177 null
2025-07-07 LAPS-Diff: A Diffusion-Based Framework for Singing Voice Synthesis With Language Aware Prosody-Style Guided Learning Sandipan Dhar et.al. 2507.04966 null
2025-07-07 Multi-Step Prediction and Control of Hierarchical Emotion Distribution in Text-to-Speech Synthesis Sho Inoue et.al. 2507.04598 null
2025-07-06 TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet Jaeseok Jeong et.al. 2507.04349 null
2025-07-05 PresentAgent: Multimodal Agent for Presentation Video Generation Jingwei Shi et.al. 2507.04036 null
2025-07-05 Prosody Labeling with Phoneme-BERT and Speech Foundation Models Tomoki Koriyama et.al. 2507.03912 null
2025-07-05 Traceable TTS: Toward Watermark-Free TTS with Strong Traceability Yuxiang Zhao et.al. 2507.03887 null
2025-07-14 DeepGesture: A conversational gesture synthesis system based on emotions and semantics Thanh Hoang-Minh et.al. 2507.03147 null
2025-07-03 De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks Wei Fan et.al. 2507.02606 null
2025-07-03 Open-Source System for Multilingual Translation and Cloned Speech Synthesis Mateo Cámara et.al. 2507.02530 null
2025-07-03 JoyTTS: LLM-based Spoken Chatbot With Voice Cloning Fangru Zhou et.al. 2507.02380 null
2025-07-02 Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis Marc-André Carbonneau et.al. 2507.02176 null
2025-07-04 Pronunciation Editing for Finnish Speech using Phonetic Posteriorgrams Zirui Li et.al. 2507.02115 null
2025-07-02 A Dataset for Automatic Assessment of TTS Quality in Spanish Alejandro Sosa Welford et.al. 2507.01805 link
2025-07-02 Voice Conversion for Likability Control via Automated Rating of Speech Synthesis Corpora Hitoshi Suda et.al. 2507.01356 null
2025-07-08 SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech Zhuangfei Cheng et.al. 2507.01348 null
2025-07-02 Multi-interaction TTS toward professional recording reproduction Hiroki Kanagawa et.al. 2507.00808 null
2025-07-01 MuteSwap: Silent Face-based Voice Conversion Yifan Liu et.al. 2507.00498 null
2025-06-30 Collecting, Curating, and Annotating Good Quality Speech deepfake dataset for Famous Figures: Process and Challenges Hashim Ali et.al. 2507.00324 null
2025-06-30 Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis Paul Mayer et.al. 2507.00227 null
2025-07-01 StreamFlow: Streaming Flow Matching with Block-wise Guided Attention Mask for Speech Token Decoding Dake Guo et.al. 2506.23986 null
2025-06-30 Efficient Interleaved Speech Modeling through Knowledge Distillation Mohammadmahdi Nouriborji et.al. 2506.23670 null
2025-06-30 JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching Mingi Kwon et.al. 2506.23552 null
2025-06-29 You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties Paige Tuttösí et.al. 2506.23367 null
2025-06-27 DiffSoundStream: Efficient Speech Tokenization via Diffusion Decoding Yang Yang et.al. 2506.22362 null
2025-06-27 Evaluating Pointing Gestures for Target Selection in Human-Robot Collaboration Noora Sassali et.al. 2506.22116 null
2025-06-27 Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy Bohan Li et.al. 2506.22023 null
2025-06-23 IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech Siyi Zhou et.al. 2506.21619 null
2025-06-26 SmoothSinger: A Conditional Diffusion Model for Singing Voice Synthesis with Multi-Resolution Architecture Kehan Sui et.al. 2506.21478 null
2025-06-26 A Multi-Stage Framework for Multimodal Controllable Speech Synthesis Rui Niu et.al. 2506.20945 null
2025-06-25 An Exploration of ECAPA-TDNN and x-vector Speaker Representations in Zero-shot Multi-speaker TTS Marie Kunešová et.al. 2506.20190 null
2025-06-24 TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems Christoph Minixhofer et.al. 2506.19441 null
2025-06-23 Selecting N-lowest scores for training MOS prediction models Yuto Kondo et.al. 2506.18326 null
2025-06-23 Rethinking Mean Opinion Scores in Speech Quality Assessment: Aggregation through Quantized Distribution Fitting Yuto Kondo et.al. 2506.18307 null
2025-06-23 JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles Yuto Kondo et.al. 2506.18296 null
2025-06-21 OpusLM: A Family of Open Unified Speech Language Models Jinchuan Tian et.al. 2506.17611 null
2025-06-20 RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching Hyun Joon Park et.al. 2506.16741 null
2025-06-20 LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization Daejin Jo et.al. 2506.16738 null
2025-06-20 V-CASS: Vision-context-aware Expressive Speech Synthesis for Enhancing User Understanding of Videos Qixin Wang et.al. 2506.16716 null
2025-06-19 Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement Tuan-Nam Nguyen et.al. 2506.16580 null
2025-06-19 InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems Kexin Huang et.al. 2506.16381 link
2025-06-19 Optimizing Multilingual Text-To-Speech with Accents & Emotions Pranav Pawar et.al. 2506.16310 null
2025-06-19 Improved Intelligibility of Dysarthric Speech using Conditional Flow Matching Shoutrik Das et.al. 2506.16127 null
2025-06-19 VS-Singer: Vision-Guided Stereo Singing Voice Synthesis with Consistency Schrödinger Bridge Zijing Zhao et.al. 2506.16020 null
2025-06-18 TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data Kentaro Seki et.al. 2506.15614 null
2025-06-18 PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction Shufan Li et.al. 2506.15556 null
2025-06-18 Factorized RVQ-GAN For Disentangled Speech Tokenization Sameer Khurana et.al. 2506.15456 null
2025-06-18 EmojiVoice: Towards long-term controllable expressivity in robot speech Paige Tuttösí et.al. 2506.15085 null
2025-06-18 An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW Prateek Mehta et.al. 2506.15029 null
2025-06-25 SLEEPING-DISCO 9M: A large-scale pre-training dataset for generative music modeling Tawsif Ahmed et.al. 2506.14293 null
2025-06-17 Investigation of Zero-shot Text-to-Speech Models for Enhancing Short-Utterance Speaker Verification Yiyang Zhao et.al. 2506.14226 null
2025-06-17 Pushing the Performance of Synthetic Speech Detection with Kolmogorov-Arnold Networks and Self-Supervised Learning Models Tuan Dat Phuong et.al. 2506.14153 link
2025-06-16 EmoNews: A Spoken Dialogue System for Expressive News Conversations Ryuki Matsuura et.al. 2506.13894 link
2025-06-16 From Flat to Feeling: A Feasibility and Impact Study on Dynamic Facial Emotions in AI-Generated Avatars Pegah Salehi et.al. 2506.13477 null
2025-06-20 ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching Han Zhu et.al. 2506.13053 link
2025-06-14 StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling Hui Wang et.al. 2506.12570 null
2025-06-14 Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction Xiaoran Fan et.al. 2506.12537 null
2025-06-14 Phonikud: Hebrew Grapheme-to-Phoneme Conversion for Real-Time Text-to-Speech Yakov Kolani et.al. 2506.12311 null
2025-06-11 S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamlessly Speech-Text Alignment and Streaming Speech Decoder Yu Pan et.al. 2506.11160 null
2025-06-16 A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data Cheng-Kang Chou et.al. 2506.11130 null
2025-06-10 GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions Wenkang Han et.al. 2506.11127 null
2025-06-10 ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams Freddie Grabovski et.al. 2506.11125 null
2025-06-05 Intelligibility of Text-to-Speech Systems for Mathematical Expressions Sujoy Roychowdhury et.al. 2506.11086 null
2025-06-12 Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs Hayato Futami et.al. 2506.10299 null
2025-06-06 A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations Tian Lan et.al. 2506.10019 null
2025-06-11 UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching Neta Glazer et.al. 2506.09874 null
2025-06-15 EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection Christoph Schuhmann et.al. 2506.09827 null
2025-06-11 OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment Chao-Hong Tan et.al. 2506.09349 link
2025-06-11 Ming-Omni: A Unified Multimodal Model for Perception and Generation Inclusion AI et.al. 2506.09344 link
2025-06-13 Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model Ailin Huang et.al. 2506.08967 null
2025-06-10 A Review on Score-based Generative Models for Audio Applications Ge Zhu et.al. 2506.08457 null
2025-06-09 Seeing Voices: Generating A-Roll Video from Audio with Mirage Aditi Sundararaman et.al. 2506.08279 null
2025-06-09 Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation Rui Hu et.al. 2506.07646 null
2025-06-10 Towards Generalized Source Tracing for Codec-Based Deepfake Speech Xuanjun Chen et.al. 2506.07294 null
2025-06-07 SynHate: Detecting Hate Speech in Synthetic Deepfake Audio Rishabh Ranjan et.al. 2506.06772 null
2025-06-06 Audio-Aware Large Language Models as Judges for Speaking Styles Cheng-Han Chiang et.al. 2506.05984 null
2025-06-09 Voice Impression Control in Zero-Shot TTS Keinichi Fujita et.al. 2506.05688 null
2025-06-05 Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning Hien Ohnaka et.al. 2506.04527 null
2025-06-04 Can we reconstruct a dysarthric voice with the large speech model Parler TTS? Ariadna Sanchez et.al. 2506.04397 null
2025-06-04 HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset Ryan Langman et.al. 2506.04152 null
2025-06-04 UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation Jinting Wang et.al. 2506.04134 null
2025-06-04 A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions Chung-Chun Wang et.al. 2506.04077 null
2025-06-04 Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian Languages Utkarsh Pathak et.al. 2506.03884 null
2025-06-04 Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts Sidharth Pulipaka et.al. 2506.03793 null
2025-06-04 Comparative Analysis of Fast and High-Fidelity Neural Vocoders for Low-Latency Streaming Synthesis in Resource-Constrained Environments Reo Yoneyama et.al. 2506.03554 null
2025-06-04 BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing Masaya Kawamura et.al. 2506.03515 null
2025-06-03 Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation Yongqi Wang et.al. 2506.02997 null
2025-06-03 Towards a Japanese Full-duplex Spoken Dialogue System Atsumoto Ohashi et.al. 2506.02979 null
2025-06-03 PartialEdit: Identifying Partial Deepfakes in the Era of Neural Speech Editing You Zhang et.al. 2506.02958 null
2025-06-03 CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech Helin Wang et.al. 2506.02863 link
2025-06-03 Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions Xiaoxue Gao et.al. 2506.02742 null
2025-06-03 StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion Fengjin Li et.al. 2506.02414 null
2025-06-03 SingaKids: A Multilingual Multimodal Dialogic Tutor for Language Learning Zhengyuan Liu et.al. 2506.02412 null
2025-06-03 Trusted Fake Audio Detection Based on Dirichlet Distribution Chi Ding et.al. 2506.02401 null
2025-06-02 Dhvani: A Weakly-supervised Phonemic Error Detection and Personalized Feedback System for Hindi Arnav Rustagi et.al. 2506.02166 null
2025-06-02 SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction Saurabh Agrawal et.al. 2506.02082 null
2025-06-02 Universal Preference-Score-based Pairwise Speech Quality Assessment Yu-Fei Shi et.al. 2506.01455 null
2025-06-02 Speech-to-Speech Translation Pipelines for Conversations in Low-Resource Languages Andrei Popescu-Belis et.al. 2506.01406 null
2025-06-02 Zero-Shot Text-to-Speech for Vietnamese Thi Vu et.al. 2506.01322 null
2025-06-02 CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction Yudong Lu et.al. 2506.01268 null
2025-06-02 WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing Yu Nakagome et.al. 2506.01263 null
2025-06-01 Source Tracing of Synthetic Speech Systems Through Paralinguistic Pre-Trained Representations Girish et.al. 2506.01157 null
2025-06-01 DS-TTS: Zero-Shot Speaker Style Adaptation from Voice Clips via Dynamic Dual-Style Feature Modulation Ming Meng et.al. 2506.01020 null
2025-06-01 Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching Jialong Zuo et.al. 2506.01014 null
2025-06-01 CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching Leying Zhang et.al. 2506.00885 null
2025-06-01 Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models Kyowoon Lee et.al. 2506.00832 null
2025-05-30 ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation Jiatong Shi et.al. 2505.24518 null
2025-05-30 Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation Wenrui Liu et.al. 2505.24496 null
2025-05-30 DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec Peijie Chen et.al. 2505.24314 null
2025-05-29 Can Emotion Fool Anti-spoofing? Aurosweta Mahapatra et.al. 2505.23962 null
2025-05-29 Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes Neta Glazer et.al. 2505.23619 link
2025-05-29 EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge Ruskin Raj Manku et.al. 2505.23009 link
2025-05-29 LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting Pai Zhu et.al. 2505.22995 null
2025-05-28 BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models Susan Liang et.al. 2505.22865 null
2025-05-28 Tell me Habibi, is it Real or Fake? Kartik Kuckreja et.al. 2505.22581 null
2025-05-28 A Linguistically Motivated Analysis of Intonational Phrasing in Text-to-Speech Systems: Revealing Gaps in Syntactic Sensitivity Charlotte Pouw et.al. 2505.22236 null
2025-05-27 Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech Nam-Gyu Kim et.al. 2505.20868 null
2025-05-26 ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis Hawau Olamide Toyin et.al. 2505.20506 null
2025-05-26 Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling Qixi Zheng et.al. 2505.19931 null
2025-05-26 DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech Deok-Hyeon Cho et.al. 2505.19687 link
2025-05-26 KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization Zhaolin Li et.al. 2505.19679 null
2025-06-02 Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling Haiyang Sun et.al. 2505.19669 null
2025-05-30 Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment Jeongsoo Choi et.al. 2505.19595 link
2025-05-26 GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor Seokgi Lee et.al. 2505.19384 null
2025-05-25 SpeakStream: Streaming Text-to-Speech with Interleaved Data Richard He Bai et.al. 2505.19206 null
2025-05-25 CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning Renyuan Li et.al. 2505.19119 null
2025-05-25 Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis Minsu Kim et.al. 2505.18972 null
2025-05-27 RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations Ashwin Sankar et.al. 2505.18609 null
2025-05-24 MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt Zhichao Wu et.al. 2505.18453 null
2025-05-27 CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training Zhihao Du et.al. 2505.17589 null
2025-05-23 What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection Binh Nguyen et.al. 2505.17513 null
2025-05-23 UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information Rui Wang et.al. 2505.17426 link
2025-05-23 Speechless: Speech Instruction Training Without Speech for Low Resource Languages Alan Dao et.al. 2505.17417 link
2025-05-22 Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2 Zackary Rackauckas et.al. 2505.17320 null
2025-05-21 Voicing Personas: Rewriting Persona Descriptions into Style Prompts for Controllable Text-to-Speech Yejin Lee et.al. 2505.17093 null
2025-05-20 Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English Haoyang Zhang et.al. 2505.17076 null
2025-05-22 From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition Tianduo Wang et.al. 2505.16972 link
2025-05-22 MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing Junjie Zheng et.al. 2505.16279 null
2025-05-21 MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling Yifan Cheng et.al. 2505.15772 null
2025-05-21 Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information Nicholas Sanders et.al. 2505.15667 null
2025-05-21 Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models Zirui Song et.al. 2505.15406 link
2025-05-21 Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning Junchuan Zhao et.al. 2505.15402 null
2025-05-21 Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding Zijian Lin et.al. 2505.15380 null
2025-05-20 TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis Yu Zhang et.al. 2505.14910 link
2025-05-20 Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits Tiantian Feng et.al. 2505.14648 link
2025-05-20 Pairwise Evaluation of Accent Similarity in Speech Synthesis Jinzuomu Zhong et.al. 2505.14410 null
2025-05-20 FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation Yutong Liu et.al. 2505.14351 null
2025-05-21 AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models Guangke Chen et.al. 2505.14103 null
2025-05-20 SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement Kuan-Yu Chen et.al. 2505.14066 null
2025-05-23 U-SAM: An audio language Model for Unified Speech, Audio, and Music Understanding Ziqian Wang et.al. 2505.13880 link
2025-05-22 Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token Denoising Ye-Xin Lu et.al. 2505.13830 null
2025-05-20 Articulatory Feature Prediction from Surface EMG during Speech Production Jihwan Lee et.al. 2505.13814 null
2025-05-19 Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space Zhengrui Ma et.al. 2505.13181 link
2025-05-19 DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation Jiaqi Li et.al. 2505.13000 link
2025-05-19 Codec-Based Deepfake Source Tracing via Neural Audio Codec Taxonomy Xuanjun Chen et.al. 2505.12994 link
2025-05-19 OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching Hieu-Nghia Huynh-Nguyen et.al. 2505.12800 null
2025-05-19 RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations Seungmin Kim et.al. 2505.12686 null
2025-05-19 Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis Yifan Hu et.al. 2505.12597 link
2025-05-18 Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis Dong Yang et.al. 2505.12226 null
2025-05-16 LipDiffuser: Lip-to-Speech Generation with Conditional Diffusion Models Danilo de Oliveira et.al. 2505.11391 null
2025-05-16 Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese Xihuai Wang et.al. 2505.11200 null
2025-05-16 BanglaFake: Constructing and Evaluating a Specialized Bengali Deepfake Audio Dataset Istiaq Ahmed Fahad et.al. 2505.10885 link
2025-05-15 UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech Jiaxuan Liu et.al. 2505.10599 null
2025-05-14 SingNet: Towards a Large-Scale, Diverse, and In-the-Wild Singing Voice Dataset Yicheng Gu et.al. 2505.09325 null
2025-05-14 DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis Zeeshan Ahmad et.al. 2505.09091 null
2025-05-13 Investigating self-supervised features for expressive, multilingual voice conversion Álvaro Martín-Cortinas et.al. 2505.08278 null
2025-05-12 MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder Bowen Zhang et.al. 2505.07916 null
2025-05-12 Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications Biel Tura Vecino et.al. 2505.07701 null
2025-05-10 VTutor: An Animated Pedagogical Agent SDK that Provide Real Time Multi-Model Feedback Eason Chen et.al. 2505.06676 null
2025-05-10 Bridging the Gap: An Intermediate Language for Enhanced and Cost-Effective Grapheme-to-Phoneme Conversion with Homographs with Multiple Pronunciations Disambiguation Abbas Bertina et.al. 2505.06599 null
2025-05-15 FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech Linhan Ma et.al. 2505.05159 null
2025-05-08 Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations Linrong Pan et.al. 2505.05056 null
2025-05-08 A Multi-Agent AI Framework for Immersive Audiobook Production through Spatial Audio and Neural Narration Shaja Arul Selvamani et.al. 2505.04885 null
2025-05-07 Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment Xueyao Zhang et.al. 2505.04113 null
2025-05-06 VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model Zuwei Long et.al. 2505.03739 link
2025-05-13 SonicRAG : High Fidelity Sound Effects Synthesis Based on Retrival Augmented Generation Yu-Ren Guo et.al. 2505.03244 null
2025-05-05 Generating Narrated Lecture Videos from Slides with Synchronized Highlights Alexander Holmberg et.al. 2505.02966 null
2025-05-05 Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play Yemin Shi et.al. 2505.02707 link
2025-05-05 LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis Qingkai Fang et.al. 2505.02625 link
2025-04-30 Towards Film-Making Production Dialogue, Narration, Monologue Adaptive Moving Dubbing Benchmarks Chaoyi Wang et.al. 2505.01450 null
2025-04-30 Sadeed: Advancing Arabic Diacritization Through Small Language Model Zeina Aldallal et.al. 2504.21635 null
2025-04-29 AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation Jeongsoo Choi et.al. 2504.20629 null
2025-04-29 ClonEval: An Open Voice Cloning Benchmark Iwona Christop et.al. 2504.20581 link
2025-05-02 Towards Flow-Matching-based TTS without Classifier-Free Guidance Yuzhe Liang et.al. 2504.20334 null
2025-04-27 Generative Adversarial Network based Voice Conversion: Techniques, Challenges, and Recent Advancements Sandipan Dhar et.al. 2504.19197 null
2025-04-27 Muyan-TTS: A Trainable Text-to-Speech Model Optimized for Podcast Scenarios with a $50K Budget Xin Li et.al. 2504.19146 link
2025-04-22 FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning Ju Yeon Kang et.al. 2504.15663 null
2025-04-22 A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models Gengxian Cao et.al. 2504.15552 null
2025-04-21 SOLIDO: A Robust Watermarking Method for Speech Synthesis via Low-Rank Adaptation Yue Li et.al. 2504.15035 null
2025-04-20 DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue Xiang Li et.al. 2504.14482 link
2025-04-18 ChatNekoHacker: Real-Time Fan Engagement with Conversational Agents Takuya Sera et.al. 2504.13793 null
2025-04-18 Collective Learning Mechanism based Optimal Transport Generative Adversarial Network for Non-parallel Voice Conversion Sandipan Dhar et.al. 2504.13791 null
2025-04-22 EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting Guanrou Yang et.al. 2504.12867 null
2025-04-15 GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture Yaodong Song et.al. 2504.12339 null
2025-04-15 Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation Yan Rong et.al. 2504.11002 null
2025-04-15 Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy Botao Zhao et.al. 2504.10819 null
2025-04-14 Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis Yifan Yang et.al. 2504.10352 null
2025-04-14 AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis Dan Luo et.al. 2504.10309 link
2025-04-14 SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis Zhisheng Zhang et.al. 2504.09839 link
2025-04-12 "It's not a representation of me": Examining Accent Bias and Digital Exclusion in Synthetic AI Voice Services Shira Michel et.al. 2504.09346 null
2025-04-12 AMNet: An Acoustic Model Network for Enhanced Mandarin Speech Synthesis Yubing Cao et.al. 2504.09225 null
2025-04-17 SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning Prabhat Pandey et.al. 2504.09081 null
2025-04-11 Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation Haowei Lou et.al. 2504.08274 null
2025-04-10 Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis Yizhong Geng et.al. 2504.07858 null
2025-04-10 SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow Kaidi Wang et.al. 2504.07776 null
2025-04-08 AVENet: Disentangling Features by Approximating Average Features for Voice Conversion Wenyu Wang et.al. 2504.05833 null
2025-04-07 P2Mark: Plug-and-play Parameter-intrinsic Watermarking for Neural Speech Generation Yong Ren et.al. 2504.05197 null
2025-04-07 SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation Stephen Brade et.al. 2504.05106 null
2025-04-04 RWKVTTS: Yet another TTS based on RWKV-7 Lin yueyu et.al. 2504.03289 link
2025-04-09 F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization Xiaohui Sun et.al. 2504.02407 link
2025-04-03 VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models Kim Sung-Bin et.al. 2504.02386 null
2025-03-31 SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation Ngoc Dung Huynh et.al. 2503.24164 null
2025-04-02 TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection Zhiming Ma et.al. 2503.24115 link
2025-03-31 SpeechDialogueFactory: Generating High-Quality Speech Dialogue Data to Accelerate Your Speech-LLM Development Minghan Wang et.al. 2503.23848 link
2025-03-31 DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance Junjie Zheng et.al. 2503.23660 null
2025-03-30 Speculative End-Turn Detector for Efficient Speech Chatbot Assistant Hyunjong Ok et.al. 2503.23439 null
2025-03-29 SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System Hyeongju Kim et.al. 2503.23108 null
2025-03-26 Dual Audio-Centric Modality Coupling for Talking Head Generation Ao Fu et.al. 2503.22728 null
2025-03-28 Cross-Technology Generalization in Synthesized Speech Detection: Evaluating AST Models with Modern Voice Generators Andrew Ustinov et.al. 2503.22503 link
2025-03-28 DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation Haomin Zhang et.al. 2503.22265 null
2025-03-26 Text-Driven Voice Conversion via Latent State-Space Modeling Wen Li et.al. 2503.20999 null
2025-03-28 FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System Hao-Han Guo et.al. 2503.20499 null
2025-03-26 Qwen2.5-Omni Technical Report Jin Xu et.al. 2503.20215 null
2025-03-21 Measuring the Robustness of Audio Deepfake Detectors Xiang Li et.al. 2503.17577 link
2025-03-21 Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication Yiwen Xu et.al. 2503.17479 null
2025-03-21 From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech Ji-Hoon Kim et.al. 2503.16956 null
2025-03-20 WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching Tianze Luo et.al. 2503.16689 link
2025-03-10 VocalEyes: Enhancing Environmental Perception for the Visually Impaired through Vision-Language Models and Distance-Aware Object Detection Kunal Chavan et.al. 2503.16488 null
2025-03-19 Shushing! Let's Imagine an Authentic Speech from the Silent Video Jiaxin Ye et.al. 2503.14928 null
2025-03-19 MoonCast: High-Quality Zero-Shot Podcast Generation Zeqian Ju et.al. 2503.14345 link
2025-03-26 InnerSelf: Designing Self-Deepfaked Voice for Emotional Well-being Guang Dai et.al. 2503.14257 null
2025-03-15 Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations Xue Jiang et.al. 2503.12115 null
2025-03-14 MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation Sungwoo Cho et.al. 2503.11026 null
2025-03-14 Proceedings of the ISCA/ITG Workshop on Diversity in Large Speech and Language Models Sebastian Möller et.al. 2503.10298 null
2025-03-11 An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR Sewade Ogun et.al. 2503.08954 null
2025-03-09 ProSE: Diffusion Priors for Speech Enhancement Sonal Kumar et.al. 2503.06375 null
2025-03-07 DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility Yifan Liu et.al. 2503.05223 link
2025-03-03 Direct Speech to Speech Translation: A Review Mohammad Sarim et.al. 2503.04799 null
2025-03-06 LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM Sambal Shikhar et.al. 2503.04724 null
2025-03-06 Scaling Rich Style-Prompted Text-to-Speech Datasets Anuj Diwan et.al. 2503.04713 link
2025-03-05 Good practices for evaluation of synthesized speech Erica Cooper et.al. 2503.03250 null
2025-03-04 InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training Dingdong Wang et.al. 2503.02769 null
2025-03-03 Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens Xinsheng Wang et.al. 2503.01710 link
2025-03-03 Voice Cloning for Dysarthric Speech Synthesis: Addressing Data Scarcity in Speech-Language Pathology Birger Moell et.al. 2503.01266 null
2025-03-02 Language-agnostic, automated assessment of listeners' speech recall using large language models Björn Herrmann et.al. 2503.01045 null
2025-03-02 UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation Alexander H. Liu et.al. 2503.00733 null
2025-03-01 PodAgent: A Comprehensive Framework for Podcast Generation Yujia Xiao et.al. 2503.00455 link
2025-03-12 Telephone Surveys Meet Conversational AI: Evaluating a LLM-Based Telephone Survey System at Scale Max M. Lang et.al. 2502.20140 null
2025-02-27 DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models Weihao wu et.al. 2502.19924 null
2025-03-04 Speculative Decoding and Beyond: An In-Depth Survey of Techniques Yunhai Hu et.al. 2502.19732 null
2025-02-26 Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis Ziyue Jiang et.al. 2502.18924 null
2025-03-08 Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding Tianyun Liu et.al. 2502.18889 null
2025-02-24 Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction Tianpeng Li et.al. 2502.17239 link
2025-02-24 Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM Jiatong Shi et.al. 2502.16897 null
2025-02-18 AV-Flow: Transforming Text to Audio-Visual Human-like Interactions Aggelina Chatziagapi et.al. 2502.13133 null
2025-02-18 High-Fidelity Music Vocoder using Neural Audio Codecs Luca A. Lanzendörfer et.al. 2502.12759 null
2025-02-18 TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching Wenxiang Guo et.al. 2502.12572 link
2025-02-18 A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond Shreya Shukla et.al. 2502.12048 null
2025-02-17 NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing Yifan Liang et.al. 2502.12002 null
2025-02-16 FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching Hui Wang et.al. 2502.11128 null
2025-02-16 SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer Zhengyan Sheng et.al. 2502.11094 null
2025-02-14 VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect Qingyuan Fei et.al. 2502.10329 null
2025-02-13 TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument Kyungsu Kim et.al. 2502.08939 link
2025-03-02 ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech Xin Wang et.al. 2502.08857 null
2025-02-11 LoRP-TTS: Low-Rank Personalized Text-To-Speech Łukasz Bondaruk et.al. 2502.07562 null
2025-02-11 Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction Leying Zhang et.al. 2502.07345 null
2025-02-11 Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement Xueyao Zhang et.al. 2502.07243 null
2025-02-10 Synthetic Audio Helps for Cognitive State Tasks Adil Soubki et.al. 2502.06922 link
2025-02-16 Recent Advances in Discrete Speech Tokens: A Review Yiwei Guo et.al. 2502.06490 null
2025-02-19 Speech to Speech Translation with Translatotron: A State of the Art Review Jules R. Kala et.al. 2502.05980 null
2025-02-09 Non-invasive electromyographic speech neuroprosthesis: a geometric perspective Harshavardhana T. Gowda et.al. 2502.05762 null
2025-02-09 BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting Mohammad Jahid Ibna Basher et.al. 2502.05729 null
2025-02-08 Gender Bias in Instruction-Guided Speech Synthesis Models Chun-Yi Kuan et.al. 2502.05649 null
2025-02-08 IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System Wei Deng et.al. 2502.05512 link
2025-02-07 Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance Shehzeen Hussain et.al. 2502.05236 null
2025-02-12 Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment Zuyan Liu et.al. 2502.04328 link
2025-02-06 Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis Zhen Ye et.al. 2502.04128 link
2025-02-14 DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation Dongya Jia et.al. 2502.03930 null
2025-02-05 Metis: A Foundation Speech Generation Model with Masked Generative Pre-training Yuancheng Wang et.al. 2502.03128 link
2025-02-05 Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech Jixun Yao et.al. 2502.02950 null
2025-02-04 Developing multilingual speech synthesis system for Ojibwe, Mi'kmaq, and Maliseet Shenran Wang et.al. 2502.02703 link
2025-02-04 Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation Peidong Wang et.al. 2502.02683 null
2025-02-03 Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis Weiwei Lin et.al. 2502.01084 null
2025-02-02 EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis Junuk Cha et.al. 2502.00654 null
2025-01-31 VisualSpeech: Enhance Prosody with Visual Context in TTS Shumin Que et.al. 2501.19258 null
2025-01-29 BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights Chan-Jan Hsu et.al. 2501.17790 null
2025-02-09 CSEval: Towards Automated, Multi-Dimensional, and Reference-Free Counterspeech Evaluation using Auto-Calibrated LLMs Amey Hengle et.al. 2501.17581 null
2025-01-28 Compact Neural TTS Voices for Accessibility Kunal Jain et.al. 2501.17332 null
2025-01-27 Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation Haorui He et.al. 2501.15907 link
2025-01-26 Overview of the Amphion Toolkit (v0.2) Jiaqi Li et.al. 2501.15442 link
2025-01-24 Characteristic-Specific Partial Fine-Tuning for Efficient Emotion and Speaker Adaptation in Codec Language Text-to-Speech Models Tianrui Wang et.al. 2501.14273 null
2025-01-24 Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation Wen Huang et.al. 2501.14240 null
2025-01-24 LoCoML: A Framework for Real-World ML Inference Pipelines Kritin Maddireddy et.al. 2501.14165 null
2025-01-23 Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference Shuqi Dai et.al. 2501.13870 null
2025-01-23 Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement Jae-Sung Bae et.al. 2501.13372 null
2025-01-21 A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data Minh Tran et.al. 2501.12501 null
2025-01-20 A Non-autoregressive Model for Joint STT and TTS Vishal Sunder et.al. 2501.09104 null
2025-01-15 Speech Synthesis along Perceptual Voice Quality Dimensions Frederik Rautenberg et.al. 2501.08791 null
2025-01-15 Adaptive Data Augmentation with NaturalSpeech3 for Far-field Speaker Verification Li Zhang et.al. 2501.08691 null
2025-01-15 Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement Qianniu Chen et.al. 2501.08566 null
2025-01-14 CodecFake-Omni: A Large-Scale Codec-based Deepfake Speech Dataset Jiawei Du et.al. 2501.08238 null
2025-01-13 Exploring the encoding of linguistic representations in the Fully-Connected Layer of generative CNNs for Speech Bruno Ferenc Šegedin et.al. 2501.07726 null
2025-01-19 MathReader : Text-to-Speech for Mathematical Documents Sieun Hyeon et.al. 2501.07088 link
2025-01-11 The 1st SpeechWellness Challenge: Detecting Suicidal Risk Among Adolescents Wen Wu et.al. 2501.06474 null
2025-01-11 Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis Rui Liu et.al. 2501.06467 link
2025-01-11 Unispeaker: A Unified Approach for Multimodality-driven Speaker Generation Zhengyan Sheng et.al. 2501.06394 null
2025-01-10 TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer Vladimir Bataev et.al. 2501.06320 null
2025-01-10 MinMo: A Multimodal Large Language Model for Seamless Voice Interaction Qian Chen et.al. 2501.06282 null
2025-01-10 PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control Shaozuo Zhang et.al. 2501.06276 null
2025-01-10 Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron Kishor Kayyar Lakshminarayana et.al. 2501.05976 null
2025-01-10 MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model Matthew Baas et.al. 2501.05787 null
2025-01-09 Probing Speaker-specific Features in Speaker Representations Aemon Yat Fei Chiu et.al. 2501.05310 null
2025-01-09 JELLY: Joint Emotion Recognition and Context Reasoning with LLMs for Conversational Speech Synthesis Jun-Hyeok Cha et.al. 2501.04904 null
2025-01-08 Cued Speech Generation Leveraging a Pre-trained Audiovisual Text-to-Speech Model Sanjana Sankar et.al. 2501.04799 null
2025-01-08 FleSpeech: Flexibly Controllable Speech Generation with Various Prompts Hanzhao Li et.al. 2501.04644 null
2025-01-09 OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis Run Luo et.al. 2501.04561 link
2025-01-08 DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions Weidong Chen et.al. 2501.04256 null
2025-01-02 FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles Tian-Hao Zhang et.al. 2501.03181 null
2025-01-02 RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer Seongho Hong et.al. 2501.01182 link
2025-01-02 Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT Dongyang Dai et.al. 2501.01102 null
2025-01-06 Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in a VR Pilot Study Mykola Maslych et.al. 2501.00168 null
2024-12-28 Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting Wooseok Han et.al. 2412.20155 null
2024-12-26 "I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities Jiawei Yu et.al. 2412.19102 null
2024-12-26 Indonesian-English Code-Switching Speech Synthesizer Utilizing Multilingual STEN-TTS and Bert LID Ahmad Alfani Handoyo et.al. 2412.19043 null
2024-12-25 Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset Neil Shah et.al. 2412.18839 null
2024-12-24 GenPod: Constructive News Framing in AI-Generated Podcasts More Effectively Reduces Negative Emotions Than Non-Constructive Framing Wen Ku et.al. 2412.18300 null
2024-12-22 Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective Hankun Wang et.al. 2412.17048 null
2024-12-22 Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis Ye-Xin Lu et.al. 2412.16977 link
2024-12-22 Autoregressive Speech Synthesis with Next-Distribution Prediction Xinfa Zhu et.al. 2412.16846 null
2024-12-23 Interleaved Speech-Text Language Models are Simple Streaming Text to Speech Synthesizers Yifan Yang et.al. 2412.16102 null
2024-12-19 Scale This, Not That: Investigating Key Dataset Attributes for Efficient Speech Enhancement Scaling Leying Zhang et.al. 2412.14890 null
2024-12-17 Synthetic Speech Classification: IEEE Signal Processing Cup 2022 challenge Mahieyin Rahmun et.al. 2412.13279 link
2024-12-17 Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion Syed Zohaib Hassan et.al. 2412.12710 null
2024-12-17 Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes Kuiyuan Zhang et.al. 2412.12619 link
2024-12-17 Hierarchical Control of Emotion Rendering in Speech Synthesis Sho Inoue et.al. 2412.12498 link
2024-12-19 ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis Xiangheng He et.al. 2412.11795 null
2024-12-17 Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech Rui Liu et.al. 2412.11409 link
2024-12-16 Efficient Generative Modeling with Residual Vector Quantization-Based Tokens Jaehyeon Kim et.al. 2412.10208 null
2024-12-13 AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation Xiyuan Gao et.al. 2412.10103 null
2024-12-13 CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder Jianwei Cui et.al. 2412.08918 null
2024-12-11 Multimodal Latent Language Modeling with Next-Token Diffusion Yutao Sun et.al. 2412.08635 link
2024-12-11 A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction Sowmya Cheripally et.al. 2412.08312 null
2024-12-11 A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings Anindita Mondal et.al. 2412.08283 null
2024-12-11 LatentSpeech: Latent Diffusion for Text-To-Speech Generation Haowei Lou et.al. 2412.08117 null
2024-12-11 Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration Haowei Lou et.al. 2412.08112 null
2024-12-09 Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey Tianxin Xie et.al. 2412.06602 link
2024-12-12 EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations Weizhen Bian et.al. 2412.06581 null
2024-12-01 Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor Ashwin Baluja et.al. 2412.05315 null
2024-12-04 DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles Jiaxuan Liu et.al. 2412.03388 null
2024-12-03 GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot Aohan Zeng et.al. 2412.02612 link
2024-11-19 A Context-Based Numerical Format Prediction for a Text-To-Speech System Yaser Darwesh et.al. 2412.00028 null
2024-11-27 Continual Learning in Machine Speech Chain Using Gradient Episodic Memory Geoffrey Tyndall et.al. 2411.18320 null
2024-11-27 SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation Wenyi Yu et.al. 2411.18138 null
2024-11-26 WavChat: A Survey of Spoken Dialogue Models Shengpeng Ji et.al. 2411.13577 link
2024-12-02 I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception Jiawei Zhang et.al. 2411.13314 null
2024-11-20 Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM Jiawei Yu et.al. 2411.13159 null
2024-11-19 Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation Praveen Srinivasa Varadhan et.al. 2411.12719 null
2024-11-19 Leveraging Virtual Reality and AI Tutoring for Language Learning: A Case Study of a Virtual Campus Environment with OpenAI GPT Integration with Unity 3D Adithya TG et.al. 2411.12619 null
2024-11-18 ESTVocoder: An Excitation-Spectral-Transformed Neural Vocoder Conditioned on Mel Spectrogram Xiao-Hang Jiang et.al. 2411.11258 null
2024-11-12 Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models Dongrui Han et.al. 2411.07563 null
2024-11-11 Enhancing Accessibility in Special Libraries: A Study on AI-Powered Assistive Technologies for Patrons with Disabilities Snehasish Paul Shivali Chauhan et.al. 2411.06970 null
2024-11-10 Debatts: Zero-Shot Debating Text-to-Speech Synthesis Yiqiao Huang et.al. 2411.06540 null
2024-11-07 CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR Kadir Burak Buldu et.al. 2411.04671 null
2024-11-04 EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector Deok-Hyeon Cho et.al. 2411.02625 link
2024-11-09 Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis Shijia Liao et.al. 2411.01156 link
2024-10-31 Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody? Ioannis Tsiamas et.al. 2410.24019 null
2024-10-30 Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis Théodor Lemerle et.al. 2410.23320 link
2024-10-29 Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech Eric Battenberg et.al. 2410.22179 link
2024-10-29 Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding Bohan Li et.al. 2410.21951 null
2024-10-29 RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis Kehan Sui et.al. 2410.21641 null
2024-10-28 Asynchronous Tool Usage for Real-Time Agents Antonio A. Ginart et.al. 2410.21620 null
2024-10-28 Enhancing TTS Stability in Hebrew using Discrete Semantic Units Ella Zeldes et.al. 2410.21502 null
2024-10-28 Mitigating Unauthorized Speech Synthesis for Voice Protection Zhisheng Zhang et.al. 2410.20742 link
2024-10-27 Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation Maohao Shen et.al. 2410.20336 null
2024-10-24 Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis Suparna De et.al. 2410.19199 null
2024-10-24 STTATTS: Unified Speech-To-Text And Text-To-Speech Model Hawau Olamide Toyin et.al. 2410.18607 link
2024-10-24 Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts ChaeHun Park et.al. 2410.18444 null
2024-10-23 ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams Srija Anand et.al. 2410.17901 null
2024-10-22 Continuous Speech Tokenizer in Text To Speech Yixing Li et.al. 2410.17081 link
2024-10-22 Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap Guanrou Yang et.al. 2410.16726 null
2024-10-21 Continuous Speech Synthesis using per-token Latent Diffusion Arnon Turetzky et.al. 2410.16048 null
2024-10-18 A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages Sujitha Sathiyamoorthy et.al. 2410.14197 null
2024-10-18 Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech Shuwei He et.al. 2410.14101 link
2024-10-17 Enhancing Crowdsourced Audio for Text-to-Speech Models José Giraldo et.al. 2410.13357 null
2024-10-17 DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech Jan Melechovsky et.al. 2410.13342 null
2024-10-17 DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis Yu Gu et.al. 2410.13288 null
2024-10-17 Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation Sreyan Ghosh et.al. 2410.13198 null
2024-10-16 ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs Rui-Chen Zheng et.al. 2410.12359 null
2024-10-14 IsoChronoMeter: A simple and effective isochronic translation evaluation metric Nikolai Rozanov et.al. 2410.11127 null
2024-10-14 DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization Yingahao Aaron Li et.al. 2410.11097 null
2024-10-12 Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling Rui Liu et.al. 2410.09524 null
2024-10-10 Unsupervised Data Validation Methods for Efficient Model Training Yurii Paniv et.al. 2410.07880 null
2024-10-15 F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching Yushen Chen et.al. 2410.06885 link
2024-10-09 Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch Teodora Răgman et.al. 2410.06787 null
2024-10-09 Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS Onkar Kishor Susladkar et.al. 2410.06608 null
2024-10-09 Can DeepFake Speech be Reliably Detected? Hongbin Liu et.al. 2410.06572 null
2024-10-07 SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech Minchan Kim et.al. 2410.04690 null
2024-10-06 HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis Yuto Nishimura et.al. 2410.04380 null
2024-10-10 SONAR: A Synthetic AI-Audio Detection Framework and Benchmark Xiang Li et.al. 2410.04324 link
2024-10-05 Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System Ze Li et.al. 2410.04017 null
2024-10-01 Recent Advances in Speech Language Models: A Survey Wenqian Cui et.al. 2410.03751 link
2024-10-04 Generative Semantic Communication for Text-to-Speech Synthesis Jiahao Zheng et.al. 2410.03459 null
2024-10-04 Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens Jinzheng Zhao et.al. 2410.03298 null
2024-10-04 Narrative Player: Reviving Data Narratives with Visuals Zekai Shao et.al. 2410.03268 null
2024-10-04 MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech Taejun Bak et.al. 2410.03192 null
2024-10-01 Augmentation through Laundering Attacks for Audio Spoof Detection Hashim Ali et.al. 2410.01108 null
2024-10-01 Zero-Shot Text-to-Speech from Continuous Text Streams Trung Dang et.al. 2410.00767 null
2024-10-01 EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control Haozhe Chen et.al. 2410.00316 link
2024-09-30 Word-wise intonation model for cross-language TTS systems Tomilov A. A. et.al. 2409.20374 null
2024-09-27 Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech Youngjae Kim et.al. 2409.18622 null
2024-09-26 Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control Ryuichi Yamamoto et.al. 2409.17452 null
2024-09-25 Exploring synthetic data for cross-speaker style transfer in style representation based TTS Lucas H. Ueda et.al. 2409.17364 null
2024-09-25 Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions Kun Zhou et.al. 2409.16681 null
2024-09-25 Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation Siyin Wang et.al. 2409.16644 link
2024-09-24 FastTalker: Jointly Generating Speech and Conversational Gestures from Text Zixin Guo et.al. 2409.16404 null
2024-09-24 Beyond Text-to-Text: An Overview of Multimodal and Generative Artificial Intelligence for Education Using Topic Modeling Ville Heilala et.al. 2409.16376 null
2024-09-24 Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech Yunji Chu et.al. 2409.16203 null
2024-09-24 NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers Nohil Park et.al. 2409.15760 null
2024-09-24 VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance Jiheum Yeom et.al. 2409.15759 null
2024-09-24 StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis Zhiyong Chen et.al. 2409.15741 null
2024-09-23 A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection Lam Pham et.al. 2409.15180 null
2024-09-23 LlamaPartialSpoof: An LLM-Driven Fake Speech Dataset Simulating Disinformation Generation Hieu-Thi Luong et.al. 2409.14743 link
2024-09-20 Zero-shot Cross-lingual Voice Transfer for TTS Fadi Biadsy et.al. 2409.13910 null
2024-09-20 On the Feasibility of Fully AI-automated Vishing Attacks João Figueiredo et.al. 2409.13793 null
2024-09-19 Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space Sebastião Quintas et.al. 2409.12745 null
2024-09-19 Preference Alignment Improves Language Model-Based TTS Jinchuan Tian et.al. 2409.12403 null
2024-09-18 Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference Edresson Casanova et.al. 2409.12117 null
2024-09-18 Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems Anusha Prakash et.al. 2409.11915 null
2024-09-18 DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech Xin Qi et.al. 2409.11835 null
2024-09-18 Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation Haohan Guo et.al. 2409.11630 null
2024-09-17 SpMis: An Investigation of Synthetic Spoken Misinformation Detection Peizhuo Liu et.al. 2409.11308 null
2024-09-19 The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives Samee Arif et.al. 2409.11261 link
2024-09-17 Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora Francesco Nespoli et.al. 2409.11107 null
2024-09-16 Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization Xiaoxue Gao et.al. 2409.10157 null
2024-09-16 StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion Yinghao Aaron Li et.al. 2409.10058 null
2024-09-15 Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning Siqi Sun et.al. 2409.09891 null
2024-09-14 E1 TTS: Simple and Fast Non-Autoregressive TTS Zhijun Liu et.al. 2409.09351 null
2024-09-14 Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation Changjin Han et.al. 2409.09311 link
2024-09-14 SafeEar: Content Privacy-Preserving Audio Deepfake Detection Xinfeng Li et.al. 2409.09272 link
2024-09-13 AccentBox: Towards High-Fidelity Zero-Shot Accent Generation Jinzuomu Zhong et.al. 2409.09098 null
2024-09-17 HLTCOE JHU Submission to the Voice Privacy Challenge 2024 Henry Li Xinyuan et.al. 2409.08913 null
2024-09-13 Text-To-Speech Synthesis In The Wild Jee-weon Jung et.al. 2409.08711 null
2024-09-14 Exploring Accessibility Trends and Challenges in Mobile App Development: A Study of Stack Overflow Questions Amila Indika et.al. 2409.07945 null
2024-09-12 Full-text Error Correction for Chinese Speech Recognition with Large Language Model Zhiyuan Tang et.al. 2409.07790 null
2024-09-11 SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis Helin Wang et.al. 2409.07556 link
2024-09-11 D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack Hong-Hanh Nguyen-Le et.al. 2409.07390 null
2024-09-11 Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT Kazuki Yamauchi et.al. 2409.07265 null
2024-09-11 Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment Tien-Hong Lo et.al. 2409.07151 null
2024-09-10 Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models Xin Jing et.al. 2409.06451 null
2024-09-10 What happens to diffusion model likelihood when your model is conditional? Mattias Cross et.al. 2409.06364 null
2024-09-10 VoiceWukong: Benchmarking Deepfake Voice Detection Ziwei Yan et.al. 2409.06348 null
2024-09-09 AS-Speech: Adaptive Style For Speech Synthesis Zhipeng Li et.al. 2409.05730 null
2024-09-09 IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS Ashwin Sankar et.al. 2409.05356 link
2024-09-10 Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion Zhengyang Chen et.al. 2409.05004 null
2024-09-01 Sample-Efficient Diffusion for Text-To-Speech Synthesis Justin Lovelace et.al. 2409.03717 link
2024-09-10 LAST: Language Model Aware Speech Tokenization Arnon Turetzky et.al. 2409.03701 null
2024-09-05 FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications Hao-Han Guo et.al. 2409.03283 null
2024-09-04 Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems Jeongmin Liu et.al. 2409.02517 null
2024-09-03 VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka Li-Wei Chen et.al. 2409.01548 null
2024-09-02 A multilingual training strategy for low resource Text to Speech Asma Amalas et.al. 2409.01217 null
2024-09-02 A Framework for Synthetic Audio Conversations Generation using Large Language Models Kaung Myat Kyaw et.al. 2409.00946 null
2024-09-02 SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis Haohan Guo et.al. 2409.00933 link
2024-09-01 MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer Yuancheng Wang et.al. 2409.00750 link
2024-08-30 SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection Ismail Rasim Ulgen et.al. 2408.17432 null
2024-08-30 AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge Kirill Borodin et.al. 2408.17352 null
2024-08-30 Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model Zhen Ye et.al. 2408.17175 link
2024-08-30 Utilizing Speaker Profiles for Impersonation Audio Detection Hao Gu et.al. 2408.17009 null
2024-08-29 Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis Zehai Tu et.al. 2408.16373 null
2024-08-28 Multi-modal Adversarial Training for Zero-Shot Voice Cloning John Janiczek et.al. 2408.15916 null
2024-08-29 Easy, Interpretable, Effective: openSMILE for voice deepfake detection Octavian Pascu et.al. 2408.15775 null
2024-08-28 VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling Yixuan Zhou et.al. 2408.15676 link
2024-08-28 VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech Heeseung Kim et.al. 2408.14739 null
2024-08-27 StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech Haowei Lou et.al. 2408.14713 link
2024-08-27 DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance Jinhyeok Yang et.al. 2408.14423 null
2024-08-26 Anonymization of Voices in Spaces for Civic Dialogue: Measuring Impact on Empathy, Trust, and Feeling Heard Wonjune Kang et.al. 2408.13970 null
2024-08-28 SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models Dongchao Yang et.al. 2408.13893 null
2024-08-22 Positional Description for Numerical Normalization Deepanshu Gupta et.al. 2408.12430 null
2024-08-22 VoiceX: A Text-To-Speech Framework for Custom Voices Silvan Mertes et.al. 2408.12170 null
2024-08-13 Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation Yinghao Aaron Li et.al. 2408.11849 null
2024-08-20 EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech Xin Qi et.al. 2408.10852 null
2024-08-20 SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS Karl El Hajal et.al. 2408.10771 null
2024-08-20 Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting Hyun Jin Park et.al. 2408.10463 null
2024-08-17 Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition Samuele Cornell et.al. 2408.09215 link
2024-08-14 PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation Sang-Hoon Lee et.al. 2408.07547 link
2024-08-13 SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis Osamu Take et.al. 2408.06858 link
2024-08-13 PRESENT: Zero-Shot Text-to-Prosody Control Perry Lam et.al. 2408.06827 link
2024-08-12 FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks Min Ma et.al. 2408.06227 null
2024-08-11 VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing Chunyu Qiang et.al. 2408.05758 null
2024-08-06 Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training Hawraz A. Ahmad et.al. 2408.03887 null
2024-08-03 ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features Peng Cheng et.al. 2408.01808 link
2024-08-01 Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation Xinhan Di et.al. 2408.00284 null
2024-07-18 Handling Numeric Expressions in Automatic Speech Recognition Christian Huber et.al. 2408.00004 null
2024-07-31 On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition Nick Rossenbach et.al. 2407.21476 null
2024-07-29 Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks Mahmoud Salhab et.al. 2407.18571 null
2024-07-25 On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures Nick Rossenbach et.al. 2407.17997 null
2024-07-24 Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model Jan Lehečka et.al. 2407.17167 null
2024-07-23 Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments Pai Zhu et.al. 2407.16840 null
2024-07-19 Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2 Chun Xu et.al. 2407.14212 null
2024-07-18 Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models Weiqin Li et.al. 2407.13509 null
2024-07-22 TTSDS -- Text-to-Speech Distribution Score Christoph Minixhofer et.al. 2407.12707 link
2024-07-17 Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech Haibin Wu et.al. 2407.12229 link
2024-07-16 A Language Modeling Approach to Diacritic-Free Hebrew TTS Amit Roth et.al. 2407.12206 null
2024-07-17 Learning High-Frequency Functions Made Easy with Sinusoidal Positional Encoding Chuanhao Sun et.al. 2407.09370 link
2024-07-11 Autoregressive Speech Synthesis without Vector Quantization Lingwei Meng et.al. 2407.08551 link
2024-07-10 Source Tracing of Audio Deepfake Systems Nicholas Klein et.al. 2407.08016 null
2024-07-07 ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation Ruibo Fu et.al. 2407.05421 null
2024-07-09 CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens Zhihao Du et.al. 2407.05407 null
2024-07-04 Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis Cong-Thanh Do et.al. 2407.04047 null
2024-07-04 Optimizing a-DCF for Spoofing-Robust Speaker Verification Oğuzhan Kurnaz et.al. 2407.04034 null
2024-07-04 On the Effectiveness of Acoustic BPE in Decoder-Only TTS Bohan Li et.al. 2407.03892 null
2024-07-14 CATT: Character-based Arabic Tashkeel Transformer Faris Alasmary et.al. 2407.03236 link
2024-07-02 Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization Yuchen Hu et.al. 2407.02243 null
2024-07-02 TTSlow: Slow Down Text-to-Speech with Efficiency Robustness Evaluations Xiaoxue Gao et.al. 2407.01927 null
2024-07-01 Lightweight Zero-shot Text-to-Speech with Mixture of Adapters Kenichi Fujita et.al. 2407.01291 null
2024-06-30 NAIST Simultaneous Speech Translation System for IWSLT 2024 Yuka Ko et.al. 2407.00826 null
2024-06-30 An Attribute Interpolation Method in Speech Synthesis by Model Merging Masato Murata et.al. 2407.00766 null
2024-06-30 FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis Yinlin Guo et.al. 2407.00753 null
2024-07-02 Open-Source Conversational AI with SpeechBrain 1.0 Mirco Ravanelli et.al. 2407.00463 null
2024-06-27 Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models Borodin Kirill Nikolayevich et.al. 2406.19243 null
2024-06-27 DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability Hyun Joon Park et.al. 2406.19135 link
2024-06-26 Automatic Speech Recognition for Hindi Anish Saha et.al. 2406.18135 null
2024-06-26 A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons Tzu-Yun Hung et.al. 2406.18089 null
2024-06-29 LLM-Driven Multimodal Opinion Expression Identification Bonian Jia et.al. 2406.18088 null
2024-06-26 E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS Sefik Emre Eskimez et.al. 2406.18009 link
2024-06-25 Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment Paarth Neekhara et.al. 2406.17957 null
2024-06-22 A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge Xiaopeng Wang et.al. 2406.17801 null
2024-06-25 High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model Joun Yeop Lee et.al. 2406.17310 null
2024-06-25 Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation Yingting Li et.al. 2406.17257 null
2024-06-24 Exploring the Capability of Mamba in Speech Applications Koichi Miyazaki et.al. 2406.16808 null
2024-06-25 Towards Zero-Shot Text-To-Speech for Arabic Dialects Khai Duy Doan et.al. 2406.16751 null
2024-06-22 TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers Yakun Song et.al. 2406.15752 link
2024-06-21 InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions Yu Nakagome et.al. 2406.14890 null
2024-06-21 GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech Wenbin Wang et.al. 2406.14875 null
2024-06-21 DASB - Discrete Audio and Speech Benchmark Pooneh Mousavi et.al. 2406.14294 null
2024-06-18 Instruction Data Generation and Unsupervised Adaptation for Speech Language Models Vahid Noroozi et.al. 2406.12946 null
2024-06-17 DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer Keon Lee et.al. 2406.11427 null
2024-06-16 NAST: Noise Aware Speech Tokenization for Speech Language Models Shoval Messica et.al. 2406.11037 link
2024-06-16 Multi-Scale Accent Modeling with Disentangling for Multi-Speaker Multi-Accent TTS Synthesis Xuehao Zhou et.al. 2406.10844 null
2024-06-14 Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice Shubham Gupta et.al. 2406.10422 null
2024-06-14 UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner Dongchao Yang et.al. 2406.10056 link
2024-06-14 MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model Jiatong Shi et.al. 2406.09869 null
2024-06-13 DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage Kyra Wang et.al. 2406.08820 null
2024-06-13 Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems Zhengyang Chen et.al. 2406.08812 null
2024-06-13 DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing Neha Sahipjohn et.al. 2406.08802 null
2024-06-12 Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis Wing-Zin Leung et.al. 2406.08568 link
2024-06-12 Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data Yuma Shirahata et.al. 2406.08111 null
2024-06-12 VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech Ashishkumar Gudmalwar et.al. 2406.08076 null
2024-06-12 LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning Masaya Kawamura et.al. 2406.07969 link
2024-06-12 VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment Bing Han et.al. 2406.07855 null
2024-06-12 EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech Deok-Hyeon Cho et.al. 2406.07803 link
2024-06-11 The Interspeech 2024 Challenge on Speech Processing Using Discrete Units Xuankai Chang et.al. 2406.07725 null
2024-06-11 Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? Qingkai Fang et.al. 2406.07289 null
2024-06-11 AudioMarkBench: Benchmarking Robustness of Audio Watermarking Hongbin Liu et.al. 2406.06979 link
2024-06-11 Controlling Emotion in Text-to-Speech with Natural Language Prompts Thomas Bott et.al. 2406.06406 link
2024-06-10 Meta Learning Text-to-Speech Synthesis in over 7000 Languages Florian Lux et.al. 2406.06403 link
2024-06-10 MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance Semin Kim et.al. 2406.05965 null
2024-06-11 WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark Linhan Ma et.al. 2406.05763 link
2024-06-09 An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS Xiaofei Wang et.al. 2406.05699 null
2024-06-11 Text-aware and Context-aware Expressive Audiobook Speech Synthesis Dake Guo et.al. 2406.05672 link
2024-06-08 Autoregressive Diffusion Transformer for Text-to-Speech Synthesis Zhijun Liu et.al. 2406.05551 null
2024-06-08 VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers Sanyuan Chen et.al. 2406.05370 null
2024-06-07 Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech Synthesis Ryan Langman et.al. 2406.05298 null
2024-06-07 XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model Edresson Casanova et.al. 2406.04904 link
2024-06-07 TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking Junzuo Zhou et.al. 2406.04840 link
2024-06-07 Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study Chong Zhang et.al. 2406.04633 null
2024-06-06 Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis Théodor Lemerle et.al. 2406.04467 link
2024-06-06 Total-Duration-Aware Duration Modeling for Text-to-Speech Systems Sefik Emre Eskimez et.al. 2406.04281 null
2024-06-06 Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining Jinlong Xue et.al. 2406.03714 null
2024-06-06 Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model Jinlong Xue et.al. 2406.03706 null
2024-06-05 Style Mixture of Experts for Expressive Text-To-Speech Synthesis Ahad Jawaid et.al. 2406.03637 null
2024-06-07 Harder or Different? Understanding Generalization of Audio Deepfake Detection Nicolas M. Müller et.al. 2406.03512 null
2024-06-05 LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes Trung Dang et.al. 2406.02897 null
2024-06-04 Seed-TTS: A Family of High-Quality Versatile Speech Generation Models Philip Anastassiou et.al. 2406.02430 link
2024-06-05 SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models Dongchao Yang et.al. 2406.02328 null
2024-06-04 BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation Hui-Peng Du et.al. 2406.02162 null
2024-06-04 Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis Kun Zhou et.al. 2406.02009 null
2024-06-03 ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec Shengpeng Ji et.al. 2406.01205 link
2024-06-03 Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training Jan Melechovsky et.al. 2406.01018 null
2024-06-02 Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback Chen Chen et.al. 2406.00654 null
2024-05-31 Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities Vicky Zayats et.al. 2405.18669 null
2024-05-28 TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation Chenyang Le et.al. 2405.17809 link
2024-05-27 RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis Haoxiang Shi et.al. 2405.17028 null
2024-05-24 Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition Zijin Gu et.al. 2405.15216 null
2024-05-23 Reinforcement Learning for Fine-tuning Text-to-speech Diffusion Models Jingyi Chen et.al. 2405.14632 null
2024-05-22 A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction Yue Li et.al. 2405.13477 null
2024-05-20 Multi-speaker Text-to-speech Training with Speaker Anonymized Data Wen-Chin Huang et.al. 2405.11767 null
2024-05-19 VR-GPT: Visual Language Model for Intelligent Virtual Reality Applications Mikhail Konenkov et.al. 2405.11537 null
2024-05-18 Exploring speech style spaces with language models: Emotional TTS without emotion labels Shreeram Suresh Chandra et.al. 2405.11413 null
2024-05-16 Faces that Speak: Jointly Synthesising Talking Face and Speech from Text Youngjoon Jang et.al. 2405.10272 null
2024-05-16 Building a Luganda Text-to-Speech Model From Crowdsourced Data Sulaiman Kagumire et.al. 2405.10211 null
2024-05-16 Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model Siyang Wang et.al. 2405.09768 null
2024-05-15 Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer Weifei Jin et.al. 2405.09470 null
2024-05-15 Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis Sho Inoue et.al. 2405.09171 null
2024-05-14 PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset Yang Hou et.al. 2405.08838 link
2024-04-30 Attention-Constrained Inference for Robust Decoder-Only Text-to-Speech Hankun Wang et.al. 2404.19723 null
2024-04-29 MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis Xiang Li et.al. 2404.18398 link
2024-04-28 USAT: A Universal Speaker-Adaptive Text-to-Speech Approach Wenbin Wang et.al. 2404.18094 link
2024-04-27 TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality Tiantian Feng et.al. 2404.17983 null
2024-04-26 An RFP dataset for Real, Fake, and Partially fake audio detection Abdulazeez AlAli et.al. 2404.17721 null
2024-04-23 StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations Sen Liu et.al. 2404.14946 link
2024-04-23 Retrieval-Augmented Audio Deepfake Detection Zuheng Kang et.al. 2404.13892 link
2024-04-14 Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling Quanxiu Wang et.al. 2404.09192 null
2024-04-11 Voice-Assisted Real-Time Traffic Sign Recognition System Using Convolutional Neural Network Mayura Manawadu et.al. 2404.07807 null
2024-04-18 Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness Xincan Feng et.al. 2404.06714 link
2024-04-10 CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations Leying Zhang et.al. 2404.06690 link
2024-04-10 The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge Yiwei Guo et.al. 2404.06079 null
2024-04-07 Cross-Domain Audio Deepfake Detection: Dataset and Analysis Yuang Li et.al. 2404.04904 null
2024-04-06 HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks Yingting Li et.al. 2404.04645 link
2024-04-18 Open vocabulary keyword spotting through transfer learning from speech synthesis Kesavaraj V et.al. 2404.03914 null
2024-04-06 RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis Detai Xin et.al. 2404.03204 null
2024-04-03 CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech Jaehyeon Kim et.al. 2404.02781 null
2024-04-13 PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders Yu Pan et.al. 2404.02702 null
2024-03-31 Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation Rohan Chaudhury et.al. 2404.01339 link
2024-03-28 A Review of Multi-Modal Large Language and Vision Models Kilian Carolan et.al. 2404.01322 null
2024-04-09 KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis Adal Abilbekov et.al. 2404.01033 link
2024-03-31 CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models Xiang Li et.al. 2404.00569 link
2024-03-25 VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild Puyuan Peng et.al. 2403.16973 link
2024-03-20 Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning Shivam Ratnakant Mhaskar et.al. 2403.15469 null
2024-03-20 UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge Wataru Nakata et.al. 2403.13720 null
2024-03-20 Building speech corpus with diverse voice characteristics for its prompt-based representation Aya Watanabe et.al. 2403.13353 null
2024-03-17 Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations Claudio Pinhanez et.al. 2403.11209 null
2024-03-17 EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech Ziqi Liang et.al. 2403.08164 null
2024-03-09 HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling Chunhui Wang et.al. 2403.05989 null
2024-03-05 AttentionStitch: How Attention Solves the Speech Editing Problem Antonios Alexos et.al. 2403.04804 null
2024-03-07 Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation Sai Akarsh et.al. 2403.04178 null
2024-03-27 NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models Zeqian Ju et.al. 2403.03100 null
2024-03-04 Brilla AI: AI Contestant for the National Science and Maths Quiz George Boateng et.al. 2403.01699 link
2024-03-02 Towards Accurate Lip-to-Speech Synthesis in-the-Wild Sindhu Hegde et.al. 2403.01087 link
2024-02-29 Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data Takaaki Saeki et.al. 2402.18932 null
2024-02-26 An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation Ahmet Gunduz et.al. 2402.16380 link
2024-02-22 Efficient data selection employing Semantic Similarity-based Graph Structures for model training Roxana Petcu et.al. 2402.14888 null
2024-02-22 Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition Rendi Chevi et.al. 2402.14523 link
2024-02-19 On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models Miri Varshavsky-Hassid et.al. 2402.12423 null
2024-02-19 Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting Haolin Chen et.al. 2402.12220 link
2024-02-18 Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in Conversations with the Tabletop Robot Haru Zining Wang et.al. 2402.11571 null
2024-02-14 MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech Shengpeng Ji et.al. 2402.09378 null
2024-02-15 BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data Mateusz Łajszczak et.al. 2402.08093 null
2024-03-04 Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like Naoyuki Kanda et.al. 2402.07383 null
2024-02-09 A New Approach to Voice Authenticity Nicolas M. Müller et.al. 2402.06304 null
2024-02-08 Unified Speech-Text Pretraining for Spoken Dialog Modeling Heeseung Kim et.al. 2402.05706 link
2024-02-05 Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations Álvaro Martín-Cortinas et.al. 2402.03407 null
2024-02-02 Natural language guidance of high-fidelity text-to-speech with synthetic annotations Dan Lyth et.al. 2402.01912 null
2024-01-23 Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization Wei-Ping Huang et.al. 2402.01692 null
2024-02-01 Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech Dong Yang et.al. 2402.00288 link
2024-02-01 PAM: Prompting Audio-Language Models for Audio Quality Assessment Soham Deshmukh et.al. 2402.00282 link
2024-01-31 Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2 Jiatong Shi et.al. 2401.17619 link
2024-01-28 MunTTS: A Text-to-Speech System for Mundari Varun Gumma et.al. 2401.15579 link
2024-01-30 VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech Chenpeng Du et.al. 2401.14321 null
2024-01-25 Text to speech synthesis Harini s et.al. 2401.13891 link
2024-01-25 SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation Dong Zhang et.al. 2401.13527 link
2024-01-22 Benchmarking Large Multimodal Models against Common Corruptions Jiawei Zhang et.al. 2401.11943 link
2024-01-22 Adversarial speech for voice privacy protection from Personalized Speech generation Shihao Chen et.al. 2401.11857 null
2024-02-16 Empowering Communication: Speech Technology for Indian and Western Accents through AI-powered Speech Synthesis Vinotha R et.al. 2401.11771 null
2024-01-19 Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech Abhinav Garg et.al. 2401.10465 null
2024-02-28 MLAAD: The Multi-Language Audio Anti-Spoofing Dataset Nicolas M. Müller et.al. 2401.09512 null
2024-01-15 MCMChaos: Improvising Rap Music with MCMC Methods and Chaos Theory Robert G. Kimelman et.al. 2401.07967 null
2024-01-14 ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering Yakun Song et.al. 2401.07333 null
2024-01-12 Multi-Task Learning for Front-End Text Processing in TTS Wonjune Kang et.al. 2401.06321 link
2024-01-11 End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2 Aniket Tathe et.al. 2401.06183 null
2024-01-11 Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection Lian Huang et.al. 2401.05614 null
2024-01-10 Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters Kenichi Fujita et.al. 2401.05111 null
2024-01-07 Evaluating and Personalizing User-Perceived Quality of Text-to-Speech Voices for Delivering Mindfulness Meditation with Different Physical Embodiments Zhonghao Shi et.al. 2401.03581 null
2024-01-07 Transfer the linguistic representations from TTS to accent conversion with non-parallel data Xi Chen et.al. 2401.03538 null
2024-01-03 Incremental FastPitch: Chunk-based High Quality Text to Speech Muyang Du et.al. 2401.01755 null
2024-01-03 Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction Minchan Kim et.al. 2401.01498 null
2023-12-18 Assisting Blind People Using Object Detection with Vocal Feedback Heba Najm et.al. 2401.01362 null
2023-12-30 Boosting Large Language Model for Speech Synthesis: An Empirical Study Hongkun Hao et.al. 2401.00246 null
2024-01-01 Normalization of Lithuanian Text Using Regular Expressions Pijus Kasparaitis et.al. 2312.17660 null
2023-12-27 AE-Flow: AutoEncoder Normalizing Flow Jakub Mosiński et.al. 2312.16552 null
2023-12-22 Creating New Voices using Normalizing Flows Piotr Bilinski et.al. 2312.14569 null
2023-12-22 ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations Cheng Gong et.al. 2312.14398 null
2023-12-19 External Knowledge Augmented Polyphone Disambiguation Using Large Language Model Chen Li et.al. 2312.11920 null
2023-12-17 A review-based study on different Text-to-Speech technologies Md. Jalal Uddin Chowdhury et.al. 2312.11563 null
2024-01-31 MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis Wenhao Guan et.al. 2312.10687 null
2024-02-22 Amphion: An Open-Source Audio, Music and Speech Generation Toolkit Xueyao Zhang et.al. 2312.09911 link
2023-12-11 Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism Georgios Milis et.al. 2312.06613 link
2023-12-08 An Experimental Study: Assessing the Combined Framework of WavLM and BEST-RQ for Text-to-Speech Synthesis Via Nielson et.al. 2312.05415 null
2023-12-06 Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis Zehua Chen et.al. 2312.03491 null
2023-12-02 Rapid Speaker Adaptation in Low Resource Text to Speech Systems using Synthetic Data and Transfer learning Raviraj Joshi et.al. 2312.01107 null
2023-12-02 Code-Mixed Text to Speech Synthesis under Low-Resource Constraints Raviraj Joshi et.al. 2312.01103 null
2023-11-29 Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes Pavel Korshunov et.al. 2311.17655 null
2024-02-06 Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech Enting Zhou et.al. 2311.14816 link
2023-12-07 Guided Flows for Generative Modeling and Decision Making Qinqing Zheng et.al. 2311.13443 null
2023-11-27 HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis Sang-Hoon Lee et.al. 2311.12454 link
2023-11-18 Utilizing Speech Emotion Recognition and Recommender Systems for Negative Emotion Handling in Therapy Chatbots Farideh Majidi et.al. 2311.11116 null
2023-11-18 Data Center Audio/Video Intelligence on Device (DAVID) -- An Edge-AI Platform for Smart-Toys Gabriel Cosache et.al. 2311.11030 null
2023-11-17 A Study on Altering the Latent Space of Pretrained Text to Speech Models for Improved Expressiveness Mathias Vogel et.al. 2311.10804 null
2023-11-16 Improving fairness for spoken language understanding in atypical speech with Text-to-Speech Helin Wang et.al. 2311.10149 link
2024-02-02 DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation Jianzong Wang et.al. 2311.07965 null
2023-11-12 ChatAnything: Facetime Chat with LLM-Enhanced Personas Yilin Zhao et.al. 2311.06772 null
2023-11-11 NewsGPT: ChatGPT Integration for Robot-Reporter Abdelhadi Hireche et.al. 2311.06640 link
2023-11-08 Synthetic Speaking Children -- Why We Need Them and How to Make Them Muhammad Ali Farooq et.al. 2311.06307 null
2023-09-25 Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural Zero-shot Speech Synthesis from a Face Image Minki Kang et.al. 2311.05844 null
2023-11-07 Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer Learning Rishabh Jain et.al. 2311.04313 link
2023-11-07 Character-Level Bangla Text-to-IPA Transcription Using Transformer Architecture with Sequence Alignment Jakir Hasan et.al. 2311.03792 null
2023-11-08 Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction Minchan Kim et.al. 2311.02898 null
2023-11-02 Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations Hanglei Zhang et.al. 2311.01260 null
2023-11-02 E3 TTS: Easy End-to-End Diffusion-based Text to Speech Yuan Gao et.al. 2311.00945 null
2023-10-31 An Implementation of Multimodal Fusion System for Intelligent Digital Human Generation Yingjie Zhou et.al. 2310.20251 link
2023-10-27 Style Description based Text-to-Speech with Conditional Prosodic Layer Normalization based Diffusion GAN Neeraj Kumar et.al. 2310.18169 null
2023-10-25 ArTST: Arabic Text and Speech Transformer Hawau Olamide Toyin et.al. 2310.16621 link
2023-10-25 Generative Pre-training for Speech with Flow Matching Alexander H. Liu et.al. 2310.16338 null
2023-10-23 DPP-TTS: Diversifying prosodic features of speech via determinantal point processes Seongho Joo et.al. 2310.14663 null
2023-10-22 An overview of text-to-speech systems and media applications Mohammad Reza Hasanabadi et.al. 2310.14301 null
2023-10-14 Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling Tiberiu Boros et.al. 2310.09636 link
2023-10-14 Attentive Multi-Layer Perceptron for Non-autoregressive Generation Shuyang Jiang et.al. 2310.09512 link
2023-12-22 Crowdsourced and Automatic Speech Prominence Estimation Max Morrison et.al. 2310.08464 link
2023-10-12 On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition Nick Rossenbach et.al. 2310.08132 null
2023-10-12 Vec-Tok Speech: speech vectorization and tokenization for neural speech generation Xinfa Zhu et.al. 2310.07246 link
2023-10-10 Prosody Analysis of Audiobooks Charuta Pethe et.al. 2310.06930 link
2023-10-09 JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions Detai Xin et.al. 2310.06072 null
2024-01-09 Unified speech and gesture synthesis using flow matching Shivam Mehta et.al. 2310.05181 null
2023-10-08 Comparative Analysis of Transfer Learning in Deep Learning Text-to-Speech Models on a Few-Shot, Low-Resource, Customized Dataset Ze Liu et.al. 2310.04982 null
2023-10-11 LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT Jiaming Wang et.al. 2310.04673 null
2024-01-22 Latent Filling: Latent Space Data Augmentation for Zero-shot Speech Synthesis Jae-Sung Bae et.al. 2310.03538 null
2023-10-07 The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains Erica Cooper et.al. 2310.02640 null
2023-10-02 Towards human-like spoken dialogue generation between AI agents from written dialogue Kentaro Mitsui et.al. 2310.01088 null
2023-10-01 Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech Dareen Alharthi et.al. 2310.00706 null
2024-03-11 Fewer-token Neural Speech Codec with Time-invariant Codes Yong Ren et.al. 2310.00014 link
2024-01-31 ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech Wenhao Guan et.al. 2309.17056 null
2023-09-29 Low-Resource Self-Supervised Learning with SSL-Enhanced TTS Po-chun Hsu et.al. 2309.17020 null
2023-09-29 Synthetic Speech Detection Based on Temporal Consistency and Distribution of Speaker Features Yuxiang Zhang et.al. 2309.16954 null
2023-12-18 High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models Chunyu Qiang et.al. 2309.15512 null
2024-01-09 BiSinger: Bilingual Singing Voice Synthesis Huali Zhou et.al. 2309.14089 link
2023-10-07 HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS Dake Guo et.al. 2309.13907 null
2023-09-24 VoiceLDM: Text-to-Speech with Environmental Context Yeonghyeon Lee et.al. 2309.13664 null
2023-09-24 Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control Aya Watanabe et.al. 2309.13509 null
2023-09-22 DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis Yu Gu et.al. 2309.12792 null
2023-09-22 Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts Shun Lei et.al. 2309.11977 null
2023-09-21 The Impact of Silence on Speech Anti-Spoofing Yuxiang Zhang et.al. 2309.11827 null
2023-09-21 Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech Rui Liu et.al. 2309.11724 link
2023-09-20 Speak While You Think: Streaming Speech Synthesis During Text Generation Avihu Dekel et.al. 2309.11210 null
2023-09-20 Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model Xinyu Zhou et.al. 2309.11000 link
2023-09-19 Exploring Speech Enhancement for Low-resource Speech Synthesis Zhaoheng Ni et.al. 2309.10795 null
2023-09-19 Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition Ziyang Ma et.al. 2309.10294 null
2023-09-17 Augmenting text for spoken language understanding with Large Language Models Roshan Sharma et.al. 2309.09390 null
2023-09-16 FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework Jianzong Wang et.al. 2309.08837 null
2023-09-15 Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech Dariusz Piotrowski et.al. 2309.08255 null
2023-09-15 HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods Hyun-seo Shin et.al. 2309.08208 link
2023-12-27 PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions Reo Shimizu et.al. 2309.08140 null
2023-09-15 Diversity-based core-set selection for text-to-speech with linguistic and acoustic features Kentaro Seki et.al. 2309.08127 null
2023-09-14 Direct Text to Speech Translation System using Acoustic Units Victoria Mingote et.al. 2309.07478 null
2023-10-07 FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec Zhihao Du et.al. 2309.07405 link
2023-09-13 DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation Zhichao Wu et.al. 2309.06787 null
2023-09-11 Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP Jinzuomu Zhong et.al. 2309.05423 link
2024-01-16 VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching Yiwei Guo et.al. 2309.05027 link
2023-09-08 Cross-Utterance Conditioned VAE for Speech Generation Yang Li et.al. 2309.04156 null
2023-09-07 Large-Scale Automatic Audiobook Creation Brendan Walsh et.al. 2309.03926 null
2023-09-11 GRASS: Unified Generation Model for Speech-to-Semantic Tasks Aobo Xia et.al. 2309.02780 null
2023-09-12 MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023 Zhihang Xu et.al. 2309.02743 null
2023-10-12 PromptTTS 2: Describing and Generating Voices with Text Prompt Yichong Leng et.al. 2309.02285 null
2023-09-04 A Comparative Analysis of Pretrained Language Models for Text-to-Speech Marcel Granero-Moya et.al. 2309.01576 null
2023-09-02 DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin Tao Li et.al. 2309.00883 null
2023-12-18 Learning Speech Representation From Contrastive Token-Acoustic Pretraining Chunyu Qiang et.al. 2309.00424 null
2023-09-01 The FruitShell French synthesis system at the Blizzard 2023 Challenge Xin Qi et.al. 2309.00223 null
2023-08-31 QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning Haohan Guo et.al. 2309.00126 null
2024-01-23 SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models Xin Zhang et.al. 2308.16692 link
2023-08-31 Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis Weiqin Li et.al. 2308.16593 null
2023-08-31 Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information Jie Chen et.al. 2308.16577 null
2023-08-31 LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech Jie Chen et.al. 2308.16569 null
2023-08-30 CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis Yi Meng et.al. 2308.16021 null
2023-09-01 The DeepZen Speech Synthesis System for Blizzard Challenge 2023 Christophe Veaux et.al. 2308.15945 null
2023-08-28 Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech Hyungchan Yoon et.al. 2308.14909 null
2023-09-04 Rep2wav: Noise Robust text-to-speech Using self-supervised representations Qiushi Zhu et.al. 2308.14553 null
2023-08-28 TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models Shengpeng Ji et.al. 2308.14430 link
2023-09-02 Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder Xuyuan Li et.al. 2308.13365 null
2023-08-24 Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations Wenbin Wang et.al. 2308.13007 null
2023-09-22 Sparks of Large Audio Models: A Survey and Outlook Siddique Latif et.al. 2308.12792 null
2023-10-25 SeamlessM4T: Massively Multilingual & Multimodal Machine Translation Seamless Communication et.al. 2308.11596 link
2023-08-31 Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models Heyang Xue et.al. 2308.10428 null
2023-08-16 AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis Hrishikesh Viswanath et.al. 2308.08577 null
2023-08-14 SpeechX: Neural Codec Language Model as a Versatile Speech Transformer Xiaofei Wang et.al. 2308.06873 null
2023-08-12 Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic Talking-head Generation Zhichao Wang et.al. 2308.06457 link
2023-09-09 AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining Haohe Liu et.al. 2308.05734 link
2023-08-09 Data Player: Automatic Generation of Data Videos with Narration-Animation Interplay Leixian Shen et.al. 2308.04703 null
2023-08-08 Towards an AI to Win Ghana's National Science and Maths Quiz George Boateng et.al. 2308.04333 link
2023-08-08 WonderFlow: Narration-Centric Design of Animated Data Videos Yun Wang et.al. 2308.04040 null
2023-08-04 Let's Give a Voice to Conversational Agents in Virtual Reality Michele Yin et.al. 2308.02665 link
2023-08-03 Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation Minsu Kim et.al. 2308.01831 link
2023-08-02 SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis Ramanan Sivaguru et.al. 2308.01018 null
2023-07-07 Artificial Eye for the Blind Abhinav Benagi et.al. 2308.00801 null
2023-07-31 Multilingual context-based pronunciation learning for Text-to-Speech Giulia Comini et.al. 2307.16709 null
2023-07-31 Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech Guangyan Zhang et.al. 2307.16679 null
2023-07-31 Improving grapheme-to-phoneme conversion by learning pronunciations from speech recordings Manuel Sam Ribeiro et.al. 2307.16643 null
2023-07-31 DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training Hyung-Seok Oh et.al. 2307.16549 link
2023-07-31 VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design Jungil Kong et.al. 2307.16430 link
2023-07-30 Improving TTS for Shanghainese: Addressing Tone Sandhi via Word Segmentation Yuanhao Chen et.al. 2307.16199 link
2023-07-29 METTS: Multilingual Emotional Text-to-Speech by Cross-speaker and Cross-lingual Emotion Transfer Xinfa Zhu et.al. 2307.15951 link
2023-12-18 Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding Chunyu Qiang et.al. 2307.15484 null
2023-07-20 SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer Daegyeom Kim et.al. 2307.10550 link
2023-07-18 SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs Yinghao Aaron Li et.al. 2307.09435 null
2023-09-28 Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts Ziyue Jiang et.al. 2307.07218 null
2023-07-13 Controllable Emphasis with zero data for text-to-speech Arnaud Joly et.al. 2307.07062 null
2023-07-11 On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis Siyang Wang et.al. 2307.05132 null
2023-07-10 The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task Kun Song et.al. 2307.04630 null
2023-10-07 ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading Yujia Xiao et.al. 2307.00782 null
2023-06-28 EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech Daria Diatlova et.al. 2307.00024 link
2023-06-29 High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units Junchen Lu et.al. 2306.17005 null
2023-06-28 UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data Heeseung Kim et.al. 2306.16083 link
2023-10-19 Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale Matthew Le et.al. 2306.15687 null
2023-06-27 GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech Yahuan Cong et.al. 2306.15304 null
2023-06-25 DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech Sen Liu et.al. 2306.14145 null
2023-06-21 Visual-Aware Text-to-Speech Mohan Zhou et.al. 2306.12020 null
2023-06-21 Expressive Machine Dubbing Through Phrase-level Cross-lingual Prosody Transfer Jakub Swiatkowski et.al. 2306.11662 null
2023-06-16 Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation Kishor Kayyar Lakshminarayana et.al. 2306.10152 null
2023-06-16 CML-TTS A Multilingual Dataset for Speech Synthesis in Low-Resource Languages Frederico S. Oliveira et.al. 2306.10097 null
2023-06-14 Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation Zheng Liang et.al. 2306.08588 null
2023-06-14 Towards Building Voice-based Conversational Recommender Systems: Datasets, Potential Solutions, and Prospects Xinghua Qu et.al. 2306.08219 link
2023-11-20 StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models Yinghao Aaron Li et.al. 2306.07691 null
2024-01-18 UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding Chenpeng Du et.al. 2306.07547 null
2023-06-13 PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling Ji-Sang Hwang et.al. 2306.07489 null
2023-06-09 Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech Shijun Wang et.al. 2306.05709 null
2023-06-08 VIFS: An End-to-End Variational Inference for Foley Sound Synthesis Junhyeok Lee et.al. 2306.05004 link
2023-07-11 Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge Wenhao Guan et.al. 2306.04301 null
2023-06-06 Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias Ziyue Jiang et.al. 2306.03509 null
2023-08-02 Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis Zhenhui Ye et.al. 2306.03504 null
2023-06-05 Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis Dengfeng Ke et.al. 2306.02593 null
2023-06-05 Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language Model Hoyeon Lee et.al. 2306.02579 null
2023-06-05 Latent Optimal Paths by Gumbel Propagation for Variational Bayesian Dynamic Programming Xinlei Niu et.al. 2306.02568 link
2023-06-02 Towards Robust FastSpeech 2 by Modelling Residual Multimodality Fabian Kögel et.al. 2306.01442 link
2023-05-30 Towards Selection of Text-to-speech Data to Augment ASR Training Shuo Liu et.al. 2306.00998 null
2023-06-01 EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis Haobin Tang et.al. 2306.00648 null
2023-06-01 The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech Phat Do et.al. 2306.00535 null
2023-05-31 Text-to-Speech Pipeline for Swiss German -- A comparison Tobias Bollinger et.al. 2305.19750 null
2023-05-31 XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech Linh The Nguyen et.al. 2305.19709 link
2023-06-01 PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions Guanghou Liu et.al. 2305.19522 null
2023-05-30 Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction in Text-to-Speech for Low-Resource Languages Phat Do et.al. 2305.19396 null
2023-05-30 Make-A-Voice: Unified Voice Synthesis With Discrete Representation Rongjie Huang et.al. 2305.19269 null
2023-05-30 STT4SG-350: A Speech Corpus for All Swiss German Dialect Regions Michel Plüss et.al. 2305.18855 null
2023-05-30 LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus Yuma Koizumi et.al. 2305.18802 null
2023-10-09 An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization Fei Kong et.al. 2305.18355 link
2023-05-29 ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation Ambuj Mehrish et.al. 2305.18028 link
2023-05-29 Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis Erik Ekstedt et.al. 2305.17971 null
2023-07-25 StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation Kun Song et.al. 2305.17732 null
2023-05-28 Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS Sewade Ogun et.al. 2305.17724 link
2023-07-19 Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on the False Alarms in Automated Speech Recognition Testing Julia Kaiwen Lau et.al. 2305.17445 link
2023-05-26 DisfluencyFixer: A tool to enhance Language Learning through Speech To Speech Disfluency Correction Vineet Bhat et.al. 2305.16957 null
2023-05-25 Betray Oneself: A Novel Audio DeepFake Detection Model via Mono-to-Stereo Conversion Rui Liu et.al. 2305.16353 link
2023-05-22 Text Generation with Speech Synthesis for ASR Data Augmentation Zhuangqun Huang et.al. 2305.16333 null
2023-05-25 VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation Tianrui Wang et.al. 2305.16107 null
2023-05-25 Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration Rustem Yeshpanov et.al. 2305.15749 link
2024-02-05 LAraBench: Benchmarking Arabic AI with Large Language Models Ahmed Abdelali et.al. 2305.14982 null
2023-05-23 EfficientSpeech: An On-Device Text to Speech Model Rowel Atienza et.al. 2305.13905 link
2023-05-23 ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models Minki Kang et.al. 2305.13831 null
2023-05-22 U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech Xin Jing et.al. 2305.13195 null
2023-05-25 EMNS /Imz/ Corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels Kari Ali Noriy et.al. 2305.13137 link
2023-05-22 ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer Huadai Liu et.al. 2305.12708 null
2023-05-21 VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages Shivam Mhaskar et.al. 2305.12518 null
2023-05-26 Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus Detai Xin et.al. 2305.12442 link
2023-05-20 ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios Yuyue Wang et.al. 2305.12200 null
2023-05-19 MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low Resource Setting Neil Shah et.al. 2305.11926 null
2024-02-20 Data Redaction from Conditional Generative Models Zhifeng Kong et.al. 2305.11351 null
2023-05-18 Parameter-Efficient Learning for Text-to-Speech Accent Adaptation Li-Jen Yang et.al. 2305.11320 link
2023-05-19 Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation Martijn Bartelds et.al. 2305.10951 link
2023-09-30 Diffusion-Based Mel-Spectrogram Enhancement for Personalized Speech Synthesis with Found Data Yusheng Tian et.al. 2305.10891 link
2023-05-18 FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs Won Jang et.al. 2305.10823 null
2023-05-18 CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training Zhenhui Ye et.al. 2305.10763 null
2023-08-29 a unified front-end framework for english text-to-speech synthesis Zelin Ying et.al. 2305.10666 null
2023-09-19 Controllable Speaking Styles Using a Large Language Model Atli Thor Sigurgeirsson et.al. 2305.10321 null
2023-05-23 Better speech synthesis through scaling James Betker et.al. 2305.07243 link
2023-10-29 CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model Zhen Ye et.al. 2305.06908 link
2023-05-08 Accented Text-to-Speech Synthesis with Limited Data Xuehao Zhou et.al. 2305.04816 null
2023-05-03 M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis Jinlong Xue et.al. 2305.02269 null
2023-05-30 A Review of Deep Learning Techniques for Speech Processing Ambuj Mehrish et.al. 2305.00359 null
2023-04-26 Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis Ye-Xin Lu et.al. 2304.13270 null
2023-04-25 Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge Chenpeng Du et.al. 2304.13121 null
2023-04-24 Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model Kenichi Fujita et.al. 2304.11976 null
2023-04-23 DiffVoice: Text-to-Speech with Latent Diffusion Zhijun Liu et.al. 2304.11750 null
2023-04-23 SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model Jianzong Wang et.al. 2304.11547 null
2023-05-31 NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers Kai Shen et.al. 2304.09116 null
2023-04-16 A Virtual Simulation-Pilot Agent for Training of Air Traffic Controllers Juan Zuluaga-Gomez et.al. 2304.07842 null
2023-04-13 Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis Shun Lei et.al. 2304.06359 null
2023-04-10 Enhancing Speech-to-Speech Translation with Multiple TTS Targets Jiatong Shi et.al. 2304.04618 null
2023-04-07 ArmanTTS single-speaker Persian dataset Mohammd Hasan Shamgholi et.al. 2304.03585 null
2023-04-03 Ensemble prosody prediction for expressive speech synthesis Tian Huey Teh et.al. 2304.00714 null
2023-03-29 AraSpot: Arabic Spoken Command Spotting Mahmoud Salhab et.al. 2303.16621 link
2023-03-28 Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages Seongyeon Park et.al. 2303.15669 link
2023-03-27 Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis Karren Yang et.al. 2303.14885 null
2023-03-24 Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis Takuhiro Kaneko et.al. 2303.13909 null
2023-04-02 A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI Chenshuang Zhang et.al. 2303.13336 null
2023-03-20 Code-Switching Text Generation and Injection in Mandarin-English ASR Haibin Yu et.al. 2303.10949 null
2023-03-14 Controlling High-Dimensional Data With Sparse Input Dan Andrei Iliescu et.al. 2303.09446 null
2023-03-09 Text-to-ECG: 12-Lead Electrocardiogram Synthesis conditioned on Clinical Text Reports Hyunseung Chung et.al. 2303.09395 link
2023-03-15 Cross-speaker Emotion Transfer by Manipulating Speech Style Latents Suhee Jo et.al. 2303.08329 null
2023-03-14 QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis Haobin Tang et.al. 2303.07682 null
2023-03-10 An End-to-End Neural Network for Image-to-Audio Transformation Liu Chen et.al. 2303.06078 null
2023-03-09 Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation Qi Chen et.al. 2303.05322 link
2023-03-07 Do Prosody Transfer Models Transfer Prosody? Atli Thor Sigurgeirsson et.al. 2303.04289 null
2023-03-07 Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling Ziqiang Zhang et.al. 2303.03926 null
2023-03-02 Evaluating Parameter-Efficient Transfer Learning Approaches on SURE Benchmark for Speech Understanding Yingting Li et.al. 2303.03267 link
2023-03-08 FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model Ruiqing Xue et.al. 2303.02939 null
2023-08-14 Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations Yuma Koizumi et.al. 2303.01664 null
2023-03-11 Fine-grained Emotional Control of Text-To-Speech: Learning To Rank Inter- And Intra-Class Emotion Intensities Shijun Wang et.al. 2303.01508 null
2023-12-17 ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations Neil Shah et.al. 2303.01261 null
2023-03-02 LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion Chunfeng Wang et.al. 2303.01086 null
2023-03-02 Leveraging Large Text Corpora for End-to-End Speech Summarization Kohei Matsuura et.al. 2303.00978 null
2023-03-01 DTW-SiameseNet: Dynamic Time Warped Siamese Network for Mispronunciation Detection and Correction Raviteja Anantha et.al. 2303.00171 null
2023-02-28 ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus Ajinkya Kulkarni et.al. 2303.00069 null
2023-02-28 Automatic Heteronym Resolution Pipeline Using RAD-TTS Aligners Jocelyn Huang et.al. 2302.14523 null
2023-06-12 CrossSpeech: Speaker-independent Acoustic Representation for Cross-lingual Speech Synthesis Ji-Hoon Kim et.al. 2302.14370 null
2023-05-19 UniFLG: Unified Facial Landmark Generator from Text or Speech Kentaro Mitsui et.al. 2302.14337 null
2023-02-27 Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech Jiyoung Lee et.al. 2302.13700 link
2023-02-27 Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech Dong Yang et.al. 2302.13652 null
2023-02-27 Varianceflow: High-Quality and Controllable Text-to-Speech using Variance Information via Normalizing Flow Yoonhyung Lee et.al. 2302.13458 null
2023-06-06 PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS Junhyeok Lee et.al. 2302.12391 link
2023-02-21 Emphasizing Unseen Words: New Vocabulary Acquisition for End-to-End Speech Recognition Leyuan Qu et.al. 2302.09723 null
2023-02-23 QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion Houjian Guo et.al. 2302.08296 link
2023-02-13 Fast and small footprint Hybrid HMM-HiFiGAN based system for speech synthesis in Indian languages Sudhanshu Srivastava et.al. 2302.06227 null
2023-02-08 A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech Li-Wei Chen et.al. 2302.04215 link
2023-02-07 Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision Eugene Kharitonov et.al. 2302.03540 null
2023-02-15 MAC: A unified framework boosting low resource automatic speech recognition Zeping Min et.al. 2302.03498 null
2023-06-25 InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt Dongchao Yang et.al. 2301.13662 link
2023-03-01 UzbekTagger: The rule-based POS tagger for Uzbek language Maksud Sharipov et.al. 2301.12711 null
2023-05-27 Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining Takaaki Saeki et.al. 2301.12596 link
2023-01-31 Time out of Mind: Generating Rate of Speech conditioned on emotion and speaker Navjot Kaur et.al. 2301.12331 link
2023-01-26 On granularity of prosodic representations in expressive text-to-speech Mikolaj Babianski et.al. 2301.11446 null
2023-01-26 Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study Massa Baali et.al. 2301.09099 link
2023-01-20 Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions Yinghao Aaron Li et.al. 2301.08810 null
2023-01-11 Modelling low-resource accents without accent-specific TTS frontend Georgi Tinchev et.al. 2301.04606 null
2022-12-11 BASPRO: a balanced script producer for speech corpus collection based on the genetic algorithm Yu-Wen Chen et.al. 2301.04120 link
2023-01-10 UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion Haogeng Liu et.al. 2301.03801 null
2023-01-10 Generative Emotional AI for Speech Emotion Recognition: The Case for Synthetic Emotional Speech Augmentation Abdullah Shahid et.al. 2301.03751 null
2023-09-19 Applying Automated Machine Translation to Educational Video Courses Linden Wang et.al. 2301.03141 null
2023-01-06 Using External Off-Policy Speech-To-Text Mappings in Contextual End-To-End Automated Speech Recognition David M. Chan et.al. 2301.02736 null
2023-01-05 Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers Chengyi Wang et.al. 2301.02111 link
2022-12-11 MnTTS2: An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis Dataset Kailin Liang et.al. 2301.00657 link
2022-12-30 ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech Zehua Chen et.al. 2212.14518 null
2022-12-29 StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models Yinghao Aaron Li et.al. 2212.14227 link
2022-12-22 HMM-based data augmentation for E2E systems for building conversational speech synthesis systems Ishika Gupta et.al. 2212.11982 null
2022-12-21 ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement Wei-Ning Hsu et.al. 2212.11377 null
2022-12-20 TTS-Guided Training for Accent Conversion Without Parallel Data Yi Zhou et.al. 2212.10204 null
2023-06-28 Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling Tuomo Raitio et.al. 2212.10075 null
2022-12-16 Speech Aware Dialog System Technology Challenge (DSTC11) Hagen Soltau et.al. 2212.08704 null
2022-12-16 Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder Yusuke Yasuda et.al. 2212.08329 null
2022-12-16 Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language Yusuke Yasuda et.al. 2212.08321 null
2022-12-15 RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis Shinhyeok Oh et.al. 2212.07939 link
2022-12-14 Probing Deep Speaker Embeddings for Speaker-related Tasks Zifeng Zhao et.al. 2212.07068 null
2022-12-08 SpeechLMScore: Evaluating speech generation using speech language model Soumi Maiti et.al. 2212.04559 link
2023-04-04 Learning to Dub Movies via Hierarchical Prosody Models Gaoxiang Cong et.al. 2212.04054 link
2022-12-07 Low-Resource End-to-end Sanskrit TTS using Tacotron2, WaveGlow and Transfer Learning Ankur Debnath et.al. 2212.03558 null
2022-12-07 Analysis and Utilization of Entrainment on Acoustic and Emotion Features in User-agent Dialogue Daxin Tan et.al. 2212.03398 null
2022-12-06 UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis Yi Lei et.al. 2212.01546 null
2022-11-30 SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech Byoung Jin Choi et.al. 2211.16866 null
2022-11-29 Controllable speech synthesis by learning discrete phoneme-level prosodic representations Nikolaos Ellinas et.al. 2211.16307 null
2023-05-25 Evaluating and reducing the distance between synthetic and real speech distributions Christoph Minixhofer et.al. 2211.16049 null
2022-11-26 Contextual Expressive Text-to-Speech Jianhong Tu et.al. 2211.14548 null
2022-12-05 Efficient Incremental Text-to-Speech on GPUs Muyang Du et.al. 2211.13939 null
2023-03-21 Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems? Xuan Shi et.al. 2211.13868 link
2022-11-23 IMaSC -- ICFOSS Malayalam Speech Corpus Deepa P Gopinath et.al. 2211.12796 null
2022-11-22 PromptTTS: Controllable Text-to-Speech with Text Descriptions Zhifang Guo et.al. 2211.12171 null
2022-11-04 Stutter-TTS: Controlled Synthesis and Improved Recognition of Stuttered Speech Xin Zhang et.al. 2211.09731 null
2023-02-17 Towards Building Text-To-Speech Systems for the Next Billion Users Gokul Karthik Kumar et.al. 2211.09536 link
2023-02-16 EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance Yiwei Guo et.al. 2211.09496 null
2022-11-17 Back-Translation-Style Data Augmentation for Mandarin Chinese Polyphone Disambiguation Chunyu Qiang et.al. 2211.09495 null
2022-11-17 NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis Hyeong-Seok Choi et.al. 2211.09407 null
2023-03-14 Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models Minki Kang et.al. 2211.09383 null
2023-01-04 Low-Resource Mongolian Speech Synthesis Based on Automatic Prosody Annotation Xin Yuan et.al. 2211.09365 null
2022-11-14 SNIPER Training: Variable Sparsity Rate Training For Text-To-Speech Perry Lam et.al. 2211.07283 null
2023-05-25 Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing Jacob J Webber et.al. 2211.06989 null
2023-05-29 OverFlow: Putting flows on top of neural transducers for better TTS Shivam Mehta et.al. 2211.06892 link
2023-05-29 Semi-supervised learning for continuous emotional intensity controllable speech synthesis with disentangled representations Yoori Oh et.al. 2211.06160 null
2022-12-04 ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech Xiaoran Fan et.al. 2211.03545 link
2022-11-07 Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder Jan Melechovsky et.al. 2211.03316 link
2022-11-06 Parallel Attention Forcing for Machine Translation Qingyun Dou et.al. 2211.03237 null
2022-11-06 An Empirical Study on L2 Accents of Cross-lingual Text-to-Speech Systems via Vowel Space Jihwan Lee et.al. 2211.03078 null
2022-11-04 NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS Dongchao Yang et.al. 2211.02448 null
2022-11-04 Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts Detai Xin et.al. 2211.02336 null
2023-04-16 Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based On FullConv-TTS Ziqi Liang et.al. 2211.01948 null
2022-11-01 Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages Anusha Prakash et.al. 2211.01338 null
2023-05-28 DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP Kun Song et.al. 2211.01087 null
2022-11-22 Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement Wei Song et.al. 2211.00967 null
2022-11-01 Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers Cheng-Ping Hsieh et.al. 2211.00585 link
2023-06-11 Generating Multilingual Gender-Ambiguous Text-to-Speech Voices Konstantinos Markopoulos et.al. 2211.00375 null
2023-05-07 Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features Alexandra Vioni et.al. 2211.00342 null
2022-11-02 Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS Kun Song et.al. 2210.17349 null
2024-02-27 Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation Nikolaos Ellinas et.al. 2210.17264 null
2022-10-31 Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection Luigi Attorresi et.al. 2210.17222 null
2022-10-31 Structured State Space Decoder for Speech Recognition and Synthesis Koichi Miyazaki et.al. 2210.17098 null
2022-10-28 Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders Jason Fong et.al. 2210.16045 null
2023-02-21 Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform Masaya Kawamura et.al. 2210.15975 link
2023-02-22 Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis Yuma Shirahata et.al. 2210.15964 null
2022-10-28 Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation Nobuyuki Morioka et.al. 2210.15868 null
2023-03-15 Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech Takaaki Saeki et.al. 2210.15447 null
2022-10-27 Explicit Intensity Control for Accented Text-to-speech Rui Liu et.al. 2210.15364 null
2022-10-27 FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis Yifan Hu et.al. 2210.15360 link
2022-10-26 Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection Kentaro Seki et.al. 2210.14850 null
2022-10-25 Semi-Supervised Learning Based on Reference Model for Low-resource TTS Xulong Zhang et.al. 2210.14723 null
2022-10-26 Cover Reproducible Steganography via Deep Generative Models Kejiang Chen et.al. 2210.14632 null
2022-10-26 Improving Speech-to-Speech Translation Through Unlabeled Text Xuan-Phi Nguyen et.al. 2210.14514 null
2022-10-26 The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge Yuhao Liang et.al. 2210.14448 null
2022-10-25 Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data Xulong Zhang et.al. 2210.13803 null
2023-09-17 HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation Chunhui Wang et.al. 2210.12740 null
2022-10-21 Low-Resource Multilingual and Zero-Shot Multispeaker TTS Florian Lux et.al. 2210.12223 link
2022-10-21 Adaptive re-calibration of channel-wise features for Adversarial Audio Classification Vardhan Dongre et.al. 2210.11722 null
2022-10-20 Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS Chunyu Qiang et.al. 2210.11429 null
2022-10-17 Towards Relation Extraction From Speech Tongtong Wu et.al. 2210.08759 link
2023-02-08 Generating Synthetic Speech from SpokenVocab for Speech Translation Jinming Zhao et.al. 2210.08174 link
2022-10-17 LeVoice ASR Systems for the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge Yan Jia et.al. 2210.07749 null
2022-10-20 Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy Sarina Meyer et.al. 2210.07002 link
2022-10-13 Pre-Avatar: An Automatic Presentation Generation Framework Leveraging Talking Avatar Aolan Sun et.al. 2210.06877 null
2022-10-12 Can we use Common Voice to train a Multi-Speaker TTS system? Sewade Ogun et.al. 2210.06370 null
2023-06-01 SQuId: Measuring Speech Naturalness in Many Languages Thibault Sellam et.al. 2210.06324 null
2022-11-22 Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech Byoung Jin Choi et.al. 2210.05979 null
2022-10-06 An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era Andreas Triantafyllopoulos et.al. 2210.03538 null
2022-09-29 Facial Landmark Predictions with Applications to Metaverse Qiao Han et.al. 2209.14698 link
2022-09-26 Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech Yusuke Nakai et.al. 2209.12549 null
2022-09-22 EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models Perry Lam et.al. 2209.10890 null
2022-09-22 MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline Yifan Hu et.al. 2209.10848 link
2022-09-22 Controllable Accented Text-to-Speech Synthesis Rui Liu et.al. 2209.10804 null
2022-09-16 TIMIT-TTS: a Text-to-Speech Dataset for Multimodal Synthetic Media Detection Davide Salvi et.al. 2209.08000 null
2022-09-14 Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset Michael Chinen et.al. 2209.06358 null
2022-09-08 SANIP: Shopping Assistant and Navigation for the visually impaired Shubham Deshmukh et.al. 2209.03570 null
2022-09-07 Non-Standard Vietnamese Word Detection and Normalization for Text-to-Speech Huu-Tien Dang et.al. 2209.02971 null
2022-09-02 Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model Jennifer Drexler Fox et.al. 2209.01250 null
2022-08-28 Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks Lev Finkelstein et.al. 2208.13183 null
2022-10-04 Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale Aditya Agarwal et.al. 2208.09796 null
2022-08-21 Visualising Model Training via Vowel Space for Text-To-Speech Systems Binu Abeysinghe et.al. 2208.09775 link
2022-08-15 Towards Parametric Speech Synthesis Using Gaussian-Markov Model of Spectral Envelope and Wavelet-Based Decomposition of F0 Mohammed Salah Al-Radhi et.al. 2208.07122 null
2022-12-28 Speech Synthesis with Mixed Emotions Kun Zhou et.al. 2208.05890 null
2022-08-03 A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis Qibing Bai et.al. 2208.02189 null
2022-07-29 Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation Giulia Comini et.al. 2207.14607 null
2022-07-25 Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis Raul Fernandez et.al. 2207.12262 null
2022-07-01 A Polyphone BERT for Polyphone Disambiguation in Mandarin Chinese Song Zhang et.al. 2207.12089 null
2022-07-20 When Is TTS Augmentation Through a Pivot Language Useful? Nathaniel Robinson et.al. 2207.09889 link
2022-07-11 LIP: Lightweight Intelligent Preprocessor for meaningful text-to-speech Harshvardhan Anand et.al. 2207.07118 null
2022-07-13 ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech Rongjie Huang et.al. 2207.06389 link
2022-07-13 Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech Zhengxi Liu et.al. 2207.06088 null
2022-07-13 SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate Nabarun Goswami et.al. 2207.06011 null
2022-07-13 Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS Yookyung Shin et.al. 2207.06000 null
2022-07-13 A Cyclical Approach to Synthetic and Natural Speech Mismatch Refinement of Neural Post-filter for Low-cost Text-to-speech System Yi-Chiao Wu et.al. 2207.05913 null
2022-07-12 Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition Rodolfo Zevallos et.al. 2207.05498 null
2022-07-12 End-to-end speech recognition modeling from de-identified data Martin Flechl et.al. 2207.05469 null
2022-07-11 Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data Naoki Makishima et.al. 2207.04659 null
2022-07-11 DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders Yanqing Liu et.al. 2207.04646 null
2023-01-02 Dreamento: an open-source dream engineering toolbox for sleep EEG wearables Mahdad Jafarzadeh Esfahani et.al. 2207.03977 link
2022-07-07 BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus Josh Meyer et.al. 2207.03546 link
2022-07-05 Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion Yi Lei et.al. 2207.01832 null
2022-07-04 BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model Brooke Stephenson et.al. 2207.01718 null
2022-07-04 Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS) Ariadna Sanchez et.al. 2207.01547 null
2022-07-04 Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS) Ziyao Zhang et.al. 2207.01507 null
2023-03-13 DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech Keon Lee et.al. 2207.01063 link
2022-07-02 Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need Daniel Korzekwa et.al. 2207.00774 null
2022-07-01 Building African Voices Perez Ogayo et.al. 2207.00688 link
2022-07-01 Automatic Evaluation of Speaker Similarity Deja Kamil et.al. 2207.00344 null
2022-08-03 Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding Wei-Ping Huang et.al. 2206.15427 null
2022-06-30 R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS Kyle Kastner et.al. 2206.15276 null
2022-07-01 Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems Hyun-Wook Yoon et.al. 2206.15067 null
2022-06-30 TTS-by-TTS 2: Data-selective augmentation for neural speech synthesis using ranking support vector machine with variational autoencoder Eunwoo Song et.al. 2206.14984 null
2022-06-29 Improving Deliberation by Text-Only and Semi-Supervised Training Ke Hu et.al. 2206.14716 null
2022-06-29 Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody Peter Makarov et.al. 2206.14643 null
2022-06-28 Expressive, Variable, and Controllable Duration Modelling in TTS Ammar Abbas et.al. 2206.14165 null
2022-06-28 Comparison of Speech Representations for the MOS Prediction System Aki Kunikoshi et.al. 2206.13817 null
2022-06-22 A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data Raviraj Joshi et.al. 2206.13240 null
2022-06-25 Synthesizing Personalized Non-speech Vocalization from Discrete Speech Representations Chin-Cheng Hsu et.al. 2206.12662 null
2022-10-21 Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech Florian Lux et.al. 2206.12229 link
2022-06-24 SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech Hyunjae Cho et.al. 2206.12132 null
2022-06-24 End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue Kentaro Mitsui et.al. 2206.12040 null
2022-05-29 Exploiting Transliterated Words for Finding Similarity in Inter-Language News Articles using Machine Learning Sameea Naeem et.al. 2206.11860 null
2022-06-21 Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS Kenta Udagawa et.al. 2206.10256 null
2022-06-24 Towards Optimizing OCR for Accessibility Peya Mowar et.al. 2206.10254 null
2022-06-16 Automatic Prosody Annotation with Pre-Trained Text-Speech Model Ziqian Dai et.al. 2206.07956 link
2022-11-16 NatiQ: An End-to-end Text-to-Speech System for Arabic Ahmed Abdelali et.al. 2206.07373 null
2022-06-15 Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning Rui Liu et.al. 2206.07229 link
2022-12-12 A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation Junhui Zhang et.al. 2206.04922 null
2022-06-09 Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos Alexander Waibel et.al. 2206.04523 null
2022-06-07 FlexLip: A Controllable Text-to-Lip System Dan Oneata et.al. 2206.03206 null
2022-10-11 UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder Jiachen Lian et.al. 2206.02512 null
2023-10-19 Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech Ziyue Jiang et.al. 2206.02147 link
2022-11-02 AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation Kun Song et.al. 2206.00208 null
2022-05-31 Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish Alp Öktem et.al. 2205.15599 link
2023-11-20 StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis Yinghao Aaron Li et.al. 2205.15439 link
2022-05-30 Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data Sungwon Kim et.al. 2205.15370 null
2022-05-26 QSpeech: Low-Qubit Quantum Speech Application Toolkit Zhenhou Hong et.al. 2205.13221 link
2022-11-10 T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation Paul-Ambroise Duquenne et.al. 2205.12216 null
2022-05-20 PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit Hui Zhang et.al. 2205.12007 link
2022-05-24 TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS Xulong Zhang et.al. 2205.11824 null
2022-10-12 GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Rongjie Huang et.al. 2205.07211 link
2022-05-13 Talking Face Generation with Multilingual TTS Hyoung-Kyu Song et.al. 2205.06421 null
2022-05-10 NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality Xu Tan et.al. 2205.04421 link
2022-05-09 Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech Yang Li et.al. 2205.04120 link
2022-05-09 ReCAB-VAE: Gumbel-Softmax Variational Inference Based on Analytic Divergence Sangshin Oh et.al. 2205.04104 null
2022-07-14 Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss Efthymios Georgiou et.al. 2204.13437 null
2024-06-06 Parallel Synthesis for Autoregressive Speech Generation Po-chun Hsu et.al. 2204.11806 null
2022-04-25 SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech Zhenhui Ye et.al. 2204.11792 link
2022-04-22 LibriS2S: A German-English Speech-to-Speech Translation Corpus Pedro Jeuris et.al. 2204.10593 link
2022-07-05 Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation Ryo Terashima et.al. 2204.10020 null
2022-04-21 FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis Rongjie Huang et.al. 2204.09934 link
2022-04-20 Audio Deep Fake Detection System with Neural Stitching for ADD 2022 Rui Yan et.al. 2204.08720 null
2022-04-14 Applying Feature Underspecified Lexicon Phonological Features in Multilingual Text-to-Speech Cong Zhang et.al. 2204.07228 null
2022-12-09 Study of Indian English Pronunciation Variabilities relative to Received Pronunciation Priyanshi Pal et.al. 2204.06502 null
2022-04-12 Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch Hanbin Bae et.al. 2204.05753 null
2023-01-30 The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance Lin Zhang et.al. 2204.05177 null
2022-10-27 Fine-grained Noise Control for Multispeaker Speech Synthesis Karolos Nikitaras et.al. 2204.05070 null
2022-08-31 Karaoker: Alignment-free singing voice synthesis with speech training data Panos Kakoulidis et.al. 2204.04127 null
2022-08-15 Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech Jae-Sung Bae et.al. 2204.04004 null
2022-04-07 Arabic Text-To-Speech (TTS) Data Preparation Hala Al Masri et.al. 2204.03255 null
2022-04-07 Unsupervised Quantized Prosody Representation for Controllable Speech Synthesis Yutian Wang et.al. 2204.03238 null
2022-08-24 SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis Georgia Maniati et.al. 2204.03040 null
2022-09-13 Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation Sravya Popuri et.al. 2204.02967 null
2022-07-02 Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification Jin Woo Lee et.al. 2204.02639 null
2023-08-28 Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech Hyungchan Yoon et.al. 2204.02172 null
2022-09-07 Deliberation Model for On-Device Spoken Language Understanding Duc Le et.al. 2204.01893 null
2022-12-14 Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck Youngsik Eom et.al. 2204.01387 null
2022-11-11 Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis Yixuan Zhou et.al. 2204.00990 null
2022-06-30 VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature Chenpeng Du et.al. 2204.00768 null
2022-04-01 AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios Yihan Wu et.al. 2204.00436 null
2022-04-01 Text-To-Speech Data Augmentation for Low Resource Speech Recognition Rodolfo Zevallos et.al. 2204.00291 null
2022-07-19 Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech Guangyan Zhang et.al. 2203.17190 null
2022-03-31 An End-to-end Chinese Text Normalization Model based on Rule-guided Flat-Lattice Transformer Wenlin Dai et.al. 2203.16954 link
2022-07-11 WavThruVec: Latent speech representation as intermediate features for neural speech synthesis Hubert Siuzdak et.al. 2203.16930 null
2022-03-31 A Character-level Span-based Model for Mandarin Prosodic Structure Prediction Xueyuan Chen et.al. 2203.16922 link
2022-07-01 JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech Dan Lim et.al. 2203.16852 link
2022-03-31 Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset Zehui Yang et.al. 2203.16844 null
2022-03-31 NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism Jingbei Li et.al. 2203.16838 link
2022-03-31 Effectiveness of text to speech pseudo labels for forced alignment and cross lingual pretrained models for low resource speech recognition Anirudh Gupta et.al. 2203.16823 null
2022-04-21 Does Audio Deepfake Detection Generalize? Nicolas M. Müller et.al. 2203.16263 null
2022-03-30 End to End Lip Synchronization with a Temporal AutoEncoder Yoav Shalev et.al. 2203.16224 link
2022-08-15 Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition Junrui Ni et.al. 2203.15796 link
2022-06-29 DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning Takaaki Saeki et.al. 2203.15683 null
2022-11-05 Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation Rendi Chevi et.al. 2203.15643 link
2022-10-06 Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus Minchan Kim et.al. 2203.15447 null
2022-07-11 VoiceMe: Personalized voice generation in TTS Pol van Rijn et.al. 2203.15379 link
2021-07-13 Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging Tamás Gábor Csapó et.al. 2107.05550 null
2021-07-07 Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm Elijah Gutierrez et.al. 2107.02527 null
2022-02-25 Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis Erica Cooper et.al. 2104.12292 null
2019-09-26 Sequence to Sequence Neural Speech Synthesis with Prosody Modification Capabilities Slava Shechtman et.al. 1909.10302 null
2019-08-28 Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis Xin Wang et.al. 1908.10256 null
2019-05-22 Effective parameter estimation methods for an ExcitNet model in generative text-to-speech systems Ohsung Kwon et.al. 1905.08486 null
2017-09-26 Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks Yuki Saito et.al. 1709.08041 null

(back to top)

About

Automatically Update Text-to-speech (TTS) Papers Daily using Github Actions (Update Every 12th hours)

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages