GitHub - liutaocode/TTS-arxiv-daily: Automatically Update Text-to-speech (TTS) Papers Daily using Github Actions (Update Every 12th hours)

Updated on 2025.12.25

Usage instructions: here

This page is modified from here

Table of Contents

TTS

TTS

Publish Date	Title	Authors	PDF	Code
2025-12-23	TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation	Ji-Hoon Kim et.al.	2512.20296	null
2025-12-23	Fun-Audio-Chat Technical Report	Qian Chen et.al.	2512.20156	null
2025-12-22	JoyVoice: Long-Context Conditioning for Anthropomorphic Multi-Speaker Conversational Synthesis	Fan Yu et.al.	2512.19090	null
2025-12-21	Smark: A Watermark for Text-to-Speech Diffusion Models via Discrete Wavelet Transform	Yichuan Zhang et.al.	2512.18791	null
2025-12-21	Task Vector in TTS: Toward Emotionally Expressive Dialectal Speech Synthesis	Pengchao Feng et.al.	2512.18699	link
2025-12-19	Training Text-to-Speech Model with Purely Synthetic Data: Feasibility, Sensitivity, and Generalization Capability	Tingxiao Zhou et.al.	2512.17356	null
2025-12-19	Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track	June Young Yi et.al.	2512.17293	null
2025-12-18	Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs	Sara Papi et.al.	2512.16378	null
2025-12-16	Adapting Speech Language Model to Singing Voice Synthesis	Yiwen Zhao et.al.	2512.14657	null
2025-12-16	Robust Training of Singing Voice Synthesis Using Prior and Posterior Uncertainty	Yiwen Zhao et.al.	2512.14653	null
2025-12-16	GLM-TTS Technical Report	Jiayan Cui et.al.	2512.14291	null
2025-12-18	A stylometric analysis of speaker attribution from speech transcripts	Cristina Aggazzotti et.al.	2512.13667	null
2025-12-15	Reproducing and Dissecting Denoising Language Models for Speech Recognition	Dorian Koch et.al.	2512.13576	null
2025-12-18	DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec	Tao Li et.al.	2512.13251	null
2025-12-11	CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences	Yiyang Wang et.al.	2512.10918	null
2025-12-10	DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance	Kang Yin et.al.	2512.09504	null
2025-12-09	LG Uplus System with Multi-Speaker IDs and Discriminator-based Sub-Judges for the WildSpoof Challenge	Jinyoung Park et.al.	2512.09000	null
2025-12-08	Beyond Unified Models: A Service-Oriented Approach to Low Latency, Context Aware Phonemization for Real Time TTS	Mahta Fetrat et.al.	2512.08006	null
2025-12-08	MultiAPI Spoof: A Multi-API Dataset and Local-Attention Network for Speech Anti-spoofing Detection	Xueping Zhang et.al.	2512.07352	null
2025-12-06	Sanvaad: A Multimodal Accessibility Framework for ISL Recognition and Voice-Based Interaction	Kush Revankar et.al.	2512.06485	null
2025-12-05	SEA-SafeguardBench: Evaluating AI Safety in SEA Languages and Cultures	Panuthep Tasawong et.al.	2512.05501	null
2025-12-05	Simulating Life Paths with Digital Twins: AI-Generated Future Selves Influence Decision-Making and Expand Human Choice	Rachel Poonsiriwong et.al.	2512.05397	null
2025-12-04	HiPPO: Exploring A Novel Hierarchical Pronunciation Assessment Approach for Spoken Languages	Bi-Cheng Yan et.al.	2512.04964	link
2025-12-04	TripleC Learning and Lightweight Speech Enhancement for Multi-Condition Target Speech Extraction	Ziling Huang et.al.	2512.04945	null
2025-12-04	YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-free Melody Guidance	Junjie Zheng et.al.	2512.04779	null
2025-12-04	Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild	Yigui Feng et.al.	2512.04728	null
2025-12-04	M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis	Xiaopeng Wang et.al.	2512.04720	null
2025-12-04	Large Speech Model Enabled Semantic Communication	Yun Tian et.al.	2512.04711	null
2025-12-04	Limit cycles for speech	Adamantios I. Gafos et.al.	2512.04642	null
2025-12-04	RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS	Cong Wang et.al.	2512.04552	null
2025-12-04	Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention	Cong Wang et.al.	2512.04551	null
2025-12-03	Head, posture, and full-body gestures in interactive communication	Ľuboš Hládek et.al.	2512.03636	null
2025-12-03	A Convolutional Framework for Mapping Imagined Auditory MEG into Listened Brain Responses	Maryam Maghsoudi et.al.	2512.03458	null
2025-12-02	Comparing Unsupervised and Supervised Semantic Speech Tokens: A Case Study of Child ASR	Mohan Shi et.al.	2512.03301	null
2025-12-02	How to DP-fy Your Data: A Practical Guide to Generating Synthetic Data With Differential Privacy	Natalia Ponomareva et.al.	2512.03238	null
2025-12-02	MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation	Youxin Pang et.al.	2512.03034	null
2025-12-02	Perceptual evaluation of Acoustic Level of Detail in Virtual Acoustic Environments	Stefan Fichna et.al.	2512.02891	null
2025-12-02	BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion	Sai Koneru et.al.	2512.02817	null
2025-12-02	Reasoning-Aware Multimodal Fusion for Hateful Video Detection	Shuonan Yang et.al.	2512.02743	null
2025-12-02	Hear What Matters! Text-conditioned Selective Video-to-Audio Generation	Junwon Lee et.al.	2512.02650	null
2025-12-02	Spoken Conversational Agents with Large Language Models	Chao-Han Huck Yang et.al.	2512.02593	null
2025-12-02	Co-speech Gesture Video Generation via Motion-Based Graph Retrieval	Yafei Song et.al.	2512.02576	null
2025-12-02	Generative Multi-modal Feedback for Singing Voice Synthesis Evaluation	Xueyan Li et.al.	2512.02523	null
2025-12-02	VibOmni: Towards Scalable Bone-conduction Speech Enhancement on Earables	Lixing He et.al.	2512.02515	null
2025-12-01	Swivuriso: The South African Next Voices Multilingual Speech Dataset	Vukosi Marivatee et.al.	2512.02201	null
2025-12-01	Cross-Lingual Interleaving for Speech Language Models	Adel Moumen et.al.	2512.01865	null
2025-12-01	MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark	Yuezhang Peng et.al.	2512.01603	link
2025-12-01	MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages	Yexing Du et.al.	2512.01512	null
2025-12-01	Model-Based Clustering of Functional Data Via Random Projection Ensembles	Matteo Mori et.al.	2512.01450	null
2025-12-01	EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans	Yingjie Zhou et.al.	2512.01340	null
2025-12-01	fMRI2GES: Co-speech Gesture Reconstruction from fMRI Signal with Dual Brain Decoding Alignment	Chunzheng Zhu et.al.	2512.01189	null
2025-11-30	Supporting Productivity Skill Development in College Students through Social Robot Coaching: A Proof-of-Concept	Himanshi Lalwani et.al.	2512.01105	null
2025-11-30	Arabic TTS with FastPitch: Reproducible Baselines, Adversarial Training, and Oversmoothing Analysis	Lars Nippert et.al.	2512.00937	null
2025-11-29	STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition	Siyu Wang et.al.	2512.00451	null
2025-11-28	OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion	Sai Koneru et.al.	2512.00234	null
2025-11-28	CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation	Fengyi Fang et.al.	2511.22863	null
2025-11-27	Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration	Kanchon Gharami et.al.	2511.22769	null
2025-11-27	PURE Codec: Progressive Unfolding of Residual Entropy for Speech Codec Learning	Jiatong Shi et.al.	2511.22687	null
2025-11-27	Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking	Katia Vendrame et.al.	2511.22503	null
2025-11-27	Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition	Maheswar Bora et.al.	2511.22443	null
2025-11-27	GLA-Grad++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis	Teysir Baoueb et.al.	2511.22293	null
2025-11-27	VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task	Yuyue Wang et.al.	2511.22229	null
2025-11-27	Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation	Joel Alberto Santos et.al.	2511.22025	null
2025-11-26	Advancing Marine Bioacoustics with Deep Generative Models: A Hybrid Augmentation Strategy for Southern Resident Killer Whale Detection	Bruno Padovese et.al.	2511.21872	null
2025-11-26	Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation	Lina Conti et.al.	2511.21517	null
2025-11-26	TSGM: Regular and Irregular Time-series Generation using Score-based Generative Models	Haksoo Lim et.al.	2511.21335	null
2025-11-26	Acoustic neural networks: Identifying design principles and exploring physical feasibility	Ivan Kalthoff et.al.	2511.21313	null
2025-11-26	Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale	Yicheng Zhong et.al.	2511.21270	null
2025-11-26	CartoonSing: Unifying Human and Nonhuman Timbres in Singing Generation	Jionghao Han et.al.	2511.21045	null
2025-11-26	RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data	Zhisheng Zheng et.al.	2511.20974	null
2025-11-26	Towards Audio Token Compression in Large Audio Language Models	Saurabhchand Bhati et.al.	2511.20973	null
2025-11-26	SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications	Jionghao Han et.al.	2511.20972	null
2025-11-25	Continual Audio Deepfake Detection via Universal Adversarial Perturbation	Wangjie Li et.al.	2511.19974	null
2025-11-25	Towards Edge General Intelligence: Knowledge Distillation for Mobile Agentic AI	Yuxuan Wu et.al.	2511.19947	null
2025-11-25	It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models	Xiangyu Zhao et.al.	2511.19877	null
2025-11-24	Evaluating Objective Speech Quality Metrics for Neural Audio Codecs	Luca A. Lanzendörfer et.al.	2511.19734	null
2025-11-24	A Layered Protocol Architecture for the Internet of Agents	Charles Fleming et.al.	2511.19699	null
2025-11-24	Dynamic Multi-Species Bird Soundscape Generation with Acoustic Patterning and 3D Spatialization	Ellie L. Zhang et.al.	2511.19275	null
2025-11-25	PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation	Huadai Liu et.al.	2511.18833	null
2025-11-24	Context-Aware Whisper for Arabic ASR Under Linguistic Varieties	Bashar Talafha et.al.	2511.18774	null
2025-11-24	AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation	Omar Garib et.al.	2511.18718	null
2025-11-23	The Locally Deployable Virtual Doctor: LLM Based Human Interface for Automated Anamnesis and Database Conversion	Jan Benedikt Ruhland et.al.	2511.18632	null
2025-11-23	InstructAudio: Unified speech and music generation with natural language instruction	Chunyu Qiang et.al.	2511.18487	null
2025-11-23	A Multimodal Conversational Agent for Tabular Data Analysis	Mohammad Nour Al Awad et.al.	2511.18405	null
2025-11-23	Gradient Masters at BLP-2025 Task 1: Advancing Low-Resource NLP for Bengali using Ensemble-Based Adversarial Training for Hate Speech Detection	Syed Mohaiminul Hoque et.al.	2511.18324	null
2025-11-23	MultiDiffNet: A Multi-Objective Diffusion Framework for Generalizable Brain Decoding	Mengchun Zhang et.al.	2511.18294	null
2025-11-22	A superpersuasive autonomous policy debating system	Allen Roush et.al.	2511.17854	null
2025-11-21	Enhancing Quranic Learning: A Multimodal Deep Learning Approach for Arabic Phoneme Recognition	Ayhan Kucukmanisa et.al.	2511.17477	null
2025-11-21	AI in Music and Sound: Pedagogical Reflections, Post-Structuralist Approaches and Creative Outcomes in Seminar Practice	Guilherme Coelho et.al.	2511.17425	null
2025-11-21	Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM	Chiori Hori et.al.	2511.17335	null
2025-11-21	Investigating self-supervised representations for audio-visual deepfake detection	Dragos-Alexandru Boldisor et.al.	2511.17181	null
2025-11-20	Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation	Wei-Cheng Tseng et.al.	2511.16757	null
2025-11-20	Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs	Wei-Cheng Tseng et.al.	2511.16639	null
2025-11-21	WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue	Zachary Ellis et.al.	2511.16544	null
2025-11-20	SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise	Rui Sang et.al.	2511.16114	null
2025-11-19	Step-Audio-R1 Technical Report	Fei Tian et.al.	2511.15848	null
2025-11-19	A Generalized Weighted Overlap-Add (WOLA) Filter Bank for Improved Subband System Identification	Mohit Sharma et.al.	2511.15766	null
2025-11-19	PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback	Sirui Chen et.al.	2511.15253	null
2025-11-19	Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding	Mingyue Huo et.al.	2511.15145	null
2025-11-19	Aligning Generative Music AI with Human Preferences: Methods and Challenges	Dorien Herremans et.al.	2511.15038	null
2025-11-18	Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion	Zanxu Wang et.al.	2511.14969	null
2025-11-18	PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants	Mingkun Yu et.al.	2511.14852	null
2025-11-18	Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech	Nam-Gyu Kim et.al.	2511.14824	null
2025-11-18	Ground Truth Generation for Multilingual Historical NLP using LLMs	Clovis Gladstone et.al.	2511.14688	null
2025-11-18	TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation	Wei Liu et.al.	2511.14410	null
2025-11-18	AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR	Gabrial Zencha Ashungafac et.al.	2511.14255	null
2025-11-18	Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning	Rui Liu et.al.	2511.14249	null
2025-11-18	StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model	Yifan Yang et.al.	2511.14223	null
2025-11-18	FxSearcher: gradient-free text-driven audio transformation	Hojoon Ki et.al.	2511.14138	null
2025-11-17	Human-centric Maintenance Process Through Integration of AI, Speech, and AR	Parul Khanna et.al.	2511.13918	null
2025-11-17	Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video	Filippo Cenacchi. Longbing Cao et.al.	2511.13802	null
2025-11-17	PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement	Xiaobin Rong et.al.	2511.13300	null
2025-11-17	Computational Measurement of Political Positions: A Review of Text-Based Ideal Point Estimation Algorithms	Patrick Parschan et.al.	2511.13238	null
2025-11-17	FoleyBench: A Benchmark For Video-to-Audio Models	Satvik Dixit et.al.	2511.13219	null
2025-11-17	Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis	Zaara Zabeen Arpa et.al.	2511.13159	link
2025-11-17	A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning	Liuyi Jin et.al.	2511.13078	null
2025-11-17	CalibrateMix: Guided-Mixup Calibration of Image Semi-Supervised Models	Mehrab Mustafy Rahman et.al.	2511.12964	null
2025-11-16	Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data	Sina Rashidi et.al.	2511.12690	null
2025-11-16	Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans	Hongbin Huang et.al.	2511.12662	null
2025-11-16	Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data	Yunxin Li et.al.	2511.12609	null
2025-11-16	DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions	Xiaoyu Lin et.al.	2511.12452	null
2025-11-14	Proactive Hearing Assistants that Isolate Egocentric Conversations	Guilin Hu et.al.	2511.11473	link
2025-11-14	Language-Aided State Estimation	Yuki Miyoshi et.al.	2511.11285	null
2025-11-14	Analysing Personal Attacks in U.S. Presidential Debates	Ruban Goyal et.al.	2511.11108	null
2025-11-14	CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation	Crystal Min Hui Poon et.al.	2511.11104	null
2025-11-14	CAT-Net: A Cross-Attention Tone Network for Cross-Subject EEG-EMG Fusion Tone Decoding	Yifan Zhuang et.al.	2511.10935	null
2025-11-14	Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio	Guangke Chen et.al.	2511.10913	null
2025-11-13	Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces	Farhan Sheth et.al.	2511.10793	null
2025-11-13	Towards Attribution of Generators and Emotional Manipulation in Cross-Lingual Synthetic Speech using Geometric Learning	Girish et.al.	2511.10790	null
2025-11-13	Music Flamingo: Scaling Music Understanding in Audio Language Models	Sreyan Ghosh et.al.	2511.10289	null
2025-11-13	VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction	Yuhao Wang et.al.	2511.10232	null
2025-11-13	Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard	Yudong Yang et.al.	2511.10222	null
2025-11-13	Towards Leveraging Sequential Structure in Animal Vocalizations	Eklavya Sarkar et.al.	2511.10190	link
2025-11-13	FabasedVC: Enhancing Voice Conversion with Text Modality Fusion and Phoneme-Level SSL Features	Wenyu Wang et.al.	2511.10112	null
2025-11-13	Mitigating Error Accumulation in Co-Speech Motion Generation via Global Rotation Diffusion and Multi-Level Constraints	Xiangyue Zhang et.al.	2511.10076	null
2025-11-13	Time-Layer Adaptive Alignment for Speaker Similarity in Flow-Matching Based Zero-Shot TTS	Haoyu Li et.al.	2511.09995	null
2025-11-13	MINDS: A Cross-cultural Dialogue Corpus for Social Norm Classification and Adherence Detection	Pritish Sahu et.al.	2511.09918	null
2025-11-12	Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages	Omnilingual ASR team et.al.	2511.09690	null
2025-11-12	End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering	Jiliang Hu et.al.	2511.09282	null
2025-11-10	Generating Novel and Realistic Speakers for Voice Conversion	Meiying Melissa Chen et.al.	2511.07135	null
2025-11-10	E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis	Zhisheng Zhang et.al.	2511.07099	link
2025-11-09	IDMap: A Pseudo-Speaker Generator Framework Based on Speaker Identity Index to Vector Mapping	Zeyan Liu et.al.	2511.06246	null
2025-11-07	Synthesizing speech with selected perceptual voice qualities - A case study with creaky voice	Frederik Rautenberg et.al.	2511.05143	null
2025-11-05	Step-Audio-EditX Technical Report	Chao Yan et.al.	2511.03601	null
2025-11-05	PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech	Michel Wong et.al.	2511.03080	null
2025-11-04	Augmenting Open-Vocabulary Dysarthric Speech Assessment with Human Perceptual Supervision	Kaimeng Jia et.al.	2511.02270	null
2025-11-03	Toward Objective and Interpretable Prosody Evaluation in Text-to-Speech: A Linguistically Motivated Approach	Cedric Chan et.al.	2511.02104	null
2025-10-31	Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication	Deok-Seon Kim et.al.	2510.27247	null
2025-10-27	SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution	Dharma Teja Donepudi et.al.	2510.25178	null
2025-10-28	Levée d'ambiguïtés par grammaires locales	Eric G. C. Laporte et.al.	2510.24530	null
2025-10-28	Bayesian Speech synthesizers Can Learn from Multiple Teachers	Ziyang Zhang et.al.	2510.24372	null
2025-10-28	emg2speech: synthesizing speech from electromyography using self-supervised speech models	Harshavardhana T. Gowda et.al.	2510.23969	null
2025-10-28	SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity	Hanke Xie et.al.	2510.23541	null
2025-10-26	UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models	Wenming Tu et.al.	2510.22588	null
2025-10-24	StylePitcher: Generating Style-Following and Expressive Pitch Curves for Versatile Singing Tasks	Jingyue Huang et.al.	2510.21685	null
2025-10-23	Vox-Evaluator: Enhancing Stability and Fidelity for Zero-shot TTS with A Multi-Level Evaluator	Hualei Wang et.al.	2510.20210	null
2025-10-23	SpeechAgent: An End-to-End Mobile Infrastructure for Speech Impairment Assistance	Haowei Lou et.al.	2510.20113	null
2025-10-22	Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent	Yangshijie Zhang et.al.	2510.19641	null
2025-10-22	Which Evaluation for Which Model? A Taxonomy for Speech Model Assessment	Maureen de Seyssel et.al.	2510.19509	null
2025-10-22	EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection	Tong Zhang et.al.	2510.19414	null
2025-10-21	StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction	Qianheng Xu et.al.	2510.18938	null
2025-10-21	KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers	Mohd Ruhul Ameen et.al.	2510.18355	null
2025-10-21	ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation	Haowei Lou et.al.	2510.18308	null
2025-10-19	U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation	Xusheng Yang et.al.	2510.16718	null
2025-10-18	Edge-Based Speech Transcription and Synthesis for Kinyarwanda and Swahili Languages	Pacome Simon Mbonimpa et.al.	2510.16497	null
2025-10-22	VoiceMorph: How AI Voice Morphing Reveals the Boundaries of Auditory Self-Recognition	Kye Shimizu et.al.	2510.16192	null
2025-10-16	RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF	Qing Yang et.al.	2510.14628	null
2025-10-15	InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue	Wenwen Tong et.al.	2510.13747	null
2025-10-15	Closing the Gap Between Text and Speech Understanding in LLMs	Santiago Cuervo et.al.	2510.13632	null
2025-10-15	Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models	Yizhou Peng et.al.	2510.13293	null
2025-10-14	Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs	Xinlu He et.al.	2510.12995	null
2025-10-14	Beating Harmful Stereotypes Through Facts: RAG-based Counter-speech Generation	Greta Damo et.al.	2510.12316	null
2025-10-15	DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation	Yakun Song et.al.	2510.12210	null
2025-10-13	BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis	Jingyuan Xing et.al.	2510.11646	null
2025-10-13	Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker	Cheng Gong et.al.	2510.11124	null
2025-10-14	ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis	Mohammad Javad Ranjbar Kalahroodi et.al.	2510.10774	null
2025-10-14	MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations	Wenxiang Guo et.al.	2510.10396	null
2025-10-10	Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models	Donghang Wu et.al.	2510.09592	null
2025-10-10	O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion	Huu Tuong Tu et.al.	2510.09061	null
2025-10-10	DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment	Zongcai Du et.al.	2510.09016	null
2025-10-09	DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching	Hanke Xie et.al.	2510.08373	null
2025-10-09	IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation	Wei Wang et.al.	2510.07979	null
2025-10-08	Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis	Zhu Li et.al.	2510.07096	null
2025-10-08	Towards Responsible Evaluation for Text-to-Speech	Yifan Yang et.al.	2510.06927	null
2025-10-08	XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection	Phuong Tuan Dat et.al.	2510.06706	null
2025-10-07	ECTSpeech: Enhancing Efficient Speech Synthesis via Easy Consistency Tuning	Tao Zhu et.al.	2510.05984	null
2025-10-07	Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech	Rikuto Kotoge et.al.	2510.05799	null
2025-10-07	Investigation of perception inconsistency in speaker embedding for asynchronous voice anonymization	Rui Wang et.al.	2510.05718	null
2025-10-07	Sparse deepfake detection promotes better disentanglement	Antoine Teissier et.al.	2510.05696	null
2025-10-07	Teaching Machines to Speak Using Articulatory Control	Akshay Anand et.al.	2510.05619	null
2025-10-06	Paper2Video: Automatic Video Generation from Scientific Papers	Zeyu Zhu et.al.	2510.05096	null
2025-10-06	Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba	Baher Mohammad et.al.	2510.04738	null
2025-10-06	UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models	Wenhao Guan et.al.	2510.04593	link
2025-10-05	GDiffuSE: Diffusion-based speech enhancement with noise model guidance	Efrayim Yanir et.al.	2510.04157	null
2025-10-05	A Multilingual Framework for Dysarthria: Detection, Severity Classification, Speech-to-Text, and Clean Speech Generation	Ananya Raghu et.al.	2510.03986	null
2025-10-07	Synthetic Audio Forensics Evaluation (SAFE) Challenge	Kirill Trapeznikov et.al.	2510.03387	null
2025-10-03	Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech	Hieu-Nghia Huynh-Nguyen et.al.	2510.02848	null
2025-10-02	Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement	Jianing Yang et.al.	2510.01722	link
2025-10-01	From Scores to Preferences: Redefining MOS Benchmarking for Speech Quality Reward Modeling	Yifei Cao et.al.	2510.00743	null
2025-10-02	MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance	Xingjian Zhao et.al.	2510.00499	null
2025-09-30	BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs	Yue Wang et.al.	2509.26514	null
2025-09-30	HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis	Ziyu Zhang et.al.	2509.25842	null
2025-09-30	LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning	Kang Yang et.al.	2509.25670	null
2025-09-29	Emotion-Aligned Generation in Diffusion Text to Speech Models via Preference-Guided Optimization	Jiacheng Shi et.al.	2509.25416	null
2025-09-29	MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech	Chengyao Wang et.al.	2509.25131	null
2025-09-30	VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning	Xin Cheng et.al.	2509.24773	null
2025-09-29	VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning	Yixuan Zhou et.al.	2509.24650	null
2025-09-29	Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis	Tianrui Wang et.al.	2509.24629	null
2025-09-29	ISSE: An Instruction-Guided Speech Style Editing Dataset And Benchmark	Yun Chen et.al.	2509.24570	null
2025-09-29	UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities	Xuenan Xu et.al.	2509.24391	null
2025-09-28	Generalizable Speech Deepfake Detection via Information Bottleneck Enhanced Adversarial Alignment	Pu Huang et.al.	2509.23618	null
2025-09-27	BFA: Real-time Multilingual Text-to-speech Forced Alignment	Abdul Rehman et.al.	2509.23147	null
2025-09-26	ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection	Mohamed Maged et.al.	2509.22808	null
2025-09-26	Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis	Zhikang Niu et.al.	2509.22167	null
2025-09-26	Speaker Anonymisation for Speech-based Suicide Risk Detection	Ziyun Cui et.al.	2509.22148	null
2025-09-26	Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling	Junjie Cao et.al.	2509.22062	null
2025-09-26	Align2Speak: Improving TTS for Low Resource Languages via ASR-Guided Online Preference Optimization	Shehzeen Hussain et.al.	2509.21718	null
2025-09-25	UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice	Sitong Cheng et.al.	2509.21144	link
2025-09-27	i-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents	Anupam Purwar et.al.	2509.20971	null
2025-09-26	SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS	Tan Dat Nguyen et.al.	2509.20802	null
2025-09-24	Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens	Ismail Rasim Ulgen et.al.	2509.20485	null
2025-09-24	OLaPh: Optimal Language Phonemizer	Johannes Wirth et.al.	2509.20086	null
2025-09-25	Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration	Yifan Yang et.al.	2509.19928	null
2025-09-24	CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance	Junchuan Zhao et.al.	2509.19883	null
2025-09-24	Eliminating stability hallucinations in llm-based tts models via attention guidance	ShiMing Wang et.al.	2509.19852	null
2025-09-24	Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation	Yang Cui et.al.	2509.19812	null
2025-09-24	PART: Progressive Alignment Representation Training for Multilingual Speech-To-Text with LLMs	Pei Zhang et.al.	2509.19745	null
2025-09-24	Selective Classifier-free Guidance for Zero-shot Text-to-speech	John Zheng et.al.	2509.19668	null
2025-09-23	Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation	Roy Fejgin et.al.	2509.19592	null
2025-09-23	HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS	Sihang Nie et.al.	2509.19001	null
2025-09-23	Direct Preference Optimization for Speech Autoregressive Diffusion Models	Zhijun Liu et.al.	2509.18928	null
2025-09-23	Group Relative Policy Optimization for Text-to-Speech with Large Language Models	Chang Liu et.al.	2509.18798	null
2025-09-23	Explore the Reinforcement Learning for the LLM based ASR and TTS system	Changfeng Gao et.al.	2509.18569	null
2025-09-23	No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS	Seungyoun Shin et.al.	2509.18531	null
2025-09-22	Discrete-time diffusion-like models for speech synthesis	Xiaozhou Tan et.al.	2509.18470	null
2025-09-22	TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation	Yutong Liu et.al.	2509.18060	null
2025-09-22	Nord-Parl-TTS: Finnish and Swedish TTS Dataset from Parliament Speech	Zirui Li et.al.	2509.17988	null
2025-09-22	Qwen3-Omni Technical Report	Jin Xu et.al.	2509.17765	null
2025-09-22	Audiobook-CC: Controllable Long-context Speech Generation for Multicast Audiobook	Min Liu et.al.	2509.17516	null
2025-09-21	Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing	Wataru Nakata et.al.	2509.17052	null
2025-09-21	Bridging the gap between training and inference in LM-based TTS models	Ruonan Zhang et.al.	2509.17021	null
2025-09-21	MBCodec:Thorough disentangle for high-fidelity audio compression	Ruonan Zhang et.al.	2509.17006	null
2025-09-19	Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation	Qi Wang et.al.	2509.16010	null
2025-09-19	VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency	Nikita Torgashov et.al.	2509.15969	null
2025-09-19	Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS	Ziqi Dai et.al.	2509.15845	null
2025-09-19	LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control	Junki Ohmura et.al.	2509.15626	null
2025-09-19	Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech	Xinlei Niu et.al.	2509.15492	null
2025-09-18	A Novel Semantic Compression Approach for Ultra-low Bandwidth Voice Communication	Ryan Collette et.al.	2509.15462	null
2025-09-18	Frustratingly Easy Data Augmentation for Low-Resource ASR	Katsumi Ibaraki et.al.	2509.15373	null
2025-09-18	Real-Time Streaming Mel Vocoding with Generative Flow Matching	Simon Welker et.al.	2509.15085	null
2025-09-20	SynParaSpeech: Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding	Bingsong Bai et.al.	2509.14946	link
2025-09-18	MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis	Keyu An et.al.	2509.14784	null
2025-09-18	DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis	Ye-Xin Lu et.al.	2509.14684	null
2025-09-18	Stochastic Clock Attention for Aligning Continuous and Ordered Sequences	Hyungjoon Soh et.al.	2509.14678	null
2025-09-18	SpeechMLC: Speech Multi-label Classification	Miseul Kim et.al.	2509.14677	null
2025-09-18	Mitigating Intra-Speaker Variability in Diarization with Style-Controllable Speech Augmentation	Miseul Kim et.al.	2509.14632	null
2025-09-18	Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis	Qingyu Liu et.al.	2509.14579	null
2025-09-17	CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset	Brian Yan et.al.	2509.14161	null
2025-09-18	Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems	Yi-Cheng Lin et.al.	2509.13989	null
2025-09-16	MSR-Codec: A Low-Bitrate Multi-Stream Residual Codec for High-Fidelity Speech Generation with Information Disentanglement	Jingyu Li et.al.	2509.13068	null
2025-09-16	A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis	Javeria Amir et.al.	2509.12831	null
2025-09-15	Preservation of Language Understanding Capabilities in Speech-aware Large Language Models	Marek Kubis et.al.	2509.12171	null
2025-09-14	FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs	Md Mubtasim Ahasan et.al.	2509.11425	null
2025-09-14	Length-Aware Rotary Position Embedding for Text-Speech Alignment	Hyeongju Kim et.al.	2509.11084	null
2025-09-12	WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers	Akshat Pandey et.al.	2509.10452	null
2025-09-12	Towards Data Drift Monitoring for Speech Deepfake Detection in the context of MLOps	Xin Wang et.al.	2509.10086	null
2025-09-11	DiTReducio: A Training-Free Acceleration for DiT-Based TTS via Progressive Calibration	Yanru Huo et.al.	2509.09748	null
2025-09-09	VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions	Jun Zhan et.al.	2509.09716	null
2025-09-12	DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech	Ngoc-Son Nguyen et.al.	2509.09631	null
2025-09-11	HISPASpoof: A New Dataset For Spanish Speech Forensics	Maria Risques et.al.	2509.09155	null
2025-09-10	Deploying AI for Signal Processing education: Selected challenges and intriguing opportunities	Jarvis Haupt et.al.	2509.08950	null
2025-09-10	Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling	Neil Zeghidour et.al.	2509.08753	null
2025-09-10	Accelerating Diffusion Transformer-Based Text-to-Speech with Transformer Layer Caching	Siratish Sakpiboonchit et.al.	2509.08696	null
2025-09-10	Joint Learning using Mixture-of-Expert-Based Representation for Enhanced Speech Generation and Robust Emotion Recognition	Jing-Tong Tzeng et.al.	2509.08470	null
2025-09-09	Progressive Facial Granularity Aggregation with Bilateral Attribute-based Enhancement for Face-to-Speech Synthesis	Yejin Jeon et.al.	2509.07376	null
2025-09-09	When Fine-Tuning is Not Enough: Lessons from HSAD on Hybrid and Adversarial Audio Spoof Detection	Bin Hu et.al.	2509.07323	null
2025-09-08	Controllable Singing Voice Synthesis using Phoneme-Level Energy Sequence	Yerin Ryu et.al.	2509.07038	null
2025-09-08	ParCzech4Speech: A New Speech Corpus Derived from Czech Parliamentary Data	Vladislav Stankov et.al.	2509.06675	null
2025-09-09	Speaker Privacy and Security in the Big Data Era: Protection and Defense against Deepfake	Liping Chen et.al.	2509.06361	null
2025-09-07	UniVerse-1: Unified Audio-Video Generation via Stitching of Experts	Duomin Wang et.al.	2509.06155	null
2025-09-07	Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis	Zhenqi Jia et.al.	2509.06074	null
2025-09-06	LatinX: Aligning a Multilingual TTS Model with Direct Preference Optimization	Luis Felipe Chary et.al.	2509.05863	null
2025-09-05	Cloning a Conversational Voice AI Agent from Call,Recording Datasets for Telesales	Krittanon Kaewtawee et.al.	2509.04871	null
2025-09-04	Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding	Rui-Chen Zheng et.al.	2509.04685	null
2025-09-04	DarkStream: real-time speech anonymization with low latency	Waris Quamer et.al.	2509.04667	null
2025-09-04	AUDETER: A Large-scale Dataset for Deepfake Audio Detection in Open Worlds	Qizhou Wang et.al.	2509.04345	null
2025-09-04	Open-Source Full-Duplex Conversational Datasets for Natural and Interactive Speech Synthesis	Zhitong Zhou et.al.	2509.04093	null
2025-09-04	LibriQuote: A Speech Dataset of Fictional Character Utterances for Expressive Zero-Shot Speech Synthesis	Gaspard Michel et.al.	2509.04072	null
2025-09-03	Multi-level SSL Feature Gating for Audio Deepfake Detection	Hoan My Tran et.al.	2509.03409	null
2025-09-03	LatPhon: Lightweight Multilingual G2P for Romance Languages and English	Luis Felipe Chary et.al.	2509.03300	null
2025-09-03	Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised Embeddings	Dyah A. M. G. Wisnu et.al.	2509.03292	null
2025-09-03	AIVA: An AI-based Virtual Companion for Emotion-aware Interaction	Chenxi Li et.al.	2509.03212	null
2025-09-04	FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot	Kun Xie et.al.	2509.02020	null
2025-09-01	MixedG2P-T5: G2P-free Speech Synthesis for Mixed-script texts using Speech Self-Supervised Learning and Language Model	Joonyong Park et.al.	2509.01391	null
2025-09-01	The AudioMOS Challenge 2025	Wen-Chin Huang et.al.	2509.01336	null
2025-09-01	An AI-Based Shopping Assistant System to Support the Visually Impaired	Larissa R. de S. Shibata et.al.	2509.01246	null
2025-09-01	SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation	Chenyang Le et.al.	2509.01200	null
2025-08-31	MPO: Multidimensional Preference Optimization for Language Model-based Text-to-Speech	Kangxiang Xia et.al.	2509.00685	null
2025-08-29	Towards Improved Speech Recognition through Optimized Synthetic Data Generation	Yanis Perrin et.al.	2508.21631	null
2025-08-28	MoTAS: MoE-Guided Feature Selection from TTS-Augmented Speech for Enhanced Multimodal Alzheimer's Early Screening	Yongqi Shao et.al.	2508.20513	null
2025-08-26	Interpolating Speaker Identities in Embedding Space for Data Expansion	Tianchi Liu et.al.	2508.19210	null
2025-08-26	CLEAR: Continuous Latent Autoregressive Modeling for High-quality and Low-latency Speech Synthesis	Chun Yat Wu et.al.	2508.19098	null
2025-08-25	Unseen Speaker and Language Adaptation for Lightweight Text-To-Speech with Adapters	Alessio Falai et.al.	2508.18006	null
2025-08-27	Vocoder-Projected Feature Discriminator	Takuhiro Kaneko et.al.	2508.17874	null
2025-09-02	Zero-shot Context Biasing with Trie-based Decoding using Synthetic Multi-Pronunciation	Changsong Liu et.al.	2508.17796	null
2025-08-25	ClearMask: Noise-Free and Naturalness-Preserving Protection Against Voice Deepfake Attacks	Yuanda Wang et.al.	2508.17660	null
2025-08-26	EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems	Jingwen Liu et.al.	2508.17623	null
2025-08-24	Improving French Synthetic Speech Quality via SSML Prosody Control	Nassima Ould Ouali et.al.	2508.17494	null
2025-08-23	RephraseTTS: Dynamic Length Text based Speech Insertion with Speaker Style Transfer	Neeraj Matiyali et.al.	2508.17031	null
2025-08-23	WildSpoof Challenge Evaluation Plan	Yihan Wu et.al.	2508.16858	null
2025-08-22	TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling	Yuancheng Wang et.al.	2508.16790	link
2025-08-22	Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation	Weiting Tan et.al.	2508.16188	null
2025-08-21	QvTAD: Differential Relative Attribute Learning for Voice Timbre Attribute Detection	Zhiyu Wu et.al.	2508.15931	null
2025-08-21	Any-to-any Speaker Attribute Perturbation for Asynchronous Voice Anonymization	Liping Chen et.al.	2508.15565	null
2025-08-24	Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets	Chenlin Liu et.al.	2508.15442	null
2025-08-21	UniCoM: A Universal Code-Switching Speech Generator	Sangmin Lee et.al.	2508.15244	link
2025-08-20	Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization	Rui Wang et.al.	2508.14947	null
2025-08-20	Long-Context Speech Synthesis with Context-Aware Memory	Zhipeng Li et.al.	2508.14713	null
2025-08-20	Improving Resource-Efficient Speech Enhancement via Neural Differentiable DSP Vocoder Refinement	Heitor R. Guimarães et.al.	2508.14709	null
2025-08-20	DiffIER: Optimizing Diffusion Models with Iterative Error Reduction	Ao Chen et.al.	2508.13628	null
2025-08-19	Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM	Dariia Puhach et.al.	2508.13603	null
2025-08-18	A Surveillance Based Interactive Robot	Kshitij Kavimandan et.al.	2508.13319	null
2025-08-18	Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis	Zhu Li et.al.	2508.13028	null
2025-08-18	Real-Time Sign Language Gestures to Speech Transcription using Deep Learning	Brandone Fonya et.al.	2508.12713	null
2025-08-19	FNH-TTS: A Fast, Natural, and Human-Like Speech Synthesis System with advanced prosodic modeling based on Mixture of Experts	Qingliang Meng et.al.	2508.12001	null
2025-08-16	SimInterview: Transforming Business Education through Large Language Model-Based Simulated Multilingual Interview Training System	Truong Thanh Hung Nguyen et.al.	2508.11873	null
2025-08-15	MoE-TTS: Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts	Heyang Xue et.al.	2508.11326	null
2025-08-15	EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens	Joonyong Park et.al.	2508.11273	null
2025-08-14	Fake Speech Wild: Detecting Deepfake Speech on Social Media Platform	Yuankun Xie et.al.	2508.10559	link
2025-08-14	Facilitating Personalized TTS for Dysarthric Speakers Using Knowledge Anchoring and Curriculum Learning	Yejin Jeon et.al.	2508.10412	null
2025-08-14	Towards Frame-level Quality Predictions of Synthetic Speech	Michael Kuhlmann et.al.	2508.10374	null
2025-08-08	LLMCARE: Alzheimer's Detection via Transformer Models Enhanced by LLM-Generated Synthetic Data	Ali Zolnour et.al.	2508.10027	null
2025-08-15	Training-Free Multimodal Large Language Model Orchestration	Tianyu Xie et.al.	2508.10016	null
2025-08-13	Analysis of Domain Shift across ASR Architectures via TTS-Enabled Separation of Target Domain and Acoustic Conditions	Tina Raissi et.al.	2508.09868	null
2025-08-13	UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech	Shuhei Kato et.al.	2508.09767	null
2025-08-13	$\text{M}^3\text{PDB}$ : A Multimodal, Multi-Label, Multilingual Prompt Database for Speech Generation	Boyu Zhu et.al.	2508.09702	null
2025-08-12	Fake-Mamba: Real-Time Speech Deepfake Detection Using Bidirectional Mamba as Self-Attention's Alternative	Xi Xuan et.al.	2508.09294	null
2025-08-13	DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models	Yuanyuan Wang et.al.	2508.08961	null
2025-08-12	QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems	Chien-Chun Wang et.al.	2508.08957	null
2025-08-15	MultiAiTutor: Child-Friendly Educational Multilingual Speech Generation Tutor with LLMs	Xiaoxue Gao et.al.	2508.08715	null
2025-08-12	Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization	Chaoqun Cui et.al.	2508.08550	null
2025-08-11	Is GAN Necessary for Mel-Spectrogram-based Neural Vocoder?	Hui-Peng Du et.al.	2508.07711	null
2025-08-10	Think Before You Talk: Enhancing Meaningful Dialogue Generation in Full-Duplex Speech Language Models with Planning-Inspired Text Guidance	Wenqian Cui et.al.	2508.07375	link
2025-08-10	KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features	Ivan Kukanov et.al.	2508.07337	null
2025-08-12	XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation	Tianlun Zuo et.al.	2508.07302	null
2025-08-09	Maestro-EVC: Controllable Emotional Voice Conversion Guided by References and Explicit Prosody	Jinsung Yoon et.al.	2508.06890	null
2025-08-09	Text to Speech System for Meitei Mayek Script	Gangular Singh Irengbam et.al.	2508.06870	null
2025-08-08	ScamAgents: How AI Agents Can Simulate Human-Level Scam Calls	Sanket Badhe et.al.	2508.06457	null
2025-08-08	Improved Dysarthric Speech to Text Conversion via TTS Personalization	Péter Mihajlik et.al.	2508.06391	null
2025-08-08	Large Language Model Data Generation for Enhanced Intent Recognition in German Speech	Theresa Pekarek Rosin et.al.	2508.06277	null
2025-08-08	Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech Synthesis	Wenjie Tian et.al.	2508.06262	null
2025-08-07	A Scalable Pipeline for Enabling Non-Verbal Speech Generation and Understanding	Runchuan Ye et.al.	2508.05385	null
2025-08-15	Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS	M Anuprabha et.al.	2508.05102	null
2025-08-06	Root Cause Analysis Training for Healthcare Professionals With AI-Powered Virtual Simulation: A Proof-of-Concept	Yuqi Hu et.al.	2508.04904	null
2025-08-05	Toward Low-Latency End-to-End Voice Agents for Telecommunications Using Streaming ASR, Quantized LLMs, and Real-Time TTS	Vignesh Ethiraj et.al.	2508.04721	null
2025-08-07	UniTalker: Conversational Speech-Visual Synthesis	Yifan Hu et.al.	2508.04585	null
2025-08-06	NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations	Huan Liao et.al.	2508.04195	null
2025-08-06	Multilingual Source Tracing of Speech Deepfakes: A First Benchmark	Xi Xuan et.al.	2508.04143	null
2025-08-06	Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech	Jingyuan Xing et.al.	2508.04141	null
2025-08-06	EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering	Tianxin Xie et.al.	2508.03543	null
2025-08-05	MiSTR: Multi-Modal iEEG-to-Speech Synthesis with Transformer-Based Prosody Prediction and Neural Phase Reconstruction	Mohammed Salah Al-Radhi et.al.	2508.03166	link
2025-08-05	Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback	Jingyi Chen et.al.	2508.03123	null
2025-08-14	Marco-Voice Technical Report	Fengping Tian et.al.	2508.02038	null
2025-08-03	Enhancing Spectrogram Realism in Singing Voice Synthesis via Explicit Bandwidth Extension Prior to Vocoder	Runxuan Yang et.al.	2508.01796	null
2025-08-03	Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe	Tiantian Feng et.al.	2508.01691	null
2025-08-01	Advancing Speech Quality Assessment Through Scientific Challenges and Open-source Activities	Wen-Chin Huang et.al.	2508.00317	null
2025-08-01	Next Tokens Denoising for Speech Synthesis	Yanqing Liu et.al.	2507.22746	null
2025-07-30	Adaptive Duration Model for Text Speech Alignment	Junjie Cao et.al.	2507.22612	null
2025-07-29	SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Incorporating Cutting-Edge Generation Methods	Wen Huang et.al.	2507.21463	null
2025-07-23	WaveVerify: A Novel Audio Watermarking Framework for Media Authentication and Combatting Deepfakes	Aditya Pujari et.al.	2507.21150	null
2025-07-22	TTS-1 Technical Report	Oleg Atamanenko et.al.	2507.21138	null
2025-07-29	JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1	Xinhan Di et.al.	2507.20987	null
2025-07-28	AV-Deepfake1M++: A Large-Scale Audio-Visual Deepfake Benchmark with Real-World Perturbations	Zhixi Cai et.al.	2507.20579	null
2025-07-27	Do Not Mimic My Voice: Speaker Identity Unlearning for Zero-Shot Text-to-Speech	Taesoo Kim et.al.	2507.20140	null
2025-07-26	Defining ethically sourced code generation	Zhuolin Xu et.al.	2507.19743	null
2025-07-25	GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness	Hongjie Chen et.al.	2507.18119	null
2025-07-24	Synthetic Data Generation for Phrase Break Prediction with Large Language Model	Hoyeon Lee et.al.	2507.18044	null
2025-07-23	AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer	Danny D. Leybzon et.al.	2507.17718	null
2025-07-23	Synthetic Voice Data for Automatic Speech Recognition in African Languages	Brian DeRenzi et.al.	2507.17578	null
2025-07-23	BoSS: Beyond-Semantic Speech	Qing Wang et.al.	2507.17563	null
2025-07-27	Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice	Shanbo Cheng et.al.	2507.17527	null
2025-07-22	SplitMeanFlow: Interval Splitting Consistency in Few-Step Generative Modeling	Yi Guo et.al.	2507.16884	null
2025-07-22	Technical report: Impact of Duration Prediction on Speaker-specific TTS for Indian Languages	Isha Pandey et.al.	2507.16875	null
2025-07-15	Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems	Nima Yazdani et.al.	2507.16835	null
2025-07-21	A2TTS: TTS for Low Resource Indian Languages	Ayush Singh Bhadoriya et.al.	2507.15272	null
2025-07-21	EchoVoices: Preserving Generational Voices and Memories for Seniors and Children	Haiying Xu et.al.	2507.15221	null
2025-07-22	Hear Your Code Fail, Voice-Assisted Debugging for Python	Sayed Mahbub Hasan Amiri et.al.	2507.15007	null
2025-07-20	DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis	Yinghao Aaron Li et.al.	2507.14988	null
2025-07-20	FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing	Shoutao Guo et.al.	2507.14815	null
2025-07-17	A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models	Kirill Borodin et.al.	2507.13563	null
2025-07-17	NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech	Maksim Borisov et.al.	2507.13155	null
2025-07-17	Intelligent Virtual Sonographer (IVS): Enhancing Physician-Robot-Patient Communication	Tianyu Song et.al.	2507.13052	null
2025-07-17	Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes	Zhou Feng et.al.	2507.12932	null
2025-07-16	Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations	Yichen Han et.al.	2507.12197	null
2025-07-16	EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis	Haoxun Li et.al.	2507.12015	null
2025-07-15	Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection	Ivan Viakhirev et.al.	2507.11777	null
2025-07-25	P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge	Marvin Sach et.al.	2507.11306	null
2025-07-20	Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition	Mengzhe Geng et.al.	2507.10827	null
2025-07-14	An Empirical Evaluation of AI-Powered Non-Player Characters' Perceived Realism and Performance in Virtual Reality Environments	Mikko Korkiakoski et.al.	2507.10469	null
2025-07-14	DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis	Wenjie Tian et.al.	2507.10109	null
2025-07-12	ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching	Han Zhu et.al.	2507.09318	null
2025-07-12	Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning	Dominika Woszczyk et.al.	2507.09310	null
2025-07-12	ClaritySpeech: Dementia Obfuscation in Speech	Dominika Woszczyk et.al.	2507.09282	link
2025-07-19	Mixture of LoRA Experts with Multi-Modal and Multi-Granularity LLM Generative Error Correction for Accented Speech Recognition	Bingshen Mu et.al.	2507.09116	null
2025-07-11	SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment	Shivam Mehta et.al.	2507.09070	null
2025-07-11	Exploiting Leaderboards for Large-Scale Distribution of Malicious Models	Anshuman Suri et.al.	2507.08983	null
2025-07-06	A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting	Niranjan Mallikarjun Sindhur et.al.	2507.08832	null
2025-07-11	Unlocking Speech Instruction Data Potential with Query Rewriting	Yonghua Hei et.al.	2507.08603	null
2025-07-11	MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling	Jingjing Tang et.al.	2507.08530	null
2025-07-11	Active Learning for Text-to-Speech Synthesis with Informative Sample Collection	Kentaro Seki et.al.	2507.08319	null
2025-07-05	RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning	Atli Sigurgeirsson et.al.	2507.08012	null
2025-07-10	SecureSpeech: Prompt-based Speaker and Content Protection	Belinda Soh Hui Hui et.al.	2507.07799	null
2025-07-09	STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation	Wenxiang Guo et.al.	2507.06670	null
2025-07-09	Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents	Zackary Rackauckas et.al.	2507.06483	null
2025-07-08	Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis	Xintong Hu et.al.	2507.06116	null
2025-07-08	Differentiable Reward Optimization for LLM based TTS system	Changfeng Gao et.al.	2507.05911	null
2025-07-08	OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model	Chen Wang et.al.	2507.05177	null
2025-07-07	LAPS-Diff: A Diffusion-Based Framework for Singing Voice Synthesis With Language Aware Prosody-Style Guided Learning	Sandipan Dhar et.al.	2507.04966	null
2025-07-07	Multi-Step Prediction and Control of Hierarchical Emotion Distribution in Text-to-Speech Synthesis	Sho Inoue et.al.	2507.04598	null
2025-07-06	TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet	Jaeseok Jeong et.al.	2507.04349	null
2025-07-05	PresentAgent: Multimodal Agent for Presentation Video Generation	Jingwei Shi et.al.	2507.04036	null
2025-07-05	Prosody Labeling with Phoneme-BERT and Speech Foundation Models	Tomoki Koriyama et.al.	2507.03912	null
2025-07-05	Traceable TTS: Toward Watermark-Free TTS with Strong Traceability	Yuxiang Zhao et.al.	2507.03887	null
2025-07-14	DeepGesture: A conversational gesture synthesis system based on emotions and semantics	Thanh Hoang-Minh et.al.	2507.03147	null
2025-07-03	De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks	Wei Fan et.al.	2507.02606	null
2025-07-03	Open-Source System for Multilingual Translation and Cloned Speech Synthesis	Mateo Cámara et.al.	2507.02530	null
2025-07-03	JoyTTS: LLM-based Spoken Chatbot With Voice Cloning	Fangru Zhou et.al.	2507.02380	null
2025-07-02	Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis	Marc-André Carbonneau et.al.	2507.02176	null
2025-07-04	Pronunciation Editing for Finnish Speech using Phonetic Posteriorgrams	Zirui Li et.al.	2507.02115	null
2025-07-02	A Dataset for Automatic Assessment of TTS Quality in Spanish	Alejandro Sosa Welford et.al.	2507.01805	link
2025-07-02	Voice Conversion for Likability Control via Automated Rating of Speech Synthesis Corpora	Hitoshi Suda et.al.	2507.01356	null
2025-07-08	SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech	Zhuangfei Cheng et.al.	2507.01348	null
2025-07-02	Multi-interaction TTS toward professional recording reproduction	Hiroki Kanagawa et.al.	2507.00808	null
2025-07-01	MuteSwap: Silent Face-based Voice Conversion	Yifan Liu et.al.	2507.00498	null
2025-06-30	Collecting, Curating, and Annotating Good Quality Speech deepfake dataset for Famous Figures: Process and Challenges	Hashim Ali et.al.	2507.00324	null
2025-06-30	Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis	Paul Mayer et.al.	2507.00227	null
2025-07-01	StreamFlow: Streaming Flow Matching with Block-wise Guided Attention Mask for Speech Token Decoding	Dake Guo et.al.	2506.23986	null
2025-06-30	Efficient Interleaved Speech Modeling through Knowledge Distillation	Mohammadmahdi Nouriborji et.al.	2506.23670	null
2025-06-30	JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching	Mingi Kwon et.al.	2506.23552	null
2025-06-29	You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties	Paige Tuttösí et.al.	2506.23367	null
2025-06-27	DiffSoundStream: Efficient Speech Tokenization via Diffusion Decoding	Yang Yang et.al.	2506.22362	null
2025-06-27	Evaluating Pointing Gestures for Target Selection in Human-Robot Collaboration	Noora Sassali et.al.	2506.22116	null
2025-06-27	Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy	Bohan Li et.al.	2506.22023	null
2025-06-23	IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech	Siyi Zhou et.al.	2506.21619	null
2025-06-26	SmoothSinger: A Conditional Diffusion Model for Singing Voice Synthesis with Multi-Resolution Architecture	Kehan Sui et.al.	2506.21478	null
2025-06-26	A Multi-Stage Framework for Multimodal Controllable Speech Synthesis	Rui Niu et.al.	2506.20945	null
2025-06-25	An Exploration of ECAPA-TDNN and x-vector Speaker Representations in Zero-shot Multi-speaker TTS	Marie Kunešová et.al.	2506.20190	null
2025-06-24	TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems	Christoph Minixhofer et.al.	2506.19441	null
2025-06-23	Selecting N-lowest scores for training MOS prediction models	Yuto Kondo et.al.	2506.18326	null
2025-06-23	Rethinking Mean Opinion Scores in Speech Quality Assessment: Aggregation through Quantized Distribution Fitting	Yuto Kondo et.al.	2506.18307	null
2025-06-23	JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles	Yuto Kondo et.al.	2506.18296	null
2025-06-21	OpusLM: A Family of Open Unified Speech Language Models	Jinchuan Tian et.al.	2506.17611	null
2025-06-20	RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching	Hyun Joon Park et.al.	2506.16741	null
2025-06-20	LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization	Daejin Jo et.al.	2506.16738	null
2025-06-20	V-CASS: Vision-context-aware Expressive Speech Synthesis for Enhancing User Understanding of Videos	Qixin Wang et.al.	2506.16716	null
2025-06-19	Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement	Tuan-Nam Nguyen et.al.	2506.16580	null
2025-06-19	InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems	Kexin Huang et.al.	2506.16381	link
2025-06-19	Optimizing Multilingual Text-To-Speech with Accents & Emotions	Pranav Pawar et.al.	2506.16310	null
2025-06-19	Improved Intelligibility of Dysarthric Speech using Conditional Flow Matching	Shoutrik Das et.al.	2506.16127	null
2025-06-19	VS-Singer: Vision-Guided Stereo Singing Voice Synthesis with Consistency Schrödinger Bridge	Zijing Zhao et.al.	2506.16020	null
2025-06-18	TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data	Kentaro Seki et.al.	2506.15614	null
2025-06-18	PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction	Shufan Li et.al.	2506.15556	null
2025-06-18	Factorized RVQ-GAN For Disentangled Speech Tokenization	Sameer Khurana et.al.	2506.15456	null
2025-06-18	EmojiVoice: Towards long-term controllable expressivity in robot speech	Paige Tuttösí et.al.	2506.15085	null
2025-06-18	An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW	Prateek Mehta et.al.	2506.15029	null
2025-06-25	SLEEPING-DISCO 9M: A large-scale pre-training dataset for generative music modeling	Tawsif Ahmed et.al.	2506.14293	null
2025-06-17	Investigation of Zero-shot Text-to-Speech Models for Enhancing Short-Utterance Speaker Verification	Yiyang Zhao et.al.	2506.14226	null
2025-06-17	Pushing the Performance of Synthetic Speech Detection with Kolmogorov-Arnold Networks and Self-Supervised Learning Models	Tuan Dat Phuong et.al.	2506.14153	link
2025-06-16	EmoNews: A Spoken Dialogue System for Expressive News Conversations	Ryuki Matsuura et.al.	2506.13894	link
2025-06-16	From Flat to Feeling: A Feasibility and Impact Study on Dynamic Facial Emotions in AI-Generated Avatars	Pegah Salehi et.al.	2506.13477	null
2025-06-20	ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching	Han Zhu et.al.	2506.13053	link
2025-06-14	StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling	Hui Wang et.al.	2506.12570	null
2025-06-14	Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction	Xiaoran Fan et.al.	2506.12537	null
2025-06-14	Phonikud: Hebrew Grapheme-to-Phoneme Conversion for Real-Time Text-to-Speech	Yakov Kolani et.al.	2506.12311	null
2025-06-11	S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamlessly Speech-Text Alignment and Streaming Speech Decoder	Yu Pan et.al.	2506.11160	null
2025-06-16	A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data	Cheng-Kang Chou et.al.	2506.11130	null
2025-06-10	GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions	Wenkang Han et.al.	2506.11127	null
2025-06-10	ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams	Freddie Grabovski et.al.	2506.11125	null
2025-06-05	Intelligibility of Text-to-Speech Systems for Mathematical Expressions	Sujoy Roychowdhury et.al.	2506.11086	null
2025-06-12	Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs	Hayato Futami et.al.	2506.10299	null
2025-06-06	A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations	Tian Lan et.al.	2506.10019	null
2025-06-11	UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching	Neta Glazer et.al.	2506.09874	null
2025-06-15	EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection	Christoph Schuhmann et.al.	2506.09827	null
2025-06-11	OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment	Chao-Hong Tan et.al.	2506.09349	link
2025-06-11	Ming-Omni: A Unified Multimodal Model for Perception and Generation	Inclusion AI et.al.	2506.09344	link
2025-06-13	Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model	Ailin Huang et.al.	2506.08967	null
2025-06-10	A Review on Score-based Generative Models for Audio Applications	Ge Zhu et.al.	2506.08457	null
2025-06-09	Seeing Voices: Generating A-Roll Video from Audio with Mirage	Aditi Sundararaman et.al.	2506.08279	null
2025-06-09	Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation	Rui Hu et.al.	2506.07646	null
2025-06-10	Towards Generalized Source Tracing for Codec-Based Deepfake Speech	Xuanjun Chen et.al.	2506.07294	null
2025-06-07	SynHate: Detecting Hate Speech in Synthetic Deepfake Audio	Rishabh Ranjan et.al.	2506.06772	null
2025-06-06	Audio-Aware Large Language Models as Judges for Speaking Styles	Cheng-Han Chiang et.al.	2506.05984	null
2025-06-09	Voice Impression Control in Zero-Shot TTS	Keinichi Fujita et.al.	2506.05688	null
2025-06-05	Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning	Hien Ohnaka et.al.	2506.04527	null
2025-06-04	Can we reconstruct a dysarthric voice with the large speech model Parler TTS?	Ariadna Sanchez et.al.	2506.04397	null
2025-06-04	HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset	Ryan Langman et.al.	2506.04152	null
2025-06-04	UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation	Jinting Wang et.al.	2506.04134	null
2025-06-04	A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions	Chung-Chun Wang et.al.	2506.04077	null
2025-06-04	Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian Languages	Utkarsh Pathak et.al.	2506.03884	null
2025-06-04	Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts	Sidharth Pulipaka et.al.	2506.03793	null
2025-06-04	Comparative Analysis of Fast and High-Fidelity Neural Vocoders for Low-Latency Streaming Synthesis in Resource-Constrained Environments	Reo Yoneyama et.al.	2506.03554	null
2025-06-04	BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing	Masaya Kawamura et.al.	2506.03515	null
2025-06-03	Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation	Yongqi Wang et.al.	2506.02997	null
2025-06-03	Towards a Japanese Full-duplex Spoken Dialogue System	Atsumoto Ohashi et.al.	2506.02979	null
2025-06-03	PartialEdit: Identifying Partial Deepfakes in the Era of Neural Speech Editing	You Zhang et.al.	2506.02958	null
2025-06-03	CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech	Helin Wang et.al.	2506.02863	link
2025-06-03	Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions	Xiaoxue Gao et.al.	2506.02742	null
2025-06-03	StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion	Fengjin Li et.al.	2506.02414	null
2025-06-03	SingaKids: A Multilingual Multimodal Dialogic Tutor for Language Learning	Zhengyuan Liu et.al.	2506.02412	null
2025-06-03	Trusted Fake Audio Detection Based on Dirichlet Distribution	Chi Ding et.al.	2506.02401	null
2025-06-02	Dhvani: A Weakly-supervised Phonemic Error Detection and Personalized Feedback System for Hindi	Arnav Rustagi et.al.	2506.02166	null
2025-06-02	SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction	Saurabh Agrawal et.al.	2506.02082	null
2025-06-02	Universal Preference-Score-based Pairwise Speech Quality Assessment	Yu-Fei Shi et.al.	2506.01455	null
2025-06-02	Speech-to-Speech Translation Pipelines for Conversations in Low-Resource Languages	Andrei Popescu-Belis et.al.	2506.01406	null
2025-06-02	Zero-Shot Text-to-Speech for Vietnamese	Thi Vu et.al.	2506.01322	null
2025-06-02	CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction	Yudong Lu et.al.	2506.01268	null
2025-06-02	WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing	Yu Nakagome et.al.	2506.01263	null
2025-06-01	Source Tracing of Synthetic Speech Systems Through Paralinguistic Pre-Trained Representations	Girish et.al.	2506.01157	null
2025-06-01	DS-TTS: Zero-Shot Speaker Style Adaptation from Voice Clips via Dynamic Dual-Style Feature Modulation	Ming Meng et.al.	2506.01020	null
2025-06-01	Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching	Jialong Zuo et.al.	2506.01014	null
2025-06-01	CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching	Leying Zhang et.al.	2506.00885	null
2025-06-01	Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models	Kyowoon Lee et.al.	2506.00832	null
2025-05-30	ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation	Jiatong Shi et.al.	2505.24518	null
2025-05-30	Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation	Wenrui Liu et.al.	2505.24496	null
2025-05-30	DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec	Peijie Chen et.al.	2505.24314	null
2025-05-29	Can Emotion Fool Anti-spoofing?	Aurosweta Mahapatra et.al.	2505.23962	null
2025-05-29	Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes	Neta Glazer et.al.	2505.23619	link
2025-05-29	EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge	Ruskin Raj Manku et.al.	2505.23009	link
2025-05-29	LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting	Pai Zhu et.al.	2505.22995	null
2025-05-28	BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models	Susan Liang et.al.	2505.22865	null
2025-05-28	Tell me Habibi, is it Real or Fake?	Kartik Kuckreja et.al.	2505.22581	null
2025-05-28	A Linguistically Motivated Analysis of Intonational Phrasing in Text-to-Speech Systems: Revealing Gaps in Syntactic Sensitivity	Charlotte Pouw et.al.	2505.22236	null
2025-05-27	Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech	Nam-Gyu Kim et.al.	2505.20868	null
2025-05-26	ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis	Hawau Olamide Toyin et.al.	2505.20506	null
2025-05-26	Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling	Qixi Zheng et.al.	2505.19931	null
2025-05-26	DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech	Deok-Hyeon Cho et.al.	2505.19687	link
2025-05-26	KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization	Zhaolin Li et.al.	2505.19679	null
2025-06-02	Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling	Haiyang Sun et.al.	2505.19669	null
2025-05-30	Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment	Jeongsoo Choi et.al.	2505.19595	link
2025-05-26	GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor	Seokgi Lee et.al.	2505.19384	null
2025-05-25	SpeakStream: Streaming Text-to-Speech with Interleaved Data	Richard He Bai et.al.	2505.19206	null
2025-05-25	CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning	Renyuan Li et.al.	2505.19119	null
2025-05-25	Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis	Minsu Kim et.al.	2505.18972	null
2025-05-27	RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations	Ashwin Sankar et.al.	2505.18609	null
2025-05-24	MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt	Zhichao Wu et.al.	2505.18453	null
2025-05-27	CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training	Zhihao Du et.al.	2505.17589	null
2025-05-23	What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection	Binh Nguyen et.al.	2505.17513	null
2025-05-23	UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information	Rui Wang et.al.	2505.17426	link
2025-05-23	Speechless: Speech Instruction Training Without Speech for Low Resource Languages	Alan Dao et.al.	2505.17417	link
2025-05-22	Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2	Zackary Rackauckas et.al.	2505.17320	null
2025-05-21	Voicing Personas: Rewriting Persona Descriptions into Style Prompts for Controllable Text-to-Speech	Yejin Lee et.al.	2505.17093	null
2025-05-20	Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English	Haoyang Zhang et.al.	2505.17076	null
2025-05-22	From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition	Tianduo Wang et.al.	2505.16972	link
2025-05-22	MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing	Junjie Zheng et.al.	2505.16279	null
2025-05-21	MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling	Yifan Cheng et.al.	2505.15772	null
2025-05-21	Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information	Nicholas Sanders et.al.	2505.15667	null
2025-05-21	Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models	Zirui Song et.al.	2505.15406	link
2025-05-21	Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning	Junchuan Zhao et.al.	2505.15402	null
2025-05-21	Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding	Zijian Lin et.al.	2505.15380	null
2025-05-20	TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis	Yu Zhang et.al.	2505.14910	link
2025-05-20	Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits	Tiantian Feng et.al.	2505.14648	link
2025-05-20	Pairwise Evaluation of Accent Similarity in Speech Synthesis	Jinzuomu Zhong et.al.	2505.14410	null
2025-05-20	FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation	Yutong Liu et.al.	2505.14351	null
2025-05-21	AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models	Guangke Chen et.al.	2505.14103	null
2025-05-20	SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement	Kuan-Yu Chen et.al.	2505.14066	null
2025-05-23	U-SAM: An audio language Model for Unified Speech, Audio, and Music Understanding	Ziqian Wang et.al.	2505.13880	link
2025-05-22	Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token Denoising	Ye-Xin Lu et.al.	2505.13830	null
2025-05-20	Articulatory Feature Prediction from Surface EMG during Speech Production	Jihwan Lee et.al.	2505.13814	null
2025-05-19	Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space	Zhengrui Ma et.al.	2505.13181	link
2025-05-19	DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation	Jiaqi Li et.al.	2505.13000	link
2025-05-19	Codec-Based Deepfake Source Tracing via Neural Audio Codec Taxonomy	Xuanjun Chen et.al.	2505.12994	link
2025-05-19	OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching	Hieu-Nghia Huynh-Nguyen et.al.	2505.12800	null
2025-05-19	RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations	Seungmin Kim et.al.	2505.12686	null
2025-05-19	Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis	Yifan Hu et.al.	2505.12597	link
2025-05-18	Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis	Dong Yang et.al.	2505.12226	null
2025-05-16	LipDiffuser: Lip-to-Speech Generation with Conditional Diffusion Models	Danilo de Oliveira et.al.	2505.11391	null
2025-05-16	Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese	Xihuai Wang et.al.	2505.11200	null
2025-05-16	BanglaFake: Constructing and Evaluating a Specialized Bengali Deepfake Audio Dataset	Istiaq Ahmed Fahad et.al.	2505.10885	link
2025-05-15	UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech	Jiaxuan Liu et.al.	2505.10599	null
2025-05-14	SingNet: Towards a Large-Scale, Diverse, and In-the-Wild Singing Voice Dataset	Yicheng Gu et.al.	2505.09325	null
2025-05-14	DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis	Zeeshan Ahmad et.al.	2505.09091	null
2025-05-13	Investigating self-supervised features for expressive, multilingual voice conversion	Álvaro Martín-Cortinas et.al.	2505.08278	null
2025-05-12	MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder	Bowen Zhang et.al.	2505.07916	null
2025-05-12	Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications	Biel Tura Vecino et.al.	2505.07701	null
2025-05-10	VTutor: An Animated Pedagogical Agent SDK that Provide Real Time Multi-Model Feedback	Eason Chen et.al.	2505.06676	null
2025-05-10	Bridging the Gap: An Intermediate Language for Enhanced and Cost-Effective Grapheme-to-Phoneme Conversion with Homographs with Multiple Pronunciations Disambiguation	Abbas Bertina et.al.	2505.06599	null
2025-05-15	FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech	Linhan Ma et.al.	2505.05159	null
2025-05-08	Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations	Linrong Pan et.al.	2505.05056	null
2025-05-08	A Multi-Agent AI Framework for Immersive Audiobook Production through Spatial Audio and Neural Narration	Shaja Arul Selvamani et.al.	2505.04885	null
2025-05-07	Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment	Xueyao Zhang et.al.	2505.04113	null
2025-05-06	VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model	Zuwei Long et.al.	2505.03739	link
2025-05-13	SonicRAG : High Fidelity Sound Effects Synthesis Based on Retrival Augmented Generation	Yu-Ren Guo et.al.	2505.03244	null
2025-05-05	Generating Narrated Lecture Videos from Slides with Synchronized Highlights	Alexander Holmberg et.al.	2505.02966	null
2025-05-05	Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play	Yemin Shi et.al.	2505.02707	link
2025-05-05	LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis	Qingkai Fang et.al.	2505.02625	link
2025-04-30	Towards Film-Making Production Dialogue, Narration, Monologue Adaptive Moving Dubbing Benchmarks	Chaoyi Wang et.al.	2505.01450	null
2025-04-30	Sadeed: Advancing Arabic Diacritization Through Small Language Model	Zeina Aldallal et.al.	2504.21635	null
2025-04-29	AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation	Jeongsoo Choi et.al.	2504.20629	null
2025-04-29	ClonEval: An Open Voice Cloning Benchmark	Iwona Christop et.al.	2504.20581	link
2025-05-02	Towards Flow-Matching-based TTS without Classifier-Free Guidance	Yuzhe Liang et.al.	2504.20334	null
2025-04-27	Generative Adversarial Network based Voice Conversion: Techniques, Challenges, and Recent Advancements	Sandipan Dhar et.al.	2504.19197	null
2025-04-27	Muyan-TTS: A Trainable Text-to-Speech Model Optimized for Podcast Scenarios with a $50K Budget	Xin Li et.al.	2504.19146	link
2025-04-22	FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning	Ju Yeon Kang et.al.	2504.15663	null
2025-04-22	A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models	Gengxian Cao et.al.	2504.15552	null
2025-04-21	SOLIDO: A Robust Watermarking Method for Speech Synthesis via Low-Rank Adaptation	Yue Li et.al.	2504.15035	null
2025-04-20	DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue	Xiang Li et.al.	2504.14482	link
2025-04-18	ChatNekoHacker: Real-Time Fan Engagement with Conversational Agents	Takuya Sera et.al.	2504.13793	null
2025-04-18	Collective Learning Mechanism based Optimal Transport Generative Adversarial Network for Non-parallel Voice Conversion	Sandipan Dhar et.al.	2504.13791	null
2025-04-22	EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting	Guanrou Yang et.al.	2504.12867	null
2025-04-15	GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture	Yaodong Song et.al.	2504.12339	null
2025-04-15	Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation	Yan Rong et.al.	2504.11002	null
2025-04-15	Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy	Botao Zhao et.al.	2504.10819	null
2025-04-14	Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis	Yifan Yang et.al.	2504.10352	null
2025-04-14	AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis	Dan Luo et.al.	2504.10309	link
2025-04-14	SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis	Zhisheng Zhang et.al.	2504.09839	link
2025-04-12	"It's not a representation of me": Examining Accent Bias and Digital Exclusion in Synthetic AI Voice Services	Shira Michel et.al.	2504.09346	null
2025-04-12	AMNet: An Acoustic Model Network for Enhanced Mandarin Speech Synthesis	Yubing Cao et.al.	2504.09225	null
2025-04-17	SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning	Prabhat Pandey et.al.	2504.09081	null
2025-04-11	Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation	Haowei Lou et.al.	2504.08274	null
2025-04-10	Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis	Yizhong Geng et.al.	2504.07858	null
2025-04-10	SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow	Kaidi Wang et.al.	2504.07776	null
2025-04-08	AVENet: Disentangling Features by Approximating Average Features for Voice Conversion	Wenyu Wang et.al.	2504.05833	null
2025-04-07	P2Mark: Plug-and-play Parameter-intrinsic Watermarking for Neural Speech Generation	Yong Ren et.al.	2504.05197	null
2025-04-07	SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation	Stephen Brade et.al.	2504.05106	null
2025-04-04	RWKVTTS: Yet another TTS based on RWKV-7	Lin yueyu et.al.	2504.03289	link
2025-04-09	F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization	Xiaohui Sun et.al.	2504.02407	link
2025-04-03	VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models	Kim Sung-Bin et.al.	2504.02386	null
2025-03-31	SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation	Ngoc Dung Huynh et.al.	2503.24164	null
2025-04-02	TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection	Zhiming Ma et.al.	2503.24115	link
2025-03-31	SpeechDialogueFactory: Generating High-Quality Speech Dialogue Data to Accelerate Your Speech-LLM Development	Minghan Wang et.al.	2503.23848	link
2025-03-31	DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance	Junjie Zheng et.al.	2503.23660	null
2025-03-30	Speculative End-Turn Detector for Efficient Speech Chatbot Assistant	Hyunjong Ok et.al.	2503.23439	null
2025-03-29	SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System	Hyeongju Kim et.al.	2503.23108	null
2025-03-26	Dual Audio-Centric Modality Coupling for Talking Head Generation	Ao Fu et.al.	2503.22728	null
2025-03-28	Cross-Technology Generalization in Synthesized Speech Detection: Evaluating AST Models with Modern Voice Generators	Andrew Ustinov et.al.	2503.22503	link
2025-03-28	DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation	Haomin Zhang et.al.	2503.22265	null
2025-03-26	Text-Driven Voice Conversion via Latent State-Space Modeling	Wen Li et.al.	2503.20999	null
2025-03-28	FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System	Hao-Han Guo et.al.	2503.20499	null
2025-03-26	Qwen2.5-Omni Technical Report	Jin Xu et.al.	2503.20215	null
2025-03-21	Measuring the Robustness of Audio Deepfake Detectors	Xiang Li et.al.	2503.17577	link
2025-03-21	Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication	Yiwen Xu et.al.	2503.17479	null
2025-03-21	From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech	Ji-Hoon Kim et.al.	2503.16956	null
2025-03-20	WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching	Tianze Luo et.al.	2503.16689	link
2025-03-10	VocalEyes: Enhancing Environmental Perception for the Visually Impaired through Vision-Language Models and Distance-Aware Object Detection	Kunal Chavan et.al.	2503.16488	null
2025-03-19	Shushing! Let's Imagine an Authentic Speech from the Silent Video	Jiaxin Ye et.al.	2503.14928	null
2025-03-19	MoonCast: High-Quality Zero-Shot Podcast Generation	Zeqian Ju et.al.	2503.14345	link
2025-03-26	InnerSelf: Designing Self-Deepfaked Voice for Emotional Well-being	Guang Dai et.al.	2503.14257	null
2025-03-15	Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations	Xue Jiang et.al.	2503.12115	null
2025-03-14	MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation	Sungwoo Cho et.al.	2503.11026	null
2025-03-14	Proceedings of the ISCA/ITG Workshop on Diversity in Large Speech and Language Models	Sebastian Möller et.al.	2503.10298	null
2025-03-11	An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR	Sewade Ogun et.al.	2503.08954	null
2025-03-09	ProSE: Diffusion Priors for Speech Enhancement	Sonal Kumar et.al.	2503.06375	null
2025-03-07	DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility	Yifan Liu et.al.	2503.05223	link
2025-03-03	Direct Speech to Speech Translation: A Review	Mohammad Sarim et.al.	2503.04799	null
2025-03-06	LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM	Sambal Shikhar et.al.	2503.04724	null
2025-03-06	Scaling Rich Style-Prompted Text-to-Speech Datasets	Anuj Diwan et.al.	2503.04713	link
2025-03-05	Good practices for evaluation of synthesized speech	Erica Cooper et.al.	2503.03250	null
2025-03-04	InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training	Dingdong Wang et.al.	2503.02769	null
2025-03-03	Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens	Xinsheng Wang et.al.	2503.01710	link
2025-03-03	Voice Cloning for Dysarthric Speech Synthesis: Addressing Data Scarcity in Speech-Language Pathology	Birger Moell et.al.	2503.01266	null
2025-03-02	Language-agnostic, automated assessment of listeners' speech recall using large language models	Björn Herrmann et.al.	2503.01045	null
2025-03-02	UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation	Alexander H. Liu et.al.	2503.00733	null
2025-03-01	PodAgent: A Comprehensive Framework for Podcast Generation	Yujia Xiao et.al.	2503.00455	link
2025-03-12	Telephone Surveys Meet Conversational AI: Evaluating a LLM-Based Telephone Survey System at Scale	Max M. Lang et.al.	2502.20140	null
2025-02-27	DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models	Weihao wu et.al.	2502.19924	null
2025-03-04	Speculative Decoding and Beyond: An In-Depth Survey of Techniques	Yunhai Hu et.al.	2502.19732	null
2025-02-26	Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis	Ziyue Jiang et.al.	2502.18924	null
2025-03-08	Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding	Tianyun Liu et.al.	2502.18889	null
2025-02-24	Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction	Tianpeng Li et.al.	2502.17239	link
2025-02-24	Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM	Jiatong Shi et.al.	2502.16897	null
2025-02-18	AV-Flow: Transforming Text to Audio-Visual Human-like Interactions	Aggelina Chatziagapi et.al.	2502.13133	null
2025-02-18	High-Fidelity Music Vocoder using Neural Audio Codecs	Luca A. Lanzendörfer et.al.	2502.12759	null
2025-02-18	TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching	Wenxiang Guo et.al.	2502.12572	link
2025-02-18	A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond	Shreya Shukla et.al.	2502.12048	null
2025-02-17	NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing	Yifan Liang et.al.	2502.12002	null
2025-02-16	FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching	Hui Wang et.al.	2502.11128	null
2025-02-16	SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer	Zhengyan Sheng et.al.	2502.11094	null
2025-02-14	VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect	Qingyuan Fei et.al.	2502.10329	null
2025-02-13	TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument	Kyungsu Kim et.al.	2502.08939	link
2025-03-02	ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech	Xin Wang et.al.	2502.08857	null
2025-02-11	LoRP-TTS: Low-Rank Personalized Text-To-Speech	Łukasz Bondaruk et.al.	2502.07562	null
2025-02-11	Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction	Leying Zhang et.al.	2502.07345	null
2025-02-11	Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement	Xueyao Zhang et.al.	2502.07243	null
2025-02-10	Synthetic Audio Helps for Cognitive State Tasks	Adil Soubki et.al.	2502.06922	link
2025-02-16	Recent Advances in Discrete Speech Tokens: A Review	Yiwei Guo et.al.	2502.06490	null
2025-02-19	Speech to Speech Translation with Translatotron: A State of the Art Review	Jules R. Kala et.al.	2502.05980	null
2025-02-09	Non-invasive electromyographic speech neuroprosthesis: a geometric perspective	Harshavardhana T. Gowda et.al.	2502.05762	null
2025-02-09	BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting	Mohammad Jahid Ibna Basher et.al.	2502.05729	null
2025-02-08	Gender Bias in Instruction-Guided Speech Synthesis Models	Chun-Yi Kuan et.al.	2502.05649	null
2025-02-08	IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System	Wei Deng et.al.	2502.05512	link
2025-02-07	Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance	Shehzeen Hussain et.al.	2502.05236	null
2025-02-12	Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment	Zuyan Liu et.al.	2502.04328	link
2025-02-06	Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis	Zhen Ye et.al.	2502.04128	link
2025-02-14	DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation	Dongya Jia et.al.	2502.03930	null
2025-02-05	Metis: A Foundation Speech Generation Model with Masked Generative Pre-training	Yuancheng Wang et.al.	2502.03128	link
2025-02-05	Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech	Jixun Yao et.al.	2502.02950	null
2025-02-04	Developing multilingual speech synthesis system for Ojibwe, Mi'kmaq, and Maliseet	Shenran Wang et.al.	2502.02703	link
2025-02-04	Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation	Peidong Wang et.al.	2502.02683	null
2025-02-03	Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis	Weiwei Lin et.al.	2502.01084	null
2025-02-02	EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis	Junuk Cha et.al.	2502.00654	null
2025-01-31	VisualSpeech: Enhance Prosody with Visual Context in TTS	Shumin Que et.al.	2501.19258	null
2025-01-29	BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights	Chan-Jan Hsu et.al.	2501.17790	null
2025-02-09	CSEval: Towards Automated, Multi-Dimensional, and Reference-Free Counterspeech Evaluation using Auto-Calibrated LLMs	Amey Hengle et.al.	2501.17581	null
2025-01-28	Compact Neural TTS Voices for Accessibility	Kunal Jain et.al.	2501.17332	null
2025-01-27	Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation	Haorui He et.al.	2501.15907	link
2025-01-26	Overview of the Amphion Toolkit (v0.2)	Jiaqi Li et.al.	2501.15442	link
2025-01-24	Characteristic-Specific Partial Fine-Tuning for Efficient Emotion and Speaker Adaptation in Codec Language Text-to-Speech Models	Tianrui Wang et.al.	2501.14273	null
2025-01-24	Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation	Wen Huang et.al.	2501.14240	null
2025-01-24	LoCoML: A Framework for Real-World ML Inference Pipelines	Kritin Maddireddy et.al.	2501.14165	null
2025-01-23	Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference	Shuqi Dai et.al.	2501.13870	null
2025-01-23	Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement	Jae-Sung Bae et.al.	2501.13372	null
2025-01-21	A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data	Minh Tran et.al.	2501.12501	null
2025-01-20	A Non-autoregressive Model for Joint STT and TTS	Vishal Sunder et.al.	2501.09104	null
2025-01-15	Speech Synthesis along Perceptual Voice Quality Dimensions	Frederik Rautenberg et.al.	2501.08791	null
2025-01-15	Adaptive Data Augmentation with NaturalSpeech3 for Far-field Speaker Verification	Li Zhang et.al.	2501.08691	null
2025-01-15	Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement	Qianniu Chen et.al.	2501.08566	null
2025-01-14	CodecFake-Omni: A Large-Scale Codec-based Deepfake Speech Dataset	Jiawei Du et.al.	2501.08238	null
2025-01-13	Exploring the encoding of linguistic representations in the Fully-Connected Layer of generative CNNs for Speech	Bruno Ferenc Šegedin et.al.	2501.07726	null
2025-01-19	MathReader : Text-to-Speech for Mathematical Documents	Sieun Hyeon et.al.	2501.07088	link
2025-01-11	The 1st SpeechWellness Challenge: Detecting Suicidal Risk Among Adolescents	Wen Wu et.al.	2501.06474	null
2025-01-11	Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis	Rui Liu et.al.	2501.06467	link
2025-01-11	Unispeaker: A Unified Approach for Multimodality-driven Speaker Generation	Zhengyan Sheng et.al.	2501.06394	null
2025-01-10	TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer	Vladimir Bataev et.al.	2501.06320	null
2025-01-10	MinMo: A Multimodal Large Language Model for Seamless Voice Interaction	Qian Chen et.al.	2501.06282	null
2025-01-10	PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control	Shaozuo Zhang et.al.	2501.06276	null
2025-01-10	Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron	Kishor Kayyar Lakshminarayana et.al.	2501.05976	null
2025-01-10	MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model	Matthew Baas et.al.	2501.05787	null
2025-01-09	Probing Speaker-specific Features in Speaker Representations	Aemon Yat Fei Chiu et.al.	2501.05310	null
2025-01-09	JELLY: Joint Emotion Recognition and Context Reasoning with LLMs for Conversational Speech Synthesis	Jun-Hyeok Cha et.al.	2501.04904	null
2025-01-08	Cued Speech Generation Leveraging a Pre-trained Audiovisual Text-to-Speech Model	Sanjana Sankar et.al.	2501.04799	null
2025-01-08	FleSpeech: Flexibly Controllable Speech Generation with Various Prompts	Hanzhao Li et.al.	2501.04644	null
2025-01-09	OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis	Run Luo et.al.	2501.04561	link
2025-01-08	DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions	Weidong Chen et.al.	2501.04256	null
2025-01-02	FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles	Tian-Hao Zhang et.al.	2501.03181	null
2025-01-02	RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer	Seongho Hong et.al.	2501.01182	link
2025-01-02	Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT	Dongyang Dai et.al.	2501.01102	null
2025-01-06	Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in a VR Pilot Study	Mykola Maslych et.al.	2501.00168	null
2024-12-28	Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting	Wooseok Han et.al.	2412.20155	null
2024-12-26	"I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities	Jiawei Yu et.al.	2412.19102	null
2024-12-26	Indonesian-English Code-Switching Speech Synthesizer Utilizing Multilingual STEN-TTS and Bert LID	Ahmad Alfani Handoyo et.al.	2412.19043	null
2024-12-25	Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset	Neil Shah et.al.	2412.18839	null
2024-12-24	GenPod: Constructive News Framing in AI-Generated Podcasts More Effectively Reduces Negative Emotions Than Non-Constructive Framing	Wen Ku et.al.	2412.18300	null
2024-12-22	Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective	Hankun Wang et.al.	2412.17048	null
2024-12-22	Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis	Ye-Xin Lu et.al.	2412.16977	link
2024-12-22	Autoregressive Speech Synthesis with Next-Distribution Prediction	Xinfa Zhu et.al.	2412.16846	null
2024-12-23	Interleaved Speech-Text Language Models are Simple Streaming Text to Speech Synthesizers	Yifan Yang et.al.	2412.16102	null
2024-12-19	Scale This, Not That: Investigating Key Dataset Attributes for Efficient Speech Enhancement Scaling	Leying Zhang et.al.	2412.14890	null
2024-12-17	Synthetic Speech Classification: IEEE Signal Processing Cup 2022 challenge	Mahieyin Rahmun et.al.	2412.13279	link
2024-12-17	Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion	Syed Zohaib Hassan et.al.	2412.12710	null
2024-12-17	Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes	Kuiyuan Zhang et.al.	2412.12619	link
2024-12-17	Hierarchical Control of Emotion Rendering in Speech Synthesis	Sho Inoue et.al.	2412.12498	link
2024-12-19	ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis	Xiangheng He et.al.	2412.11795	null
2024-12-17	Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech	Rui Liu et.al.	2412.11409	link
2024-12-16	Efficient Generative Modeling with Residual Vector Quantization-Based Tokens	Jaehyeon Kim et.al.	2412.10208	null
2024-12-13	AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation	Xiyuan Gao et.al.	2412.10103	null
2024-12-13	CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder	Jianwei Cui et.al.	2412.08918	null
2024-12-11	Multimodal Latent Language Modeling with Next-Token Diffusion	Yutao Sun et.al.	2412.08635	link
2024-12-11	A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction	Sowmya Cheripally et.al.	2412.08312	null
2024-12-11	A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings	Anindita Mondal et.al.	2412.08283	null
2024-12-11	LatentSpeech: Latent Diffusion for Text-To-Speech Generation	Haowei Lou et.al.	2412.08117	null
2024-12-11	Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration	Haowei Lou et.al.	2412.08112	null
2024-12-09	Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey	Tianxin Xie et.al.	2412.06602	link
2024-12-12	EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations	Weizhen Bian et.al.	2412.06581	null
2024-12-01	Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor	Ashwin Baluja et.al.	2412.05315	null
2024-12-04	DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles	Jiaxuan Liu et.al.	2412.03388	null
2024-12-03	GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot	Aohan Zeng et.al.	2412.02612	link
2024-11-19	A Context-Based Numerical Format Prediction for a Text-To-Speech System	Yaser Darwesh et.al.	2412.00028	null
2024-11-27	Continual Learning in Machine Speech Chain Using Gradient Episodic Memory	Geoffrey Tyndall et.al.	2411.18320	null
2024-11-27	SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation	Wenyi Yu et.al.	2411.18138	null
2024-11-26	WavChat: A Survey of Spoken Dialogue Models	Shengpeng Ji et.al.	2411.13577	link
2024-12-02	I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception	Jiawei Zhang et.al.	2411.13314	null
2024-11-20	Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM	Jiawei Yu et.al.	2411.13159	null
2024-11-19	Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation	Praveen Srinivasa Varadhan et.al.	2411.12719	null
2024-11-19	Leveraging Virtual Reality and AI Tutoring for Language Learning: A Case Study of a Virtual Campus Environment with OpenAI GPT Integration with Unity 3D	Adithya TG et.al.	2411.12619	null
2024-11-18	ESTVocoder: An Excitation-Spectral-Transformed Neural Vocoder Conditioned on Mel Spectrogram	Xiao-Hang Jiang et.al.	2411.11258	null
2024-11-12	Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models	Dongrui Han et.al.	2411.07563	null
2024-11-11	Enhancing Accessibility in Special Libraries: A Study on AI-Powered Assistive Technologies for Patrons with Disabilities	Snehasish Paul Shivali Chauhan et.al.	2411.06970	null
2024-11-10	Debatts: Zero-Shot Debating Text-to-Speech Synthesis	Yiqiao Huang et.al.	2411.06540	null
2024-11-07	CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR	Kadir Burak Buldu et.al.	2411.04671	null
2024-11-04	EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector	Deok-Hyeon Cho et.al.	2411.02625	link
2024-11-09	Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis	Shijia Liao et.al.	2411.01156	link
2024-10-31	Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?	Ioannis Tsiamas et.al.	2410.24019	null
2024-10-30	Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis	Théodor Lemerle et.al.	2410.23320	link
2024-10-29	Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech	Eric Battenberg et.al.	2410.22179	link
2024-10-29	Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding	Bohan Li et.al.	2410.21951	null
2024-10-29	RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis	Kehan Sui et.al.	2410.21641	null
2024-10-28	Asynchronous Tool Usage for Real-Time Agents	Antonio A. Ginart et.al.	2410.21620	null
2024-10-28	Enhancing TTS Stability in Hebrew using Discrete Semantic Units	Ella Zeldes et.al.	2410.21502	null
2024-10-28	Mitigating Unauthorized Speech Synthesis for Voice Protection	Zhisheng Zhang et.al.	2410.20742	link
2024-10-27	Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation	Maohao Shen et.al.	2410.20336	null
2024-10-24	Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis	Suparna De et.al.	2410.19199	null
2024-10-24	STTATTS: Unified Speech-To-Text And Text-To-Speech Model	Hawau Olamide Toyin et.al.	2410.18607	link
2024-10-24	Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts	ChaeHun Park et.al.	2410.18444	null
2024-10-23	ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams	Srija Anand et.al.	2410.17901	null
2024-10-22	Continuous Speech Tokenizer in Text To Speech	Yixing Li et.al.	2410.17081	link
2024-10-22	Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap	Guanrou Yang et.al.	2410.16726	null
2024-10-21	Continuous Speech Synthesis using per-token Latent Diffusion	Arnon Turetzky et.al.	2410.16048	null
2024-10-18	A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages	Sujitha Sathiyamoorthy et.al.	2410.14197	null
2024-10-18	Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech	Shuwei He et.al.	2410.14101	link
2024-10-17	Enhancing Crowdsourced Audio for Text-to-Speech Models	José Giraldo et.al.	2410.13357	null
2024-10-17	DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech	Jan Melechovsky et.al.	2410.13342	null
2024-10-17	DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis	Yu Gu et.al.	2410.13288	null
2024-10-17	Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation	Sreyan Ghosh et.al.	2410.13198	null
2024-10-16	ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs	Rui-Chen Zheng et.al.	2410.12359	null
2024-10-14	IsoChronoMeter: A simple and effective isochronic translation evaluation metric	Nikolai Rozanov et.al.	2410.11127	null
2024-10-14	DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization	Yingahao Aaron Li et.al.	2410.11097	null
2024-10-12	Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling	Rui Liu et.al.	2410.09524	null
2024-10-10	Unsupervised Data Validation Methods for Efficient Model Training	Yurii Paniv et.al.	2410.07880	null
2024-10-15	F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching	Yushen Chen et.al.	2410.06885	link
2024-10-09	Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch	Teodora Răgman et.al.	2410.06787	null
2024-10-09	Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS	Onkar Kishor Susladkar et.al.	2410.06608	null
2024-10-09	Can DeepFake Speech be Reliably Detected?	Hongbin Liu et.al.	2410.06572	null
2024-10-07	SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech	Minchan Kim et.al.	2410.04690	null
2024-10-06	HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis	Yuto Nishimura et.al.	2410.04380	null
2024-10-10	SONAR: A Synthetic AI-Audio Detection Framework and Benchmark	Xiang Li et.al.	2410.04324	link
2024-10-05	Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System	Ze Li et.al.	2410.04017	null
2024-10-01	Recent Advances in Speech Language Models: A Survey	Wenqian Cui et.al.	2410.03751	link
2024-10-04	Generative Semantic Communication for Text-to-Speech Synthesis	Jiahao Zheng et.al.	2410.03459	null
2024-10-04	Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens	Jinzheng Zhao et.al.	2410.03298	null
2024-10-04	Narrative Player: Reviving Data Narratives with Visuals	Zekai Shao et.al.	2410.03268	null
2024-10-04	MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech	Taejun Bak et.al.	2410.03192	null
2024-10-01	Augmentation through Laundering Attacks for Audio Spoof Detection	Hashim Ali et.al.	2410.01108	null
2024-10-01	Zero-Shot Text-to-Speech from Continuous Text Streams	Trung Dang et.al.	2410.00767	null
2024-10-01	EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control	Haozhe Chen et.al.	2410.00316	link
2024-09-30	Word-wise intonation model for cross-language TTS systems	Tomilov A. A. et.al.	2409.20374	null
2024-09-27	Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech	Youngjae Kim et.al.	2409.18622	null
2024-09-26	Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control	Ryuichi Yamamoto et.al.	2409.17452	null
2024-09-25	Exploring synthetic data for cross-speaker style transfer in style representation based TTS	Lucas H. Ueda et.al.	2409.17364	null
2024-09-25	Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions	Kun Zhou et.al.	2409.16681	null
2024-09-25	Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation	Siyin Wang et.al.	2409.16644	link
2024-09-24	FastTalker: Jointly Generating Speech and Conversational Gestures from Text	Zixin Guo et.al.	2409.16404	null
2024-09-24	Beyond Text-to-Text: An Overview of Multimodal and Generative Artificial Intelligence for Education Using Topic Modeling	Ville Heilala et.al.	2409.16376	null
2024-09-24	Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech	Yunji Chu et.al.	2409.16203	null
2024-09-24	NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers	Nohil Park et.al.	2409.15760	null
2024-09-24	VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance	Jiheum Yeom et.al.	2409.15759	null
2024-09-24	StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis	Zhiyong Chen et.al.	2409.15741	null
2024-09-23	A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection	Lam Pham et.al.	2409.15180	null
2024-09-23	LlamaPartialSpoof: An LLM-Driven Fake Speech Dataset Simulating Disinformation Generation	Hieu-Thi Luong et.al.	2409.14743	link
2024-09-20	Zero-shot Cross-lingual Voice Transfer for TTS	Fadi Biadsy et.al.	2409.13910	null
2024-09-20	On the Feasibility of Fully AI-automated Vishing Attacks	João Figueiredo et.al.	2409.13793	null
2024-09-19	Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space	Sebastião Quintas et.al.	2409.12745	null
2024-09-19	Preference Alignment Improves Language Model-Based TTS	Jinchuan Tian et.al.	2409.12403	null
2024-09-18	Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference	Edresson Casanova et.al.	2409.12117	null
2024-09-18	Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems	Anusha Prakash et.al.	2409.11915	null
2024-09-18	DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech	Xin Qi et.al.	2409.11835	null
2024-09-18	Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation	Haohan Guo et.al.	2409.11630	null
2024-09-17	SpMis: An Investigation of Synthetic Spoken Misinformation Detection	Peizhuo Liu et.al.	2409.11308	null
2024-09-19	The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives	Samee Arif et.al.	2409.11261	link
2024-09-17	Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora	Francesco Nespoli et.al.	2409.11107	null
2024-09-16	Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization	Xiaoxue Gao et.al.	2409.10157	null
2024-09-16	StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion	Yinghao Aaron Li et.al.	2409.10058	null
2024-09-15	Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning	Siqi Sun et.al.	2409.09891	null
2024-09-14	E1 TTS: Simple and Fast Non-Autoregressive TTS	Zhijun Liu et.al.	2409.09351	null
2024-09-14	Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation	Changjin Han et.al.	2409.09311	link
2024-09-14	SafeEar: Content Privacy-Preserving Audio Deepfake Detection	Xinfeng Li et.al.	2409.09272	link
2024-09-13	AccentBox: Towards High-Fidelity Zero-Shot Accent Generation	Jinzuomu Zhong et.al.	2409.09098	null
2024-09-17	HLTCOE JHU Submission to the Voice Privacy Challenge 2024	Henry Li Xinyuan et.al.	2409.08913	null
2024-09-13	Text-To-Speech Synthesis In The Wild	Jee-weon Jung et.al.	2409.08711	null
2024-09-14	Exploring Accessibility Trends and Challenges in Mobile App Development: A Study of Stack Overflow Questions	Amila Indika et.al.	2409.07945	null
2024-09-12	Full-text Error Correction for Chinese Speech Recognition with Large Language Model	Zhiyuan Tang et.al.	2409.07790	null
2024-09-11	SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis	Helin Wang et.al.	2409.07556	link
2024-09-11	D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack	Hong-Hanh Nguyen-Le et.al.	2409.07390	null
2024-09-11	Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT	Kazuki Yamauchi et.al.	2409.07265	null
2024-09-11	Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment	Tien-Hong Lo et.al.	2409.07151	null
2024-09-10	Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models	Xin Jing et.al.	2409.06451	null
2024-09-10	What happens to diffusion model likelihood when your model is conditional?	Mattias Cross et.al.	2409.06364	null
2024-09-10	VoiceWukong: Benchmarking Deepfake Voice Detection	Ziwei Yan et.al.	2409.06348	null
2024-09-09	AS-Speech: Adaptive Style For Speech Synthesis	Zhipeng Li et.al.	2409.05730	null
2024-09-09	IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS	Ashwin Sankar et.al.	2409.05356	link
2024-09-10	Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion	Zhengyang Chen et.al.	2409.05004	null
2024-09-01	Sample-Efficient Diffusion for Text-To-Speech Synthesis	Justin Lovelace et.al.	2409.03717	link
2024-09-10	LAST: Language Model Aware Speech Tokenization	Arnon Turetzky et.al.	2409.03701	null
2024-09-05	FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications	Hao-Han Guo et.al.	2409.03283	null
2024-09-04	Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems	Jeongmin Liu et.al.	2409.02517	null
2024-09-03	VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka	Li-Wei Chen et.al.	2409.01548	null
2024-09-02	A multilingual training strategy for low resource Text to Speech	Asma Amalas et.al.	2409.01217	null
2024-09-02	A Framework for Synthetic Audio Conversations Generation using Large Language Models	Kaung Myat Kyaw et.al.	2409.00946	null
2024-09-02	SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis	Haohan Guo et.al.	2409.00933	link
2024-09-01	MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer	Yuancheng Wang et.al.	2409.00750	link
2024-08-30	SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection	Ismail Rasim Ulgen et.al.	2408.17432	null
2024-08-30	AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge	Kirill Borodin et.al.	2408.17352	null
2024-08-30	Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model	Zhen Ye et.al.	2408.17175	link
2024-08-30	Utilizing Speaker Profiles for Impersonation Audio Detection	Hao Gu et.al.	2408.17009	null
2024-08-29	Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis	Zehai Tu et.al.	2408.16373	null
2024-08-28	Multi-modal Adversarial Training for Zero-Shot Voice Cloning	John Janiczek et.al.	2408.15916	null
2024-08-29	Easy, Interpretable, Effective: openSMILE for voice deepfake detection	Octavian Pascu et.al.	2408.15775	null
2024-08-28	VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling	Yixuan Zhou et.al.	2408.15676	link
2024-08-28	VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech	Heeseung Kim et.al.	2408.14739	null
2024-08-27	StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech	Haowei Lou et.al.	2408.14713	link
2024-08-27	DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance	Jinhyeok Yang et.al.	2408.14423	null
2024-08-26	Anonymization of Voices in Spaces for Civic Dialogue: Measuring Impact on Empathy, Trust, and Feeling Heard	Wonjune Kang et.al.	2408.13970	null
2024-08-28	SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models	Dongchao Yang et.al.	2408.13893	null
2024-08-22	Positional Description for Numerical Normalization	Deepanshu Gupta et.al.	2408.12430	null
2024-08-22	VoiceX: A Text-To-Speech Framework for Custom Voices	Silvan Mertes et.al.	2408.12170	null
2024-08-13	Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation	Yinghao Aaron Li et.al.	2408.11849	null
2024-08-20	EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech	Xin Qi et.al.	2408.10852	null
2024-08-20	SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS	Karl El Hajal et.al.	2408.10771	null
2024-08-20	Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting	Hyun Jin Park et.al.	2408.10463	null
2024-08-17	Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition	Samuele Cornell et.al.	2408.09215	link
2024-08-14	PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation	Sang-Hoon Lee et.al.	2408.07547	link
2024-08-13	SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis	Osamu Take et.al.	2408.06858	link
2024-08-13	PRESENT: Zero-Shot Text-to-Prosody Control	Perry Lam et.al.	2408.06827	link
2024-08-12	FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks	Min Ma et.al.	2408.06227	null
2024-08-11	VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing	Chunyu Qiang et.al.	2408.05758	null
2024-08-06	Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training	Hawraz A. Ahmad et.al.	2408.03887	null
2024-08-03	ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features	Peng Cheng et.al.	2408.01808	link
2024-08-01	Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation	Xinhan Di et.al.	2408.00284	null
2024-07-18	Handling Numeric Expressions in Automatic Speech Recognition	Christian Huber et.al.	2408.00004	null
2024-07-31	On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition	Nick Rossenbach et.al.	2407.21476	null
2024-07-29	Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks	Mahmoud Salhab et.al.	2407.18571	null
2024-07-25	On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures	Nick Rossenbach et.al.	2407.17997	null
2024-07-24	Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model	Jan Lehečka et.al.	2407.17167	null
2024-07-23	Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments	Pai Zhu et.al.	2407.16840	null
2024-07-19	Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2	Chun Xu et.al.	2407.14212	null
2024-07-18	Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models	Weiqin Li et.al.	2407.13509	null
2024-07-22	TTSDS -- Text-to-Speech Distribution Score	Christoph Minixhofer et.al.	2407.12707	link
2024-07-17	Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech	Haibin Wu et.al.	2407.12229	link
2024-07-16	A Language Modeling Approach to Diacritic-Free Hebrew TTS	Amit Roth et.al.	2407.12206	null
2024-07-17	Learning High-Frequency Functions Made Easy with Sinusoidal Positional Encoding	Chuanhao Sun et.al.	2407.09370	link
2024-07-11	Autoregressive Speech Synthesis without Vector Quantization	Lingwei Meng et.al.	2407.08551	link
2024-07-10	Source Tracing of Audio Deepfake Systems	Nicholas Klein et.al.	2407.08016	null
2024-07-07	ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation	Ruibo Fu et.al.	2407.05421	null
2024-07-09	CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens	Zhihao Du et.al.	2407.05407	null
2024-07-04	Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis	Cong-Thanh Do et.al.	2407.04047	null
2024-07-04	Optimizing a-DCF for Spoofing-Robust Speaker Verification	Oğuzhan Kurnaz et.al.	2407.04034	null
2024-07-04	On the Effectiveness of Acoustic BPE in Decoder-Only TTS	Bohan Li et.al.	2407.03892	null
2024-07-14	CATT: Character-based Arabic Tashkeel Transformer	Faris Alasmary et.al.	2407.03236	link
2024-07-02	Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization	Yuchen Hu et.al.	2407.02243	null
2024-07-02	TTSlow: Slow Down Text-to-Speech with Efficiency Robustness Evaluations	Xiaoxue Gao et.al.	2407.01927	null
2024-07-01	Lightweight Zero-shot Text-to-Speech with Mixture of Adapters	Kenichi Fujita et.al.	2407.01291	null
2024-06-30	NAIST Simultaneous Speech Translation System for IWSLT 2024	Yuka Ko et.al.	2407.00826	null
2024-06-30	An Attribute Interpolation Method in Speech Synthesis by Model Merging	Masato Murata et.al.	2407.00766	null
2024-06-30	FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis	Yinlin Guo et.al.	2407.00753	null
2024-07-02	Open-Source Conversational AI with SpeechBrain 1.0	Mirco Ravanelli et.al.	2407.00463	null
2024-06-27	Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models	Borodin Kirill Nikolayevich et.al.	2406.19243	null
2024-06-27	DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability	Hyun Joon Park et.al.	2406.19135	link
2024-06-26	Automatic Speech Recognition for Hindi	Anish Saha et.al.	2406.18135	null
2024-06-26	A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons	Tzu-Yun Hung et.al.	2406.18089	null
2024-06-29	LLM-Driven Multimodal Opinion Expression Identification	Bonian Jia et.al.	2406.18088	null
2024-06-26	E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS	Sefik Emre Eskimez et.al.	2406.18009	link
2024-06-25	Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment	Paarth Neekhara et.al.	2406.17957	null
2024-06-22	A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge	Xiaopeng Wang et.al.	2406.17801	null
2024-06-25	High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model	Joun Yeop Lee et.al.	2406.17310	null
2024-06-25	Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation	Yingting Li et.al.	2406.17257	null
2024-06-24	Exploring the Capability of Mamba in Speech Applications	Koichi Miyazaki et.al.	2406.16808	null
2024-06-25	Towards Zero-Shot Text-To-Speech for Arabic Dialects	Khai Duy Doan et.al.	2406.16751	null
2024-06-22	TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers	Yakun Song et.al.	2406.15752	link
2024-06-21	InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions	Yu Nakagome et.al.	2406.14890	null
2024-06-21	GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech	Wenbin Wang et.al.	2406.14875	null
2024-06-21	DASB - Discrete Audio and Speech Benchmark	Pooneh Mousavi et.al.	2406.14294	null
2024-06-18	Instruction Data Generation and Unsupervised Adaptation for Speech Language Models	Vahid Noroozi et.al.	2406.12946	null
2024-06-17	DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer	Keon Lee et.al.	2406.11427	null
2024-06-16	NAST: Noise Aware Speech Tokenization for Speech Language Models	Shoval Messica et.al.	2406.11037	link
2024-06-16	Multi-Scale Accent Modeling with Disentangling for Multi-Speaker Multi-Accent TTS Synthesis	Xuehao Zhou et.al.	2406.10844	null
2024-06-14	Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice	Shubham Gupta et.al.	2406.10422	null
2024-06-14	UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner	Dongchao Yang et.al.	2406.10056	link
2024-06-14	MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model	Jiatong Shi et.al.	2406.09869	null
2024-06-13	DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage	Kyra Wang et.al.	2406.08820	null
2024-06-13	Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems	Zhengyang Chen et.al.	2406.08812	null
2024-06-13	DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing	Neha Sahipjohn et.al.	2406.08802	null
2024-06-12	Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis	Wing-Zin Leung et.al.	2406.08568	link
2024-06-12	Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data	Yuma Shirahata et.al.	2406.08111	null
2024-06-12	VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech	Ashishkumar Gudmalwar et.al.	2406.08076	null
2024-06-12	LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning	Masaya Kawamura et.al.	2406.07969	link
2024-06-12	VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment	Bing Han et.al.	2406.07855	null
2024-06-12	EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech	Deok-Hyeon Cho et.al.	2406.07803	link
2024-06-11	The Interspeech 2024 Challenge on Speech Processing Using Discrete Units	Xuankai Chang et.al.	2406.07725	null
2024-06-11	Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?	Qingkai Fang et.al.	2406.07289	null
2024-06-11	AudioMarkBench: Benchmarking Robustness of Audio Watermarking	Hongbin Liu et.al.	2406.06979	link
2024-06-11	Controlling Emotion in Text-to-Speech with Natural Language Prompts	Thomas Bott et.al.	2406.06406	link
2024-06-10	Meta Learning Text-to-Speech Synthesis in over 7000 Languages	Florian Lux et.al.	2406.06403	link
2024-06-10	MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance	Semin Kim et.al.	2406.05965	null
2024-06-11	WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark	Linhan Ma et.al.	2406.05763	link
2024-06-09	An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS	Xiaofei Wang et.al.	2406.05699	null
2024-06-11	Text-aware and Context-aware Expressive Audiobook Speech Synthesis	Dake Guo et.al.	2406.05672	link
2024-06-08	Autoregressive Diffusion Transformer for Text-to-Speech Synthesis	Zhijun Liu et.al.	2406.05551	null
2024-06-08	VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers	Sanyuan Chen et.al.	2406.05370	null
2024-06-07	Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech Synthesis	Ryan Langman et.al.	2406.05298	null
2024-06-07	XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model	Edresson Casanova et.al.	2406.04904	link
2024-06-07	TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking	Junzuo Zhou et.al.	2406.04840	link
2024-06-07	Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study	Chong Zhang et.al.	2406.04633	null
2024-06-06	Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis	Théodor Lemerle et.al.	2406.04467	link
2024-06-06	Total-Duration-Aware Duration Modeling for Text-to-Speech Systems	Sefik Emre Eskimez et.al.	2406.04281	null
2024-06-06	Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining	Jinlong Xue et.al.	2406.03714	null
2024-06-06	Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model	Jinlong Xue et.al.	2406.03706	null
2024-06-05	Style Mixture of Experts for Expressive Text-To-Speech Synthesis	Ahad Jawaid et.al.	2406.03637	null
2024-06-07	Harder or Different? Understanding Generalization of Audio Deepfake Detection	Nicolas M. Müller et.al.	2406.03512	null
2024-06-05	LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes	Trung Dang et.al.	2406.02897	null
2024-06-04	Seed-TTS: A Family of High-Quality Versatile Speech Generation Models	Philip Anastassiou et.al.	2406.02430	link
2024-06-05	SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models	Dongchao Yang et.al.	2406.02328	null
2024-06-04	BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation	Hui-Peng Du et.al.	2406.02162	null
2024-06-04	Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis	Kun Zhou et.al.	2406.02009	null
2024-06-03	ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec	Shengpeng Ji et.al.	2406.01205	link
2024-06-03	Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training	Jan Melechovsky et.al.	2406.01018	null
2024-06-02	Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback	Chen Chen et.al.	2406.00654	null
2024-05-31	Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities	Vicky Zayats et.al.	2405.18669	null
2024-05-28	TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation	Chenyang Le et.al.	2405.17809	link
2024-05-27	RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis	Haoxiang Shi et.al.	2405.17028	null
2024-05-24	Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition	Zijin Gu et.al.	2405.15216	null
2024-05-23	Reinforcement Learning for Fine-tuning Text-to-speech Diffusion Models	Jingyi Chen et.al.	2405.14632	null
2024-05-22	A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction	Yue Li et.al.	2405.13477	null
2024-05-20	Multi-speaker Text-to-speech Training with Speaker Anonymized Data	Wen-Chin Huang et.al.	2405.11767	null
2024-05-19	VR-GPT: Visual Language Model for Intelligent Virtual Reality Applications	Mikhail Konenkov et.al.	2405.11537	null
2024-05-18	Exploring speech style spaces with language models: Emotional TTS without emotion labels	Shreeram Suresh Chandra et.al.	2405.11413	null
2024-05-16	Faces that Speak: Jointly Synthesising Talking Face and Speech from Text	Youngjoon Jang et.al.	2405.10272	null
2024-05-16	Building a Luganda Text-to-Speech Model From Crowdsourced Data	Sulaiman Kagumire et.al.	2405.10211	null
2024-05-16	Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model	Siyang Wang et.al.	2405.09768	null
2024-05-15	Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer	Weifei Jin et.al.	2405.09470	null
2024-05-15	Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis	Sho Inoue et.al.	2405.09171	null
2024-05-14	PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset	Yang Hou et.al.	2405.08838	link
2024-04-30	Attention-Constrained Inference for Robust Decoder-Only Text-to-Speech	Hankun Wang et.al.	2404.19723	null
2024-04-29	MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis	Xiang Li et.al.	2404.18398	link
2024-04-28	USAT: A Universal Speaker-Adaptive Text-to-Speech Approach	Wenbin Wang et.al.	2404.18094	link
2024-04-27	TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality	Tiantian Feng et.al.	2404.17983	null
2024-04-26	An RFP dataset for Real, Fake, and Partially fake audio detection	Abdulazeez AlAli et.al.	2404.17721	null
2024-04-23	StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations	Sen Liu et.al.	2404.14946	link
2024-04-23	Retrieval-Augmented Audio Deepfake Detection	Zuheng Kang et.al.	2404.13892	link
2024-04-14	Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling	Quanxiu Wang et.al.	2404.09192	null
2024-04-11	Voice-Assisted Real-Time Traffic Sign Recognition System Using Convolutional Neural Network	Mayura Manawadu et.al.	2404.07807	null
2024-04-18	Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness	Xincan Feng et.al.	2404.06714	link
2024-04-10	CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations	Leying Zhang et.al.	2404.06690	link
2024-04-10	The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge	Yiwei Guo et.al.	2404.06079	null
2024-04-07	Cross-Domain Audio Deepfake Detection: Dataset and Analysis	Yuang Li et.al.	2404.04904	null
2024-04-06	HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks	Yingting Li et.al.	2404.04645	link
2024-04-18	Open vocabulary keyword spotting through transfer learning from speech synthesis	Kesavaraj V et.al.	2404.03914	null
2024-04-06	RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis	Detai Xin et.al.	2404.03204	null
2024-04-03	CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech	Jaehyeon Kim et.al.	2404.02781	null
2024-04-13	PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders	Yu Pan et.al.	2404.02702	null
2024-03-31	Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation	Rohan Chaudhury et.al.	2404.01339	link
2024-03-28	A Review of Multi-Modal Large Language and Vision Models	Kilian Carolan et.al.	2404.01322	null
2024-04-09	KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis	Adal Abilbekov et.al.	2404.01033	link
2024-03-31	CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models	Xiang Li et.al.	2404.00569	link
2024-03-25	VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild	Puyuan Peng et.al.	2403.16973	link
2024-03-20	Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning	Shivam Ratnakant Mhaskar et.al.	2403.15469	null
2024-03-20	UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge	Wataru Nakata et.al.	2403.13720	null
2024-03-20	Building speech corpus with diverse voice characteristics for its prompt-based representation	Aya Watanabe et.al.	2403.13353	null
2024-03-17	Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations	Claudio Pinhanez et.al.	2403.11209	null
2024-03-17	EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech	Ziqi Liang et.al.	2403.08164	null
2024-03-09	HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling	Chunhui Wang et.al.	2403.05989	null
2024-03-05	AttentionStitch: How Attention Solves the Speech Editing Problem	Antonios Alexos et.al.	2403.04804	null
2024-03-07	Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation	Sai Akarsh et.al.	2403.04178	null
2024-03-27	NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models	Zeqian Ju et.al.	2403.03100	null
2024-03-04	Brilla AI: AI Contestant for the National Science and Maths Quiz	George Boateng et.al.	2403.01699	link
2024-03-02	Towards Accurate Lip-to-Speech Synthesis in-the-Wild	Sindhu Hegde et.al.	2403.01087	link
2024-02-29	Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data	Takaaki Saeki et.al.	2402.18932	null
2024-02-26	An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation	Ahmet Gunduz et.al.	2402.16380	link
2024-02-22	Efficient data selection employing Semantic Similarity-based Graph Structures for model training	Roxana Petcu et.al.	2402.14888	null
2024-02-22	Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition	Rendi Chevi et.al.	2402.14523	link
2024-02-19	On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models	Miri Varshavsky-Hassid et.al.	2402.12423	null
2024-02-19	Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting	Haolin Chen et.al.	2402.12220	link
2024-02-18	Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in Conversations with the Tabletop Robot Haru	Zining Wang et.al.	2402.11571	null
2024-02-14	MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech	Shengpeng Ji et.al.	2402.09378	null
2024-02-15	BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data	Mateusz Łajszczak et.al.	2402.08093	null
2024-03-04	Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like	Naoyuki Kanda et.al.	2402.07383	null
2024-02-09	A New Approach to Voice Authenticity	Nicolas M. Müller et.al.	2402.06304	null
2024-02-08	Unified Speech-Text Pretraining for Spoken Dialog Modeling	Heeseung Kim et.al.	2402.05706	link
2024-02-05	Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations	Álvaro Martín-Cortinas et.al.	2402.03407	null
2024-02-02	Natural language guidance of high-fidelity text-to-speech with synthetic annotations	Dan Lyth et.al.	2402.01912	null
2024-01-23	Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization	Wei-Ping Huang et.al.	2402.01692	null
2024-02-01	Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech	Dong Yang et.al.	2402.00288	link
2024-02-01	PAM: Prompting Audio-Language Models for Audio Quality Assessment	Soham Deshmukh et.al.	2402.00282	link
2024-01-31	Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2	Jiatong Shi et.al.	2401.17619	link
2024-01-28	MunTTS: A Text-to-Speech System for Mundari	Varun Gumma et.al.	2401.15579	link
2024-01-30	VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech	Chenpeng Du et.al.	2401.14321	null
2024-01-25	Text to speech synthesis	Harini s et.al.	2401.13891	link
2024-01-25	SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation	Dong Zhang et.al.	2401.13527	link
2024-01-22	Benchmarking Large Multimodal Models against Common Corruptions	Jiawei Zhang et.al.	2401.11943	link
2024-01-22	Adversarial speech for voice privacy protection from Personalized Speech generation	Shihao Chen et.al.	2401.11857	null
2024-02-16	Empowering Communication: Speech Technology for Indian and Western Accents through AI-powered Speech Synthesis	Vinotha R et.al.	2401.11771	null
2024-01-19	Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech	Abhinav Garg et.al.	2401.10465	null
2024-02-28	MLAAD: The Multi-Language Audio Anti-Spoofing Dataset	Nicolas M. Müller et.al.	2401.09512	null
2024-01-15	MCMChaos: Improvising Rap Music with MCMC Methods and Chaos Theory	Robert G. Kimelman et.al.	2401.07967	null
2024-01-14	ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering	Yakun Song et.al.	2401.07333	null
2024-01-12	Multi-Task Learning for Front-End Text Processing in TTS	Wonjune Kang et.al.	2401.06321	link
2024-01-11	End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2	Aniket Tathe et.al.	2401.06183	null
2024-01-11	Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection	Lian Huang et.al.	2401.05614	null
2024-01-10	Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters	Kenichi Fujita et.al.	2401.05111	null
2024-01-07	Evaluating and Personalizing User-Perceived Quality of Text-to-Speech Voices for Delivering Mindfulness Meditation with Different Physical Embodiments	Zhonghao Shi et.al.	2401.03581	null
2024-01-07	Transfer the linguistic representations from TTS to accent conversion with non-parallel data	Xi Chen et.al.	2401.03538	null
2024-01-03	Incremental FastPitch: Chunk-based High Quality Text to Speech	Muyang Du et.al.	2401.01755	null
2024-01-03	Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction	Minchan Kim et.al.	2401.01498	null
2023-12-18	Assisting Blind People Using Object Detection with Vocal Feedback	Heba Najm et.al.	2401.01362	null
2023-12-30	Boosting Large Language Model for Speech Synthesis: An Empirical Study	Hongkun Hao et.al.	2401.00246	null
2024-01-01	Normalization of Lithuanian Text Using Regular Expressions	Pijus Kasparaitis et.al.	2312.17660	null
2023-12-27	AE-Flow: AutoEncoder Normalizing Flow	Jakub Mosiński et.al.	2312.16552	null
2023-12-22	Creating New Voices using Normalizing Flows	Piotr Bilinski et.al.	2312.14569	null
2023-12-22	ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations	Cheng Gong et.al.	2312.14398	null
2023-12-19	External Knowledge Augmented Polyphone Disambiguation Using Large Language Model	Chen Li et.al.	2312.11920	null
2023-12-17	A review-based study on different Text-to-Speech technologies	Md. Jalal Uddin Chowdhury et.al.	2312.11563	null
2024-01-31	MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis	Wenhao Guan et.al.	2312.10687	null
2024-02-22	Amphion: An Open-Source Audio, Music and Speech Generation Toolkit	Xueyao Zhang et.al.	2312.09911	link
2023-12-11	Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism	Georgios Milis et.al.	2312.06613	link
2023-12-08	An Experimental Study: Assessing the Combined Framework of WavLM and BEST-RQ for Text-to-Speech Synthesis	Via Nielson et.al.	2312.05415	null
2023-12-06	Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis	Zehua Chen et.al.	2312.03491	null
2023-12-02	Rapid Speaker Adaptation in Low Resource Text to Speech Systems using Synthetic Data and Transfer learning	Raviraj Joshi et.al.	2312.01107	null
2023-12-02	Code-Mixed Text to Speech Synthesis under Low-Resource Constraints	Raviraj Joshi et.al.	2312.01103	null
2023-11-29	Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes	Pavel Korshunov et.al.	2311.17655	null
2024-02-06	Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech	Enting Zhou et.al.	2311.14816	link
2023-12-07	Guided Flows for Generative Modeling and Decision Making	Qinqing Zheng et.al.	2311.13443	null
2023-11-27	HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis	Sang-Hoon Lee et.al.	2311.12454	link
2023-11-18	Utilizing Speech Emotion Recognition and Recommender Systems for Negative Emotion Handling in Therapy Chatbots	Farideh Majidi et.al.	2311.11116	null
2023-11-18	Data Center Audio/Video Intelligence on Device (DAVID) -- An Edge-AI Platform for Smart-Toys	Gabriel Cosache et.al.	2311.11030	null
2023-11-17	A Study on Altering the Latent Space of Pretrained Text to Speech Models for Improved Expressiveness	Mathias Vogel et.al.	2311.10804	null
2023-11-16	Improving fairness for spoken language understanding in atypical speech with Text-to-Speech	Helin Wang et.al.	2311.10149	link
2024-02-02	DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation	Jianzong Wang et.al.	2311.07965	null
2023-11-12	ChatAnything: Facetime Chat with LLM-Enhanced Personas	Yilin Zhao et.al.	2311.06772	null
2023-11-11	NewsGPT: ChatGPT Integration for Robot-Reporter	Abdelhadi Hireche et.al.	2311.06640	link
2023-11-08	Synthetic Speaking Children -- Why We Need Them and How to Make Them	Muhammad Ali Farooq et.al.	2311.06307	null
2023-09-25	Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural Zero-shot Speech Synthesis from a Face Image	Minki Kang et.al.	2311.05844	null
2023-11-07	Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer Learning	Rishabh Jain et.al.	2311.04313	link
2023-11-07	Character-Level Bangla Text-to-IPA Transcription Using Transformer Architecture with Sequence Alignment	Jakir Hasan et.al.	2311.03792	null
2023-11-08	Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction	Minchan Kim et.al.	2311.02898	null
2023-11-02	Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations	Hanglei Zhang et.al.	2311.01260	null
2023-11-02	E3 TTS: Easy End-to-End Diffusion-based Text to Speech	Yuan Gao et.al.	2311.00945	null
2023-10-31	An Implementation of Multimodal Fusion System for Intelligent Digital Human Generation	Yingjie Zhou et.al.	2310.20251	link
2023-10-27	Style Description based Text-to-Speech with Conditional Prosodic Layer Normalization based Diffusion GAN	Neeraj Kumar et.al.	2310.18169	null
2023-10-25	ArTST: Arabic Text and Speech Transformer	Hawau Olamide Toyin et.al.	2310.16621	link
2023-10-25	Generative Pre-training for Speech with Flow Matching	Alexander H. Liu et.al.	2310.16338	null
2023-10-23	DPP-TTS: Diversifying prosodic features of speech via determinantal point processes	Seongho Joo et.al.	2310.14663	null
2023-10-22	An overview of text-to-speech systems and media applications	Mohammad Reza Hasanabadi et.al.	2310.14301	null
2023-10-14	Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling	Tiberiu Boros et.al.	2310.09636	link
2023-10-14	Attentive Multi-Layer Perceptron for Non-autoregressive Generation	Shuyang Jiang et.al.	2310.09512	link
2023-12-22	Crowdsourced and Automatic Speech Prominence Estimation	Max Morrison et.al.	2310.08464	link
2023-10-12	On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition	Nick Rossenbach et.al.	2310.08132	null
2023-10-12	Vec-Tok Speech: speech vectorization and tokenization for neural speech generation	Xinfa Zhu et.al.	2310.07246	link
2023-10-10	Prosody Analysis of Audiobooks	Charuta Pethe et.al.	2310.06930	link
2023-10-09	JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions	Detai Xin et.al.	2310.06072	null
2024-01-09	Unified speech and gesture synthesis using flow matching	Shivam Mehta et.al.	2310.05181	null
2023-10-08	Comparative Analysis of Transfer Learning in Deep Learning Text-to-Speech Models on a Few-Shot, Low-Resource, Customized Dataset	Ze Liu et.al.	2310.04982	null
2023-10-11	LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT	Jiaming Wang et.al.	2310.04673	null
2024-01-22	Latent Filling: Latent Space Data Augmentation for Zero-shot Speech Synthesis	Jae-Sung Bae et.al.	2310.03538	null
2023-10-07	The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains	Erica Cooper et.al.	2310.02640	null
2023-10-02	Towards human-like spoken dialogue generation between AI agents from written dialogue	Kentaro Mitsui et.al.	2310.01088	null
2023-10-01	Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech	Dareen Alharthi et.al.	2310.00706	null
2024-03-11	Fewer-token Neural Speech Codec with Time-invariant Codes	Yong Ren et.al.	2310.00014	link
2024-01-31	ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech	Wenhao Guan et.al.	2309.17056	null
2023-09-29	Low-Resource Self-Supervised Learning with SSL-Enhanced TTS	Po-chun Hsu et.al.	2309.17020	null
2023-09-29	Synthetic Speech Detection Based on Temporal Consistency and Distribution of Speaker Features	Yuxiang Zhang et.al.	2309.16954	null
2023-12-18	High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models	Chunyu Qiang et.al.	2309.15512	null
2024-01-09	BiSinger: Bilingual Singing Voice Synthesis	Huali Zhou et.al.	2309.14089	link
2023-10-07	HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS	Dake Guo et.al.	2309.13907	null
2023-09-24	VoiceLDM: Text-to-Speech with Environmental Context	Yeonghyeon Lee et.al.	2309.13664	null
2023-09-24	Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control	Aya Watanabe et.al.	2309.13509	null
2023-09-22	DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis	Yu Gu et.al.	2309.12792	null
2023-09-22	Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts	Shun Lei et.al.	2309.11977	null
2023-09-21	The Impact of Silence on Speech Anti-Spoofing	Yuxiang Zhang et.al.	2309.11827	null
2023-09-21	Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech	Rui Liu et.al.	2309.11724	link
2023-09-20	Speak While You Think: Streaming Speech Synthesis During Text Generation	Avihu Dekel et.al.	2309.11210	null
2023-09-20	Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model	Xinyu Zhou et.al.	2309.11000	link
2023-09-19	Exploring Speech Enhancement for Low-resource Speech Synthesis	Zhaoheng Ni et.al.	2309.10795	null
2023-09-19	Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition	Ziyang Ma et.al.	2309.10294	null
2023-09-17	Augmenting text for spoken language understanding with Large Language Models	Roshan Sharma et.al.	2309.09390	null
2023-09-16	FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework	Jianzong Wang et.al.	2309.08837	null
2023-09-15	Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech	Dariusz Piotrowski et.al.	2309.08255	null
2023-09-15	HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods	Hyun-seo Shin et.al.	2309.08208	link
2023-12-27	PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions	Reo Shimizu et.al.	2309.08140	null
2023-09-15	Diversity-based core-set selection for text-to-speech with linguistic and acoustic features	Kentaro Seki et.al.	2309.08127	null
2023-09-14	Direct Text to Speech Translation System using Acoustic Units	Victoria Mingote et.al.	2309.07478	null
2023-10-07	FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec	Zhihao Du et.al.	2309.07405	link
2023-09-13	DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation	Zhichao Wu et.al.	2309.06787	null
2023-09-11	Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP	Jinzuomu Zhong et.al.	2309.05423	link
2024-01-16	VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching	Yiwei Guo et.al.	2309.05027	link
2023-09-08	Cross-Utterance Conditioned VAE for Speech Generation	Yang Li et.al.	2309.04156	null
2023-09-07	Large-Scale Automatic Audiobook Creation	Brendan Walsh et.al.	2309.03926	null
2023-09-11	GRASS: Unified Generation Model for Speech-to-Semantic Tasks	Aobo Xia et.al.	2309.02780	null
2023-09-12	MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023	Zhihang Xu et.al.	2309.02743	null
2023-10-12	PromptTTS 2: Describing and Generating Voices with Text Prompt	Yichong Leng et.al.	2309.02285	null
2023-09-04	A Comparative Analysis of Pretrained Language Models for Text-to-Speech	Marcel Granero-Moya et.al.	2309.01576	null
2023-09-02	DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin	Tao Li et.al.	2309.00883	null
2023-12-18	Learning Speech Representation From Contrastive Token-Acoustic Pretraining	Chunyu Qiang et.al.	2309.00424	null
2023-09-01	The FruitShell French synthesis system at the Blizzard 2023 Challenge	Xin Qi et.al.	2309.00223	null
2023-08-31	QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning	Haohan Guo et.al.	2309.00126	null
2024-01-23	SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models	Xin Zhang et.al.	2308.16692	link
2023-08-31	Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis	Weiqin Li et.al.	2308.16593	null
2023-08-31	Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information	Jie Chen et.al.	2308.16577	null
2023-08-31	LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech	Jie Chen et.al.	2308.16569	null
2023-08-30	CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis	Yi Meng et.al.	2308.16021	null
2023-09-01	The DeepZen Speech Synthesis System for Blizzard Challenge 2023	Christophe Veaux et.al.	2308.15945	null
2023-08-28	Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech	Hyungchan Yoon et.al.	2308.14909	null
2023-09-04	Rep2wav: Noise Robust text-to-speech Using self-supervised representations	Qiushi Zhu et.al.	2308.14553	null
2023-08-28	TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models	Shengpeng Ji et.al.	2308.14430	link
2023-09-02	Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder	Xuyuan Li et.al.	2308.13365	null
2023-08-24	Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations	Wenbin Wang et.al.	2308.13007	null
2023-09-22	Sparks of Large Audio Models: A Survey and Outlook	Siddique Latif et.al.	2308.12792	null
2023-10-25	SeamlessM4T: Massively Multilingual & Multimodal Machine Translation	Seamless Communication et.al.	2308.11596	link
2023-08-31	Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models	Heyang Xue et.al.	2308.10428	null
2023-08-16	AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis	Hrishikesh Viswanath et.al.	2308.08577	null
2023-08-14	SpeechX: Neural Codec Language Model as a Versatile Speech Transformer	Xiaofei Wang et.al.	2308.06873	null
2023-08-12	Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic Talking-head Generation	Zhichao Wang et.al.	2308.06457	link
2023-09-09	AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining	Haohe Liu et.al.	2308.05734	link
2023-08-09	Data Player: Automatic Generation of Data Videos with Narration-Animation Interplay	Leixian Shen et.al.	2308.04703	null
2023-08-08	Towards an AI to Win Ghana's National Science and Maths Quiz	George Boateng et.al.	2308.04333	link
2023-08-08	WonderFlow: Narration-Centric Design of Animated Data Videos	Yun Wang et.al.	2308.04040	null
2023-08-04	Let's Give a Voice to Conversational Agents in Virtual Reality	Michele Yin et.al.	2308.02665	link
2023-08-03	Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation	Minsu Kim et.al.	2308.01831	link
2023-08-02	SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis	Ramanan Sivaguru et.al.	2308.01018	null
2023-07-07	Artificial Eye for the Blind	Abhinav Benagi et.al.	2308.00801	null
2023-07-31	Multilingual context-based pronunciation learning for Text-to-Speech	Giulia Comini et.al.	2307.16709	null
2023-07-31	Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech	Guangyan Zhang et.al.	2307.16679	null
2023-07-31	Improving grapheme-to-phoneme conversion by learning pronunciations from speech recordings	Manuel Sam Ribeiro et.al.	2307.16643	null
2023-07-31	DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training	Hyung-Seok Oh et.al.	2307.16549	link
2023-07-31	VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design	Jungil Kong et.al.	2307.16430	link
2023-07-30	Improving TTS for Shanghainese: Addressing Tone Sandhi via Word Segmentation	Yuanhao Chen et.al.	2307.16199	link
2023-07-29	METTS: Multilingual Emotional Text-to-Speech by Cross-speaker and Cross-lingual Emotion Transfer	Xinfa Zhu et.al.	2307.15951	link
2023-12-18	Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding	Chunyu Qiang et.al.	2307.15484	null
2023-07-20	SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer	Daegyeom Kim et.al.	2307.10550	link
2023-07-18	SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs	Yinghao Aaron Li et.al.	2307.09435	null
2023-09-28	Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts	Ziyue Jiang et.al.	2307.07218	null
2023-07-13	Controllable Emphasis with zero data for text-to-speech	Arnaud Joly et.al.	2307.07062	null
2023-07-11	On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis	Siyang Wang et.al.	2307.05132	null
2023-07-10	The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task	Kun Song et.al.	2307.04630	null
2023-10-07	ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading	Yujia Xiao et.al.	2307.00782	null
2023-06-28	EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech	Daria Diatlova et.al.	2307.00024	link
2023-06-29	High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units	Junchen Lu et.al.	2306.17005	null
2023-06-28	UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data	Heeseung Kim et.al.	2306.16083	link
2023-10-19	Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale	Matthew Le et.al.	2306.15687	null
2023-06-27	GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech	Yahuan Cong et.al.	2306.15304	null
2023-06-25	DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech	Sen Liu et.al.	2306.14145	null
2023-06-21	Visual-Aware Text-to-Speech	Mohan Zhou et.al.	2306.12020	null
2023-06-21	Expressive Machine Dubbing Through Phrase-level Cross-lingual Prosody Transfer	Jakub Swiatkowski et.al.	2306.11662	null
2023-06-16	Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation	Kishor Kayyar Lakshminarayana et.al.	2306.10152	null
2023-06-16	CML-TTS A Multilingual Dataset for Speech Synthesis in Low-Resource Languages	Frederico S. Oliveira et.al.	2306.10097	null
2023-06-14	Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation	Zheng Liang et.al.	2306.08588	null
2023-06-14	Towards Building Voice-based Conversational Recommender Systems: Datasets, Potential Solutions, and Prospects	Xinghua Qu et.al.	2306.08219	link
2023-11-20	StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models	Yinghao Aaron Li et.al.	2306.07691	null
2024-01-18	UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding	Chenpeng Du et.al.	2306.07547	null
2023-06-13	PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling	Ji-Sang Hwang et.al.	2306.07489	null
2023-06-09	Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech	Shijun Wang et.al.	2306.05709	null
2023-06-08	VIFS: An End-to-End Variational Inference for Foley Sound Synthesis	Junhyeok Lee et.al.	2306.05004	link
2023-07-11	Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge	Wenhao Guan et.al.	2306.04301	null
2023-06-06	Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias	Ziyue Jiang et.al.	2306.03509	null
2023-08-02	Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis	Zhenhui Ye et.al.	2306.03504	null
2023-06-05	Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis	Dengfeng Ke et.al.	2306.02593	null
2023-06-05	Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language Model	Hoyeon Lee et.al.	2306.02579	null
2023-06-05	Latent Optimal Paths by Gumbel Propagation for Variational Bayesian Dynamic Programming	Xinlei Niu et.al.	2306.02568	link
2023-06-02	Towards Robust FastSpeech 2 by Modelling Residual Multimodality	Fabian Kögel et.al.	2306.01442	link
2023-05-30	Towards Selection of Text-to-speech Data to Augment ASR Training	Shuo Liu et.al.	2306.00998	null
2023-06-01	EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis	Haobin Tang et.al.	2306.00648	null
2023-06-01	The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech	Phat Do et.al.	2306.00535	null
2023-05-31	Text-to-Speech Pipeline for Swiss German -- A comparison	Tobias Bollinger et.al.	2305.19750	null
2023-05-31	XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech	Linh The Nguyen et.al.	2305.19709	link
2023-06-01	PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions	Guanghou Liu et.al.	2305.19522	null
2023-05-30	Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction in Text-to-Speech for Low-Resource Languages	Phat Do et.al.	2305.19396	null
2023-05-30	Make-A-Voice: Unified Voice Synthesis With Discrete Representation	Rongjie Huang et.al.	2305.19269	null
2023-05-30	STT4SG-350: A Speech Corpus for All Swiss German Dialect Regions	Michel Plüss et.al.	2305.18855	null
2023-05-30	LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus	Yuma Koizumi et.al.	2305.18802	null
2023-10-09	An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization	Fei Kong et.al.	2305.18355	link
2023-05-29	ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation	Ambuj Mehrish et.al.	2305.18028	link
2023-05-29	Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis	Erik Ekstedt et.al.	2305.17971	null
2023-07-25	StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation	Kun Song et.al.	2305.17732	null
2023-05-28	Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS	Sewade Ogun et.al.	2305.17724	link
2023-07-19	Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on the False Alarms in Automated Speech Recognition Testing	Julia Kaiwen Lau et.al.	2305.17445	link
2023-05-26	DisfluencyFixer: A tool to enhance Language Learning through Speech To Speech Disfluency Correction	Vineet Bhat et.al.	2305.16957	null
2023-05-25	Betray Oneself: A Novel Audio DeepFake Detection Model via Mono-to-Stereo Conversion	Rui Liu et.al.	2305.16353	link
2023-05-22	Text Generation with Speech Synthesis for ASR Data Augmentation	Zhuangqun Huang et.al.	2305.16333	null
2023-05-25	VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation	Tianrui Wang et.al.	2305.16107	null
2023-05-25	Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration	Rustem Yeshpanov et.al.	2305.15749	link
2024-02-05	LAraBench: Benchmarking Arabic AI with Large Language Models	Ahmed Abdelali et.al.	2305.14982	null
2023-05-23	EfficientSpeech: An On-Device Text to Speech Model	Rowel Atienza et.al.	2305.13905	link
2023-05-23	ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models	Minki Kang et.al.	2305.13831	null
2023-05-22	U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech	Xin Jing et.al.	2305.13195	null
2023-05-25	EMNS /Imz/ Corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels	Kari Ali Noriy et.al.	2305.13137	link
2023-05-22	ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer	Huadai Liu et.al.	2305.12708	null
2023-05-21	VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages	Shivam Mhaskar et.al.	2305.12518	null
2023-05-26	Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus	Detai Xin et.al.	2305.12442	link
2023-05-20	ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios	Yuyue Wang et.al.	2305.12200	null
2023-05-19	MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low Resource Setting	Neil Shah et.al.	2305.11926	null
2024-02-20	Data Redaction from Conditional Generative Models	Zhifeng Kong et.al.	2305.11351	null
2023-05-18	Parameter-Efficient Learning for Text-to-Speech Accent Adaptation	Li-Jen Yang et.al.	2305.11320	link
2023-05-19	Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation	Martijn Bartelds et.al.	2305.10951	link
2023-09-30	Diffusion-Based Mel-Spectrogram Enhancement for Personalized Speech Synthesis with Found Data	Yusheng Tian et.al.	2305.10891	link
2023-05-18	FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs	Won Jang et.al.	2305.10823	null
2023-05-18	CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training	Zhenhui Ye et.al.	2305.10763	null
2023-08-29	a unified front-end framework for english text-to-speech synthesis	Zelin Ying et.al.	2305.10666	null
2023-09-19	Controllable Speaking Styles Using a Large Language Model	Atli Thor Sigurgeirsson et.al.	2305.10321	null
2023-05-23	Better speech synthesis through scaling	James Betker et.al.	2305.07243	link
2023-10-29	CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model	Zhen Ye et.al.	2305.06908	link
2023-05-08	Accented Text-to-Speech Synthesis with Limited Data	Xuehao Zhou et.al.	2305.04816	null
2023-05-03	M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis	Jinlong Xue et.al.	2305.02269	null
2023-05-30	A Review of Deep Learning Techniques for Speech Processing	Ambuj Mehrish et.al.	2305.00359	null
2023-04-26	Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis	Ye-Xin Lu et.al.	2304.13270	null
2023-04-25	Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge	Chenpeng Du et.al.	2304.13121	null
2023-04-24	Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model	Kenichi Fujita et.al.	2304.11976	null
2023-04-23	DiffVoice: Text-to-Speech with Latent Diffusion	Zhijun Liu et.al.	2304.11750	null
2023-04-23	SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model	Jianzong Wang et.al.	2304.11547	null
2023-05-31	NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers	Kai Shen et.al.	2304.09116	null
2023-04-16	A Virtual Simulation-Pilot Agent for Training of Air Traffic Controllers	Juan Zuluaga-Gomez et.al.	2304.07842	null
2023-04-13	Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis	Shun Lei et.al.	2304.06359	null
2023-04-10	Enhancing Speech-to-Speech Translation with Multiple TTS Targets	Jiatong Shi et.al.	2304.04618	null
2023-04-07	ArmanTTS single-speaker Persian dataset	Mohammd Hasan Shamgholi et.al.	2304.03585	null
2023-04-03	Ensemble prosody prediction for expressive speech synthesis	Tian Huey Teh et.al.	2304.00714	null
2023-03-29	AraSpot: Arabic Spoken Command Spotting	Mahmoud Salhab et.al.	2303.16621	link
2023-03-28	Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages	Seongyeon Park et.al.	2303.15669	link
2023-03-27	Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis	Karren Yang et.al.	2303.14885	null
2023-03-24	Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis	Takuhiro Kaneko et.al.	2303.13909	null
2023-04-02	A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI	Chenshuang Zhang et.al.	2303.13336	null
2023-03-20	Code-Switching Text Generation and Injection in Mandarin-English ASR	Haibin Yu et.al.	2303.10949	null
2023-03-14	Controlling High-Dimensional Data With Sparse Input	Dan Andrei Iliescu et.al.	2303.09446	null
2023-03-09	Text-to-ECG: 12-Lead Electrocardiogram Synthesis conditioned on Clinical Text Reports	Hyunseung Chung et.al.	2303.09395	link
2023-03-15	Cross-speaker Emotion Transfer by Manipulating Speech Style Latents	Suhee Jo et.al.	2303.08329	null
2023-03-14	QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis	Haobin Tang et.al.	2303.07682	null
2023-03-10	An End-to-End Neural Network for Image-to-Audio Transformation	Liu Chen et.al.	2303.06078	null
2023-03-09	Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation	Qi Chen et.al.	2303.05322	link
2023-03-07	Do Prosody Transfer Models Transfer Prosody?	Atli Thor Sigurgeirsson et.al.	2303.04289	null
2023-03-07	Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling	Ziqiang Zhang et.al.	2303.03926	null
2023-03-02	Evaluating Parameter-Efficient Transfer Learning Approaches on SURE Benchmark for Speech Understanding	Yingting Li et.al.	2303.03267	link
2023-03-08	FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model	Ruiqing Xue et.al.	2303.02939	null
2023-08-14	Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations	Yuma Koizumi et.al.	2303.01664	null
2023-03-11	Fine-grained Emotional Control of Text-To-Speech: Learning To Rank Inter- And Intra-Class Emotion Intensities	Shijun Wang et.al.	2303.01508	null
2023-12-17	ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations	Neil Shah et.al.	2303.01261	null
2023-03-02	LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion	Chunfeng Wang et.al.	2303.01086	null
2023-03-02	Leveraging Large Text Corpora for End-to-End Speech Summarization	Kohei Matsuura et.al.	2303.00978	null
2023-03-01	DTW-SiameseNet: Dynamic Time Warped Siamese Network for Mispronunciation Detection and Correction	Raviteja Anantha et.al.	2303.00171	null
2023-02-28	ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus	Ajinkya Kulkarni et.al.	2303.00069	null
2023-02-28	Automatic Heteronym Resolution Pipeline Using RAD-TTS Aligners	Jocelyn Huang et.al.	2302.14523	null
2023-06-12	CrossSpeech: Speaker-independent Acoustic Representation for Cross-lingual Speech Synthesis	Ji-Hoon Kim et.al.	2302.14370	null
2023-05-19	UniFLG: Unified Facial Landmark Generator from Text or Speech	Kentaro Mitsui et.al.	2302.14337	null
2023-02-27	Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech	Jiyoung Lee et.al.	2302.13700	link
2023-02-27	Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech	Dong Yang et.al.	2302.13652	null
2023-02-27	Varianceflow: High-Quality and Controllable Text-to-Speech using Variance Information via Normalizing Flow	Yoonhyung Lee et.al.	2302.13458	null
2023-06-06	PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS	Junhyeok Lee et.al.	2302.12391	link
2023-02-21	Emphasizing Unseen Words: New Vocabulary Acquisition for End-to-End Speech Recognition	Leyuan Qu et.al.	2302.09723	null
2023-02-23	QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion	Houjian Guo et.al.	2302.08296	link
2023-02-13	Fast and small footprint Hybrid HMM-HiFiGAN based system for speech synthesis in Indian languages	Sudhanshu Srivastava et.al.	2302.06227	null
2023-02-08	A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech	Li-Wei Chen et.al.	2302.04215	link
2023-02-07	Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision	Eugene Kharitonov et.al.	2302.03540	null
2023-02-15	MAC: A unified framework boosting low resource automatic speech recognition	Zeping Min et.al.	2302.03498	null
2023-06-25	InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt	Dongchao Yang et.al.	2301.13662	link
2023-03-01	UzbekTagger: The rule-based POS tagger for Uzbek language	Maksud Sharipov et.al.	2301.12711	null
2023-05-27	Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining	Takaaki Saeki et.al.	2301.12596	link
2023-01-31	Time out of Mind: Generating Rate of Speech conditioned on emotion and speaker	Navjot Kaur et.al.	2301.12331	link
2023-01-26	On granularity of prosodic representations in expressive text-to-speech	Mikolaj Babianski et.al.	2301.11446	null
2023-01-26	Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study	Massa Baali et.al.	2301.09099	link
2023-01-20	Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions	Yinghao Aaron Li et.al.	2301.08810	null
2023-01-11	Modelling low-resource accents without accent-specific TTS frontend	Georgi Tinchev et.al.	2301.04606	null
2022-12-11	BASPRO: a balanced script producer for speech corpus collection based on the genetic algorithm	Yu-Wen Chen et.al.	2301.04120	link
2023-01-10	UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion	Haogeng Liu et.al.	2301.03801	null
2023-01-10	Generative Emotional AI for Speech Emotion Recognition: The Case for Synthetic Emotional Speech Augmentation	Abdullah Shahid et.al.	2301.03751	null
2023-09-19	Applying Automated Machine Translation to Educational Video Courses	Linden Wang et.al.	2301.03141	null
2023-01-06	Using External Off-Policy Speech-To-Text Mappings in Contextual End-To-End Automated Speech Recognition	David M. Chan et.al.	2301.02736	null
2023-01-05	Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers	Chengyi Wang et.al.	2301.02111	link
2022-12-11	MnTTS2: An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis Dataset	Kailin Liang et.al.	2301.00657	link
2022-12-30	ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech	Zehua Chen et.al.	2212.14518	null
2022-12-29	StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models	Yinghao Aaron Li et.al.	2212.14227	link
2022-12-22	HMM-based data augmentation for E2E systems for building conversational speech synthesis systems	Ishika Gupta et.al.	2212.11982	null
2022-12-21	ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement	Wei-Ning Hsu et.al.	2212.11377	null
2022-12-20	TTS-Guided Training for Accent Conversion Without Parallel Data	Yi Zhou et.al.	2212.10204	null
2023-06-28	Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling	Tuomo Raitio et.al.	2212.10075	null
2022-12-16	Speech Aware Dialog System Technology Challenge (DSTC11)	Hagen Soltau et.al.	2212.08704	null
2022-12-16	Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder	Yusuke Yasuda et.al.	2212.08329	null
2022-12-16	Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language	Yusuke Yasuda et.al.	2212.08321	null
2022-12-15	RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis	Shinhyeok Oh et.al.	2212.07939	link
2022-12-14	Probing Deep Speaker Embeddings for Speaker-related Tasks	Zifeng Zhao et.al.	2212.07068	null
2022-12-08	SpeechLMScore: Evaluating speech generation using speech language model	Soumi Maiti et.al.	2212.04559	link
2023-04-04	Learning to Dub Movies via Hierarchical Prosody Models	Gaoxiang Cong et.al.	2212.04054	link
2022-12-07	Low-Resource End-to-end Sanskrit TTS using Tacotron2, WaveGlow and Transfer Learning	Ankur Debnath et.al.	2212.03558	null
2022-12-07	Analysis and Utilization of Entrainment on Acoustic and Emotion Features in User-agent Dialogue	Daxin Tan et.al.	2212.03398	null
2022-12-06	UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis	Yi Lei et.al.	2212.01546	null
2022-11-30	SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech	Byoung Jin Choi et.al.	2211.16866	null
2022-11-29	Controllable speech synthesis by learning discrete phoneme-level prosodic representations	Nikolaos Ellinas et.al.	2211.16307	null
2023-05-25	Evaluating and reducing the distance between synthetic and real speech distributions	Christoph Minixhofer et.al.	2211.16049	null
2022-11-26	Contextual Expressive Text-to-Speech	Jianhong Tu et.al.	2211.14548	null
2022-12-05	Efficient Incremental Text-to-Speech on GPUs	Muyang Du et.al.	2211.13939	null
2023-03-21	Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?	Xuan Shi et.al.	2211.13868	link
2022-11-23	IMaSC -- ICFOSS Malayalam Speech Corpus	Deepa P Gopinath et.al.	2211.12796	null
2022-11-22	PromptTTS: Controllable Text-to-Speech with Text Descriptions	Zhifang Guo et.al.	2211.12171	null
2022-11-04	Stutter-TTS: Controlled Synthesis and Improved Recognition of Stuttered Speech	Xin Zhang et.al.	2211.09731	null
2023-02-17	Towards Building Text-To-Speech Systems for the Next Billion Users	Gokul Karthik Kumar et.al.	2211.09536	link
2023-02-16	EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance	Yiwei Guo et.al.	2211.09496	null
2022-11-17	Back-Translation-Style Data Augmentation for Mandarin Chinese Polyphone Disambiguation	Chunyu Qiang et.al.	2211.09495	null
2022-11-17	NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis	Hyeong-Seok Choi et.al.	2211.09407	null
2023-03-14	Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models	Minki Kang et.al.	2211.09383	null
2023-01-04	Low-Resource Mongolian Speech Synthesis Based on Automatic Prosody Annotation	Xin Yuan et.al.	2211.09365	null
2022-11-14	SNIPER Training: Variable Sparsity Rate Training For Text-To-Speech	Perry Lam et.al.	2211.07283	null
2023-05-25	Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing	Jacob J Webber et.al.	2211.06989	null
2023-05-29	OverFlow: Putting flows on top of neural transducers for better TTS	Shivam Mehta et.al.	2211.06892	link
2023-05-29	Semi-supervised learning for continuous emotional intensity controllable speech synthesis with disentangled representations	Yoori Oh et.al.	2211.06160	null
2022-12-04	ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech	Xiaoran Fan et.al.	2211.03545	link
2022-11-07	Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder	Jan Melechovsky et.al.	2211.03316	link
2022-11-06	Parallel Attention Forcing for Machine Translation	Qingyun Dou et.al.	2211.03237	null
2022-11-06	An Empirical Study on L2 Accents of Cross-lingual Text-to-Speech Systems via Vowel Space	Jihwan Lee et.al.	2211.03078	null
2022-11-04	NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS	Dongchao Yang et.al.	2211.02448	null
2022-11-04	Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts	Detai Xin et.al.	2211.02336	null
2023-04-16	Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based On FullConv-TTS	Ziqi Liang et.al.	2211.01948	null
2022-11-01	Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages	Anusha Prakash et.al.	2211.01338	null
2023-05-28	DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP	Kun Song et.al.	2211.01087	null
2022-11-22	Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement	Wei Song et.al.	2211.00967	null
2022-11-01	Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers	Cheng-Ping Hsieh et.al.	2211.00585	link
2023-06-11	Generating Multilingual Gender-Ambiguous Text-to-Speech Voices	Konstantinos Markopoulos et.al.	2211.00375	null
2023-05-07	Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features	Alexandra Vioni et.al.	2211.00342	null
2022-11-02	Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS	Kun Song et.al.	2210.17349	null
2024-02-27	Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation	Nikolaos Ellinas et.al.	2210.17264	null
2022-10-31	Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection	Luigi Attorresi et.al.	2210.17222	null
2022-10-31	Structured State Space Decoder for Speech Recognition and Synthesis	Koichi Miyazaki et.al.	2210.17098	null
2022-10-28	Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders	Jason Fong et.al.	2210.16045	null
2023-02-21	Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform	Masaya Kawamura et.al.	2210.15975	link
2023-02-22	Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis	Yuma Shirahata et.al.	2210.15964	null
2022-10-28	Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation	Nobuyuki Morioka et.al.	2210.15868	null
2023-03-15	Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech	Takaaki Saeki et.al.	2210.15447	null
2022-10-27	Explicit Intensity Control for Accented Text-to-speech	Rui Liu et.al.	2210.15364	null
2022-10-27	FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis	Yifan Hu et.al.	2210.15360	link
2022-10-26	Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection	Kentaro Seki et.al.	2210.14850	null
2022-10-25	Semi-Supervised Learning Based on Reference Model for Low-resource TTS	Xulong Zhang et.al.	2210.14723	null
2022-10-26	Cover Reproducible Steganography via Deep Generative Models	Kejiang Chen et.al.	2210.14632	null
2022-10-26	Improving Speech-to-Speech Translation Through Unlabeled Text	Xuan-Phi Nguyen et.al.	2210.14514	null
2022-10-26	The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge	Yuhao Liang et.al.	2210.14448	null
2022-10-25	Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data	Xulong Zhang et.al.	2210.13803	null
2023-09-17	HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation	Chunhui Wang et.al.	2210.12740	null
2022-10-21	Low-Resource Multilingual and Zero-Shot Multispeaker TTS	Florian Lux et.al.	2210.12223	link
2022-10-21	Adaptive re-calibration of channel-wise features for Adversarial Audio Classification	Vardhan Dongre et.al.	2210.11722	null
2022-10-20	Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS	Chunyu Qiang et.al.	2210.11429	null
2022-10-17	Towards Relation Extraction From Speech	Tongtong Wu et.al.	2210.08759	link
2023-02-08	Generating Synthetic Speech from SpokenVocab for Speech Translation	Jinming Zhao et.al.	2210.08174	link
2022-10-17	LeVoice ASR Systems for the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge	Yan Jia et.al.	2210.07749	null
2022-10-20	Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy	Sarina Meyer et.al.	2210.07002	link
2022-10-13	Pre-Avatar: An Automatic Presentation Generation Framework Leveraging Talking Avatar	Aolan Sun et.al.	2210.06877	null
2022-10-12	Can we use Common Voice to train a Multi-Speaker TTS system?	Sewade Ogun et.al.	2210.06370	null
2023-06-01	SQuId: Measuring Speech Naturalness in Many Languages	Thibault Sellam et.al.	2210.06324	null
2022-11-22	Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech	Byoung Jin Choi et.al.	2210.05979	null
2022-10-06	An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era	Andreas Triantafyllopoulos et.al.	2210.03538	null
2022-09-29	Facial Landmark Predictions with Applications to Metaverse	Qiao Han et.al.	2209.14698	link
2022-09-26	Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech	Yusuke Nakai et.al.	2209.12549	null
2022-09-22	EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models	Perry Lam et.al.	2209.10890	null
2022-09-22	MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline	Yifan Hu et.al.	2209.10848	link
2022-09-22	Controllable Accented Text-to-Speech Synthesis	Rui Liu et.al.	2209.10804	null
2022-09-16	TIMIT-TTS: a Text-to-Speech Dataset for Multimodal Synthetic Media Detection	Davide Salvi et.al.	2209.08000	null
2022-09-14	Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset	Michael Chinen et.al.	2209.06358	null
2022-09-08	SANIP: Shopping Assistant and Navigation for the visually impaired	Shubham Deshmukh et.al.	2209.03570	null
2022-09-07	Non-Standard Vietnamese Word Detection and Normalization for Text-to-Speech	Huu-Tien Dang et.al.	2209.02971	null
2022-09-02	Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model	Jennifer Drexler Fox et.al.	2209.01250	null
2022-08-28	Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks	Lev Finkelstein et.al.	2208.13183	null
2022-10-04	Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale	Aditya Agarwal et.al.	2208.09796	null
2022-08-21	Visualising Model Training via Vowel Space for Text-To-Speech Systems	Binu Abeysinghe et.al.	2208.09775	link
2022-08-15	Towards Parametric Speech Synthesis Using Gaussian-Markov Model of Spectral Envelope and Wavelet-Based Decomposition of F0	Mohammed Salah Al-Radhi et.al.	2208.07122	null
2022-12-28	Speech Synthesis with Mixed Emotions	Kun Zhou et.al.	2208.05890	null
2022-08-03	A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis	Qibing Bai et.al.	2208.02189	null
2022-07-29	Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation	Giulia Comini et.al.	2207.14607	null
2022-07-25	Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis	Raul Fernandez et.al.	2207.12262	null
2022-07-01	A Polyphone BERT for Polyphone Disambiguation in Mandarin Chinese	Song Zhang et.al.	2207.12089	null
2022-07-20	When Is TTS Augmentation Through a Pivot Language Useful?	Nathaniel Robinson et.al.	2207.09889	link
2022-07-11	LIP: Lightweight Intelligent Preprocessor for meaningful text-to-speech	Harshvardhan Anand et.al.	2207.07118	null
2022-07-13	ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech	Rongjie Huang et.al.	2207.06389	link
2022-07-13	Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech	Zhengxi Liu et.al.	2207.06088	null
2022-07-13	SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate	Nabarun Goswami et.al.	2207.06011	null
2022-07-13	Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS	Yookyung Shin et.al.	2207.06000	null
2022-07-13	A Cyclical Approach to Synthetic and Natural Speech Mismatch Refinement of Neural Post-filter for Low-cost Text-to-speech System	Yi-Chiao Wu et.al.	2207.05913	null
2022-07-12	Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition	Rodolfo Zevallos et.al.	2207.05498	null
2022-07-12	End-to-end speech recognition modeling from de-identified data	Martin Flechl et.al.	2207.05469	null
2022-07-11	Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data	Naoki Makishima et.al.	2207.04659	null
2022-07-11	DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders	Yanqing Liu et.al.	2207.04646	null
2023-01-02	Dreamento: an open-source dream engineering toolbox for sleep EEG wearables	Mahdad Jafarzadeh Esfahani et.al.	2207.03977	link
2022-07-07	BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus	Josh Meyer et.al.	2207.03546	link
2022-07-05	Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion	Yi Lei et.al.	2207.01832	null
2022-07-04	BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model	Brooke Stephenson et.al.	2207.01718	null
2022-07-04	Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS)	Ariadna Sanchez et.al.	2207.01547	null
2022-07-04	Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS)	Ziyao Zhang et.al.	2207.01507	null
2023-03-13	DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech	Keon Lee et.al.	2207.01063	link
2022-07-02	Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need	Daniel Korzekwa et.al.	2207.00774	null
2022-07-01	Building African Voices	Perez Ogayo et.al.	2207.00688	link
2022-07-01	Automatic Evaluation of Speaker Similarity	Deja Kamil et.al.	2207.00344	null
2022-08-03	Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding	Wei-Ping Huang et.al.	2206.15427	null
2022-06-30	R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS	Kyle Kastner et.al.	2206.15276	null
2022-07-01	Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems	Hyun-Wook Yoon et.al.	2206.15067	null
2022-06-30	TTS-by-TTS 2: Data-selective augmentation for neural speech synthesis using ranking support vector machine with variational autoencoder	Eunwoo Song et.al.	2206.14984	null
2022-06-29	Improving Deliberation by Text-Only and Semi-Supervised Training	Ke Hu et.al.	2206.14716	null
2022-06-29	Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody	Peter Makarov et.al.	2206.14643	null
2022-06-28	Expressive, Variable, and Controllable Duration Modelling in TTS	Ammar Abbas et.al.	2206.14165	null
2022-06-28	Comparison of Speech Representations for the MOS Prediction System	Aki Kunikoshi et.al.	2206.13817	null
2022-06-22	A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data	Raviraj Joshi et.al.	2206.13240	null
2022-06-25	Synthesizing Personalized Non-speech Vocalization from Discrete Speech Representations	Chin-Cheng Hsu et.al.	2206.12662	null
2022-10-21	Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech	Florian Lux et.al.	2206.12229	link
2022-06-24	SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech	Hyunjae Cho et.al.	2206.12132	null
2022-06-24	End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue	Kentaro Mitsui et.al.	2206.12040	null
2022-05-29	Exploiting Transliterated Words for Finding Similarity in Inter-Language News Articles using Machine Learning	Sameea Naeem et.al.	2206.11860	null
2022-06-21	Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS	Kenta Udagawa et.al.	2206.10256	null
2022-06-24	Towards Optimizing OCR for Accessibility	Peya Mowar et.al.	2206.10254	null
2022-06-16	Automatic Prosody Annotation with Pre-Trained Text-Speech Model	Ziqian Dai et.al.	2206.07956	link
2022-11-16	NatiQ: An End-to-end Text-to-Speech System for Arabic	Ahmed Abdelali et.al.	2206.07373	null
2022-06-15	Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning	Rui Liu et.al.	2206.07229	link
2022-12-12	A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation	Junhui Zhang et.al.	2206.04922	null
2022-06-09	Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos	Alexander Waibel et.al.	2206.04523	null
2022-06-07	FlexLip: A Controllable Text-to-Lip System	Dan Oneata et.al.	2206.03206	null
2022-10-11	UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder	Jiachen Lian et.al.	2206.02512	null
2023-10-19	Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech	Ziyue Jiang et.al.	2206.02147	link
2022-11-02	AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation	Kun Song et.al.	2206.00208	null
2022-05-31	Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish	Alp Öktem et.al.	2205.15599	link
2023-11-20	StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis	Yinghao Aaron Li et.al.	2205.15439	link
2022-05-30	Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data	Sungwon Kim et.al.	2205.15370	null
2022-05-26	QSpeech: Low-Qubit Quantum Speech Application Toolkit	Zhenhou Hong et.al.	2205.13221	link
2022-11-10	T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation	Paul-Ambroise Duquenne et.al.	2205.12216	null
2022-05-20	PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit	Hui Zhang et.al.	2205.12007	link
2022-05-24	TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS	Xulong Zhang et.al.	2205.11824	null
2022-10-12	GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech	Rongjie Huang et.al.	2205.07211	link
2022-05-13	Talking Face Generation with Multilingual TTS	Hyoung-Kyu Song et.al.	2205.06421	null
2022-05-10	NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality	Xu Tan et.al.	2205.04421	link
2022-05-09	Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech	Yang Li et.al.	2205.04120	link
2022-05-09	ReCAB-VAE: Gumbel-Softmax Variational Inference Based on Analytic Divergence	Sangshin Oh et.al.	2205.04104	null
2022-07-14	Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss	Efthymios Georgiou et.al.	2204.13437	null
2024-06-06	Parallel Synthesis for Autoregressive Speech Generation	Po-chun Hsu et.al.	2204.11806	null
2022-04-25	SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech	Zhenhui Ye et.al.	2204.11792	link
2022-04-22	LibriS2S: A German-English Speech-to-Speech Translation Corpus	Pedro Jeuris et.al.	2204.10593	link
2022-07-05	Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation	Ryo Terashima et.al.	2204.10020	null
2022-04-21	FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis	Rongjie Huang et.al.	2204.09934	link
2022-04-20	Audio Deep Fake Detection System with Neural Stitching for ADD 2022	Rui Yan et.al.	2204.08720	null
2022-04-14	Applying Feature Underspecified Lexicon Phonological Features in Multilingual Text-to-Speech	Cong Zhang et.al.	2204.07228	null
2022-12-09	Study of Indian English Pronunciation Variabilities relative to Received Pronunciation	Priyanshi Pal et.al.	2204.06502	null
2022-04-12	Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch	Hanbin Bae et.al.	2204.05753	null
2023-01-30	The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance	Lin Zhang et.al.	2204.05177	null
2022-10-27	Fine-grained Noise Control for Multispeaker Speech Synthesis	Karolos Nikitaras et.al.	2204.05070	null
2022-08-31	Karaoker: Alignment-free singing voice synthesis with speech training data	Panos Kakoulidis et.al.	2204.04127	null
2022-08-15	Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech	Jae-Sung Bae et.al.	2204.04004	null
2022-04-07	Arabic Text-To-Speech (TTS) Data Preparation	Hala Al Masri et.al.	2204.03255	null
2022-04-07	Unsupervised Quantized Prosody Representation for Controllable Speech Synthesis	Yutian Wang et.al.	2204.03238	null
2022-08-24	SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis	Georgia Maniati et.al.	2204.03040	null
2022-09-13	Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation	Sravya Popuri et.al.	2204.02967	null
2022-07-02	Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification	Jin Woo Lee et.al.	2204.02639	null
2023-08-28	Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech	Hyungchan Yoon et.al.	2204.02172	null
2022-09-07	Deliberation Model for On-Device Spoken Language Understanding	Duc Le et.al.	2204.01893	null
2022-12-14	Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck	Youngsik Eom et.al.	2204.01387	null
2022-11-11	Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis	Yixuan Zhou et.al.	2204.00990	null
2022-06-30	VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature	Chenpeng Du et.al.	2204.00768	null
2022-04-01	AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios	Yihan Wu et.al.	2204.00436	null
2022-04-01	Text-To-Speech Data Augmentation for Low Resource Speech Recognition	Rodolfo Zevallos et.al.	2204.00291	null
2022-07-19	Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech	Guangyan Zhang et.al.	2203.17190	null
2022-03-31	An End-to-end Chinese Text Normalization Model based on Rule-guided Flat-Lattice Transformer	Wenlin Dai et.al.	2203.16954	link
2022-07-11	WavThruVec: Latent speech representation as intermediate features for neural speech synthesis	Hubert Siuzdak et.al.	2203.16930	null
2022-03-31	A Character-level Span-based Model for Mandarin Prosodic Structure Prediction	Xueyuan Chen et.al.	2203.16922	link
2022-07-01	JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech	Dan Lim et.al.	2203.16852	link
2022-03-31	Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset	Zehui Yang et.al.	2203.16844	null
2022-03-31	NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism	Jingbei Li et.al.	2203.16838	link
2022-03-31	Effectiveness of text to speech pseudo labels for forced alignment and cross lingual pretrained models for low resource speech recognition	Anirudh Gupta et.al.	2203.16823	null
2022-04-21	Does Audio Deepfake Detection Generalize?	Nicolas M. Müller et.al.	2203.16263	null
2022-03-30	End to End Lip Synchronization with a Temporal AutoEncoder	Yoav Shalev et.al.	2203.16224	link
2022-08-15	Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition	Junrui Ni et.al.	2203.15796	link
2022-06-29	DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning	Takaaki Saeki et.al.	2203.15683	null
2022-11-05	Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation	Rendi Chevi et.al.	2203.15643	link
2022-10-06	Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus	Minchan Kim et.al.	2203.15447	null
2022-07-11	VoiceMe: Personalized voice generation in TTS	Pol van Rijn et.al.	2203.15379	link
2021-07-13	Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging	Tamás Gábor Csapó et.al.	2107.05550	null
2021-07-07	Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm	Elijah Gutierrez et.al.	2107.02527	null
2022-02-25	Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis	Erica Cooper et.al.	2104.12292	null
2019-09-26	Sequence to Sequence Neural Speech Synthesis with Prosody Modification Capabilities	Slava Shechtman et.al.	1909.10302	null
2019-08-28	Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis	Xin Wang et.al.	1908.10256	null
2019-05-22	Effective parameter estimation methods for an ExcitNet model in generative text-to-speech systems	Ohsung Kwon et.al.	1905.08486	null
2017-09-26	Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks	Yuki Saito et.al.	1709.08041	null

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 1,040 Commits
.github		.github
assets		assets
docs		docs
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
daily_arxiv.py		daily_arxiv.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Updated on 2025.12.25

TTS

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

liutaocode/TTS-arxiv-daily

Folders and files

Latest commit

History

Repository files navigation

Updated on 2025.12.25

TTS

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages