Usage instructions: here
This page is modified from here
Table of Contents
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-12-23 | TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation | Ji-Hoon Kim et.al. | 2512.20296 | null |
| 2025-12-23 | Fun-Audio-Chat Technical Report | Qian Chen et.al. | 2512.20156 | null |
| 2025-12-22 | JoyVoice: Long-Context Conditioning for Anthropomorphic Multi-Speaker Conversational Synthesis | Fan Yu et.al. | 2512.19090 | null |
| 2025-12-21 | Smark: A Watermark for Text-to-Speech Diffusion Models via Discrete Wavelet Transform | Yichuan Zhang et.al. | 2512.18791 | null |
| 2025-12-21 | Task Vector in TTS: Toward Emotionally Expressive Dialectal Speech Synthesis | Pengchao Feng et.al. | 2512.18699 | link |
| 2025-12-19 | Training Text-to-Speech Model with Purely Synthetic Data: Feasibility, Sensitivity, and Generalization Capability | Tingxiao Zhou et.al. | 2512.17356 | null |
| 2025-12-19 | Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track | June Young Yi et.al. | 2512.17293 | null |
| 2025-12-18 | Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs | Sara Papi et.al. | 2512.16378 | null |
| 2025-12-16 | Adapting Speech Language Model to Singing Voice Synthesis | Yiwen Zhao et.al. | 2512.14657 | null |
| 2025-12-16 | Robust Training of Singing Voice Synthesis Using Prior and Posterior Uncertainty | Yiwen Zhao et.al. | 2512.14653 | null |
| 2025-12-16 | GLM-TTS Technical Report | Jiayan Cui et.al. | 2512.14291 | null |
| 2025-12-18 | A stylometric analysis of speaker attribution from speech transcripts | Cristina Aggazzotti et.al. | 2512.13667 | null |
| 2025-12-15 | Reproducing and Dissecting Denoising Language Models for Speech Recognition | Dorian Koch et.al. | 2512.13576 | null |
| 2025-12-18 | DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec | Tao Li et.al. | 2512.13251 | null |
| 2025-12-11 | CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences | Yiyang Wang et.al. | 2512.10918 | null |
| 2025-12-10 | DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance | Kang Yin et.al. | 2512.09504 | null |
| 2025-12-09 | LG Uplus System with Multi-Speaker IDs and Discriminator-based Sub-Judges for the WildSpoof Challenge | Jinyoung Park et.al. | 2512.09000 | null |
| 2025-12-08 | Beyond Unified Models: A Service-Oriented Approach to Low Latency, Context Aware Phonemization for Real Time TTS | Mahta Fetrat et.al. | 2512.08006 | null |
| 2025-12-08 | MultiAPI Spoof: A Multi-API Dataset and Local-Attention Network for Speech Anti-spoofing Detection | Xueping Zhang et.al. | 2512.07352 | null |
| 2025-12-06 | Sanvaad: A Multimodal Accessibility Framework for ISL Recognition and Voice-Based Interaction | Kush Revankar et.al. | 2512.06485 | null |
| 2025-12-05 | SEA-SafeguardBench: Evaluating AI Safety in SEA Languages and Cultures | Panuthep Tasawong et.al. | 2512.05501 | null |
| 2025-12-05 | Simulating Life Paths with Digital Twins: AI-Generated Future Selves Influence Decision-Making and Expand Human Choice | Rachel Poonsiriwong et.al. | 2512.05397 | null |
| 2025-12-04 | HiPPO: Exploring A Novel Hierarchical Pronunciation Assessment Approach for Spoken Languages | Bi-Cheng Yan et.al. | 2512.04964 | link |
| 2025-12-04 | TripleC Learning and Lightweight Speech Enhancement for Multi-Condition Target Speech Extraction | Ziling Huang et.al. | 2512.04945 | null |
| 2025-12-04 | YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-free Melody Guidance | Junjie Zheng et.al. | 2512.04779 | null |
| 2025-12-04 | Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild | Yigui Feng et.al. | 2512.04728 | null |
| 2025-12-04 | M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis | Xiaopeng Wang et.al. | 2512.04720 | null |
| 2025-12-04 | Large Speech Model Enabled Semantic Communication | Yun Tian et.al. | 2512.04711 | null |
| 2025-12-04 | Limit cycles for speech | Adamantios I. Gafos et.al. | 2512.04642 | null |
| 2025-12-04 | RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS | Cong Wang et.al. | 2512.04552 | null |
| 2025-12-04 | Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention | Cong Wang et.al. | 2512.04551 | null |
| 2025-12-03 | Head, posture, and full-body gestures in interactive communication | Ľuboš Hládek et.al. | 2512.03636 | null |
| 2025-12-03 | A Convolutional Framework for Mapping Imagined Auditory MEG into Listened Brain Responses | Maryam Maghsoudi et.al. | 2512.03458 | null |
| 2025-12-02 | Comparing Unsupervised and Supervised Semantic Speech Tokens: A Case Study of Child ASR | Mohan Shi et.al. | 2512.03301 | null |
| 2025-12-02 | How to DP-fy Your Data: A Practical Guide to Generating Synthetic Data With Differential Privacy | Natalia Ponomareva et.al. | 2512.03238 | null |
| 2025-12-02 | MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation | Youxin Pang et.al. | 2512.03034 | null |
| 2025-12-02 | Perceptual evaluation of Acoustic Level of Detail in Virtual Acoustic Environments | Stefan Fichna et.al. | 2512.02891 | null |
| 2025-12-02 | BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion | Sai Koneru et.al. | 2512.02817 | null |
| 2025-12-02 | Reasoning-Aware Multimodal Fusion for Hateful Video Detection | Shuonan Yang et.al. | 2512.02743 | null |
| 2025-12-02 | Hear What Matters! Text-conditioned Selective Video-to-Audio Generation | Junwon Lee et.al. | 2512.02650 | null |
| 2025-12-02 | Spoken Conversational Agents with Large Language Models | Chao-Han Huck Yang et.al. | 2512.02593 | null |
| 2025-12-02 | Co-speech Gesture Video Generation via Motion-Based Graph Retrieval | Yafei Song et.al. | 2512.02576 | null |
| 2025-12-02 | Generative Multi-modal Feedback for Singing Voice Synthesis Evaluation | Xueyan Li et.al. | 2512.02523 | null |
| 2025-12-02 | VibOmni: Towards Scalable Bone-conduction Speech Enhancement on Earables | Lixing He et.al. | 2512.02515 | null |
| 2025-12-01 | Swivuriso: The South African Next Voices Multilingual Speech Dataset | Vukosi Marivatee et.al. | 2512.02201 | null |
| 2025-12-01 | Cross-Lingual Interleaving for Speech Language Models | Adel Moumen et.al. | 2512.01865 | null |
| 2025-12-01 | MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark | Yuezhang Peng et.al. | 2512.01603 | link |
| 2025-12-01 | MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages | Yexing Du et.al. | 2512.01512 | null |
| 2025-12-01 | Model-Based Clustering of Functional Data Via Random Projection Ensembles | Matteo Mori et.al. | 2512.01450 | null |
| 2025-12-01 | EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans | Yingjie Zhou et.al. | 2512.01340 | null |
| 2025-12-01 | fMRI2GES: Co-speech Gesture Reconstruction from fMRI Signal with Dual Brain Decoding Alignment | Chunzheng Zhu et.al. | 2512.01189 | null |
| 2025-11-30 | Supporting Productivity Skill Development in College Students through Social Robot Coaching: A Proof-of-Concept | Himanshi Lalwani et.al. | 2512.01105 | null |
| 2025-11-30 | Arabic TTS with FastPitch: Reproducible Baselines, Adversarial Training, and Oversmoothing Analysis | Lars Nippert et.al. | 2512.00937 | null |
| 2025-11-29 | STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition | Siyu Wang et.al. | 2512.00451 | null |
| 2025-11-28 | OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion | Sai Koneru et.al. | 2512.00234 | null |
| 2025-11-28 | CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation | Fengyi Fang et.al. | 2511.22863 | null |
| 2025-11-27 | Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration | Kanchon Gharami et.al. | 2511.22769 | null |
| 2025-11-27 | PURE Codec: Progressive Unfolding of Residual Entropy for Speech Codec Learning | Jiatong Shi et.al. | 2511.22687 | null |
| 2025-11-27 | Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking | Katia Vendrame et.al. | 2511.22503 | null |
| 2025-11-27 | Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition | Maheswar Bora et.al. | 2511.22443 | null |
| 2025-11-27 | GLA-Grad++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis | Teysir Baoueb et.al. | 2511.22293 | null |
| 2025-11-27 | VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task | Yuyue Wang et.al. | 2511.22229 | null |
| 2025-11-27 | Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation | Joel Alberto Santos et.al. | 2511.22025 | null |
| 2025-11-26 | Advancing Marine Bioacoustics with Deep Generative Models: A Hybrid Augmentation Strategy for Southern Resident Killer Whale Detection | Bruno Padovese et.al. | 2511.21872 | null |
| 2025-11-26 | Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation | Lina Conti et.al. | 2511.21517 | null |
| 2025-11-26 | TSGM: Regular and Irregular Time-series Generation using Score-based Generative Models | Haksoo Lim et.al. | 2511.21335 | null |
| 2025-11-26 | Acoustic neural networks: Identifying design principles and exploring physical feasibility | Ivan Kalthoff et.al. | 2511.21313 | null |
| 2025-11-26 | Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale | Yicheng Zhong et.al. | 2511.21270 | null |
| 2025-11-26 | CartoonSing: Unifying Human and Nonhuman Timbres in Singing Generation | Jionghao Han et.al. | 2511.21045 | null |
| 2025-11-26 | RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data | Zhisheng Zheng et.al. | 2511.20974 | null |
| 2025-11-26 | Towards Audio Token Compression in Large Audio Language Models | Saurabhchand Bhati et.al. | 2511.20973 | null |
| 2025-11-26 | SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications | Jionghao Han et.al. | 2511.20972 | null |
| 2025-11-25 | Continual Audio Deepfake Detection via Universal Adversarial Perturbation | Wangjie Li et.al. | 2511.19974 | null |
| 2025-11-25 | Towards Edge General Intelligence: Knowledge Distillation for Mobile Agentic AI | Yuxuan Wu et.al. | 2511.19947 | null |
| 2025-11-25 | It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models | Xiangyu Zhao et.al. | 2511.19877 | null |
| 2025-11-24 | Evaluating Objective Speech Quality Metrics for Neural Audio Codecs | Luca A. Lanzendörfer et.al. | 2511.19734 | null |
| 2025-11-24 | A Layered Protocol Architecture for the Internet of Agents | Charles Fleming et.al. | 2511.19699 | null |
| 2025-11-24 | Dynamic Multi-Species Bird Soundscape Generation with Acoustic Patterning and 3D Spatialization | Ellie L. Zhang et.al. | 2511.19275 | null |
| 2025-11-25 | PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation | Huadai Liu et.al. | 2511.18833 | null |
| 2025-11-24 | Context-Aware Whisper for Arabic ASR Under Linguistic Varieties | Bashar Talafha et.al. | 2511.18774 | null |
| 2025-11-24 | AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation | Omar Garib et.al. | 2511.18718 | null |
| 2025-11-23 | The Locally Deployable Virtual Doctor: LLM Based Human Interface for Automated Anamnesis and Database Conversion | Jan Benedikt Ruhland et.al. | 2511.18632 | null |
| 2025-11-23 | InstructAudio: Unified speech and music generation with natural language instruction | Chunyu Qiang et.al. | 2511.18487 | null |
| 2025-11-23 | A Multimodal Conversational Agent for Tabular Data Analysis | Mohammad Nour Al Awad et.al. | 2511.18405 | null |
| 2025-11-23 | Gradient Masters at BLP-2025 Task 1: Advancing Low-Resource NLP for Bengali using Ensemble-Based Adversarial Training for Hate Speech Detection | Syed Mohaiminul Hoque et.al. | 2511.18324 | null |
| 2025-11-23 | MultiDiffNet: A Multi-Objective Diffusion Framework for Generalizable Brain Decoding | Mengchun Zhang et.al. | 2511.18294 | null |
| 2025-11-22 | A superpersuasive autonomous policy debating system | Allen Roush et.al. | 2511.17854 | null |
| 2025-11-21 | Enhancing Quranic Learning: A Multimodal Deep Learning Approach for Arabic Phoneme Recognition | Ayhan Kucukmanisa et.al. | 2511.17477 | null |
| 2025-11-21 | AI in Music and Sound: Pedagogical Reflections, Post-Structuralist Approaches and Creative Outcomes in Seminar Practice | Guilherme Coelho et.al. | 2511.17425 | null |
| 2025-11-21 | Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM | Chiori Hori et.al. | 2511.17335 | null |
| 2025-11-21 | Investigating self-supervised representations for audio-visual deepfake detection | Dragos-Alexandru Boldisor et.al. | 2511.17181 | null |
| 2025-11-20 | Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation | Wei-Cheng Tseng et.al. | 2511.16757 | null |
| 2025-11-20 | Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs | Wei-Cheng Tseng et.al. | 2511.16639 | null |
| 2025-11-21 | WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue | Zachary Ellis et.al. | 2511.16544 | null |
| 2025-11-20 | SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise | Rui Sang et.al. | 2511.16114 | null |
| 2025-11-19 | Step-Audio-R1 Technical Report | Fei Tian et.al. | 2511.15848 | null |
| 2025-11-19 | A Generalized Weighted Overlap-Add (WOLA) Filter Bank for Improved Subband System Identification | Mohit Sharma et.al. | 2511.15766 | null |
| 2025-11-19 | PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback | Sirui Chen et.al. | 2511.15253 | null |
| 2025-11-19 | Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding | Mingyue Huo et.al. | 2511.15145 | null |
| 2025-11-19 | Aligning Generative Music AI with Human Preferences: Methods and Challenges | Dorien Herremans et.al. | 2511.15038 | null |
| 2025-11-18 | Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion | Zanxu Wang et.al. | 2511.14969 | null |
| 2025-11-18 | PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants | Mingkun Yu et.al. | 2511.14852 | null |
| 2025-11-18 | Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech | Nam-Gyu Kim et.al. | 2511.14824 | null |
| 2025-11-18 | Ground Truth Generation for Multilingual Historical NLP using LLMs | Clovis Gladstone et.al. | 2511.14688 | null |
| 2025-11-18 | TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation | Wei Liu et.al. | 2511.14410 | null |
| 2025-11-18 | AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR | Gabrial Zencha Ashungafac et.al. | 2511.14255 | null |
| 2025-11-18 | Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning | Rui Liu et.al. | 2511.14249 | null |
| 2025-11-18 | StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model | Yifan Yang et.al. | 2511.14223 | null |
| 2025-11-18 | FxSearcher: gradient-free text-driven audio transformation | Hojoon Ki et.al. | 2511.14138 | null |
| 2025-11-17 | Human-centric Maintenance Process Through Integration of AI, Speech, and AR | Parul Khanna et.al. | 2511.13918 | null |
| 2025-11-17 | Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video | Filippo Cenacchi. Longbing Cao et.al. | 2511.13802 | null |
| 2025-11-17 | PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement | Xiaobin Rong et.al. | 2511.13300 | null |
| 2025-11-17 | Computational Measurement of Political Positions: A Review of Text-Based Ideal Point Estimation Algorithms | Patrick Parschan et.al. | 2511.13238 | null |
| 2025-11-17 | FoleyBench: A Benchmark For Video-to-Audio Models | Satvik Dixit et.al. | 2511.13219 | null |
| 2025-11-17 | Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis | Zaara Zabeen Arpa et.al. | 2511.13159 | link |
| 2025-11-17 | A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning | Liuyi Jin et.al. | 2511.13078 | null |
| 2025-11-17 | CalibrateMix: Guided-Mixup Calibration of Image Semi-Supervised Models | Mehrab Mustafy Rahman et.al. | 2511.12964 | null |
| 2025-11-16 | Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data | Sina Rashidi et.al. | 2511.12690 | null |
| 2025-11-16 | Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans | Hongbin Huang et.al. | 2511.12662 | null |
| 2025-11-16 | Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data | Yunxin Li et.al. | 2511.12609 | null |
| 2025-11-16 | DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions | Xiaoyu Lin et.al. | 2511.12452 | null |
| 2025-11-14 | Proactive Hearing Assistants that Isolate Egocentric Conversations | Guilin Hu et.al. | 2511.11473 | link |
| 2025-11-14 | Language-Aided State Estimation | Yuki Miyoshi et.al. | 2511.11285 | null |
| 2025-11-14 | Analysing Personal Attacks in U.S. Presidential Debates | Ruban Goyal et.al. | 2511.11108 | null |
| 2025-11-14 | CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation | Crystal Min Hui Poon et.al. | 2511.11104 | null |
| 2025-11-14 | CAT-Net: A Cross-Attention Tone Network for Cross-Subject EEG-EMG Fusion Tone Decoding | Yifan Zhuang et.al. | 2511.10935 | null |
| 2025-11-14 | Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio | Guangke Chen et.al. | 2511.10913 | null |
| 2025-11-13 | Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces | Farhan Sheth et.al. | 2511.10793 | null |
| 2025-11-13 | Towards Attribution of Generators and Emotional Manipulation in Cross-Lingual Synthetic Speech using Geometric Learning | Girish et.al. | 2511.10790 | null |
| 2025-11-13 | Music Flamingo: Scaling Music Understanding in Audio Language Models | Sreyan Ghosh et.al. | 2511.10289 | null |
| 2025-11-13 | VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction | Yuhao Wang et.al. | 2511.10232 | null |
| 2025-11-13 | Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard | Yudong Yang et.al. | 2511.10222 | null |
| 2025-11-13 | Towards Leveraging Sequential Structure in Animal Vocalizations | Eklavya Sarkar et.al. | 2511.10190 | link |
| 2025-11-13 | FabasedVC: Enhancing Voice Conversion with Text Modality Fusion and Phoneme-Level SSL Features | Wenyu Wang et.al. | 2511.10112 | null |
| 2025-11-13 | Mitigating Error Accumulation in Co-Speech Motion Generation via Global Rotation Diffusion and Multi-Level Constraints | Xiangyue Zhang et.al. | 2511.10076 | null |
| 2025-11-13 | Time-Layer Adaptive Alignment for Speaker Similarity in Flow-Matching Based Zero-Shot TTS | Haoyu Li et.al. | 2511.09995 | null |
| 2025-11-13 | MINDS: A Cross-cultural Dialogue Corpus for Social Norm Classification and Adherence Detection | Pritish Sahu et.al. | 2511.09918 | null |
| 2025-11-12 | Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages | Omnilingual ASR team et.al. | 2511.09690 | null |
| 2025-11-12 | End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering | Jiliang Hu et.al. | 2511.09282 | null |
| 2025-11-10 | Generating Novel and Realistic Speakers for Voice Conversion | Meiying Melissa Chen et.al. | 2511.07135 | null |
| 2025-11-10 | E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis | Zhisheng Zhang et.al. | 2511.07099 | link |
| 2025-11-09 | IDMap: A Pseudo-Speaker Generator Framework Based on Speaker Identity Index to Vector Mapping | Zeyan Liu et.al. | 2511.06246 | null |
| 2025-11-07 | Synthesizing speech with selected perceptual voice qualities - A case study with creaky voice | Frederik Rautenberg et.al. | 2511.05143 | null |
| 2025-11-05 | Step-Audio-EditX Technical Report | Chao Yan et.al. | 2511.03601 | null |
| 2025-11-05 | PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech | Michel Wong et.al. | 2511.03080 | null |
| 2025-11-04 | Augmenting Open-Vocabulary Dysarthric Speech Assessment with Human Perceptual Supervision | Kaimeng Jia et.al. | 2511.02270 | null |
| 2025-11-03 | Toward Objective and Interpretable Prosody Evaluation in Text-to-Speech: A Linguistically Motivated Approach | Cedric Chan et.al. | 2511.02104 | null |
| 2025-10-31 | Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication | Deok-Seon Kim et.al. | 2510.27247 | null |
| 2025-10-27 | SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution | Dharma Teja Donepudi et.al. | 2510.25178 | null |
| 2025-10-28 | Levée d'ambiguïtés par grammaires locales | Eric G. C. Laporte et.al. | 2510.24530 | null |
| 2025-10-28 | Bayesian Speech synthesizers Can Learn from Multiple Teachers | Ziyang Zhang et.al. | 2510.24372 | null |
| 2025-10-28 | emg2speech: synthesizing speech from electromyography using self-supervised speech models | Harshavardhana T. Gowda et.al. | 2510.23969 | null |
| 2025-10-28 | SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity | Hanke Xie et.al. | 2510.23541 | null |
| 2025-10-26 | UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models | Wenming Tu et.al. | 2510.22588 | null |
| 2025-10-24 | StylePitcher: Generating Style-Following and Expressive Pitch Curves for Versatile Singing Tasks | Jingyue Huang et.al. | 2510.21685 | null |
| 2025-10-23 | Vox-Evaluator: Enhancing Stability and Fidelity for Zero-shot TTS with A Multi-Level Evaluator | Hualei Wang et.al. | 2510.20210 | null |
| 2025-10-23 | SpeechAgent: An End-to-End Mobile Infrastructure for Speech Impairment Assistance | Haowei Lou et.al. | 2510.20113 | null |
| 2025-10-22 | Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent | Yangshijie Zhang et.al. | 2510.19641 | null |
| 2025-10-22 | Which Evaluation for Which Model? A Taxonomy for Speech Model Assessment | Maureen de Seyssel et.al. | 2510.19509 | null |
| 2025-10-22 | EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection | Tong Zhang et.al. | 2510.19414 | null |
| 2025-10-21 | StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction | Qianheng Xu et.al. | 2510.18938 | null |
| 2025-10-21 | KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers | Mohd Ruhul Ameen et.al. | 2510.18355 | null |
| 2025-10-21 | ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation | Haowei Lou et.al. | 2510.18308 | null |
| 2025-10-19 | U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation | Xusheng Yang et.al. | 2510.16718 | null |
| 2025-10-18 | Edge-Based Speech Transcription and Synthesis for Kinyarwanda and Swahili Languages | Pacome Simon Mbonimpa et.al. | 2510.16497 | null |
| 2025-10-22 | VoiceMorph: How AI Voice Morphing Reveals the Boundaries of Auditory Self-Recognition | Kye Shimizu et.al. | 2510.16192 | null |
| 2025-10-16 | RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF | Qing Yang et.al. | 2510.14628 | null |
| 2025-10-15 | InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue | Wenwen Tong et.al. | 2510.13747 | null |
| 2025-10-15 | Closing the Gap Between Text and Speech Understanding in LLMs | Santiago Cuervo et.al. | 2510.13632 | null |
| 2025-10-15 | Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models | Yizhou Peng et.al. | 2510.13293 | null |
| 2025-10-14 | Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs | Xinlu He et.al. | 2510.12995 | null |
| 2025-10-14 | Beating Harmful Stereotypes Through Facts: RAG-based Counter-speech Generation | Greta Damo et.al. | 2510.12316 | null |
| 2025-10-15 | DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation | Yakun Song et.al. | 2510.12210 | null |
| 2025-10-13 | BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis | Jingyuan Xing et.al. | 2510.11646 | null |
| 2025-10-13 | Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker | Cheng Gong et.al. | 2510.11124 | null |
| 2025-10-14 | ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis | Mohammad Javad Ranjbar Kalahroodi et.al. | 2510.10774 | null |
| 2025-10-14 | MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations | Wenxiang Guo et.al. | 2510.10396 | null |
| 2025-10-10 | Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models | Donghang Wu et.al. | 2510.09592 | null |
| 2025-10-10 | O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion | Huu Tuong Tu et.al. | 2510.09061 | null |
| 2025-10-10 | DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment | Zongcai Du et.al. | 2510.09016 | null |
| 2025-10-09 | DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching | Hanke Xie et.al. | 2510.08373 | null |
| 2025-10-09 | IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation | Wei Wang et.al. | 2510.07979 | null |
| 2025-10-08 | Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis | Zhu Li et.al. | 2510.07096 | null |
| 2025-10-08 | Towards Responsible Evaluation for Text-to-Speech | Yifan Yang et.al. | 2510.06927 | null |
| 2025-10-08 | XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection | Phuong Tuan Dat et.al. | 2510.06706 | null |
| 2025-10-07 | ECTSpeech: Enhancing Efficient Speech Synthesis via Easy Consistency Tuning | Tao Zhu et.al. | 2510.05984 | null |
| 2025-10-07 | Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech | Rikuto Kotoge et.al. | 2510.05799 | null |
| 2025-10-07 | Investigation of perception inconsistency in speaker embedding for asynchronous voice anonymization | Rui Wang et.al. | 2510.05718 | null |
| 2025-10-07 | Sparse deepfake detection promotes better disentanglement | Antoine Teissier et.al. | 2510.05696 | null |
| 2025-10-07 | Teaching Machines to Speak Using Articulatory Control | Akshay Anand et.al. | 2510.05619 | null |
| 2025-10-06 | Paper2Video: Automatic Video Generation from Scientific Papers | Zeyu Zhu et.al. | 2510.05096 | null |
| 2025-10-06 | Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba | Baher Mohammad et.al. | 2510.04738 | null |
| 2025-10-06 | UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models | Wenhao Guan et.al. | 2510.04593 | link |
| 2025-10-05 | GDiffuSE: Diffusion-based speech enhancement with noise model guidance | Efrayim Yanir et.al. | 2510.04157 | null |
| 2025-10-05 | A Multilingual Framework for Dysarthria: Detection, Severity Classification, Speech-to-Text, and Clean Speech Generation | Ananya Raghu et.al. | 2510.03986 | null |
| 2025-10-07 | Synthetic Audio Forensics Evaluation (SAFE) Challenge | Kirill Trapeznikov et.al. | 2510.03387 | null |
| 2025-10-03 | Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech | Hieu-Nghia Huynh-Nguyen et.al. | 2510.02848 | null |
| 2025-10-02 | Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement | Jianing Yang et.al. | 2510.01722 | link |
| 2025-10-01 | From Scores to Preferences: Redefining MOS Benchmarking for Speech Quality Reward Modeling | Yifei Cao et.al. | 2510.00743 | null |
| 2025-10-02 | MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance | Xingjian Zhao et.al. | 2510.00499 | null |
| 2025-09-30 | BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs | Yue Wang et.al. | 2509.26514 | null |
| 2025-09-30 | HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis | Ziyu Zhang et.al. | 2509.25842 | null |
| 2025-09-30 | LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning | Kang Yang et.al. | 2509.25670 | null |
| 2025-09-29 | Emotion-Aligned Generation in Diffusion Text to Speech Models via Preference-Guided Optimization | Jiacheng Shi et.al. | 2509.25416 | null |
| 2025-09-29 | MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech | Chengyao Wang et.al. | 2509.25131 | null |
| 2025-09-30 | VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning | Xin Cheng et.al. | 2509.24773 | null |
| 2025-09-29 | VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning | Yixuan Zhou et.al. | 2509.24650 | null |
| 2025-09-29 | Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis | Tianrui Wang et.al. | 2509.24629 | null |
| 2025-09-29 | ISSE: An Instruction-Guided Speech Style Editing Dataset And Benchmark | Yun Chen et.al. | 2509.24570 | null |
| 2025-09-29 | UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities | Xuenan Xu et.al. | 2509.24391 | null |
| 2025-09-28 | Generalizable Speech Deepfake Detection via Information Bottleneck Enhanced Adversarial Alignment | Pu Huang et.al. | 2509.23618 | null |
| 2025-09-27 | BFA: Real-time Multilingual Text-to-speech Forced Alignment | Abdul Rehman et.al. | 2509.23147 | null |
| 2025-09-26 | ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection | Mohamed Maged et.al. | 2509.22808 | null |
| 2025-09-26 | Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis | Zhikang Niu et.al. | 2509.22167 | null |
| 2025-09-26 | Speaker Anonymisation for Speech-based Suicide Risk Detection | Ziyun Cui et.al. | 2509.22148 | null |
| 2025-09-26 | Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling | Junjie Cao et.al. | 2509.22062 | null |
| 2025-09-26 | Align2Speak: Improving TTS for Low Resource Languages via ASR-Guided Online Preference Optimization | Shehzeen Hussain et.al. | 2509.21718 | null |
| 2025-09-25 | UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice | Sitong Cheng et.al. | 2509.21144 | link |
| 2025-09-27 | i-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents | Anupam Purwar et.al. | 2509.20971 | null |
| 2025-09-26 | SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS | Tan Dat Nguyen et.al. | 2509.20802 | null |
| 2025-09-24 | Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens | Ismail Rasim Ulgen et.al. | 2509.20485 | null |
| 2025-09-24 | OLaPh: Optimal Language Phonemizer | Johannes Wirth et.al. | 2509.20086 | null |
| 2025-09-25 | Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration | Yifan Yang et.al. | 2509.19928 | null |
| 2025-09-24 | CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance | Junchuan Zhao et.al. | 2509.19883 | null |
| 2025-09-24 | Eliminating stability hallucinations in llm-based tts models via attention guidance | ShiMing Wang et.al. | 2509.19852 | null |
| 2025-09-24 | Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation | Yang Cui et.al. | 2509.19812 | null |
| 2025-09-24 | PART: Progressive Alignment Representation Training for Multilingual Speech-To-Text with LLMs | Pei Zhang et.al. | 2509.19745 | null |
| 2025-09-24 | Selective Classifier-free Guidance for Zero-shot Text-to-speech | John Zheng et.al. | 2509.19668 | null |
| 2025-09-23 | Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation | Roy Fejgin et.al. | 2509.19592 | null |
| 2025-09-23 | HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS | Sihang Nie et.al. | 2509.19001 | null |
| 2025-09-23 | Direct Preference Optimization for Speech Autoregressive Diffusion Models | Zhijun Liu et.al. | 2509.18928 | null |
| 2025-09-23 | Group Relative Policy Optimization for Text-to-Speech with Large Language Models | Chang Liu et.al. | 2509.18798 | null |
| 2025-09-23 | Explore the Reinforcement Learning for the LLM based ASR and TTS system | Changfeng Gao et.al. | 2509.18569 | null |
| 2025-09-23 | No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS | Seungyoun Shin et.al. | 2509.18531 | null |
| 2025-09-22 | Discrete-time diffusion-like models for speech synthesis | Xiaozhou Tan et.al. | 2509.18470 | null |
| 2025-09-22 | TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation | Yutong Liu et.al. | 2509.18060 | null |
| 2025-09-22 | Nord-Parl-TTS: Finnish and Swedish TTS Dataset from Parliament Speech | Zirui Li et.al. | 2509.17988 | null |
| 2025-09-22 | Qwen3-Omni Technical Report | Jin Xu et.al. | 2509.17765 | null |
| 2025-09-22 | Audiobook-CC: Controllable Long-context Speech Generation for Multicast Audiobook | Min Liu et.al. | 2509.17516 | null |
| 2025-09-21 | Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing | Wataru Nakata et.al. | 2509.17052 | null |
| 2025-09-21 | Bridging the gap between training and inference in LM-based TTS models | Ruonan Zhang et.al. | 2509.17021 | null |
| 2025-09-21 | MBCodec:Thorough disentangle for high-fidelity audio compression | Ruonan Zhang et.al. | 2509.17006 | null |
| 2025-09-19 | Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation | Qi Wang et.al. | 2509.16010 | null |
| 2025-09-19 | VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency | Nikita Torgashov et.al. | 2509.15969 | null |
| 2025-09-19 | Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS | Ziqi Dai et.al. | 2509.15845 | null |
| 2025-09-19 | LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control | Junki Ohmura et.al. | 2509.15626 | null |
| 2025-09-19 | Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech | Xinlei Niu et.al. | 2509.15492 | null |
| 2025-09-18 | A Novel Semantic Compression Approach for Ultra-low Bandwidth Voice Communication | Ryan Collette et.al. | 2509.15462 | null |
| 2025-09-18 | Frustratingly Easy Data Augmentation for Low-Resource ASR | Katsumi Ibaraki et.al. | 2509.15373 | null |
| 2025-09-18 | Real-Time Streaming Mel Vocoding with Generative Flow Matching | Simon Welker et.al. | 2509.15085 | null |
| 2025-09-20 | SynParaSpeech: Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding | Bingsong Bai et.al. | 2509.14946 | link |
| 2025-09-18 | MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis | Keyu An et.al. | 2509.14784 | null |
| 2025-09-18 | DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis | Ye-Xin Lu et.al. | 2509.14684 | null |
| 2025-09-18 | Stochastic Clock Attention for Aligning Continuous and Ordered Sequences | Hyungjoon Soh et.al. | 2509.14678 | null |
| 2025-09-18 | SpeechMLC: Speech Multi-label Classification | Miseul Kim et.al. | 2509.14677 | null |
| 2025-09-18 | Mitigating Intra-Speaker Variability in Diarization with Style-Controllable Speech Augmentation | Miseul Kim et.al. | 2509.14632 | null |
| 2025-09-18 | Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis | Qingyu Liu et.al. | 2509.14579 | null |
| 2025-09-17 | CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset | Brian Yan et.al. | 2509.14161 | null |
| 2025-09-18 | Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems | Yi-Cheng Lin et.al. | 2509.13989 | null |
| 2025-09-16 | MSR-Codec: A Low-Bitrate Multi-Stream Residual Codec for High-Fidelity Speech Generation with Information Disentanglement | Jingyu Li et.al. | 2509.13068 | null |
| 2025-09-16 | A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis | Javeria Amir et.al. | 2509.12831 | null |
| 2025-09-15 | Preservation of Language Understanding Capabilities in Speech-aware Large Language Models | Marek Kubis et.al. | 2509.12171 | null |
| 2025-09-14 | FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs | Md Mubtasim Ahasan et.al. | 2509.11425 | null |
| 2025-09-14 | Length-Aware Rotary Position Embedding for Text-Speech Alignment | Hyeongju Kim et.al. | 2509.11084 | null |
| 2025-09-12 | WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers | Akshat Pandey et.al. | 2509.10452 | null |
| 2025-09-12 | Towards Data Drift Monitoring for Speech Deepfake Detection in the context of MLOps | Xin Wang et.al. | 2509.10086 | null |
| 2025-09-11 | DiTReducio: A Training-Free Acceleration for DiT-Based TTS via Progressive Calibration | Yanru Huo et.al. | 2509.09748 | null |
| 2025-09-09 | VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions | Jun Zhan et.al. | 2509.09716 | null |
| 2025-09-12 | DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech | Ngoc-Son Nguyen et.al. | 2509.09631 | null |
| 2025-09-11 | HISPASpoof: A New Dataset For Spanish Speech Forensics | Maria Risques et.al. | 2509.09155 | null |
| 2025-09-10 | Deploying AI for Signal Processing education: Selected challenges and intriguing opportunities | Jarvis Haupt et.al. | 2509.08950 | null |
| 2025-09-10 | Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling | Neil Zeghidour et.al. | 2509.08753 | null |
| 2025-09-10 | Accelerating Diffusion Transformer-Based Text-to-Speech with Transformer Layer Caching | Siratish Sakpiboonchit et.al. | 2509.08696 | null |
| 2025-09-10 | Joint Learning using Mixture-of-Expert-Based Representation for Enhanced Speech Generation and Robust Emotion Recognition | Jing-Tong Tzeng et.al. | 2509.08470 | null |
| 2025-09-09 | Progressive Facial Granularity Aggregation with Bilateral Attribute-based Enhancement for Face-to-Speech Synthesis | Yejin Jeon et.al. | 2509.07376 | null |
| 2025-09-09 | When Fine-Tuning is Not Enough: Lessons from HSAD on Hybrid and Adversarial Audio Spoof Detection | Bin Hu et.al. | 2509.07323 | null |
| 2025-09-08 | Controllable Singing Voice Synthesis using Phoneme-Level Energy Sequence | Yerin Ryu et.al. | 2509.07038 | null |
| 2025-09-08 | ParCzech4Speech: A New Speech Corpus Derived from Czech Parliamentary Data | Vladislav Stankov et.al. | 2509.06675 | null |
| 2025-09-09 | Speaker Privacy and Security in the Big Data Era: Protection and Defense against Deepfake | Liping Chen et.al. | 2509.06361 | null |
| 2025-09-07 | UniVerse-1: Unified Audio-Video Generation via Stitching of Experts | Duomin Wang et.al. | 2509.06155 | null |
| 2025-09-07 | Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis | Zhenqi Jia et.al. | 2509.06074 | null |
| 2025-09-06 | LatinX: Aligning a Multilingual TTS Model with Direct Preference Optimization | Luis Felipe Chary et.al. | 2509.05863 | null |
| 2025-09-05 | Cloning a Conversational Voice AI Agent from Call,Recording Datasets for Telesales | Krittanon Kaewtawee et.al. | 2509.04871 | null |
| 2025-09-04 | Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding | Rui-Chen Zheng et.al. | 2509.04685 | null |
| 2025-09-04 | DarkStream: real-time speech anonymization with low latency | Waris Quamer et.al. | 2509.04667 | null |
| 2025-09-04 | AUDETER: A Large-scale Dataset for Deepfake Audio Detection in Open Worlds | Qizhou Wang et.al. | 2509.04345 | null |
| 2025-09-04 | Open-Source Full-Duplex Conversational Datasets for Natural and Interactive Speech Synthesis | Zhitong Zhou et.al. | 2509.04093 | null |
| 2025-09-04 | LibriQuote: A Speech Dataset of Fictional Character Utterances for Expressive Zero-Shot Speech Synthesis | Gaspard Michel et.al. | 2509.04072 | null |
| 2025-09-03 | Multi-level SSL Feature Gating for Audio Deepfake Detection | Hoan My Tran et.al. | 2509.03409 | null |
| 2025-09-03 | LatPhon: Lightweight Multilingual G2P for Romance Languages and English | Luis Felipe Chary et.al. | 2509.03300 | null |
| 2025-09-03 | Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised Embeddings | Dyah A. M. G. Wisnu et.al. | 2509.03292 | null |
| 2025-09-03 | AIVA: An AI-based Virtual Companion for Emotion-aware Interaction | Chenxi Li et.al. | 2509.03212 | null |
| 2025-09-04 | FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot | Kun Xie et.al. | 2509.02020 | null |
| 2025-09-01 | MixedG2P-T5: G2P-free Speech Synthesis for Mixed-script texts using Speech Self-Supervised Learning and Language Model | Joonyong Park et.al. | 2509.01391 | null |
| 2025-09-01 | The AudioMOS Challenge 2025 | Wen-Chin Huang et.al. | 2509.01336 | null |
| 2025-09-01 | An AI-Based Shopping Assistant System to Support the Visually Impaired | Larissa R. de S. Shibata et.al. | 2509.01246 | null |
| 2025-09-01 | SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation | Chenyang Le et.al. | 2509.01200 | null |
| 2025-08-31 | MPO: Multidimensional Preference Optimization for Language Model-based Text-to-Speech | Kangxiang Xia et.al. | 2509.00685 | null |
| 2025-08-29 | Towards Improved Speech Recognition through Optimized Synthetic Data Generation | Yanis Perrin et.al. | 2508.21631 | null |
| 2025-08-28 | MoTAS: MoE-Guided Feature Selection from TTS-Augmented Speech for Enhanced Multimodal Alzheimer's Early Screening | Yongqi Shao et.al. | 2508.20513 | null |
| 2025-08-26 | Interpolating Speaker Identities in Embedding Space for Data Expansion | Tianchi Liu et.al. | 2508.19210 | null |
| 2025-08-26 | CLEAR: Continuous Latent Autoregressive Modeling for High-quality and Low-latency Speech Synthesis | Chun Yat Wu et.al. | 2508.19098 | null |
| 2025-08-25 | Unseen Speaker and Language Adaptation for Lightweight Text-To-Speech with Adapters | Alessio Falai et.al. | 2508.18006 | null |
| 2025-08-27 | Vocoder-Projected Feature Discriminator | Takuhiro Kaneko et.al. | 2508.17874 | null |
| 2025-09-02 | Zero-shot Context Biasing with Trie-based Decoding using Synthetic Multi-Pronunciation | Changsong Liu et.al. | 2508.17796 | null |
| 2025-08-25 | ClearMask: Noise-Free and Naturalness-Preserving Protection Against Voice Deepfake Attacks | Yuanda Wang et.al. | 2508.17660 | null |
| 2025-08-26 | EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems | Jingwen Liu et.al. | 2508.17623 | null |
| 2025-08-24 | Improving French Synthetic Speech Quality via SSML Prosody Control | Nassima Ould Ouali et.al. | 2508.17494 | null |
| 2025-08-23 | RephraseTTS: Dynamic Length Text based Speech Insertion with Speaker Style Transfer | Neeraj Matiyali et.al. | 2508.17031 | null |
| 2025-08-23 | WildSpoof Challenge Evaluation Plan | Yihan Wu et.al. | 2508.16858 | null |
| 2025-08-22 | TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling | Yuancheng Wang et.al. | 2508.16790 | link |
| 2025-08-22 | Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation | Weiting Tan et.al. | 2508.16188 | null |
| 2025-08-21 | QvTAD: Differential Relative Attribute Learning for Voice Timbre Attribute Detection | Zhiyu Wu et.al. | 2508.15931 | null |
| 2025-08-21 | Any-to-any Speaker Attribute Perturbation for Asynchronous Voice Anonymization | Liping Chen et.al. | 2508.15565 | null |
| 2025-08-24 | Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets | Chenlin Liu et.al. | 2508.15442 | null |
| 2025-08-21 | UniCoM: A Universal Code-Switching Speech Generator | Sangmin Lee et.al. | 2508.15244 | link |
| 2025-08-20 | Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization | Rui Wang et.al. | 2508.14947 | null |
| 2025-08-20 | Long-Context Speech Synthesis with Context-Aware Memory | Zhipeng Li et.al. | 2508.14713 | null |
| 2025-08-20 | Improving Resource-Efficient Speech Enhancement via Neural Differentiable DSP Vocoder Refinement | Heitor R. Guimarães et.al. | 2508.14709 | null |
| 2025-08-20 | DiffIER: Optimizing Diffusion Models with Iterative Error Reduction | Ao Chen et.al. | 2508.13628 | null |
| 2025-08-19 | Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM | Dariia Puhach et.al. | 2508.13603 | null |
| 2025-08-18 | A Surveillance Based Interactive Robot | Kshitij Kavimandan et.al. | 2508.13319 | null |
| 2025-08-18 | Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis | Zhu Li et.al. | 2508.13028 | null |
| 2025-08-18 | Real-Time Sign Language Gestures to Speech Transcription using Deep Learning | Brandone Fonya et.al. | 2508.12713 | null |
| 2025-08-19 | FNH-TTS: A Fast, Natural, and Human-Like Speech Synthesis System with advanced prosodic modeling based on Mixture of Experts | Qingliang Meng et.al. | 2508.12001 | null |
| 2025-08-16 | SimInterview: Transforming Business Education through Large Language Model-Based Simulated Multilingual Interview Training System | Truong Thanh Hung Nguyen et.al. | 2508.11873 | null |
| 2025-08-15 | MoE-TTS: Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts | Heyang Xue et.al. | 2508.11326 | null |
| 2025-08-15 | EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens | Joonyong Park et.al. | 2508.11273 | null |
| 2025-08-14 | Fake Speech Wild: Detecting Deepfake Speech on Social Media Platform | Yuankun Xie et.al. | 2508.10559 | link |
| 2025-08-14 | Facilitating Personalized TTS for Dysarthric Speakers Using Knowledge Anchoring and Curriculum Learning | Yejin Jeon et.al. | 2508.10412 | null |
| 2025-08-14 | Towards Frame-level Quality Predictions of Synthetic Speech | Michael Kuhlmann et.al. | 2508.10374 | null |
| 2025-08-08 | LLMCARE: Alzheimer's Detection via Transformer Models Enhanced by LLM-Generated Synthetic Data | Ali Zolnour et.al. | 2508.10027 | null |
| 2025-08-15 | Training-Free Multimodal Large Language Model Orchestration | Tianyu Xie et.al. | 2508.10016 | null |
| 2025-08-13 | Analysis of Domain Shift across ASR Architectures via TTS-Enabled Separation of Target Domain and Acoustic Conditions | Tina Raissi et.al. | 2508.09868 | null |
| 2025-08-13 | UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech | Shuhei Kato et.al. | 2508.09767 | null |
| 2025-08-13 | Boyu Zhu et.al. | 2508.09702 | null | |
| 2025-08-12 | Fake-Mamba: Real-Time Speech Deepfake Detection Using Bidirectional Mamba as Self-Attention's Alternative | Xi Xuan et.al. | 2508.09294 | null |
| 2025-08-13 | DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models | Yuanyuan Wang et.al. | 2508.08961 | null |
| 2025-08-12 | QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems | Chien-Chun Wang et.al. | 2508.08957 | null |
| 2025-08-15 | MultiAiTutor: Child-Friendly Educational Multilingual Speech Generation Tutor with LLMs | Xiaoxue Gao et.al. | 2508.08715 | null |
| 2025-08-12 | Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization | Chaoqun Cui et.al. | 2508.08550 | null |
| 2025-08-11 | Is GAN Necessary for Mel-Spectrogram-based Neural Vocoder? | Hui-Peng Du et.al. | 2508.07711 | null |
| 2025-08-10 | Think Before You Talk: Enhancing Meaningful Dialogue Generation in Full-Duplex Speech Language Models with Planning-Inspired Text Guidance | Wenqian Cui et.al. | 2508.07375 | link |
| 2025-08-10 | KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features | Ivan Kukanov et.al. | 2508.07337 | null |
| 2025-08-12 | XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation | Tianlun Zuo et.al. | 2508.07302 | null |
| 2025-08-09 | Maestro-EVC: Controllable Emotional Voice Conversion Guided by References and Explicit Prosody | Jinsung Yoon et.al. | 2508.06890 | null |
| 2025-08-09 | Text to Speech System for Meitei Mayek Script | Gangular Singh Irengbam et.al. | 2508.06870 | null |
| 2025-08-08 | ScamAgents: How AI Agents Can Simulate Human-Level Scam Calls | Sanket Badhe et.al. | 2508.06457 | null |
| 2025-08-08 | Improved Dysarthric Speech to Text Conversion via TTS Personalization | Péter Mihajlik et.al. | 2508.06391 | null |
| 2025-08-08 | Large Language Model Data Generation for Enhanced Intent Recognition in German Speech | Theresa Pekarek Rosin et.al. | 2508.06277 | null |
| 2025-08-08 | Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech Synthesis | Wenjie Tian et.al. | 2508.06262 | null |
| 2025-08-07 | A Scalable Pipeline for Enabling Non-Verbal Speech Generation and Understanding | Runchuan Ye et.al. | 2508.05385 | null |
| 2025-08-15 | Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS | M Anuprabha et.al. | 2508.05102 | null |
| 2025-08-06 | Root Cause Analysis Training for Healthcare Professionals With AI-Powered Virtual Simulation: A Proof-of-Concept | Yuqi Hu et.al. | 2508.04904 | null |
| 2025-08-05 | Toward Low-Latency End-to-End Voice Agents for Telecommunications Using Streaming ASR, Quantized LLMs, and Real-Time TTS | Vignesh Ethiraj et.al. | 2508.04721 | null |
| 2025-08-07 | UniTalker: Conversational Speech-Visual Synthesis | Yifan Hu et.al. | 2508.04585 | null |
| 2025-08-06 | NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations | Huan Liao et.al. | 2508.04195 | null |
| 2025-08-06 | Multilingual Source Tracing of Speech Deepfakes: A First Benchmark | Xi Xuan et.al. | 2508.04143 | null |
| 2025-08-06 | Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech | Jingyuan Xing et.al. | 2508.04141 | null |
| 2025-08-06 | EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering | Tianxin Xie et.al. | 2508.03543 | null |
| 2025-08-05 | MiSTR: Multi-Modal iEEG-to-Speech Synthesis with Transformer-Based Prosody Prediction and Neural Phase Reconstruction | Mohammed Salah Al-Radhi et.al. | 2508.03166 | link |
| 2025-08-05 | Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback | Jingyi Chen et.al. | 2508.03123 | null |
| 2025-08-14 | Marco-Voice Technical Report | Fengping Tian et.al. | 2508.02038 | null |
| 2025-08-03 | Enhancing Spectrogram Realism in Singing Voice Synthesis via Explicit Bandwidth Extension Prior to Vocoder | Runxuan Yang et.al. | 2508.01796 | null |
| 2025-08-03 | Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe | Tiantian Feng et.al. | 2508.01691 | null |
| 2025-08-01 | Advancing Speech Quality Assessment Through Scientific Challenges and Open-source Activities | Wen-Chin Huang et.al. | 2508.00317 | null |
| 2025-08-01 | Next Tokens Denoising for Speech Synthesis | Yanqing Liu et.al. | 2507.22746 | null |
| 2025-07-30 | Adaptive Duration Model for Text Speech Alignment | Junjie Cao et.al. | 2507.22612 | null |
| 2025-07-29 | SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Incorporating Cutting-Edge Generation Methods | Wen Huang et.al. | 2507.21463 | null |
| 2025-07-23 | WaveVerify: A Novel Audio Watermarking Framework for Media Authentication and Combatting Deepfakes | Aditya Pujari et.al. | 2507.21150 | null |
| 2025-07-22 | TTS-1 Technical Report | Oleg Atamanenko et.al. | 2507.21138 | null |
| 2025-07-29 | JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1 | Xinhan Di et.al. | 2507.20987 | null |
| 2025-07-28 | AV-Deepfake1M++: A Large-Scale Audio-Visual Deepfake Benchmark with Real-World Perturbations | Zhixi Cai et.al. | 2507.20579 | null |
| 2025-07-27 | Do Not Mimic My Voice: Speaker Identity Unlearning for Zero-Shot Text-to-Speech | Taesoo Kim et.al. | 2507.20140 | null |
| 2025-07-26 | Defining ethically sourced code generation | Zhuolin Xu et.al. | 2507.19743 | null |
| 2025-07-25 | GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness | Hongjie Chen et.al. | 2507.18119 | null |
| 2025-07-24 | Synthetic Data Generation for Phrase Break Prediction with Large Language Model | Hoyeon Lee et.al. | 2507.18044 | null |
| 2025-07-23 | AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer | Danny D. Leybzon et.al. | 2507.17718 | null |
| 2025-07-23 | Synthetic Voice Data for Automatic Speech Recognition in African Languages | Brian DeRenzi et.al. | 2507.17578 | null |
| 2025-07-23 | BoSS: Beyond-Semantic Speech | Qing Wang et.al. | 2507.17563 | null |
| 2025-07-27 | Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice | Shanbo Cheng et.al. | 2507.17527 | null |
| 2025-07-22 | SplitMeanFlow: Interval Splitting Consistency in Few-Step Generative Modeling | Yi Guo et.al. | 2507.16884 | null |
| 2025-07-22 | Technical report: Impact of Duration Prediction on Speaker-specific TTS for Indian Languages | Isha Pandey et.al. | 2507.16875 | null |
| 2025-07-15 | Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems | Nima Yazdani et.al. | 2507.16835 | null |
| 2025-07-21 | A2TTS: TTS for Low Resource Indian Languages | Ayush Singh Bhadoriya et.al. | 2507.15272 | null |
| 2025-07-21 | EchoVoices: Preserving Generational Voices and Memories for Seniors and Children | Haiying Xu et.al. | 2507.15221 | null |
| 2025-07-22 | Hear Your Code Fail, Voice-Assisted Debugging for Python | Sayed Mahbub Hasan Amiri et.al. | 2507.15007 | null |
| 2025-07-20 | DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis | Yinghao Aaron Li et.al. | 2507.14988 | null |
| 2025-07-20 | FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing | Shoutao Guo et.al. | 2507.14815 | null |
| 2025-07-17 | A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models | Kirill Borodin et.al. | 2507.13563 | null |
| 2025-07-17 | NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech | Maksim Borisov et.al. | 2507.13155 | null |
| 2025-07-17 | Intelligent Virtual Sonographer (IVS): Enhancing Physician-Robot-Patient Communication | Tianyu Song et.al. | 2507.13052 | null |
| 2025-07-17 | Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes | Zhou Feng et.al. | 2507.12932 | null |
| 2025-07-16 | Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations | Yichen Han et.al. | 2507.12197 | null |
| 2025-07-16 | EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis | Haoxun Li et.al. | 2507.12015 | null |
| 2025-07-15 | Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection | Ivan Viakhirev et.al. | 2507.11777 | null |
| 2025-07-25 | P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge | Marvin Sach et.al. | 2507.11306 | null |
| 2025-07-20 | Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition | Mengzhe Geng et.al. | 2507.10827 | null |
| 2025-07-14 | An Empirical Evaluation of AI-Powered Non-Player Characters' Perceived Realism and Performance in Virtual Reality Environments | Mikko Korkiakoski et.al. | 2507.10469 | null |
| 2025-07-14 | DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis | Wenjie Tian et.al. | 2507.10109 | null |
| 2025-07-12 | ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching | Han Zhu et.al. | 2507.09318 | null |
| 2025-07-12 | Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning | Dominika Woszczyk et.al. | 2507.09310 | null |
| 2025-07-12 | ClaritySpeech: Dementia Obfuscation in Speech | Dominika Woszczyk et.al. | 2507.09282 | link |
| 2025-07-19 | Mixture of LoRA Experts with Multi-Modal and Multi-Granularity LLM Generative Error Correction for Accented Speech Recognition | Bingshen Mu et.al. | 2507.09116 | null |
| 2025-07-11 | SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment | Shivam Mehta et.al. | 2507.09070 | null |
| 2025-07-11 | Exploiting Leaderboards for Large-Scale Distribution of Malicious Models | Anshuman Suri et.al. | 2507.08983 | null |
| 2025-07-06 | A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting | Niranjan Mallikarjun Sindhur et.al. | 2507.08832 | null |
| 2025-07-11 | Unlocking Speech Instruction Data Potential with Query Rewriting | Yonghua Hei et.al. | 2507.08603 | null |
| 2025-07-11 | MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling | Jingjing Tang et.al. | 2507.08530 | null |
| 2025-07-11 | Active Learning for Text-to-Speech Synthesis with Informative Sample Collection | Kentaro Seki et.al. | 2507.08319 | null |
| 2025-07-05 | RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning | Atli Sigurgeirsson et.al. | 2507.08012 | null |
| 2025-07-10 | SecureSpeech: Prompt-based Speaker and Content Protection | Belinda Soh Hui Hui et.al. | 2507.07799 | null |
| 2025-07-09 | STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation | Wenxiang Guo et.al. | 2507.06670 | null |
| 2025-07-09 | Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents | Zackary Rackauckas et.al. | 2507.06483 | null |
| 2025-07-08 | Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis | Xintong Hu et.al. | 2507.06116 | null |
| 2025-07-08 | Differentiable Reward Optimization for LLM based TTS system | Changfeng Gao et.al. | 2507.05911 | null |
| 2025-07-08 | OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model | Chen Wang et.al. | 2507.05177 | null |
| 2025-07-07 | LAPS-Diff: A Diffusion-Based Framework for Singing Voice Synthesis With Language Aware Prosody-Style Guided Learning | Sandipan Dhar et.al. | 2507.04966 | null |
| 2025-07-07 | Multi-Step Prediction and Control of Hierarchical Emotion Distribution in Text-to-Speech Synthesis | Sho Inoue et.al. | 2507.04598 | null |
| 2025-07-06 | TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet | Jaeseok Jeong et.al. | 2507.04349 | null |
| 2025-07-05 | PresentAgent: Multimodal Agent for Presentation Video Generation | Jingwei Shi et.al. | 2507.04036 | null |
| 2025-07-05 | Prosody Labeling with Phoneme-BERT and Speech Foundation Models | Tomoki Koriyama et.al. | 2507.03912 | null |
| 2025-07-05 | Traceable TTS: Toward Watermark-Free TTS with Strong Traceability | Yuxiang Zhao et.al. | 2507.03887 | null |
| 2025-07-14 | DeepGesture: A conversational gesture synthesis system based on emotions and semantics | Thanh Hoang-Minh et.al. | 2507.03147 | null |
| 2025-07-03 | De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks | Wei Fan et.al. | 2507.02606 | null |
| 2025-07-03 | Open-Source System for Multilingual Translation and Cloned Speech Synthesis | Mateo Cámara et.al. | 2507.02530 | null |
| 2025-07-03 | JoyTTS: LLM-based Spoken Chatbot With Voice Cloning | Fangru Zhou et.al. | 2507.02380 | null |
| 2025-07-02 | Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis | Marc-André Carbonneau et.al. | 2507.02176 | null |
| 2025-07-04 | Pronunciation Editing for Finnish Speech using Phonetic Posteriorgrams | Zirui Li et.al. | 2507.02115 | null |
| 2025-07-02 | A Dataset for Automatic Assessment of TTS Quality in Spanish | Alejandro Sosa Welford et.al. | 2507.01805 | link |
| 2025-07-02 | Voice Conversion for Likability Control via Automated Rating of Speech Synthesis Corpora | Hitoshi Suda et.al. | 2507.01356 | null |
| 2025-07-08 | SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech | Zhuangfei Cheng et.al. | 2507.01348 | null |
| 2025-07-02 | Multi-interaction TTS toward professional recording reproduction | Hiroki Kanagawa et.al. | 2507.00808 | null |
| 2025-07-01 | MuteSwap: Silent Face-based Voice Conversion | Yifan Liu et.al. | 2507.00498 | null |
| 2025-06-30 | Collecting, Curating, and Annotating Good Quality Speech deepfake dataset for Famous Figures: Process and Challenges | Hashim Ali et.al. | 2507.00324 | null |
| 2025-06-30 | Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis | Paul Mayer et.al. | 2507.00227 | null |
| 2025-07-01 | StreamFlow: Streaming Flow Matching with Block-wise Guided Attention Mask for Speech Token Decoding | Dake Guo et.al. | 2506.23986 | null |
| 2025-06-30 | Efficient Interleaved Speech Modeling through Knowledge Distillation | Mohammadmahdi Nouriborji et.al. | 2506.23670 | null |
| 2025-06-30 | JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching | Mingi Kwon et.al. | 2506.23552 | null |
| 2025-06-29 | You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties | Paige Tuttösí et.al. | 2506.23367 | null |
| 2025-06-27 | DiffSoundStream: Efficient Speech Tokenization via Diffusion Decoding | Yang Yang et.al. | 2506.22362 | null |
| 2025-06-27 | Evaluating Pointing Gestures for Target Selection in Human-Robot Collaboration | Noora Sassali et.al. | 2506.22116 | null |
| 2025-06-27 | Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy | Bohan Li et.al. | 2506.22023 | null |
| 2025-06-23 | IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech | Siyi Zhou et.al. | 2506.21619 | null |
| 2025-06-26 | SmoothSinger: A Conditional Diffusion Model for Singing Voice Synthesis with Multi-Resolution Architecture | Kehan Sui et.al. | 2506.21478 | null |
| 2025-06-26 | A Multi-Stage Framework for Multimodal Controllable Speech Synthesis | Rui Niu et.al. | 2506.20945 | null |
| 2025-06-25 | An Exploration of ECAPA-TDNN and x-vector Speaker Representations in Zero-shot Multi-speaker TTS | Marie Kunešová et.al. | 2506.20190 | null |
| 2025-06-24 | TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems | Christoph Minixhofer et.al. | 2506.19441 | null |
| 2025-06-23 | Selecting N-lowest scores for training MOS prediction models | Yuto Kondo et.al. | 2506.18326 | null |
| 2025-06-23 | Rethinking Mean Opinion Scores in Speech Quality Assessment: Aggregation through Quantized Distribution Fitting | Yuto Kondo et.al. | 2506.18307 | null |
| 2025-06-23 | JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles | Yuto Kondo et.al. | 2506.18296 | null |
| 2025-06-21 | OpusLM: A Family of Open Unified Speech Language Models | Jinchuan Tian et.al. | 2506.17611 | null |
| 2025-06-20 | RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching | Hyun Joon Park et.al. | 2506.16741 | null |
| 2025-06-20 | LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization | Daejin Jo et.al. | 2506.16738 | null |
| 2025-06-20 | V-CASS: Vision-context-aware Expressive Speech Synthesis for Enhancing User Understanding of Videos | Qixin Wang et.al. | 2506.16716 | null |
| 2025-06-19 | Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement | Tuan-Nam Nguyen et.al. | 2506.16580 | null |
| 2025-06-19 | InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems | Kexin Huang et.al. | 2506.16381 | link |
| 2025-06-19 | Optimizing Multilingual Text-To-Speech with Accents & Emotions | Pranav Pawar et.al. | 2506.16310 | null |
| 2025-06-19 | Improved Intelligibility of Dysarthric Speech using Conditional Flow Matching | Shoutrik Das et.al. | 2506.16127 | null |
| 2025-06-19 | VS-Singer: Vision-Guided Stereo Singing Voice Synthesis with Consistency Schrödinger Bridge | Zijing Zhao et.al. | 2506.16020 | null |
| 2025-06-18 | TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data | Kentaro Seki et.al. | 2506.15614 | null |
| 2025-06-18 | PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction | Shufan Li et.al. | 2506.15556 | null |
| 2025-06-18 | Factorized RVQ-GAN For Disentangled Speech Tokenization | Sameer Khurana et.al. | 2506.15456 | null |
| 2025-06-18 | EmojiVoice: Towards long-term controllable expressivity in robot speech | Paige Tuttösí et.al. | 2506.15085 | null |
| 2025-06-18 | An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW | Prateek Mehta et.al. | 2506.15029 | null |
| 2025-06-25 | SLEEPING-DISCO 9M: A large-scale pre-training dataset for generative music modeling | Tawsif Ahmed et.al. | 2506.14293 | null |
| 2025-06-17 | Investigation of Zero-shot Text-to-Speech Models for Enhancing Short-Utterance Speaker Verification | Yiyang Zhao et.al. | 2506.14226 | null |
| 2025-06-17 | Pushing the Performance of Synthetic Speech Detection with Kolmogorov-Arnold Networks and Self-Supervised Learning Models | Tuan Dat Phuong et.al. | 2506.14153 | link |
| 2025-06-16 | EmoNews: A Spoken Dialogue System for Expressive News Conversations | Ryuki Matsuura et.al. | 2506.13894 | link |
| 2025-06-16 | From Flat to Feeling: A Feasibility and Impact Study on Dynamic Facial Emotions in AI-Generated Avatars | Pegah Salehi et.al. | 2506.13477 | null |
| 2025-06-20 | ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching | Han Zhu et.al. | 2506.13053 | link |
| 2025-06-14 | StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling | Hui Wang et.al. | 2506.12570 | null |
| 2025-06-14 | Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction | Xiaoran Fan et.al. | 2506.12537 | null |
| 2025-06-14 | Phonikud: Hebrew Grapheme-to-Phoneme Conversion for Real-Time Text-to-Speech | Yakov Kolani et.al. | 2506.12311 | null |
| 2025-06-11 | S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamlessly Speech-Text Alignment and Streaming Speech Decoder | Yu Pan et.al. | 2506.11160 | null |
| 2025-06-16 | A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data | Cheng-Kang Chou et.al. | 2506.11130 | null |
| 2025-06-10 | GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions | Wenkang Han et.al. | 2506.11127 | null |
| 2025-06-10 | ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams | Freddie Grabovski et.al. | 2506.11125 | null |
| 2025-06-05 | Intelligibility of Text-to-Speech Systems for Mathematical Expressions | Sujoy Roychowdhury et.al. | 2506.11086 | null |
| 2025-06-12 | Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs | Hayato Futami et.al. | 2506.10299 | null |
| 2025-06-06 | A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations | Tian Lan et.al. | 2506.10019 | null |
| 2025-06-11 | UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching | Neta Glazer et.al. | 2506.09874 | null |
| 2025-06-15 | EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection | Christoph Schuhmann et.al. | 2506.09827 | null |
| 2025-06-11 | OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment | Chao-Hong Tan et.al. | 2506.09349 | link |
| 2025-06-11 | Ming-Omni: A Unified Multimodal Model for Perception and Generation | Inclusion AI et.al. | 2506.09344 | link |
| 2025-06-13 | Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model | Ailin Huang et.al. | 2506.08967 | null |
| 2025-06-10 | A Review on Score-based Generative Models for Audio Applications | Ge Zhu et.al. | 2506.08457 | null |
| 2025-06-09 | Seeing Voices: Generating A-Roll Video from Audio with Mirage | Aditi Sundararaman et.al. | 2506.08279 | null |
| 2025-06-09 | Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation | Rui Hu et.al. | 2506.07646 | null |
| 2025-06-10 | Towards Generalized Source Tracing for Codec-Based Deepfake Speech | Xuanjun Chen et.al. | 2506.07294 | null |
| 2025-06-07 | SynHate: Detecting Hate Speech in Synthetic Deepfake Audio | Rishabh Ranjan et.al. | 2506.06772 | null |
| 2025-06-06 | Audio-Aware Large Language Models as Judges for Speaking Styles | Cheng-Han Chiang et.al. | 2506.05984 | null |
| 2025-06-09 | Voice Impression Control in Zero-Shot TTS | Keinichi Fujita et.al. | 2506.05688 | null |
| 2025-06-05 | Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning | Hien Ohnaka et.al. | 2506.04527 | null |
| 2025-06-04 | Can we reconstruct a dysarthric voice with the large speech model Parler TTS? | Ariadna Sanchez et.al. | 2506.04397 | null |
| 2025-06-04 | HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset | Ryan Langman et.al. | 2506.04152 | null |
| 2025-06-04 | UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation | Jinting Wang et.al. | 2506.04134 | null |
| 2025-06-04 | A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions | Chung-Chun Wang et.al. | 2506.04077 | null |
| 2025-06-04 | Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian Languages | Utkarsh Pathak et.al. | 2506.03884 | null |
| 2025-06-04 | Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts | Sidharth Pulipaka et.al. | 2506.03793 | null |
| 2025-06-04 | Comparative Analysis of Fast and High-Fidelity Neural Vocoders for Low-Latency Streaming Synthesis in Resource-Constrained Environments | Reo Yoneyama et.al. | 2506.03554 | null |
| 2025-06-04 | BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing | Masaya Kawamura et.al. | 2506.03515 | null |
| 2025-06-03 | Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation | Yongqi Wang et.al. | 2506.02997 | null |
| 2025-06-03 | Towards a Japanese Full-duplex Spoken Dialogue System | Atsumoto Ohashi et.al. | 2506.02979 | null |
| 2025-06-03 | PartialEdit: Identifying Partial Deepfakes in the Era of Neural Speech Editing | You Zhang et.al. | 2506.02958 | null |
| 2025-06-03 | CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech | Helin Wang et.al. | 2506.02863 | link |
| 2025-06-03 | Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions | Xiaoxue Gao et.al. | 2506.02742 | null |
| 2025-06-03 | StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion | Fengjin Li et.al. | 2506.02414 | null |
| 2025-06-03 | SingaKids: A Multilingual Multimodal Dialogic Tutor for Language Learning | Zhengyuan Liu et.al. | 2506.02412 | null |
| 2025-06-03 | Trusted Fake Audio Detection Based on Dirichlet Distribution | Chi Ding et.al. | 2506.02401 | null |
| 2025-06-02 | Dhvani: A Weakly-supervised Phonemic Error Detection and Personalized Feedback System for Hindi | Arnav Rustagi et.al. | 2506.02166 | null |
| 2025-06-02 | SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction | Saurabh Agrawal et.al. | 2506.02082 | null |
| 2025-06-02 | Universal Preference-Score-based Pairwise Speech Quality Assessment | Yu-Fei Shi et.al. | 2506.01455 | null |
| 2025-06-02 | Speech-to-Speech Translation Pipelines for Conversations in Low-Resource Languages | Andrei Popescu-Belis et.al. | 2506.01406 | null |
| 2025-06-02 | Zero-Shot Text-to-Speech for Vietnamese | Thi Vu et.al. | 2506.01322 | null |
| 2025-06-02 | CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction | Yudong Lu et.al. | 2506.01268 | null |
| 2025-06-02 | WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing | Yu Nakagome et.al. | 2506.01263 | null |
| 2025-06-01 | Source Tracing of Synthetic Speech Systems Through Paralinguistic Pre-Trained Representations | Girish et.al. | 2506.01157 | null |
| 2025-06-01 | DS-TTS: Zero-Shot Speaker Style Adaptation from Voice Clips via Dynamic Dual-Style Feature Modulation | Ming Meng et.al. | 2506.01020 | null |
| 2025-06-01 | Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching | Jialong Zuo et.al. | 2506.01014 | null |
| 2025-06-01 | CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching | Leying Zhang et.al. | 2506.00885 | null |
| 2025-06-01 | Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models | Kyowoon Lee et.al. | 2506.00832 | null |
| 2025-05-30 | ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation | Jiatong Shi et.al. | 2505.24518 | null |
| 2025-05-30 | Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation | Wenrui Liu et.al. | 2505.24496 | null |
| 2025-05-30 | DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec | Peijie Chen et.al. | 2505.24314 | null |
| 2025-05-29 | Can Emotion Fool Anti-spoofing? | Aurosweta Mahapatra et.al. | 2505.23962 | null |
| 2025-05-29 | Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes | Neta Glazer et.al. | 2505.23619 | link |
| 2025-05-29 | EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge | Ruskin Raj Manku et.al. | 2505.23009 | link |
| 2025-05-29 | LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting | Pai Zhu et.al. | 2505.22995 | null |
| 2025-05-28 | BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models | Susan Liang et.al. | 2505.22865 | null |
| 2025-05-28 | Tell me Habibi, is it Real or Fake? | Kartik Kuckreja et.al. | 2505.22581 | null |
| 2025-05-28 | A Linguistically Motivated Analysis of Intonational Phrasing in Text-to-Speech Systems: Revealing Gaps in Syntactic Sensitivity | Charlotte Pouw et.al. | 2505.22236 | null |
| 2025-05-27 | Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech | Nam-Gyu Kim et.al. | 2505.20868 | null |
| 2025-05-26 | ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis | Hawau Olamide Toyin et.al. | 2505.20506 | null |
| 2025-05-26 | Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling | Qixi Zheng et.al. | 2505.19931 | null |
| 2025-05-26 | DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech | Deok-Hyeon Cho et.al. | 2505.19687 | link |
| 2025-05-26 | KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization | Zhaolin Li et.al. | 2505.19679 | null |
| 2025-06-02 | Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling | Haiyang Sun et.al. | 2505.19669 | null |
| 2025-05-30 | Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment | Jeongsoo Choi et.al. | 2505.19595 | link |
| 2025-05-26 | GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor | Seokgi Lee et.al. | 2505.19384 | null |
| 2025-05-25 | SpeakStream: Streaming Text-to-Speech with Interleaved Data | Richard He Bai et.al. | 2505.19206 | null |
| 2025-05-25 | CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning | Renyuan Li et.al. | 2505.19119 | null |
| 2025-05-25 | Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis | Minsu Kim et.al. | 2505.18972 | null |
| 2025-05-27 | RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations | Ashwin Sankar et.al. | 2505.18609 | null |
| 2025-05-24 | MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt | Zhichao Wu et.al. | 2505.18453 | null |
| 2025-05-27 | CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training | Zhihao Du et.al. | 2505.17589 | null |
| 2025-05-23 | What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection | Binh Nguyen et.al. | 2505.17513 | null |
| 2025-05-23 | UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information | Rui Wang et.al. | 2505.17426 | link |
| 2025-05-23 | Speechless: Speech Instruction Training Without Speech for Low Resource Languages | Alan Dao et.al. | 2505.17417 | link |
| 2025-05-22 | Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2 | Zackary Rackauckas et.al. | 2505.17320 | null |
| 2025-05-21 | Voicing Personas: Rewriting Persona Descriptions into Style Prompts for Controllable Text-to-Speech | Yejin Lee et.al. | 2505.17093 | null |
| 2025-05-20 | Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English | Haoyang Zhang et.al. | 2505.17076 | null |
| 2025-05-22 | From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition | Tianduo Wang et.al. | 2505.16972 | link |
| 2025-05-22 | MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing | Junjie Zheng et.al. | 2505.16279 | null |
| 2025-05-21 | MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling | Yifan Cheng et.al. | 2505.15772 | null |
| 2025-05-21 | Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information | Nicholas Sanders et.al. | 2505.15667 | null |
| 2025-05-21 | Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models | Zirui Song et.al. | 2505.15406 | link |
| 2025-05-21 | Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning | Junchuan Zhao et.al. | 2505.15402 | null |
| 2025-05-21 | Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding | Zijian Lin et.al. | 2505.15380 | null |
| 2025-05-20 | TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis | Yu Zhang et.al. | 2505.14910 | link |
| 2025-05-20 | Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits | Tiantian Feng et.al. | 2505.14648 | link |
| 2025-05-20 | Pairwise Evaluation of Accent Similarity in Speech Synthesis | Jinzuomu Zhong et.al. | 2505.14410 | null |
| 2025-05-20 | FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation | Yutong Liu et.al. | 2505.14351 | null |
| 2025-05-21 | AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models | Guangke Chen et.al. | 2505.14103 | null |
| 2025-05-20 | SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement | Kuan-Yu Chen et.al. | 2505.14066 | null |
| 2025-05-23 | U-SAM: An audio language Model for Unified Speech, Audio, and Music Understanding | Ziqian Wang et.al. | 2505.13880 | link |
| 2025-05-22 | Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token Denoising | Ye-Xin Lu et.al. | 2505.13830 | null |
| 2025-05-20 | Articulatory Feature Prediction from Surface EMG during Speech Production | Jihwan Lee et.al. | 2505.13814 | null |
| 2025-05-19 | Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space | Zhengrui Ma et.al. | 2505.13181 | link |
| 2025-05-19 | DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation | Jiaqi Li et.al. | 2505.13000 | link |
| 2025-05-19 | Codec-Based Deepfake Source Tracing via Neural Audio Codec Taxonomy | Xuanjun Chen et.al. | 2505.12994 | link |
| 2025-05-19 | OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching | Hieu-Nghia Huynh-Nguyen et.al. | 2505.12800 | null |
| 2025-05-19 | RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations | Seungmin Kim et.al. | 2505.12686 | null |
| 2025-05-19 | Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis | Yifan Hu et.al. | 2505.12597 | link |
| 2025-05-18 | Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis | Dong Yang et.al. | 2505.12226 | null |
| 2025-05-16 | LipDiffuser: Lip-to-Speech Generation with Conditional Diffusion Models | Danilo de Oliveira et.al. | 2505.11391 | null |
| 2025-05-16 | Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese | Xihuai Wang et.al. | 2505.11200 | null |
| 2025-05-16 | BanglaFake: Constructing and Evaluating a Specialized Bengali Deepfake Audio Dataset | Istiaq Ahmed Fahad et.al. | 2505.10885 | link |
| 2025-05-15 | UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech | Jiaxuan Liu et.al. | 2505.10599 | null |
| 2025-05-14 | SingNet: Towards a Large-Scale, Diverse, and In-the-Wild Singing Voice Dataset | Yicheng Gu et.al. | 2505.09325 | null |
| 2025-05-14 | DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis | Zeeshan Ahmad et.al. | 2505.09091 | null |
| 2025-05-13 | Investigating self-supervised features for expressive, multilingual voice conversion | Álvaro Martín-Cortinas et.al. | 2505.08278 | null |
| 2025-05-12 | MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder | Bowen Zhang et.al. | 2505.07916 | null |
| 2025-05-12 | Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications | Biel Tura Vecino et.al. | 2505.07701 | null |
| 2025-05-10 | VTutor: An Animated Pedagogical Agent SDK that Provide Real Time Multi-Model Feedback | Eason Chen et.al. | 2505.06676 | null |
| 2025-05-10 | Bridging the Gap: An Intermediate Language for Enhanced and Cost-Effective Grapheme-to-Phoneme Conversion with Homographs with Multiple Pronunciations Disambiguation | Abbas Bertina et.al. | 2505.06599 | null |
| 2025-05-15 | FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech | Linhan Ma et.al. | 2505.05159 | null |
| 2025-05-08 | Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations | Linrong Pan et.al. | 2505.05056 | null |
| 2025-05-08 | A Multi-Agent AI Framework for Immersive Audiobook Production through Spatial Audio and Neural Narration | Shaja Arul Selvamani et.al. | 2505.04885 | null |
| 2025-05-07 | Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment | Xueyao Zhang et.al. | 2505.04113 | null |
| 2025-05-06 | VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model | Zuwei Long et.al. | 2505.03739 | link |
| 2025-05-13 | SonicRAG : High Fidelity Sound Effects Synthesis Based on Retrival Augmented Generation | Yu-Ren Guo et.al. | 2505.03244 | null |
| 2025-05-05 | Generating Narrated Lecture Videos from Slides with Synchronized Highlights | Alexander Holmberg et.al. | 2505.02966 | null |
| 2025-05-05 | Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play | Yemin Shi et.al. | 2505.02707 | link |
| 2025-05-05 | LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis | Qingkai Fang et.al. | 2505.02625 | link |
| 2025-04-30 | Towards Film-Making Production Dialogue, Narration, Monologue Adaptive Moving Dubbing Benchmarks | Chaoyi Wang et.al. | 2505.01450 | null |
| 2025-04-30 | Sadeed: Advancing Arabic Diacritization Through Small Language Model | Zeina Aldallal et.al. | 2504.21635 | null |
| 2025-04-29 | AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation | Jeongsoo Choi et.al. | 2504.20629 | null |
| 2025-04-29 | ClonEval: An Open Voice Cloning Benchmark | Iwona Christop et.al. | 2504.20581 | link |
| 2025-05-02 | Towards Flow-Matching-based TTS without Classifier-Free Guidance | Yuzhe Liang et.al. | 2504.20334 | null |
| 2025-04-27 | Generative Adversarial Network based Voice Conversion: Techniques, Challenges, and Recent Advancements | Sandipan Dhar et.al. | 2504.19197 | null |
| 2025-04-27 | Muyan-TTS: A Trainable Text-to-Speech Model Optimized for Podcast Scenarios with a $50K Budget | Xin Li et.al. | 2504.19146 | link |
| 2025-04-22 | FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning | Ju Yeon Kang et.al. | 2504.15663 | null |
| 2025-04-22 | A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models | Gengxian Cao et.al. | 2504.15552 | null |
| 2025-04-21 | SOLIDO: A Robust Watermarking Method for Speech Synthesis via Low-Rank Adaptation | Yue Li et.al. | 2504.15035 | null |
| 2025-04-20 | DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue | Xiang Li et.al. | 2504.14482 | link |
| 2025-04-18 | ChatNekoHacker: Real-Time Fan Engagement with Conversational Agents | Takuya Sera et.al. | 2504.13793 | null |
| 2025-04-18 | Collective Learning Mechanism based Optimal Transport Generative Adversarial Network for Non-parallel Voice Conversion | Sandipan Dhar et.al. | 2504.13791 | null |
| 2025-04-22 | EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting | Guanrou Yang et.al. | 2504.12867 | null |
| 2025-04-15 | GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture | Yaodong Song et.al. | 2504.12339 | null |
| 2025-04-15 | Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation | Yan Rong et.al. | 2504.11002 | null |
| 2025-04-15 | Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy | Botao Zhao et.al. | 2504.10819 | null |
| 2025-04-14 | Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis | Yifan Yang et.al. | 2504.10352 | null |
| 2025-04-14 | AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis | Dan Luo et.al. | 2504.10309 | link |
| 2025-04-14 | SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis | Zhisheng Zhang et.al. | 2504.09839 | link |
| 2025-04-12 | "It's not a representation of me": Examining Accent Bias and Digital Exclusion in Synthetic AI Voice Services | Shira Michel et.al. | 2504.09346 | null |
| 2025-04-12 | AMNet: An Acoustic Model Network for Enhanced Mandarin Speech Synthesis | Yubing Cao et.al. | 2504.09225 | null |
| 2025-04-17 | SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning | Prabhat Pandey et.al. | 2504.09081 | null |
| 2025-04-11 | Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation | Haowei Lou et.al. | 2504.08274 | null |
| 2025-04-10 | Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis | Yizhong Geng et.al. | 2504.07858 | null |
| 2025-04-10 | SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow | Kaidi Wang et.al. | 2504.07776 | null |
| 2025-04-08 | AVENet: Disentangling Features by Approximating Average Features for Voice Conversion | Wenyu Wang et.al. | 2504.05833 | null |
| 2025-04-07 | P2Mark: Plug-and-play Parameter-intrinsic Watermarking for Neural Speech Generation | Yong Ren et.al. | 2504.05197 | null |
| 2025-04-07 | SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation | Stephen Brade et.al. | 2504.05106 | null |
| 2025-04-04 | RWKVTTS: Yet another TTS based on RWKV-7 | Lin yueyu et.al. | 2504.03289 | link |
| 2025-04-09 | F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization | Xiaohui Sun et.al. | 2504.02407 | link |
| 2025-04-03 | VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models | Kim Sung-Bin et.al. | 2504.02386 | null |
| 2025-03-31 | SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation | Ngoc Dung Huynh et.al. | 2503.24164 | null |
| 2025-04-02 | TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection | Zhiming Ma et.al. | 2503.24115 | link |
| 2025-03-31 | SpeechDialogueFactory: Generating High-Quality Speech Dialogue Data to Accelerate Your Speech-LLM Development | Minghan Wang et.al. | 2503.23848 | link |
| 2025-03-31 | DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance | Junjie Zheng et.al. | 2503.23660 | null |
| 2025-03-30 | Speculative End-Turn Detector for Efficient Speech Chatbot Assistant | Hyunjong Ok et.al. | 2503.23439 | null |
| 2025-03-29 | SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System | Hyeongju Kim et.al. | 2503.23108 | null |
| 2025-03-26 | Dual Audio-Centric Modality Coupling for Talking Head Generation | Ao Fu et.al. | 2503.22728 | null |
| 2025-03-28 | Cross-Technology Generalization in Synthesized Speech Detection: Evaluating AST Models with Modern Voice Generators | Andrew Ustinov et.al. | 2503.22503 | link |
| 2025-03-28 | DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation | Haomin Zhang et.al. | 2503.22265 | null |
| 2025-03-26 | Text-Driven Voice Conversion via Latent State-Space Modeling | Wen Li et.al. | 2503.20999 | null |
| 2025-03-28 | FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System | Hao-Han Guo et.al. | 2503.20499 | null |
| 2025-03-26 | Qwen2.5-Omni Technical Report | Jin Xu et.al. | 2503.20215 | null |
| 2025-03-21 | Measuring the Robustness of Audio Deepfake Detectors | Xiang Li et.al. | 2503.17577 | link |
| 2025-03-21 | Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication | Yiwen Xu et.al. | 2503.17479 | null |
| 2025-03-21 | From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech | Ji-Hoon Kim et.al. | 2503.16956 | null |
| 2025-03-20 | WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching | Tianze Luo et.al. | 2503.16689 | link |
| 2025-03-10 | VocalEyes: Enhancing Environmental Perception for the Visually Impaired through Vision-Language Models and Distance-Aware Object Detection | Kunal Chavan et.al. | 2503.16488 | null |
| 2025-03-19 | Shushing! Let's Imagine an Authentic Speech from the Silent Video | Jiaxin Ye et.al. | 2503.14928 | null |
| 2025-03-19 | MoonCast: High-Quality Zero-Shot Podcast Generation | Zeqian Ju et.al. | 2503.14345 | link |
| 2025-03-26 | InnerSelf: Designing Self-Deepfaked Voice for Emotional Well-being | Guang Dai et.al. | 2503.14257 | null |
| 2025-03-15 | Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations | Xue Jiang et.al. | 2503.12115 | null |
| 2025-03-14 | MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation | Sungwoo Cho et.al. | 2503.11026 | null |
| 2025-03-14 | Proceedings of the ISCA/ITG Workshop on Diversity in Large Speech and Language Models | Sebastian Möller et.al. | 2503.10298 | null |
| 2025-03-11 | An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR | Sewade Ogun et.al. | 2503.08954 | null |
| 2025-03-09 | ProSE: Diffusion Priors for Speech Enhancement | Sonal Kumar et.al. | 2503.06375 | null |
| 2025-03-07 | DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility | Yifan Liu et.al. | 2503.05223 | link |
| 2025-03-03 | Direct Speech to Speech Translation: A Review | Mohammad Sarim et.al. | 2503.04799 | null |
| 2025-03-06 | LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM | Sambal Shikhar et.al. | 2503.04724 | null |
| 2025-03-06 | Scaling Rich Style-Prompted Text-to-Speech Datasets | Anuj Diwan et.al. | 2503.04713 | link |
| 2025-03-05 | Good practices for evaluation of synthesized speech | Erica Cooper et.al. | 2503.03250 | null |
| 2025-03-04 | InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training | Dingdong Wang et.al. | 2503.02769 | null |
| 2025-03-03 | Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens | Xinsheng Wang et.al. | 2503.01710 | link |
| 2025-03-03 | Voice Cloning for Dysarthric Speech Synthesis: Addressing Data Scarcity in Speech-Language Pathology | Birger Moell et.al. | 2503.01266 | null |
| 2025-03-02 | Language-agnostic, automated assessment of listeners' speech recall using large language models | Björn Herrmann et.al. | 2503.01045 | null |
| 2025-03-02 | UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation | Alexander H. Liu et.al. | 2503.00733 | null |
| 2025-03-01 | PodAgent: A Comprehensive Framework for Podcast Generation | Yujia Xiao et.al. | 2503.00455 | link |
| 2025-03-12 | Telephone Surveys Meet Conversational AI: Evaluating a LLM-Based Telephone Survey System at Scale | Max M. Lang et.al. | 2502.20140 | null |
| 2025-02-27 | DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models | Weihao wu et.al. | 2502.19924 | null |
| 2025-03-04 | Speculative Decoding and Beyond: An In-Depth Survey of Techniques | Yunhai Hu et.al. | 2502.19732 | null |
| 2025-02-26 | Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis | Ziyue Jiang et.al. | 2502.18924 | null |
| 2025-03-08 | Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding | Tianyun Liu et.al. | 2502.18889 | null |
| 2025-02-24 | Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction | Tianpeng Li et.al. | 2502.17239 | link |
| 2025-02-24 | Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM | Jiatong Shi et.al. | 2502.16897 | null |
| 2025-02-18 | AV-Flow: Transforming Text to Audio-Visual Human-like Interactions | Aggelina Chatziagapi et.al. | 2502.13133 | null |
| 2025-02-18 | High-Fidelity Music Vocoder using Neural Audio Codecs | Luca A. Lanzendörfer et.al. | 2502.12759 | null |
| 2025-02-18 | TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching | Wenxiang Guo et.al. | 2502.12572 | link |
| 2025-02-18 | A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond | Shreya Shukla et.al. | 2502.12048 | null |
| 2025-02-17 | NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing | Yifan Liang et.al. | 2502.12002 | null |
| 2025-02-16 | FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching | Hui Wang et.al. | 2502.11128 | null |
| 2025-02-16 | SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer | Zhengyan Sheng et.al. | 2502.11094 | null |
| 2025-02-14 | VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect | Qingyuan Fei et.al. | 2502.10329 | null |
| 2025-02-13 | TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument | Kyungsu Kim et.al. | 2502.08939 | link |
| 2025-03-02 | ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech | Xin Wang et.al. | 2502.08857 | null |
| 2025-02-11 | LoRP-TTS: Low-Rank Personalized Text-To-Speech | Łukasz Bondaruk et.al. | 2502.07562 | null |
| 2025-02-11 | Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction | Leying Zhang et.al. | 2502.07345 | null |
| 2025-02-11 | Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement | Xueyao Zhang et.al. | 2502.07243 | null |
| 2025-02-10 | Synthetic Audio Helps for Cognitive State Tasks | Adil Soubki et.al. | 2502.06922 | link |
| 2025-02-16 | Recent Advances in Discrete Speech Tokens: A Review | Yiwei Guo et.al. | 2502.06490 | null |
| 2025-02-19 | Speech to Speech Translation with Translatotron: A State of the Art Review | Jules R. Kala et.al. | 2502.05980 | null |
| 2025-02-09 | Non-invasive electromyographic speech neuroprosthesis: a geometric perspective | Harshavardhana T. Gowda et.al. | 2502.05762 | null |
| 2025-02-09 | BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting | Mohammad Jahid Ibna Basher et.al. | 2502.05729 | null |
| 2025-02-08 | Gender Bias in Instruction-Guided Speech Synthesis Models | Chun-Yi Kuan et.al. | 2502.05649 | null |
| 2025-02-08 | IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System | Wei Deng et.al. | 2502.05512 | link |
| 2025-02-07 | Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance | Shehzeen Hussain et.al. | 2502.05236 | null |
| 2025-02-12 | Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment | Zuyan Liu et.al. | 2502.04328 | link |
| 2025-02-06 | Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis | Zhen Ye et.al. | 2502.04128 | link |
| 2025-02-14 | DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation | Dongya Jia et.al. | 2502.03930 | null |
| 2025-02-05 | Metis: A Foundation Speech Generation Model with Masked Generative Pre-training | Yuancheng Wang et.al. | 2502.03128 | link |
| 2025-02-05 | Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech | Jixun Yao et.al. | 2502.02950 | null |
| 2025-02-04 | Developing multilingual speech synthesis system for Ojibwe, Mi'kmaq, and Maliseet | Shenran Wang et.al. | 2502.02703 | link |
| 2025-02-04 | Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation | Peidong Wang et.al. | 2502.02683 | null |
| 2025-02-03 | Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis | Weiwei Lin et.al. | 2502.01084 | null |
| 2025-02-02 | EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis | Junuk Cha et.al. | 2502.00654 | null |
| 2025-01-31 | VisualSpeech: Enhance Prosody with Visual Context in TTS | Shumin Que et.al. | 2501.19258 | null |
| 2025-01-29 | BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights | Chan-Jan Hsu et.al. | 2501.17790 | null |
| 2025-02-09 | CSEval: Towards Automated, Multi-Dimensional, and Reference-Free Counterspeech Evaluation using Auto-Calibrated LLMs | Amey Hengle et.al. | 2501.17581 | null |
| 2025-01-28 | Compact Neural TTS Voices for Accessibility | Kunal Jain et.al. | 2501.17332 | null |
| 2025-01-27 | Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation | Haorui He et.al. | 2501.15907 | link |
| 2025-01-26 | Overview of the Amphion Toolkit (v0.2) | Jiaqi Li et.al. | 2501.15442 | link |
| 2025-01-24 | Characteristic-Specific Partial Fine-Tuning for Efficient Emotion and Speaker Adaptation in Codec Language Text-to-Speech Models | Tianrui Wang et.al. | 2501.14273 | null |
| 2025-01-24 | Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation | Wen Huang et.al. | 2501.14240 | null |
| 2025-01-24 | LoCoML: A Framework for Real-World ML Inference Pipelines | Kritin Maddireddy et.al. | 2501.14165 | null |
| 2025-01-23 | Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference | Shuqi Dai et.al. | 2501.13870 | null |
| 2025-01-23 | Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement | Jae-Sung Bae et.al. | 2501.13372 | null |
| 2025-01-21 | A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data | Minh Tran et.al. | 2501.12501 | null |
| 2025-01-20 | A Non-autoregressive Model for Joint STT and TTS | Vishal Sunder et.al. | 2501.09104 | null |
| 2025-01-15 | Speech Synthesis along Perceptual Voice Quality Dimensions | Frederik Rautenberg et.al. | 2501.08791 | null |
| 2025-01-15 | Adaptive Data Augmentation with NaturalSpeech3 for Far-field Speaker Verification | Li Zhang et.al. | 2501.08691 | null |
| 2025-01-15 | Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement | Qianniu Chen et.al. | 2501.08566 | null |
| 2025-01-14 | CodecFake-Omni: A Large-Scale Codec-based Deepfake Speech Dataset | Jiawei Du et.al. | 2501.08238 | null |
| 2025-01-13 | Exploring the encoding of linguistic representations in the Fully-Connected Layer of generative CNNs for Speech | Bruno Ferenc Šegedin et.al. | 2501.07726 | null |
| 2025-01-19 | MathReader : Text-to-Speech for Mathematical Documents | Sieun Hyeon et.al. | 2501.07088 | link |
| 2025-01-11 | The 1st SpeechWellness Challenge: Detecting Suicidal Risk Among Adolescents | Wen Wu et.al. | 2501.06474 | null |
| 2025-01-11 | Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis | Rui Liu et.al. | 2501.06467 | link |
| 2025-01-11 | Unispeaker: A Unified Approach for Multimodality-driven Speaker Generation | Zhengyan Sheng et.al. | 2501.06394 | null |
| 2025-01-10 | TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer | Vladimir Bataev et.al. | 2501.06320 | null |
| 2025-01-10 | MinMo: A Multimodal Large Language Model for Seamless Voice Interaction | Qian Chen et.al. | 2501.06282 | null |
| 2025-01-10 | PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control | Shaozuo Zhang et.al. | 2501.06276 | null |
| 2025-01-10 | Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron | Kishor Kayyar Lakshminarayana et.al. | 2501.05976 | null |
| 2025-01-10 | MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model | Matthew Baas et.al. | 2501.05787 | null |
| 2025-01-09 | Probing Speaker-specific Features in Speaker Representations | Aemon Yat Fei Chiu et.al. | 2501.05310 | null |
| 2025-01-09 | JELLY: Joint Emotion Recognition and Context Reasoning with LLMs for Conversational Speech Synthesis | Jun-Hyeok Cha et.al. | 2501.04904 | null |
| 2025-01-08 | Cued Speech Generation Leveraging a Pre-trained Audiovisual Text-to-Speech Model | Sanjana Sankar et.al. | 2501.04799 | null |
| 2025-01-08 | FleSpeech: Flexibly Controllable Speech Generation with Various Prompts | Hanzhao Li et.al. | 2501.04644 | null |
| 2025-01-09 | OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis | Run Luo et.al. | 2501.04561 | link |
| 2025-01-08 | DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions | Weidong Chen et.al. | 2501.04256 | null |
| 2025-01-02 | FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles | Tian-Hao Zhang et.al. | 2501.03181 | null |
| 2025-01-02 | RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer | Seongho Hong et.al. | 2501.01182 | link |
| 2025-01-02 | Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT | Dongyang Dai et.al. | 2501.01102 | null |
| 2025-01-06 | Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in a VR Pilot Study | Mykola Maslych et.al. | 2501.00168 | null |
| 2024-12-28 | Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting | Wooseok Han et.al. | 2412.20155 | null |
| 2024-12-26 | "I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities | Jiawei Yu et.al. | 2412.19102 | null |
| 2024-12-26 | Indonesian-English Code-Switching Speech Synthesizer Utilizing Multilingual STEN-TTS and Bert LID | Ahmad Alfani Handoyo et.al. | 2412.19043 | null |
| 2024-12-25 | Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset | Neil Shah et.al. | 2412.18839 | null |
| 2024-12-24 | GenPod: Constructive News Framing in AI-Generated Podcasts More Effectively Reduces Negative Emotions Than Non-Constructive Framing | Wen Ku et.al. | 2412.18300 | null |
| 2024-12-22 | Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective | Hankun Wang et.al. | 2412.17048 | null |
| 2024-12-22 | Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis | Ye-Xin Lu et.al. | 2412.16977 | link |
| 2024-12-22 | Autoregressive Speech Synthesis with Next-Distribution Prediction | Xinfa Zhu et.al. | 2412.16846 | null |
| 2024-12-23 | Interleaved Speech-Text Language Models are Simple Streaming Text to Speech Synthesizers | Yifan Yang et.al. | 2412.16102 | null |
| 2024-12-19 | Scale This, Not That: Investigating Key Dataset Attributes for Efficient Speech Enhancement Scaling | Leying Zhang et.al. | 2412.14890 | null |
| 2024-12-17 | Synthetic Speech Classification: IEEE Signal Processing Cup 2022 challenge | Mahieyin Rahmun et.al. | 2412.13279 | link |
| 2024-12-17 | Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion | Syed Zohaib Hassan et.al. | 2412.12710 | null |
| 2024-12-17 | Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes | Kuiyuan Zhang et.al. | 2412.12619 | link |
| 2024-12-17 | Hierarchical Control of Emotion Rendering in Speech Synthesis | Sho Inoue et.al. | 2412.12498 | link |
| 2024-12-19 | ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis | Xiangheng He et.al. | 2412.11795 | null |
| 2024-12-17 | Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech | Rui Liu et.al. | 2412.11409 | link |
| 2024-12-16 | Efficient Generative Modeling with Residual Vector Quantization-Based Tokens | Jaehyeon Kim et.al. | 2412.10208 | null |
| 2024-12-13 | AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation | Xiyuan Gao et.al. | 2412.10103 | null |
| 2024-12-13 | CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder | Jianwei Cui et.al. | 2412.08918 | null |
| 2024-12-11 | Multimodal Latent Language Modeling with Next-Token Diffusion | Yutao Sun et.al. | 2412.08635 | link |
| 2024-12-11 | A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction | Sowmya Cheripally et.al. | 2412.08312 | null |
| 2024-12-11 | A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings | Anindita Mondal et.al. | 2412.08283 | null |
| 2024-12-11 | LatentSpeech: Latent Diffusion for Text-To-Speech Generation | Haowei Lou et.al. | 2412.08117 | null |
| 2024-12-11 | Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration | Haowei Lou et.al. | 2412.08112 | null |
| 2024-12-09 | Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey | Tianxin Xie et.al. | 2412.06602 | link |
| 2024-12-12 | EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations | Weizhen Bian et.al. | 2412.06581 | null |
| 2024-12-01 | Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor | Ashwin Baluja et.al. | 2412.05315 | null |
| 2024-12-04 | DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles | Jiaxuan Liu et.al. | 2412.03388 | null |
| 2024-12-03 | GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot | Aohan Zeng et.al. | 2412.02612 | link |
| 2024-11-19 | A Context-Based Numerical Format Prediction for a Text-To-Speech System | Yaser Darwesh et.al. | 2412.00028 | null |
| 2024-11-27 | Continual Learning in Machine Speech Chain Using Gradient Episodic Memory | Geoffrey Tyndall et.al. | 2411.18320 | null |
| 2024-11-27 | SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation | Wenyi Yu et.al. | 2411.18138 | null |
| 2024-11-26 | WavChat: A Survey of Spoken Dialogue Models | Shengpeng Ji et.al. | 2411.13577 | link |
| 2024-12-02 | I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception | Jiawei Zhang et.al. | 2411.13314 | null |
| 2024-11-20 | Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM | Jiawei Yu et.al. | 2411.13159 | null |
| 2024-11-19 | Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation | Praveen Srinivasa Varadhan et.al. | 2411.12719 | null |
| 2024-11-19 | Leveraging Virtual Reality and AI Tutoring for Language Learning: A Case Study of a Virtual Campus Environment with OpenAI GPT Integration with Unity 3D | Adithya TG et.al. | 2411.12619 | null |
| 2024-11-18 | ESTVocoder: An Excitation-Spectral-Transformed Neural Vocoder Conditioned on Mel Spectrogram | Xiao-Hang Jiang et.al. | 2411.11258 | null |
| 2024-11-12 | Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models | Dongrui Han et.al. | 2411.07563 | null |
| 2024-11-11 | Enhancing Accessibility in Special Libraries: A Study on AI-Powered Assistive Technologies for Patrons with Disabilities | Snehasish Paul Shivali Chauhan et.al. | 2411.06970 | null |
| 2024-11-10 | Debatts: Zero-Shot Debating Text-to-Speech Synthesis | Yiqiao Huang et.al. | 2411.06540 | null |
| 2024-11-07 | CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR | Kadir Burak Buldu et.al. | 2411.04671 | null |
| 2024-11-04 | EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector | Deok-Hyeon Cho et.al. | 2411.02625 | link |
| 2024-11-09 | Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis | Shijia Liao et.al. | 2411.01156 | link |
| 2024-10-31 | Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody? | Ioannis Tsiamas et.al. | 2410.24019 | null |
| 2024-10-30 | Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis | Théodor Lemerle et.al. | 2410.23320 | link |
| 2024-10-29 | Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech | Eric Battenberg et.al. | 2410.22179 | link |
| 2024-10-29 | Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding | Bohan Li et.al. | 2410.21951 | null |
| 2024-10-29 | RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis | Kehan Sui et.al. | 2410.21641 | null |
| 2024-10-28 | Asynchronous Tool Usage for Real-Time Agents | Antonio A. Ginart et.al. | 2410.21620 | null |
| 2024-10-28 | Enhancing TTS Stability in Hebrew using Discrete Semantic Units | Ella Zeldes et.al. | 2410.21502 | null |
| 2024-10-28 | Mitigating Unauthorized Speech Synthesis for Voice Protection | Zhisheng Zhang et.al. | 2410.20742 | link |
| 2024-10-27 | Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation | Maohao Shen et.al. | 2410.20336 | null |
| 2024-10-24 | Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis | Suparna De et.al. | 2410.19199 | null |
| 2024-10-24 | STTATTS: Unified Speech-To-Text And Text-To-Speech Model | Hawau Olamide Toyin et.al. | 2410.18607 | link |
| 2024-10-24 | Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts | ChaeHun Park et.al. | 2410.18444 | null |
| 2024-10-23 | ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams | Srija Anand et.al. | 2410.17901 | null |
| 2024-10-22 | Continuous Speech Tokenizer in Text To Speech | Yixing Li et.al. | 2410.17081 | link |
| 2024-10-22 | Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap | Guanrou Yang et.al. | 2410.16726 | null |
| 2024-10-21 | Continuous Speech Synthesis using per-token Latent Diffusion | Arnon Turetzky et.al. | 2410.16048 | null |
| 2024-10-18 | A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages | Sujitha Sathiyamoorthy et.al. | 2410.14197 | null |
| 2024-10-18 | Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech | Shuwei He et.al. | 2410.14101 | link |
| 2024-10-17 | Enhancing Crowdsourced Audio for Text-to-Speech Models | José Giraldo et.al. | 2410.13357 | null |
| 2024-10-17 | DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech | Jan Melechovsky et.al. | 2410.13342 | null |
| 2024-10-17 | DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis | Yu Gu et.al. | 2410.13288 | null |
| 2024-10-17 | Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation | Sreyan Ghosh et.al. | 2410.13198 | null |
| 2024-10-16 | ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs | Rui-Chen Zheng et.al. | 2410.12359 | null |
| 2024-10-14 | IsoChronoMeter: A simple and effective isochronic translation evaluation metric | Nikolai Rozanov et.al. | 2410.11127 | null |
| 2024-10-14 | DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization | Yingahao Aaron Li et.al. | 2410.11097 | null |
| 2024-10-12 | Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling | Rui Liu et.al. | 2410.09524 | null |
| 2024-10-10 | Unsupervised Data Validation Methods for Efficient Model Training | Yurii Paniv et.al. | 2410.07880 | null |
| 2024-10-15 | F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching | Yushen Chen et.al. | 2410.06885 | link |
| 2024-10-09 | Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch | Teodora Răgman et.al. | 2410.06787 | null |
| 2024-10-09 | Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS | Onkar Kishor Susladkar et.al. | 2410.06608 | null |
| 2024-10-09 | Can DeepFake Speech be Reliably Detected? | Hongbin Liu et.al. | 2410.06572 | null |
| 2024-10-07 | SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech | Minchan Kim et.al. | 2410.04690 | null |
| 2024-10-06 | HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis | Yuto Nishimura et.al. | 2410.04380 | null |
| 2024-10-10 | SONAR: A Synthetic AI-Audio Detection Framework and Benchmark | Xiang Li et.al. | 2410.04324 | link |
| 2024-10-05 | Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System | Ze Li et.al. | 2410.04017 | null |
| 2024-10-01 | Recent Advances in Speech Language Models: A Survey | Wenqian Cui et.al. | 2410.03751 | link |
| 2024-10-04 | Generative Semantic Communication for Text-to-Speech Synthesis | Jiahao Zheng et.al. | 2410.03459 | null |
| 2024-10-04 | Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens | Jinzheng Zhao et.al. | 2410.03298 | null |
| 2024-10-04 | Narrative Player: Reviving Data Narratives with Visuals | Zekai Shao et.al. | 2410.03268 | null |
| 2024-10-04 | MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech | Taejun Bak et.al. | 2410.03192 | null |
| 2024-10-01 | Augmentation through Laundering Attacks for Audio Spoof Detection | Hashim Ali et.al. | 2410.01108 | null |
| 2024-10-01 | Zero-Shot Text-to-Speech from Continuous Text Streams | Trung Dang et.al. | 2410.00767 | null |
| 2024-10-01 | EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control | Haozhe Chen et.al. | 2410.00316 | link |
| 2024-09-30 | Word-wise intonation model for cross-language TTS systems | Tomilov A. A. et.al. | 2409.20374 | null |
| 2024-09-27 | Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech | Youngjae Kim et.al. | 2409.18622 | null |
| 2024-09-26 | Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control | Ryuichi Yamamoto et.al. | 2409.17452 | null |
| 2024-09-25 | Exploring synthetic data for cross-speaker style transfer in style representation based TTS | Lucas H. Ueda et.al. | 2409.17364 | null |
| 2024-09-25 | Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions | Kun Zhou et.al. | 2409.16681 | null |
| 2024-09-25 | Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation | Siyin Wang et.al. | 2409.16644 | link |
| 2024-09-24 | FastTalker: Jointly Generating Speech and Conversational Gestures from Text | Zixin Guo et.al. | 2409.16404 | null |
| 2024-09-24 | Beyond Text-to-Text: An Overview of Multimodal and Generative Artificial Intelligence for Education Using Topic Modeling | Ville Heilala et.al. | 2409.16376 | null |
| 2024-09-24 | Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech | Yunji Chu et.al. | 2409.16203 | null |
| 2024-09-24 | NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers | Nohil Park et.al. | 2409.15760 | null |
| 2024-09-24 | VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance | Jiheum Yeom et.al. | 2409.15759 | null |
| 2024-09-24 | StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis | Zhiyong Chen et.al. | 2409.15741 | null |
| 2024-09-23 | A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection | Lam Pham et.al. | 2409.15180 | null |
| 2024-09-23 | LlamaPartialSpoof: An LLM-Driven Fake Speech Dataset Simulating Disinformation Generation | Hieu-Thi Luong et.al. | 2409.14743 | link |
| 2024-09-20 | Zero-shot Cross-lingual Voice Transfer for TTS | Fadi Biadsy et.al. | 2409.13910 | null |
| 2024-09-20 | On the Feasibility of Fully AI-automated Vishing Attacks | João Figueiredo et.al. | 2409.13793 | null |
| 2024-09-19 | Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space | Sebastião Quintas et.al. | 2409.12745 | null |
| 2024-09-19 | Preference Alignment Improves Language Model-Based TTS | Jinchuan Tian et.al. | 2409.12403 | null |
| 2024-09-18 | Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference | Edresson Casanova et.al. | 2409.12117 | null |
| 2024-09-18 | Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems | Anusha Prakash et.al. | 2409.11915 | null |
| 2024-09-18 | DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech | Xin Qi et.al. | 2409.11835 | null |
| 2024-09-18 | Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation | Haohan Guo et.al. | 2409.11630 | null |
| 2024-09-17 | SpMis: An Investigation of Synthetic Spoken Misinformation Detection | Peizhuo Liu et.al. | 2409.11308 | null |
| 2024-09-19 | The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives | Samee Arif et.al. | 2409.11261 | link |
| 2024-09-17 | Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora | Francesco Nespoli et.al. | 2409.11107 | null |
| 2024-09-16 | Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization | Xiaoxue Gao et.al. | 2409.10157 | null |
| 2024-09-16 | StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion | Yinghao Aaron Li et.al. | 2409.10058 | null |
| 2024-09-15 | Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning | Siqi Sun et.al. | 2409.09891 | null |
| 2024-09-14 | E1 TTS: Simple and Fast Non-Autoregressive TTS | Zhijun Liu et.al. | 2409.09351 | null |
| 2024-09-14 | Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation | Changjin Han et.al. | 2409.09311 | link |
| 2024-09-14 | SafeEar: Content Privacy-Preserving Audio Deepfake Detection | Xinfeng Li et.al. | 2409.09272 | link |
| 2024-09-13 | AccentBox: Towards High-Fidelity Zero-Shot Accent Generation | Jinzuomu Zhong et.al. | 2409.09098 | null |
| 2024-09-17 | HLTCOE JHU Submission to the Voice Privacy Challenge 2024 | Henry Li Xinyuan et.al. | 2409.08913 | null |
| 2024-09-13 | Text-To-Speech Synthesis In The Wild | Jee-weon Jung et.al. | 2409.08711 | null |
| 2024-09-14 | Exploring Accessibility Trends and Challenges in Mobile App Development: A Study of Stack Overflow Questions | Amila Indika et.al. | 2409.07945 | null |
| 2024-09-12 | Full-text Error Correction for Chinese Speech Recognition with Large Language Model | Zhiyuan Tang et.al. | 2409.07790 | null |
| 2024-09-11 | SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis | Helin Wang et.al. | 2409.07556 | link |
| 2024-09-11 | D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack | Hong-Hanh Nguyen-Le et.al. | 2409.07390 | null |
| 2024-09-11 | Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT | Kazuki Yamauchi et.al. | 2409.07265 | null |
| 2024-09-11 | Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment | Tien-Hong Lo et.al. | 2409.07151 | null |
| 2024-09-10 | Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models | Xin Jing et.al. | 2409.06451 | null |
| 2024-09-10 | What happens to diffusion model likelihood when your model is conditional? | Mattias Cross et.al. | 2409.06364 | null |
| 2024-09-10 | VoiceWukong: Benchmarking Deepfake Voice Detection | Ziwei Yan et.al. | 2409.06348 | null |
| 2024-09-09 | AS-Speech: Adaptive Style For Speech Synthesis | Zhipeng Li et.al. | 2409.05730 | null |
| 2024-09-09 | IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS | Ashwin Sankar et.al. | 2409.05356 | link |
| 2024-09-10 | Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion | Zhengyang Chen et.al. | 2409.05004 | null |
| 2024-09-01 | Sample-Efficient Diffusion for Text-To-Speech Synthesis | Justin Lovelace et.al. | 2409.03717 | link |
| 2024-09-10 | LAST: Language Model Aware Speech Tokenization | Arnon Turetzky et.al. | 2409.03701 | null |
| 2024-09-05 | FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications | Hao-Han Guo et.al. | 2409.03283 | null |
| 2024-09-04 | Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems | Jeongmin Liu et.al. | 2409.02517 | null |
| 2024-09-03 | VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka | Li-Wei Chen et.al. | 2409.01548 | null |
| 2024-09-02 | A multilingual training strategy for low resource Text to Speech | Asma Amalas et.al. | 2409.01217 | null |
| 2024-09-02 | A Framework for Synthetic Audio Conversations Generation using Large Language Models | Kaung Myat Kyaw et.al. | 2409.00946 | null |
| 2024-09-02 | SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis | Haohan Guo et.al. | 2409.00933 | link |
| 2024-09-01 | MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer | Yuancheng Wang et.al. | 2409.00750 | link |
| 2024-08-30 | SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection | Ismail Rasim Ulgen et.al. | 2408.17432 | null |
| 2024-08-30 | AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge | Kirill Borodin et.al. | 2408.17352 | null |
| 2024-08-30 | Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model | Zhen Ye et.al. | 2408.17175 | link |
| 2024-08-30 | Utilizing Speaker Profiles for Impersonation Audio Detection | Hao Gu et.al. | 2408.17009 | null |
| 2024-08-29 | Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis | Zehai Tu et.al. | 2408.16373 | null |
| 2024-08-28 | Multi-modal Adversarial Training for Zero-Shot Voice Cloning | John Janiczek et.al. | 2408.15916 | null |
| 2024-08-29 | Easy, Interpretable, Effective: openSMILE for voice deepfake detection | Octavian Pascu et.al. | 2408.15775 | null |
| 2024-08-28 | VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling | Yixuan Zhou et.al. | 2408.15676 | link |
| 2024-08-28 | VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech | Heeseung Kim et.al. | 2408.14739 | null |
| 2024-08-27 | StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech | Haowei Lou et.al. | 2408.14713 | link |
| 2024-08-27 | DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance | Jinhyeok Yang et.al. | 2408.14423 | null |
| 2024-08-26 | Anonymization of Voices in Spaces for Civic Dialogue: Measuring Impact on Empathy, Trust, and Feeling Heard | Wonjune Kang et.al. | 2408.13970 | null |
| 2024-08-28 | SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models | Dongchao Yang et.al. | 2408.13893 | null |
| 2024-08-22 | Positional Description for Numerical Normalization | Deepanshu Gupta et.al. | 2408.12430 | null |
| 2024-08-22 | VoiceX: A Text-To-Speech Framework for Custom Voices | Silvan Mertes et.al. | 2408.12170 | null |
| 2024-08-13 | Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation | Yinghao Aaron Li et.al. | 2408.11849 | null |
| 2024-08-20 | EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech | Xin Qi et.al. | 2408.10852 | null |
| 2024-08-20 | SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS | Karl El Hajal et.al. | 2408.10771 | null |
| 2024-08-20 | Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting | Hyun Jin Park et.al. | 2408.10463 | null |
| 2024-08-17 | Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition | Samuele Cornell et.al. | 2408.09215 | link |
| 2024-08-14 | PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation | Sang-Hoon Lee et.al. | 2408.07547 | link |
| 2024-08-13 | SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis | Osamu Take et.al. | 2408.06858 | link |
| 2024-08-13 | PRESENT: Zero-Shot Text-to-Prosody Control | Perry Lam et.al. | 2408.06827 | link |
| 2024-08-12 | FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks | Min Ma et.al. | 2408.06227 | null |
| 2024-08-11 | VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing | Chunyu Qiang et.al. | 2408.05758 | null |
| 2024-08-06 | Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training | Hawraz A. Ahmad et.al. | 2408.03887 | null |
| 2024-08-03 | ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features | Peng Cheng et.al. | 2408.01808 | link |
| 2024-08-01 | Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation | Xinhan Di et.al. | 2408.00284 | null |
| 2024-07-18 | Handling Numeric Expressions in Automatic Speech Recognition | Christian Huber et.al. | 2408.00004 | null |
| 2024-07-31 | On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition | Nick Rossenbach et.al. | 2407.21476 | null |
| 2024-07-29 | Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks | Mahmoud Salhab et.al. | 2407.18571 | null |
| 2024-07-25 | On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures | Nick Rossenbach et.al. | 2407.17997 | null |
| 2024-07-24 | Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model | Jan Lehečka et.al. | 2407.17167 | null |
| 2024-07-23 | Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments | Pai Zhu et.al. | 2407.16840 | null |
| 2024-07-19 | Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2 | Chun Xu et.al. | 2407.14212 | null |
| 2024-07-18 | Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models | Weiqin Li et.al. | 2407.13509 | null |
| 2024-07-22 | TTSDS -- Text-to-Speech Distribution Score | Christoph Minixhofer et.al. | 2407.12707 | link |
| 2024-07-17 | Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech | Haibin Wu et.al. | 2407.12229 | link |
| 2024-07-16 | A Language Modeling Approach to Diacritic-Free Hebrew TTS | Amit Roth et.al. | 2407.12206 | null |
| 2024-07-17 | Learning High-Frequency Functions Made Easy with Sinusoidal Positional Encoding | Chuanhao Sun et.al. | 2407.09370 | link |
| 2024-07-11 | Autoregressive Speech Synthesis without Vector Quantization | Lingwei Meng et.al. | 2407.08551 | link |
| 2024-07-10 | Source Tracing of Audio Deepfake Systems | Nicholas Klein et.al. | 2407.08016 | null |
| 2024-07-07 | ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation | Ruibo Fu et.al. | 2407.05421 | null |
| 2024-07-09 | CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens | Zhihao Du et.al. | 2407.05407 | null |
| 2024-07-04 | Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis | Cong-Thanh Do et.al. | 2407.04047 | null |
| 2024-07-04 | Optimizing a-DCF for Spoofing-Robust Speaker Verification | Oğuzhan Kurnaz et.al. | 2407.04034 | null |
| 2024-07-04 | On the Effectiveness of Acoustic BPE in Decoder-Only TTS | Bohan Li et.al. | 2407.03892 | null |
| 2024-07-14 | CATT: Character-based Arabic Tashkeel Transformer | Faris Alasmary et.al. | 2407.03236 | link |
| 2024-07-02 | Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization | Yuchen Hu et.al. | 2407.02243 | null |
| 2024-07-02 | TTSlow: Slow Down Text-to-Speech with Efficiency Robustness Evaluations | Xiaoxue Gao et.al. | 2407.01927 | null |
| 2024-07-01 | Lightweight Zero-shot Text-to-Speech with Mixture of Adapters | Kenichi Fujita et.al. | 2407.01291 | null |
| 2024-06-30 | NAIST Simultaneous Speech Translation System for IWSLT 2024 | Yuka Ko et.al. | 2407.00826 | null |
| 2024-06-30 | An Attribute Interpolation Method in Speech Synthesis by Model Merging | Masato Murata et.al. | 2407.00766 | null |
| 2024-06-30 | FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis | Yinlin Guo et.al. | 2407.00753 | null |
| 2024-07-02 | Open-Source Conversational AI with SpeechBrain 1.0 | Mirco Ravanelli et.al. | 2407.00463 | null |
| 2024-06-27 | Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models | Borodin Kirill Nikolayevich et.al. | 2406.19243 | null |
| 2024-06-27 | DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability | Hyun Joon Park et.al. | 2406.19135 | link |
| 2024-06-26 | Automatic Speech Recognition for Hindi | Anish Saha et.al. | 2406.18135 | null |
| 2024-06-26 | A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons | Tzu-Yun Hung et.al. | 2406.18089 | null |
| 2024-06-29 | LLM-Driven Multimodal Opinion Expression Identification | Bonian Jia et.al. | 2406.18088 | null |
| 2024-06-26 | E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS | Sefik Emre Eskimez et.al. | 2406.18009 | link |
| 2024-06-25 | Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment | Paarth Neekhara et.al. | 2406.17957 | null |
| 2024-06-22 | A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge | Xiaopeng Wang et.al. | 2406.17801 | null |
| 2024-06-25 | High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model | Joun Yeop Lee et.al. | 2406.17310 | null |
| 2024-06-25 | Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation | Yingting Li et.al. | 2406.17257 | null |
| 2024-06-24 | Exploring the Capability of Mamba in Speech Applications | Koichi Miyazaki et.al. | 2406.16808 | null |
| 2024-06-25 | Towards Zero-Shot Text-To-Speech for Arabic Dialects | Khai Duy Doan et.al. | 2406.16751 | null |
| 2024-06-22 | TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers | Yakun Song et.al. | 2406.15752 | link |
| 2024-06-21 | InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions | Yu Nakagome et.al. | 2406.14890 | null |
| 2024-06-21 | GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech | Wenbin Wang et.al. | 2406.14875 | null |
| 2024-06-21 | DASB - Discrete Audio and Speech Benchmark | Pooneh Mousavi et.al. | 2406.14294 | null |
| 2024-06-18 | Instruction Data Generation and Unsupervised Adaptation for Speech Language Models | Vahid Noroozi et.al. | 2406.12946 | null |
| 2024-06-17 | DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer | Keon Lee et.al. | 2406.11427 | null |
| 2024-06-16 | NAST: Noise Aware Speech Tokenization for Speech Language Models | Shoval Messica et.al. | 2406.11037 | link |
| 2024-06-16 | Multi-Scale Accent Modeling with Disentangling for Multi-Speaker Multi-Accent TTS Synthesis | Xuehao Zhou et.al. | 2406.10844 | null |
| 2024-06-14 | Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice | Shubham Gupta et.al. | 2406.10422 | null |
| 2024-06-14 | UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner | Dongchao Yang et.al. | 2406.10056 | link |
| 2024-06-14 | MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model | Jiatong Shi et.al. | 2406.09869 | null |
| 2024-06-13 | DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage | Kyra Wang et.al. | 2406.08820 | null |
| 2024-06-13 | Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems | Zhengyang Chen et.al. | 2406.08812 | null |
| 2024-06-13 | DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing | Neha Sahipjohn et.al. | 2406.08802 | null |
| 2024-06-12 | Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis | Wing-Zin Leung et.al. | 2406.08568 | link |
| 2024-06-12 | Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data | Yuma Shirahata et.al. | 2406.08111 | null |
| 2024-06-12 | VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech | Ashishkumar Gudmalwar et.al. | 2406.08076 | null |
| 2024-06-12 | LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning | Masaya Kawamura et.al. | 2406.07969 | link |
| 2024-06-12 | VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment | Bing Han et.al. | 2406.07855 | null |
| 2024-06-12 | EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech | Deok-Hyeon Cho et.al. | 2406.07803 | link |
| 2024-06-11 | The Interspeech 2024 Challenge on Speech Processing Using Discrete Units | Xuankai Chang et.al. | 2406.07725 | null |
| 2024-06-11 | Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? | Qingkai Fang et.al. | 2406.07289 | null |
| 2024-06-11 | AudioMarkBench: Benchmarking Robustness of Audio Watermarking | Hongbin Liu et.al. | 2406.06979 | link |
| 2024-06-11 | Controlling Emotion in Text-to-Speech with Natural Language Prompts | Thomas Bott et.al. | 2406.06406 | link |
| 2024-06-10 | Meta Learning Text-to-Speech Synthesis in over 7000 Languages | Florian Lux et.al. | 2406.06403 | link |
| 2024-06-10 | MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance | Semin Kim et.al. | 2406.05965 | null |
| 2024-06-11 | WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark | Linhan Ma et.al. | 2406.05763 | link |
| 2024-06-09 | An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS | Xiaofei Wang et.al. | 2406.05699 | null |
| 2024-06-11 | Text-aware and Context-aware Expressive Audiobook Speech Synthesis | Dake Guo et.al. | 2406.05672 | link |
| 2024-06-08 | Autoregressive Diffusion Transformer for Text-to-Speech Synthesis | Zhijun Liu et.al. | 2406.05551 | null |
| 2024-06-08 | VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers | Sanyuan Chen et.al. | 2406.05370 | null |
| 2024-06-07 | Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech Synthesis | Ryan Langman et.al. | 2406.05298 | null |
| 2024-06-07 | XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model | Edresson Casanova et.al. | 2406.04904 | link |
| 2024-06-07 | TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking | Junzuo Zhou et.al. | 2406.04840 | link |
| 2024-06-07 | Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study | Chong Zhang et.al. | 2406.04633 | null |
| 2024-06-06 | Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis | Théodor Lemerle et.al. | 2406.04467 | link |
| 2024-06-06 | Total-Duration-Aware Duration Modeling for Text-to-Speech Systems | Sefik Emre Eskimez et.al. | 2406.04281 | null |
| 2024-06-06 | Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining | Jinlong Xue et.al. | 2406.03714 | null |
| 2024-06-06 | Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model | Jinlong Xue et.al. | 2406.03706 | null |
| 2024-06-05 | Style Mixture of Experts for Expressive Text-To-Speech Synthesis | Ahad Jawaid et.al. | 2406.03637 | null |
| 2024-06-07 | Harder or Different? Understanding Generalization of Audio Deepfake Detection | Nicolas M. Müller et.al. | 2406.03512 | null |
| 2024-06-05 | LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes | Trung Dang et.al. | 2406.02897 | null |
| 2024-06-04 | Seed-TTS: A Family of High-Quality Versatile Speech Generation Models | Philip Anastassiou et.al. | 2406.02430 | link |
| 2024-06-05 | SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models | Dongchao Yang et.al. | 2406.02328 | null |
| 2024-06-04 | BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation | Hui-Peng Du et.al. | 2406.02162 | null |
| 2024-06-04 | Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis | Kun Zhou et.al. | 2406.02009 | null |
| 2024-06-03 | ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec | Shengpeng Ji et.al. | 2406.01205 | link |
| 2024-06-03 | Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training | Jan Melechovsky et.al. | 2406.01018 | null |
| 2024-06-02 | Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback | Chen Chen et.al. | 2406.00654 | null |
| 2024-05-31 | Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities | Vicky Zayats et.al. | 2405.18669 | null |
| 2024-05-28 | TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation | Chenyang Le et.al. | 2405.17809 | link |
| 2024-05-27 | RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis | Haoxiang Shi et.al. | 2405.17028 | null |
| 2024-05-24 | Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition | Zijin Gu et.al. | 2405.15216 | null |
| 2024-05-23 | Reinforcement Learning for Fine-tuning Text-to-speech Diffusion Models | Jingyi Chen et.al. | 2405.14632 | null |
| 2024-05-22 | A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction | Yue Li et.al. | 2405.13477 | null |
| 2024-05-20 | Multi-speaker Text-to-speech Training with Speaker Anonymized Data | Wen-Chin Huang et.al. | 2405.11767 | null |
| 2024-05-19 | VR-GPT: Visual Language Model for Intelligent Virtual Reality Applications | Mikhail Konenkov et.al. | 2405.11537 | null |
| 2024-05-18 | Exploring speech style spaces with language models: Emotional TTS without emotion labels | Shreeram Suresh Chandra et.al. | 2405.11413 | null |
| 2024-05-16 | Faces that Speak: Jointly Synthesising Talking Face and Speech from Text | Youngjoon Jang et.al. | 2405.10272 | null |
| 2024-05-16 | Building a Luganda Text-to-Speech Model From Crowdsourced Data | Sulaiman Kagumire et.al. | 2405.10211 | null |
| 2024-05-16 | Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model | Siyang Wang et.al. | 2405.09768 | null |
| 2024-05-15 | Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer | Weifei Jin et.al. | 2405.09470 | null |
| 2024-05-15 | Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis | Sho Inoue et.al. | 2405.09171 | null |
| 2024-05-14 | PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset | Yang Hou et.al. | 2405.08838 | link |
| 2024-04-30 | Attention-Constrained Inference for Robust Decoder-Only Text-to-Speech | Hankun Wang et.al. | 2404.19723 | null |
| 2024-04-29 | MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis | Xiang Li et.al. | 2404.18398 | link |
| 2024-04-28 | USAT: A Universal Speaker-Adaptive Text-to-Speech Approach | Wenbin Wang et.al. | 2404.18094 | link |
| 2024-04-27 | TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality | Tiantian Feng et.al. | 2404.17983 | null |
| 2024-04-26 | An RFP dataset for Real, Fake, and Partially fake audio detection | Abdulazeez AlAli et.al. | 2404.17721 | null |
| 2024-04-23 | StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations | Sen Liu et.al. | 2404.14946 | link |
| 2024-04-23 | Retrieval-Augmented Audio Deepfake Detection | Zuheng Kang et.al. | 2404.13892 | link |
| 2024-04-14 | Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling | Quanxiu Wang et.al. | 2404.09192 | null |
| 2024-04-11 | Voice-Assisted Real-Time Traffic Sign Recognition System Using Convolutional Neural Network | Mayura Manawadu et.al. | 2404.07807 | null |
| 2024-04-18 | Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness | Xincan Feng et.al. | 2404.06714 | link |
| 2024-04-10 | CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations | Leying Zhang et.al. | 2404.06690 | link |
| 2024-04-10 | The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge | Yiwei Guo et.al. | 2404.06079 | null |
| 2024-04-07 | Cross-Domain Audio Deepfake Detection: Dataset and Analysis | Yuang Li et.al. | 2404.04904 | null |
| 2024-04-06 | HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks | Yingting Li et.al. | 2404.04645 | link |
| 2024-04-18 | Open vocabulary keyword spotting through transfer learning from speech synthesis | Kesavaraj V et.al. | 2404.03914 | null |
| 2024-04-06 | RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis | Detai Xin et.al. | 2404.03204 | null |
| 2024-04-03 | CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech | Jaehyeon Kim et.al. | 2404.02781 | null |
| 2024-04-13 | PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders | Yu Pan et.al. | 2404.02702 | null |
| 2024-03-31 | Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation | Rohan Chaudhury et.al. | 2404.01339 | link |
| 2024-03-28 | A Review of Multi-Modal Large Language and Vision Models | Kilian Carolan et.al. | 2404.01322 | null |
| 2024-04-09 | KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis | Adal Abilbekov et.al. | 2404.01033 | link |
| 2024-03-31 | CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models | Xiang Li et.al. | 2404.00569 | link |
| 2024-03-25 | VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild | Puyuan Peng et.al. | 2403.16973 | link |
| 2024-03-20 | Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning | Shivam Ratnakant Mhaskar et.al. | 2403.15469 | null |
| 2024-03-20 | UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge | Wataru Nakata et.al. | 2403.13720 | null |
| 2024-03-20 | Building speech corpus with diverse voice characteristics for its prompt-based representation | Aya Watanabe et.al. | 2403.13353 | null |
| 2024-03-17 | Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations | Claudio Pinhanez et.al. | 2403.11209 | null |
| 2024-03-17 | EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech | Ziqi Liang et.al. | 2403.08164 | null |
| 2024-03-09 | HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling | Chunhui Wang et.al. | 2403.05989 | null |
| 2024-03-05 | AttentionStitch: How Attention Solves the Speech Editing Problem | Antonios Alexos et.al. | 2403.04804 | null |
| 2024-03-07 | Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation | Sai Akarsh et.al. | 2403.04178 | null |
| 2024-03-27 | NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models | Zeqian Ju et.al. | 2403.03100 | null |
| 2024-03-04 | Brilla AI: AI Contestant for the National Science and Maths Quiz | George Boateng et.al. | 2403.01699 | link |
| 2024-03-02 | Towards Accurate Lip-to-Speech Synthesis in-the-Wild | Sindhu Hegde et.al. | 2403.01087 | link |
| 2024-02-29 | Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data | Takaaki Saeki et.al. | 2402.18932 | null |
| 2024-02-26 | An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation | Ahmet Gunduz et.al. | 2402.16380 | link |
| 2024-02-22 | Efficient data selection employing Semantic Similarity-based Graph Structures for model training | Roxana Petcu et.al. | 2402.14888 | null |
| 2024-02-22 | Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition | Rendi Chevi et.al. | 2402.14523 | link |
| 2024-02-19 | On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models | Miri Varshavsky-Hassid et.al. | 2402.12423 | null |
| 2024-02-19 | Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting | Haolin Chen et.al. | 2402.12220 | link |
| 2024-02-18 | Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in Conversations with the Tabletop Robot Haru | Zining Wang et.al. | 2402.11571 | null |
| 2024-02-14 | MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech | Shengpeng Ji et.al. | 2402.09378 | null |
| 2024-02-15 | BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data | Mateusz Łajszczak et.al. | 2402.08093 | null |
| 2024-03-04 | Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like | Naoyuki Kanda et.al. | 2402.07383 | null |
| 2024-02-09 | A New Approach to Voice Authenticity | Nicolas M. Müller et.al. | 2402.06304 | null |
| 2024-02-08 | Unified Speech-Text Pretraining for Spoken Dialog Modeling | Heeseung Kim et.al. | 2402.05706 | link |
| 2024-02-05 | Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations | Álvaro Martín-Cortinas et.al. | 2402.03407 | null |
| 2024-02-02 | Natural language guidance of high-fidelity text-to-speech with synthetic annotations | Dan Lyth et.al. | 2402.01912 | null |
| 2024-01-23 | Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization | Wei-Ping Huang et.al. | 2402.01692 | null |
| 2024-02-01 | Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech | Dong Yang et.al. | 2402.00288 | link |
| 2024-02-01 | PAM: Prompting Audio-Language Models for Audio Quality Assessment | Soham Deshmukh et.al. | 2402.00282 | link |
| 2024-01-31 | Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2 | Jiatong Shi et.al. | 2401.17619 | link |
| 2024-01-28 | MunTTS: A Text-to-Speech System for Mundari | Varun Gumma et.al. | 2401.15579 | link |
| 2024-01-30 | VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech | Chenpeng Du et.al. | 2401.14321 | null |
| 2024-01-25 | Text to speech synthesis | Harini s et.al. | 2401.13891 | link |
| 2024-01-25 | SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation | Dong Zhang et.al. | 2401.13527 | link |
| 2024-01-22 | Benchmarking Large Multimodal Models against Common Corruptions | Jiawei Zhang et.al. | 2401.11943 | link |
| 2024-01-22 | Adversarial speech for voice privacy protection from Personalized Speech generation | Shihao Chen et.al. | 2401.11857 | null |
| 2024-02-16 | Empowering Communication: Speech Technology for Indian and Western Accents through AI-powered Speech Synthesis | Vinotha R et.al. | 2401.11771 | null |
| 2024-01-19 | Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech | Abhinav Garg et.al. | 2401.10465 | null |
| 2024-02-28 | MLAAD: The Multi-Language Audio Anti-Spoofing Dataset | Nicolas M. Müller et.al. | 2401.09512 | null |
| 2024-01-15 | MCMChaos: Improvising Rap Music with MCMC Methods and Chaos Theory | Robert G. Kimelman et.al. | 2401.07967 | null |
| 2024-01-14 | ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering | Yakun Song et.al. | 2401.07333 | null |
| 2024-01-12 | Multi-Task Learning for Front-End Text Processing in TTS | Wonjune Kang et.al. | 2401.06321 | link |
| 2024-01-11 | End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2 | Aniket Tathe et.al. | 2401.06183 | null |
| 2024-01-11 | Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection | Lian Huang et.al. | 2401.05614 | null |
| 2024-01-10 | Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters | Kenichi Fujita et.al. | 2401.05111 | null |
| 2024-01-07 | Evaluating and Personalizing User-Perceived Quality of Text-to-Speech Voices for Delivering Mindfulness Meditation with Different Physical Embodiments | Zhonghao Shi et.al. | 2401.03581 | null |
| 2024-01-07 | Transfer the linguistic representations from TTS to accent conversion with non-parallel data | Xi Chen et.al. | 2401.03538 | null |
| 2024-01-03 | Incremental FastPitch: Chunk-based High Quality Text to Speech | Muyang Du et.al. | 2401.01755 | null |
| 2024-01-03 | Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction | Minchan Kim et.al. | 2401.01498 | null |
| 2023-12-18 | Assisting Blind People Using Object Detection with Vocal Feedback | Heba Najm et.al. | 2401.01362 | null |
| 2023-12-30 | Boosting Large Language Model for Speech Synthesis: An Empirical Study | Hongkun Hao et.al. | 2401.00246 | null |
| 2024-01-01 | Normalization of Lithuanian Text Using Regular Expressions | Pijus Kasparaitis et.al. | 2312.17660 | null |
| 2023-12-27 | AE-Flow: AutoEncoder Normalizing Flow | Jakub Mosiński et.al. | 2312.16552 | null |
| 2023-12-22 | Creating New Voices using Normalizing Flows | Piotr Bilinski et.al. | 2312.14569 | null |
| 2023-12-22 | ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations | Cheng Gong et.al. | 2312.14398 | null |
| 2023-12-19 | External Knowledge Augmented Polyphone Disambiguation Using Large Language Model | Chen Li et.al. | 2312.11920 | null |
| 2023-12-17 | A review-based study on different Text-to-Speech technologies | Md. Jalal Uddin Chowdhury et.al. | 2312.11563 | null |
| 2024-01-31 | MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis | Wenhao Guan et.al. | 2312.10687 | null |
| 2024-02-22 | Amphion: An Open-Source Audio, Music and Speech Generation Toolkit | Xueyao Zhang et.al. | 2312.09911 | link |
| 2023-12-11 | Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism | Georgios Milis et.al. | 2312.06613 | link |
| 2023-12-08 | An Experimental Study: Assessing the Combined Framework of WavLM and BEST-RQ for Text-to-Speech Synthesis | Via Nielson et.al. | 2312.05415 | null |
| 2023-12-06 | Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis | Zehua Chen et.al. | 2312.03491 | null |
| 2023-12-02 | Rapid Speaker Adaptation in Low Resource Text to Speech Systems using Synthetic Data and Transfer learning | Raviraj Joshi et.al. | 2312.01107 | null |
| 2023-12-02 | Code-Mixed Text to Speech Synthesis under Low-Resource Constraints | Raviraj Joshi et.al. | 2312.01103 | null |
| 2023-11-29 | Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes | Pavel Korshunov et.al. | 2311.17655 | null |
| 2024-02-06 | Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech | Enting Zhou et.al. | 2311.14816 | link |
| 2023-12-07 | Guided Flows for Generative Modeling and Decision Making | Qinqing Zheng et.al. | 2311.13443 | null |
| 2023-11-27 | HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis | Sang-Hoon Lee et.al. | 2311.12454 | link |
| 2023-11-18 | Utilizing Speech Emotion Recognition and Recommender Systems for Negative Emotion Handling in Therapy Chatbots | Farideh Majidi et.al. | 2311.11116 | null |
| 2023-11-18 | Data Center Audio/Video Intelligence on Device (DAVID) -- An Edge-AI Platform for Smart-Toys | Gabriel Cosache et.al. | 2311.11030 | null |
| 2023-11-17 | A Study on Altering the Latent Space of Pretrained Text to Speech Models for Improved Expressiveness | Mathias Vogel et.al. | 2311.10804 | null |
| 2023-11-16 | Improving fairness for spoken language understanding in atypical speech with Text-to-Speech | Helin Wang et.al. | 2311.10149 | link |
| 2024-02-02 | DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation | Jianzong Wang et.al. | 2311.07965 | null |
| 2023-11-12 | ChatAnything: Facetime Chat with LLM-Enhanced Personas | Yilin Zhao et.al. | 2311.06772 | null |
| 2023-11-11 | NewsGPT: ChatGPT Integration for Robot-Reporter | Abdelhadi Hireche et.al. | 2311.06640 | link |
| 2023-11-08 | Synthetic Speaking Children -- Why We Need Them and How to Make Them | Muhammad Ali Farooq et.al. | 2311.06307 | null |
| 2023-09-25 | Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural Zero-shot Speech Synthesis from a Face Image | Minki Kang et.al. | 2311.05844 | null |
| 2023-11-07 | Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer Learning | Rishabh Jain et.al. | 2311.04313 | link |
| 2023-11-07 | Character-Level Bangla Text-to-IPA Transcription Using Transformer Architecture with Sequence Alignment | Jakir Hasan et.al. | 2311.03792 | null |
| 2023-11-08 | Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction | Minchan Kim et.al. | 2311.02898 | null |
| 2023-11-02 | Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations | Hanglei Zhang et.al. | 2311.01260 | null |
| 2023-11-02 | E3 TTS: Easy End-to-End Diffusion-based Text to Speech | Yuan Gao et.al. | 2311.00945 | null |
| 2023-10-31 | An Implementation of Multimodal Fusion System for Intelligent Digital Human Generation | Yingjie Zhou et.al. | 2310.20251 | link |
| 2023-10-27 | Style Description based Text-to-Speech with Conditional Prosodic Layer Normalization based Diffusion GAN | Neeraj Kumar et.al. | 2310.18169 | null |
| 2023-10-25 | ArTST: Arabic Text and Speech Transformer | Hawau Olamide Toyin et.al. | 2310.16621 | link |
| 2023-10-25 | Generative Pre-training for Speech with Flow Matching | Alexander H. Liu et.al. | 2310.16338 | null |
| 2023-10-23 | DPP-TTS: Diversifying prosodic features of speech via determinantal point processes | Seongho Joo et.al. | 2310.14663 | null |
| 2023-10-22 | An overview of text-to-speech systems and media applications | Mohammad Reza Hasanabadi et.al. | 2310.14301 | null |
| 2023-10-14 | Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling | Tiberiu Boros et.al. | 2310.09636 | link |
| 2023-10-14 | Attentive Multi-Layer Perceptron for Non-autoregressive Generation | Shuyang Jiang et.al. | 2310.09512 | link |
| 2023-12-22 | Crowdsourced and Automatic Speech Prominence Estimation | Max Morrison et.al. | 2310.08464 | link |
| 2023-10-12 | On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition | Nick Rossenbach et.al. | 2310.08132 | null |
| 2023-10-12 | Vec-Tok Speech: speech vectorization and tokenization for neural speech generation | Xinfa Zhu et.al. | 2310.07246 | link |
| 2023-10-10 | Prosody Analysis of Audiobooks | Charuta Pethe et.al. | 2310.06930 | link |
| 2023-10-09 | JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions | Detai Xin et.al. | 2310.06072 | null |
| 2024-01-09 | Unified speech and gesture synthesis using flow matching | Shivam Mehta et.al. | 2310.05181 | null |
| 2023-10-08 | Comparative Analysis of Transfer Learning in Deep Learning Text-to-Speech Models on a Few-Shot, Low-Resource, Customized Dataset | Ze Liu et.al. | 2310.04982 | null |
| 2023-10-11 | LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT | Jiaming Wang et.al. | 2310.04673 | null |
| 2024-01-22 | Latent Filling: Latent Space Data Augmentation for Zero-shot Speech Synthesis | Jae-Sung Bae et.al. | 2310.03538 | null |
| 2023-10-07 | The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains | Erica Cooper et.al. | 2310.02640 | null |
| 2023-10-02 | Towards human-like spoken dialogue generation between AI agents from written dialogue | Kentaro Mitsui et.al. | 2310.01088 | null |
| 2023-10-01 | Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech | Dareen Alharthi et.al. | 2310.00706 | null |
| 2024-03-11 | Fewer-token Neural Speech Codec with Time-invariant Codes | Yong Ren et.al. | 2310.00014 | link |
| 2024-01-31 | ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech | Wenhao Guan et.al. | 2309.17056 | null |
| 2023-09-29 | Low-Resource Self-Supervised Learning with SSL-Enhanced TTS | Po-chun Hsu et.al. | 2309.17020 | null |
| 2023-09-29 | Synthetic Speech Detection Based on Temporal Consistency and Distribution of Speaker Features | Yuxiang Zhang et.al. | 2309.16954 | null |
| 2023-12-18 | High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models | Chunyu Qiang et.al. | 2309.15512 | null |
| 2024-01-09 | BiSinger: Bilingual Singing Voice Synthesis | Huali Zhou et.al. | 2309.14089 | link |
| 2023-10-07 | HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS | Dake Guo et.al. | 2309.13907 | null |
| 2023-09-24 | VoiceLDM: Text-to-Speech with Environmental Context | Yeonghyeon Lee et.al. | 2309.13664 | null |
| 2023-09-24 | Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control | Aya Watanabe et.al. | 2309.13509 | null |
| 2023-09-22 | DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis | Yu Gu et.al. | 2309.12792 | null |
| 2023-09-22 | Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts | Shun Lei et.al. | 2309.11977 | null |
| 2023-09-21 | The Impact of Silence on Speech Anti-Spoofing | Yuxiang Zhang et.al. | 2309.11827 | null |
| 2023-09-21 | Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech | Rui Liu et.al. | 2309.11724 | link |
| 2023-09-20 | Speak While You Think: Streaming Speech Synthesis During Text Generation | Avihu Dekel et.al. | 2309.11210 | null |
| 2023-09-20 | Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model | Xinyu Zhou et.al. | 2309.11000 | link |
| 2023-09-19 | Exploring Speech Enhancement for Low-resource Speech Synthesis | Zhaoheng Ni et.al. | 2309.10795 | null |
| 2023-09-19 | Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition | Ziyang Ma et.al. | 2309.10294 | null |
| 2023-09-17 | Augmenting text for spoken language understanding with Large Language Models | Roshan Sharma et.al. | 2309.09390 | null |
| 2023-09-16 | FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework | Jianzong Wang et.al. | 2309.08837 | null |
| 2023-09-15 | Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech | Dariusz Piotrowski et.al. | 2309.08255 | null |
| 2023-09-15 | HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods | Hyun-seo Shin et.al. | 2309.08208 | link |
| 2023-12-27 | PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions | Reo Shimizu et.al. | 2309.08140 | null |
| 2023-09-15 | Diversity-based core-set selection for text-to-speech with linguistic and acoustic features | Kentaro Seki et.al. | 2309.08127 | null |
| 2023-09-14 | Direct Text to Speech Translation System using Acoustic Units | Victoria Mingote et.al. | 2309.07478 | null |
| 2023-10-07 | FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec | Zhihao Du et.al. | 2309.07405 | link |
| 2023-09-13 | DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation | Zhichao Wu et.al. | 2309.06787 | null |
| 2023-09-11 | Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP | Jinzuomu Zhong et.al. | 2309.05423 | link |
| 2024-01-16 | VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching | Yiwei Guo et.al. | 2309.05027 | link |
| 2023-09-08 | Cross-Utterance Conditioned VAE for Speech Generation | Yang Li et.al. | 2309.04156 | null |
| 2023-09-07 | Large-Scale Automatic Audiobook Creation | Brendan Walsh et.al. | 2309.03926 | null |
| 2023-09-11 | GRASS: Unified Generation Model for Speech-to-Semantic Tasks | Aobo Xia et.al. | 2309.02780 | null |
| 2023-09-12 | MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023 | Zhihang Xu et.al. | 2309.02743 | null |
| 2023-10-12 | PromptTTS 2: Describing and Generating Voices with Text Prompt | Yichong Leng et.al. | 2309.02285 | null |
| 2023-09-04 | A Comparative Analysis of Pretrained Language Models for Text-to-Speech | Marcel Granero-Moya et.al. | 2309.01576 | null |
| 2023-09-02 | DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin | Tao Li et.al. | 2309.00883 | null |
| 2023-12-18 | Learning Speech Representation From Contrastive Token-Acoustic Pretraining | Chunyu Qiang et.al. | 2309.00424 | null |
| 2023-09-01 | The FruitShell French synthesis system at the Blizzard 2023 Challenge | Xin Qi et.al. | 2309.00223 | null |
| 2023-08-31 | QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning | Haohan Guo et.al. | 2309.00126 | null |
| 2024-01-23 | SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models | Xin Zhang et.al. | 2308.16692 | link |
| 2023-08-31 | Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis | Weiqin Li et.al. | 2308.16593 | null |
| 2023-08-31 | Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information | Jie Chen et.al. | 2308.16577 | null |
| 2023-08-31 | LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech | Jie Chen et.al. | 2308.16569 | null |
| 2023-08-30 | CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis | Yi Meng et.al. | 2308.16021 | null |
| 2023-09-01 | The DeepZen Speech Synthesis System for Blizzard Challenge 2023 | Christophe Veaux et.al. | 2308.15945 | null |
| 2023-08-28 | Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech | Hyungchan Yoon et.al. | 2308.14909 | null |
| 2023-09-04 | Rep2wav: Noise Robust text-to-speech Using self-supervised representations | Qiushi Zhu et.al. | 2308.14553 | null |
| 2023-08-28 | TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models | Shengpeng Ji et.al. | 2308.14430 | link |
| 2023-09-02 | Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder | Xuyuan Li et.al. | 2308.13365 | null |
| 2023-08-24 | Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations | Wenbin Wang et.al. | 2308.13007 | null |
| 2023-09-22 | Sparks of Large Audio Models: A Survey and Outlook | Siddique Latif et.al. | 2308.12792 | null |
| 2023-10-25 | SeamlessM4T: Massively Multilingual & Multimodal Machine Translation | Seamless Communication et.al. | 2308.11596 | link |
| 2023-08-31 | Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models | Heyang Xue et.al. | 2308.10428 | null |
| 2023-08-16 | AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis | Hrishikesh Viswanath et.al. | 2308.08577 | null |
| 2023-08-14 | SpeechX: Neural Codec Language Model as a Versatile Speech Transformer | Xiaofei Wang et.al. | 2308.06873 | null |
| 2023-08-12 | Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic Talking-head Generation | Zhichao Wang et.al. | 2308.06457 | link |
| 2023-09-09 | AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining | Haohe Liu et.al. | 2308.05734 | link |
| 2023-08-09 | Data Player: Automatic Generation of Data Videos with Narration-Animation Interplay | Leixian Shen et.al. | 2308.04703 | null |
| 2023-08-08 | Towards an AI to Win Ghana's National Science and Maths Quiz | George Boateng et.al. | 2308.04333 | link |
| 2023-08-08 | WonderFlow: Narration-Centric Design of Animated Data Videos | Yun Wang et.al. | 2308.04040 | null |
| 2023-08-04 | Let's Give a Voice to Conversational Agents in Virtual Reality | Michele Yin et.al. | 2308.02665 | link |
| 2023-08-03 | Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation | Minsu Kim et.al. | 2308.01831 | link |
| 2023-08-02 | SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis | Ramanan Sivaguru et.al. | 2308.01018 | null |
| 2023-07-07 | Artificial Eye for the Blind | Abhinav Benagi et.al. | 2308.00801 | null |
| 2023-07-31 | Multilingual context-based pronunciation learning for Text-to-Speech | Giulia Comini et.al. | 2307.16709 | null |
| 2023-07-31 | Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech | Guangyan Zhang et.al. | 2307.16679 | null |
| 2023-07-31 | Improving grapheme-to-phoneme conversion by learning pronunciations from speech recordings | Manuel Sam Ribeiro et.al. | 2307.16643 | null |
| 2023-07-31 | DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training | Hyung-Seok Oh et.al. | 2307.16549 | link |
| 2023-07-31 | VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design | Jungil Kong et.al. | 2307.16430 | link |
| 2023-07-30 | Improving TTS for Shanghainese: Addressing Tone Sandhi via Word Segmentation | Yuanhao Chen et.al. | 2307.16199 | link |
| 2023-07-29 | METTS: Multilingual Emotional Text-to-Speech by Cross-speaker and Cross-lingual Emotion Transfer | Xinfa Zhu et.al. | 2307.15951 | link |
| 2023-12-18 | Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding | Chunyu Qiang et.al. | 2307.15484 | null |
| 2023-07-20 | SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer | Daegyeom Kim et.al. | 2307.10550 | link |
| 2023-07-18 | SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs | Yinghao Aaron Li et.al. | 2307.09435 | null |
| 2023-09-28 | Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts | Ziyue Jiang et.al. | 2307.07218 | null |
| 2023-07-13 | Controllable Emphasis with zero data for text-to-speech | Arnaud Joly et.al. | 2307.07062 | null |
| 2023-07-11 | On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis | Siyang Wang et.al. | 2307.05132 | null |
| 2023-07-10 | The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task | Kun Song et.al. | 2307.04630 | null |
| 2023-10-07 | ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading | Yujia Xiao et.al. | 2307.00782 | null |
| 2023-06-28 | EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech | Daria Diatlova et.al. | 2307.00024 | link |
| 2023-06-29 | High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units | Junchen Lu et.al. | 2306.17005 | null |
| 2023-06-28 | UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data | Heeseung Kim et.al. | 2306.16083 | link |
| 2023-10-19 | Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale | Matthew Le et.al. | 2306.15687 | null |
| 2023-06-27 | GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech | Yahuan Cong et.al. | 2306.15304 | null |
| 2023-06-25 | DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech | Sen Liu et.al. | 2306.14145 | null |
| 2023-06-21 | Visual-Aware Text-to-Speech | Mohan Zhou et.al. | 2306.12020 | null |
| 2023-06-21 | Expressive Machine Dubbing Through Phrase-level Cross-lingual Prosody Transfer | Jakub Swiatkowski et.al. | 2306.11662 | null |
| 2023-06-16 | Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation | Kishor Kayyar Lakshminarayana et.al. | 2306.10152 | null |
| 2023-06-16 | CML-TTS A Multilingual Dataset for Speech Synthesis in Low-Resource Languages | Frederico S. Oliveira et.al. | 2306.10097 | null |
| 2023-06-14 | Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation | Zheng Liang et.al. | 2306.08588 | null |
| 2023-06-14 | Towards Building Voice-based Conversational Recommender Systems: Datasets, Potential Solutions, and Prospects | Xinghua Qu et.al. | 2306.08219 | link |
| 2023-11-20 | StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models | Yinghao Aaron Li et.al. | 2306.07691 | null |
| 2024-01-18 | UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding | Chenpeng Du et.al. | 2306.07547 | null |
| 2023-06-13 | PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling | Ji-Sang Hwang et.al. | 2306.07489 | null |
| 2023-06-09 | Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech | Shijun Wang et.al. | 2306.05709 | null |
| 2023-06-08 | VIFS: An End-to-End Variational Inference for Foley Sound Synthesis | Junhyeok Lee et.al. | 2306.05004 | link |
| 2023-07-11 | Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge | Wenhao Guan et.al. | 2306.04301 | null |
| 2023-06-06 | Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias | Ziyue Jiang et.al. | 2306.03509 | null |
| 2023-08-02 | Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis | Zhenhui Ye et.al. | 2306.03504 | null |
| 2023-06-05 | Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis | Dengfeng Ke et.al. | 2306.02593 | null |
| 2023-06-05 | Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language Model | Hoyeon Lee et.al. | 2306.02579 | null |
| 2023-06-05 | Latent Optimal Paths by Gumbel Propagation for Variational Bayesian Dynamic Programming | Xinlei Niu et.al. | 2306.02568 | link |
| 2023-06-02 | Towards Robust FastSpeech 2 by Modelling Residual Multimodality | Fabian Kögel et.al. | 2306.01442 | link |
| 2023-05-30 | Towards Selection of Text-to-speech Data to Augment ASR Training | Shuo Liu et.al. | 2306.00998 | null |
| 2023-06-01 | EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis | Haobin Tang et.al. | 2306.00648 | null |
| 2023-06-01 | The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech | Phat Do et.al. | 2306.00535 | null |
| 2023-05-31 | Text-to-Speech Pipeline for Swiss German -- A comparison | Tobias Bollinger et.al. | 2305.19750 | null |
| 2023-05-31 | XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech | Linh The Nguyen et.al. | 2305.19709 | link |
| 2023-06-01 | PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions | Guanghou Liu et.al. | 2305.19522 | null |
| 2023-05-30 | Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction in Text-to-Speech for Low-Resource Languages | Phat Do et.al. | 2305.19396 | null |
| 2023-05-30 | Make-A-Voice: Unified Voice Synthesis With Discrete Representation | Rongjie Huang et.al. | 2305.19269 | null |
| 2023-05-30 | STT4SG-350: A Speech Corpus for All Swiss German Dialect Regions | Michel Plüss et.al. | 2305.18855 | null |
| 2023-05-30 | LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus | Yuma Koizumi et.al. | 2305.18802 | null |
| 2023-10-09 | An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization | Fei Kong et.al. | 2305.18355 | link |
| 2023-05-29 | ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation | Ambuj Mehrish et.al. | 2305.18028 | link |
| 2023-05-29 | Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis | Erik Ekstedt et.al. | 2305.17971 | null |
| 2023-07-25 | StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation | Kun Song et.al. | 2305.17732 | null |
| 2023-05-28 | Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS | Sewade Ogun et.al. | 2305.17724 | link |
| 2023-07-19 | Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on the False Alarms in Automated Speech Recognition Testing | Julia Kaiwen Lau et.al. | 2305.17445 | link |
| 2023-05-26 | DisfluencyFixer: A tool to enhance Language Learning through Speech To Speech Disfluency Correction | Vineet Bhat et.al. | 2305.16957 | null |
| 2023-05-25 | Betray Oneself: A Novel Audio DeepFake Detection Model via Mono-to-Stereo Conversion | Rui Liu et.al. | 2305.16353 | link |
| 2023-05-22 | Text Generation with Speech Synthesis for ASR Data Augmentation | Zhuangqun Huang et.al. | 2305.16333 | null |
| 2023-05-25 | VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation | Tianrui Wang et.al. | 2305.16107 | null |
| 2023-05-25 | Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration | Rustem Yeshpanov et.al. | 2305.15749 | link |
| 2024-02-05 | LAraBench: Benchmarking Arabic AI with Large Language Models | Ahmed Abdelali et.al. | 2305.14982 | null |
| 2023-05-23 | EfficientSpeech: An On-Device Text to Speech Model | Rowel Atienza et.al. | 2305.13905 | link |
| 2023-05-23 | ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models | Minki Kang et.al. | 2305.13831 | null |
| 2023-05-22 | U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech | Xin Jing et.al. | 2305.13195 | null |
| 2023-05-25 | EMNS /Imz/ Corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels | Kari Ali Noriy et.al. | 2305.13137 | link |
| 2023-05-22 | ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer | Huadai Liu et.al. | 2305.12708 | null |
| 2023-05-21 | VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages | Shivam Mhaskar et.al. | 2305.12518 | null |
| 2023-05-26 | Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus | Detai Xin et.al. | 2305.12442 | link |
| 2023-05-20 | ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios | Yuyue Wang et.al. | 2305.12200 | null |
| 2023-05-19 | MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low Resource Setting | Neil Shah et.al. | 2305.11926 | null |
| 2024-02-20 | Data Redaction from Conditional Generative Models | Zhifeng Kong et.al. | 2305.11351 | null |
| 2023-05-18 | Parameter-Efficient Learning for Text-to-Speech Accent Adaptation | Li-Jen Yang et.al. | 2305.11320 | link |
| 2023-05-19 | Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation | Martijn Bartelds et.al. | 2305.10951 | link |
| 2023-09-30 | Diffusion-Based Mel-Spectrogram Enhancement for Personalized Speech Synthesis with Found Data | Yusheng Tian et.al. | 2305.10891 | link |
| 2023-05-18 | FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs | Won Jang et.al. | 2305.10823 | null |
| 2023-05-18 | CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training | Zhenhui Ye et.al. | 2305.10763 | null |
| 2023-08-29 | a unified front-end framework for english text-to-speech synthesis | Zelin Ying et.al. | 2305.10666 | null |
| 2023-09-19 | Controllable Speaking Styles Using a Large Language Model | Atli Thor Sigurgeirsson et.al. | 2305.10321 | null |
| 2023-05-23 | Better speech synthesis through scaling | James Betker et.al. | 2305.07243 | link |
| 2023-10-29 | CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model | Zhen Ye et.al. | 2305.06908 | link |
| 2023-05-08 | Accented Text-to-Speech Synthesis with Limited Data | Xuehao Zhou et.al. | 2305.04816 | null |
| 2023-05-03 | M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis | Jinlong Xue et.al. | 2305.02269 | null |
| 2023-05-30 | A Review of Deep Learning Techniques for Speech Processing | Ambuj Mehrish et.al. | 2305.00359 | null |
| 2023-04-26 | Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis | Ye-Xin Lu et.al. | 2304.13270 | null |
| 2023-04-25 | Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge | Chenpeng Du et.al. | 2304.13121 | null |
| 2023-04-24 | Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model | Kenichi Fujita et.al. | 2304.11976 | null |
| 2023-04-23 | DiffVoice: Text-to-Speech with Latent Diffusion | Zhijun Liu et.al. | 2304.11750 | null |
| 2023-04-23 | SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model | Jianzong Wang et.al. | 2304.11547 | null |
| 2023-05-31 | NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers | Kai Shen et.al. | 2304.09116 | null |
| 2023-04-16 | A Virtual Simulation-Pilot Agent for Training of Air Traffic Controllers | Juan Zuluaga-Gomez et.al. | 2304.07842 | null |
| 2023-04-13 | Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis | Shun Lei et.al. | 2304.06359 | null |
| 2023-04-10 | Enhancing Speech-to-Speech Translation with Multiple TTS Targets | Jiatong Shi et.al. | 2304.04618 | null |
| 2023-04-07 | ArmanTTS single-speaker Persian dataset | Mohammd Hasan Shamgholi et.al. | 2304.03585 | null |
| 2023-04-03 | Ensemble prosody prediction for expressive speech synthesis | Tian Huey Teh et.al. | 2304.00714 | null |
| 2023-03-29 | AraSpot: Arabic Spoken Command Spotting | Mahmoud Salhab et.al. | 2303.16621 | link |
| 2023-03-28 | Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages | Seongyeon Park et.al. | 2303.15669 | link |
| 2023-03-27 | Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis | Karren Yang et.al. | 2303.14885 | null |
| 2023-03-24 | Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis | Takuhiro Kaneko et.al. | 2303.13909 | null |
| 2023-04-02 | A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI | Chenshuang Zhang et.al. | 2303.13336 | null |
| 2023-03-20 | Code-Switching Text Generation and Injection in Mandarin-English ASR | Haibin Yu et.al. | 2303.10949 | null |
| 2023-03-14 | Controlling High-Dimensional Data With Sparse Input | Dan Andrei Iliescu et.al. | 2303.09446 | null |
| 2023-03-09 | Text-to-ECG: 12-Lead Electrocardiogram Synthesis conditioned on Clinical Text Reports | Hyunseung Chung et.al. | 2303.09395 | link |
| 2023-03-15 | Cross-speaker Emotion Transfer by Manipulating Speech Style Latents | Suhee Jo et.al. | 2303.08329 | null |
| 2023-03-14 | QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis | Haobin Tang et.al. | 2303.07682 | null |
| 2023-03-10 | An End-to-End Neural Network for Image-to-Audio Transformation | Liu Chen et.al. | 2303.06078 | null |
| 2023-03-09 | Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation | Qi Chen et.al. | 2303.05322 | link |
| 2023-03-07 | Do Prosody Transfer Models Transfer Prosody? | Atli Thor Sigurgeirsson et.al. | 2303.04289 | null |
| 2023-03-07 | Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling | Ziqiang Zhang et.al. | 2303.03926 | null |
| 2023-03-02 | Evaluating Parameter-Efficient Transfer Learning Approaches on SURE Benchmark for Speech Understanding | Yingting Li et.al. | 2303.03267 | link |
| 2023-03-08 | FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model | Ruiqing Xue et.al. | 2303.02939 | null |
| 2023-08-14 | Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations | Yuma Koizumi et.al. | 2303.01664 | null |
| 2023-03-11 | Fine-grained Emotional Control of Text-To-Speech: Learning To Rank Inter- And Intra-Class Emotion Intensities | Shijun Wang et.al. | 2303.01508 | null |
| 2023-12-17 | ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations | Neil Shah et.al. | 2303.01261 | null |
| 2023-03-02 | LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion | Chunfeng Wang et.al. | 2303.01086 | null |
| 2023-03-02 | Leveraging Large Text Corpora for End-to-End Speech Summarization | Kohei Matsuura et.al. | 2303.00978 | null |
| 2023-03-01 | DTW-SiameseNet: Dynamic Time Warped Siamese Network for Mispronunciation Detection and Correction | Raviteja Anantha et.al. | 2303.00171 | null |
| 2023-02-28 | ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus | Ajinkya Kulkarni et.al. | 2303.00069 | null |
| 2023-02-28 | Automatic Heteronym Resolution Pipeline Using RAD-TTS Aligners | Jocelyn Huang et.al. | 2302.14523 | null |
| 2023-06-12 | CrossSpeech: Speaker-independent Acoustic Representation for Cross-lingual Speech Synthesis | Ji-Hoon Kim et.al. | 2302.14370 | null |
| 2023-05-19 | UniFLG: Unified Facial Landmark Generator from Text or Speech | Kentaro Mitsui et.al. | 2302.14337 | null |
| 2023-02-27 | Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech | Jiyoung Lee et.al. | 2302.13700 | link |
| 2023-02-27 | Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech | Dong Yang et.al. | 2302.13652 | null |
| 2023-02-27 | Varianceflow: High-Quality and Controllable Text-to-Speech using Variance Information via Normalizing Flow | Yoonhyung Lee et.al. | 2302.13458 | null |
| 2023-06-06 | PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS | Junhyeok Lee et.al. | 2302.12391 | link |
| 2023-02-21 | Emphasizing Unseen Words: New Vocabulary Acquisition for End-to-End Speech Recognition | Leyuan Qu et.al. | 2302.09723 | null |
| 2023-02-23 | QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion | Houjian Guo et.al. | 2302.08296 | link |
| 2023-02-13 | Fast and small footprint Hybrid HMM-HiFiGAN based system for speech synthesis in Indian languages | Sudhanshu Srivastava et.al. | 2302.06227 | null |
| 2023-02-08 | A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech | Li-Wei Chen et.al. | 2302.04215 | link |
| 2023-02-07 | Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision | Eugene Kharitonov et.al. | 2302.03540 | null |
| 2023-02-15 | MAC: A unified framework boosting low resource automatic speech recognition | Zeping Min et.al. | 2302.03498 | null |
| 2023-06-25 | InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt | Dongchao Yang et.al. | 2301.13662 | link |
| 2023-03-01 | UzbekTagger: The rule-based POS tagger for Uzbek language | Maksud Sharipov et.al. | 2301.12711 | null |
| 2023-05-27 | Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining | Takaaki Saeki et.al. | 2301.12596 | link |
| 2023-01-31 | Time out of Mind: Generating Rate of Speech conditioned on emotion and speaker | Navjot Kaur et.al. | 2301.12331 | link |
| 2023-01-26 | On granularity of prosodic representations in expressive text-to-speech | Mikolaj Babianski et.al. | 2301.11446 | null |
| 2023-01-26 | Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study | Massa Baali et.al. | 2301.09099 | link |
| 2023-01-20 | Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions | Yinghao Aaron Li et.al. | 2301.08810 | null |
| 2023-01-11 | Modelling low-resource accents without accent-specific TTS frontend | Georgi Tinchev et.al. | 2301.04606 | null |
| 2022-12-11 | BASPRO: a balanced script producer for speech corpus collection based on the genetic algorithm | Yu-Wen Chen et.al. | 2301.04120 | link |
| 2023-01-10 | UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion | Haogeng Liu et.al. | 2301.03801 | null |
| 2023-01-10 | Generative Emotional AI for Speech Emotion Recognition: The Case for Synthetic Emotional Speech Augmentation | Abdullah Shahid et.al. | 2301.03751 | null |
| 2023-09-19 | Applying Automated Machine Translation to Educational Video Courses | Linden Wang et.al. | 2301.03141 | null |
| 2023-01-06 | Using External Off-Policy Speech-To-Text Mappings in Contextual End-To-End Automated Speech Recognition | David M. Chan et.al. | 2301.02736 | null |
| 2023-01-05 | Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers | Chengyi Wang et.al. | 2301.02111 | link |
| 2022-12-11 | MnTTS2: An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis Dataset | Kailin Liang et.al. | 2301.00657 | link |
| 2022-12-30 | ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech | Zehua Chen et.al. | 2212.14518 | null |
| 2022-12-29 | StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models | Yinghao Aaron Li et.al. | 2212.14227 | link |
| 2022-12-22 | HMM-based data augmentation for E2E systems for building conversational speech synthesis systems | Ishika Gupta et.al. | 2212.11982 | null |
| 2022-12-21 | ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement | Wei-Ning Hsu et.al. | 2212.11377 | null |
| 2022-12-20 | TTS-Guided Training for Accent Conversion Without Parallel Data | Yi Zhou et.al. | 2212.10204 | null |
| 2023-06-28 | Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling | Tuomo Raitio et.al. | 2212.10075 | null |
| 2022-12-16 | Speech Aware Dialog System Technology Challenge (DSTC11) | Hagen Soltau et.al. | 2212.08704 | null |
| 2022-12-16 | Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder | Yusuke Yasuda et.al. | 2212.08329 | null |
| 2022-12-16 | Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language | Yusuke Yasuda et.al. | 2212.08321 | null |
| 2022-12-15 | RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis | Shinhyeok Oh et.al. | 2212.07939 | link |
| 2022-12-14 | Probing Deep Speaker Embeddings for Speaker-related Tasks | Zifeng Zhao et.al. | 2212.07068 | null |
| 2022-12-08 | SpeechLMScore: Evaluating speech generation using speech language model | Soumi Maiti et.al. | 2212.04559 | link |
| 2023-04-04 | Learning to Dub Movies via Hierarchical Prosody Models | Gaoxiang Cong et.al. | 2212.04054 | link |
| 2022-12-07 | Low-Resource End-to-end Sanskrit TTS using Tacotron2, WaveGlow and Transfer Learning | Ankur Debnath et.al. | 2212.03558 | null |
| 2022-12-07 | Analysis and Utilization of Entrainment on Acoustic and Emotion Features in User-agent Dialogue | Daxin Tan et.al. | 2212.03398 | null |
| 2022-12-06 | UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis | Yi Lei et.al. | 2212.01546 | null |
| 2022-11-30 | SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech | Byoung Jin Choi et.al. | 2211.16866 | null |
| 2022-11-29 | Controllable speech synthesis by learning discrete phoneme-level prosodic representations | Nikolaos Ellinas et.al. | 2211.16307 | null |
| 2023-05-25 | Evaluating and reducing the distance between synthetic and real speech distributions | Christoph Minixhofer et.al. | 2211.16049 | null |
| 2022-11-26 | Contextual Expressive Text-to-Speech | Jianhong Tu et.al. | 2211.14548 | null |
| 2022-12-05 | Efficient Incremental Text-to-Speech on GPUs | Muyang Du et.al. | 2211.13939 | null |
| 2023-03-21 | Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems? | Xuan Shi et.al. | 2211.13868 | link |
| 2022-11-23 | IMaSC -- ICFOSS Malayalam Speech Corpus | Deepa P Gopinath et.al. | 2211.12796 | null |
| 2022-11-22 | PromptTTS: Controllable Text-to-Speech with Text Descriptions | Zhifang Guo et.al. | 2211.12171 | null |
| 2022-11-04 | Stutter-TTS: Controlled Synthesis and Improved Recognition of Stuttered Speech | Xin Zhang et.al. | 2211.09731 | null |
| 2023-02-17 | Towards Building Text-To-Speech Systems for the Next Billion Users | Gokul Karthik Kumar et.al. | 2211.09536 | link |
| 2023-02-16 | EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance | Yiwei Guo et.al. | 2211.09496 | null |
| 2022-11-17 | Back-Translation-Style Data Augmentation for Mandarin Chinese Polyphone Disambiguation | Chunyu Qiang et.al. | 2211.09495 | null |
| 2022-11-17 | NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis | Hyeong-Seok Choi et.al. | 2211.09407 | null |
| 2023-03-14 | Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models | Minki Kang et.al. | 2211.09383 | null |
| 2023-01-04 | Low-Resource Mongolian Speech Synthesis Based on Automatic Prosody Annotation | Xin Yuan et.al. | 2211.09365 | null |
| 2022-11-14 | SNIPER Training: Variable Sparsity Rate Training For Text-To-Speech | Perry Lam et.al. | 2211.07283 | null |
| 2023-05-25 | Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing | Jacob J Webber et.al. | 2211.06989 | null |
| 2023-05-29 | OverFlow: Putting flows on top of neural transducers for better TTS | Shivam Mehta et.al. | 2211.06892 | link |
| 2023-05-29 | Semi-supervised learning for continuous emotional intensity controllable speech synthesis with disentangled representations | Yoori Oh et.al. | 2211.06160 | null |
| 2022-12-04 | ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech | Xiaoran Fan et.al. | 2211.03545 | link |
| 2022-11-07 | Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder | Jan Melechovsky et.al. | 2211.03316 | link |
| 2022-11-06 | Parallel Attention Forcing for Machine Translation | Qingyun Dou et.al. | 2211.03237 | null |
| 2022-11-06 | An Empirical Study on L2 Accents of Cross-lingual Text-to-Speech Systems via Vowel Space | Jihwan Lee et.al. | 2211.03078 | null |
| 2022-11-04 | NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS | Dongchao Yang et.al. | 2211.02448 | null |
| 2022-11-04 | Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts | Detai Xin et.al. | 2211.02336 | null |
| 2023-04-16 | Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based On FullConv-TTS | Ziqi Liang et.al. | 2211.01948 | null |
| 2022-11-01 | Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages | Anusha Prakash et.al. | 2211.01338 | null |
| 2023-05-28 | DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP | Kun Song et.al. | 2211.01087 | null |
| 2022-11-22 | Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement | Wei Song et.al. | 2211.00967 | null |
| 2022-11-01 | Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers | Cheng-Ping Hsieh et.al. | 2211.00585 | link |
| 2023-06-11 | Generating Multilingual Gender-Ambiguous Text-to-Speech Voices | Konstantinos Markopoulos et.al. | 2211.00375 | null |
| 2023-05-07 | Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features | Alexandra Vioni et.al. | 2211.00342 | null |
| 2022-11-02 | Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS | Kun Song et.al. | 2210.17349 | null |
| 2024-02-27 | Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation | Nikolaos Ellinas et.al. | 2210.17264 | null |
| 2022-10-31 | Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection | Luigi Attorresi et.al. | 2210.17222 | null |
| 2022-10-31 | Structured State Space Decoder for Speech Recognition and Synthesis | Koichi Miyazaki et.al. | 2210.17098 | null |
| 2022-10-28 | Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders | Jason Fong et.al. | 2210.16045 | null |
| 2023-02-21 | Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform | Masaya Kawamura et.al. | 2210.15975 | link |
| 2023-02-22 | Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis | Yuma Shirahata et.al. | 2210.15964 | null |
| 2022-10-28 | Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation | Nobuyuki Morioka et.al. | 2210.15868 | null |
| 2023-03-15 | Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech | Takaaki Saeki et.al. | 2210.15447 | null |
| 2022-10-27 | Explicit Intensity Control for Accented Text-to-speech | Rui Liu et.al. | 2210.15364 | null |
| 2022-10-27 | FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis | Yifan Hu et.al. | 2210.15360 | link |
| 2022-10-26 | Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection | Kentaro Seki et.al. | 2210.14850 | null |
| 2022-10-25 | Semi-Supervised Learning Based on Reference Model for Low-resource TTS | Xulong Zhang et.al. | 2210.14723 | null |
| 2022-10-26 | Cover Reproducible Steganography via Deep Generative Models | Kejiang Chen et.al. | 2210.14632 | null |
| 2022-10-26 | Improving Speech-to-Speech Translation Through Unlabeled Text | Xuan-Phi Nguyen et.al. | 2210.14514 | null |
| 2022-10-26 | The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge | Yuhao Liang et.al. | 2210.14448 | null |
| 2022-10-25 | Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data | Xulong Zhang et.al. | 2210.13803 | null |
| 2023-09-17 | HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation | Chunhui Wang et.al. | 2210.12740 | null |
| 2022-10-21 | Low-Resource Multilingual and Zero-Shot Multispeaker TTS | Florian Lux et.al. | 2210.12223 | link |
| 2022-10-21 | Adaptive re-calibration of channel-wise features for Adversarial Audio Classification | Vardhan Dongre et.al. | 2210.11722 | null |
| 2022-10-20 | Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS | Chunyu Qiang et.al. | 2210.11429 | null |
| 2022-10-17 | Towards Relation Extraction From Speech | Tongtong Wu et.al. | 2210.08759 | link |
| 2023-02-08 | Generating Synthetic Speech from SpokenVocab for Speech Translation | Jinming Zhao et.al. | 2210.08174 | link |
| 2022-10-17 | LeVoice ASR Systems for the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge | Yan Jia et.al. | 2210.07749 | null |
| 2022-10-20 | Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy | Sarina Meyer et.al. | 2210.07002 | link |
| 2022-10-13 | Pre-Avatar: An Automatic Presentation Generation Framework Leveraging Talking Avatar | Aolan Sun et.al. | 2210.06877 | null |
| 2022-10-12 | Can we use Common Voice to train a Multi-Speaker TTS system? | Sewade Ogun et.al. | 2210.06370 | null |
| 2023-06-01 | SQuId: Measuring Speech Naturalness in Many Languages | Thibault Sellam et.al. | 2210.06324 | null |
| 2022-11-22 | Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech | Byoung Jin Choi et.al. | 2210.05979 | null |
| 2022-10-06 | An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era | Andreas Triantafyllopoulos et.al. | 2210.03538 | null |
| 2022-09-29 | Facial Landmark Predictions with Applications to Metaverse | Qiao Han et.al. | 2209.14698 | link |
| 2022-09-26 | Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech | Yusuke Nakai et.al. | 2209.12549 | null |
| 2022-09-22 | EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models | Perry Lam et.al. | 2209.10890 | null |
| 2022-09-22 | MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline | Yifan Hu et.al. | 2209.10848 | link |
| 2022-09-22 | Controllable Accented Text-to-Speech Synthesis | Rui Liu et.al. | 2209.10804 | null |
| 2022-09-16 | TIMIT-TTS: a Text-to-Speech Dataset for Multimodal Synthetic Media Detection | Davide Salvi et.al. | 2209.08000 | null |
| 2022-09-14 | Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset | Michael Chinen et.al. | 2209.06358 | null |
| 2022-09-08 | SANIP: Shopping Assistant and Navigation for the visually impaired | Shubham Deshmukh et.al. | 2209.03570 | null |
| 2022-09-07 | Non-Standard Vietnamese Word Detection and Normalization for Text-to-Speech | Huu-Tien Dang et.al. | 2209.02971 | null |
| 2022-09-02 | Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model | Jennifer Drexler Fox et.al. | 2209.01250 | null |
| 2022-08-28 | Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks | Lev Finkelstein et.al. | 2208.13183 | null |
| 2022-10-04 | Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale | Aditya Agarwal et.al. | 2208.09796 | null |
| 2022-08-21 | Visualising Model Training via Vowel Space for Text-To-Speech Systems | Binu Abeysinghe et.al. | 2208.09775 | link |
| 2022-08-15 | Towards Parametric Speech Synthesis Using Gaussian-Markov Model of Spectral Envelope and Wavelet-Based Decomposition of F0 | Mohammed Salah Al-Radhi et.al. | 2208.07122 | null |
| 2022-12-28 | Speech Synthesis with Mixed Emotions | Kun Zhou et.al. | 2208.05890 | null |
| 2022-08-03 | A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis | Qibing Bai et.al. | 2208.02189 | null |
| 2022-07-29 | Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation | Giulia Comini et.al. | 2207.14607 | null |
| 2022-07-25 | Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis | Raul Fernandez et.al. | 2207.12262 | null |
| 2022-07-01 | A Polyphone BERT for Polyphone Disambiguation in Mandarin Chinese | Song Zhang et.al. | 2207.12089 | null |
| 2022-07-20 | When Is TTS Augmentation Through a Pivot Language Useful? | Nathaniel Robinson et.al. | 2207.09889 | link |
| 2022-07-11 | LIP: Lightweight Intelligent Preprocessor for meaningful text-to-speech | Harshvardhan Anand et.al. | 2207.07118 | null |
| 2022-07-13 | ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech | Rongjie Huang et.al. | 2207.06389 | link |
| 2022-07-13 | Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech | Zhengxi Liu et.al. | 2207.06088 | null |
| 2022-07-13 | SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate | Nabarun Goswami et.al. | 2207.06011 | null |
| 2022-07-13 | Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS | Yookyung Shin et.al. | 2207.06000 | null |
| 2022-07-13 | A Cyclical Approach to Synthetic and Natural Speech Mismatch Refinement of Neural Post-filter for Low-cost Text-to-speech System | Yi-Chiao Wu et.al. | 2207.05913 | null |
| 2022-07-12 | Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition | Rodolfo Zevallos et.al. | 2207.05498 | null |
| 2022-07-12 | End-to-end speech recognition modeling from de-identified data | Martin Flechl et.al. | 2207.05469 | null |
| 2022-07-11 | Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data | Naoki Makishima et.al. | 2207.04659 | null |
| 2022-07-11 | DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders | Yanqing Liu et.al. | 2207.04646 | null |
| 2023-01-02 | Dreamento: an open-source dream engineering toolbox for sleep EEG wearables | Mahdad Jafarzadeh Esfahani et.al. | 2207.03977 | link |
| 2022-07-07 | BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus | Josh Meyer et.al. | 2207.03546 | link |
| 2022-07-05 | Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion | Yi Lei et.al. | 2207.01832 | null |
| 2022-07-04 | BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model | Brooke Stephenson et.al. | 2207.01718 | null |
| 2022-07-04 | Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS) | Ariadna Sanchez et.al. | 2207.01547 | null |
| 2022-07-04 | Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS) | Ziyao Zhang et.al. | 2207.01507 | null |
| 2023-03-13 | DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech | Keon Lee et.al. | 2207.01063 | link |
| 2022-07-02 | Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need | Daniel Korzekwa et.al. | 2207.00774 | null |
| 2022-07-01 | Building African Voices | Perez Ogayo et.al. | 2207.00688 | link |
| 2022-07-01 | Automatic Evaluation of Speaker Similarity | Deja Kamil et.al. | 2207.00344 | null |
| 2022-08-03 | Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding | Wei-Ping Huang et.al. | 2206.15427 | null |
| 2022-06-30 | R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS | Kyle Kastner et.al. | 2206.15276 | null |
| 2022-07-01 | Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems | Hyun-Wook Yoon et.al. | 2206.15067 | null |
| 2022-06-30 | TTS-by-TTS 2: Data-selective augmentation for neural speech synthesis using ranking support vector machine with variational autoencoder | Eunwoo Song et.al. | 2206.14984 | null |
| 2022-06-29 | Improving Deliberation by Text-Only and Semi-Supervised Training | Ke Hu et.al. | 2206.14716 | null |
| 2022-06-29 | Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody | Peter Makarov et.al. | 2206.14643 | null |
| 2022-06-28 | Expressive, Variable, and Controllable Duration Modelling in TTS | Ammar Abbas et.al. | 2206.14165 | null |
| 2022-06-28 | Comparison of Speech Representations for the MOS Prediction System | Aki Kunikoshi et.al. | 2206.13817 | null |
| 2022-06-22 | A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data | Raviraj Joshi et.al. | 2206.13240 | null |
| 2022-06-25 | Synthesizing Personalized Non-speech Vocalization from Discrete Speech Representations | Chin-Cheng Hsu et.al. | 2206.12662 | null |
| 2022-10-21 | Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech | Florian Lux et.al. | 2206.12229 | link |
| 2022-06-24 | SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech | Hyunjae Cho et.al. | 2206.12132 | null |
| 2022-06-24 | End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue | Kentaro Mitsui et.al. | 2206.12040 | null |
| 2022-05-29 | Exploiting Transliterated Words for Finding Similarity in Inter-Language News Articles using Machine Learning | Sameea Naeem et.al. | 2206.11860 | null |
| 2022-06-21 | Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS | Kenta Udagawa et.al. | 2206.10256 | null |
| 2022-06-24 | Towards Optimizing OCR for Accessibility | Peya Mowar et.al. | 2206.10254 | null |
| 2022-06-16 | Automatic Prosody Annotation with Pre-Trained Text-Speech Model | Ziqian Dai et.al. | 2206.07956 | link |
| 2022-11-16 | NatiQ: An End-to-end Text-to-Speech System for Arabic | Ahmed Abdelali et.al. | 2206.07373 | null |
| 2022-06-15 | Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning | Rui Liu et.al. | 2206.07229 | link |
| 2022-12-12 | A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation | Junhui Zhang et.al. | 2206.04922 | null |
| 2022-06-09 | Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos | Alexander Waibel et.al. | 2206.04523 | null |
| 2022-06-07 | FlexLip: A Controllable Text-to-Lip System | Dan Oneata et.al. | 2206.03206 | null |
| 2022-10-11 | UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder | Jiachen Lian et.al. | 2206.02512 | null |
| 2023-10-19 | Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech | Ziyue Jiang et.al. | 2206.02147 | link |
| 2022-11-02 | AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation | Kun Song et.al. | 2206.00208 | null |
| 2022-05-31 | Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish | Alp Öktem et.al. | 2205.15599 | link |
| 2023-11-20 | StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis | Yinghao Aaron Li et.al. | 2205.15439 | link |
| 2022-05-30 | Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data | Sungwon Kim et.al. | 2205.15370 | null |
| 2022-05-26 | QSpeech: Low-Qubit Quantum Speech Application Toolkit | Zhenhou Hong et.al. | 2205.13221 | link |
| 2022-11-10 | T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation | Paul-Ambroise Duquenne et.al. | 2205.12216 | null |
| 2022-05-20 | PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit | Hui Zhang et.al. | 2205.12007 | link |
| 2022-05-24 | TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS | Xulong Zhang et.al. | 2205.11824 | null |
| 2022-10-12 | GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech | Rongjie Huang et.al. | 2205.07211 | link |
| 2022-05-13 | Talking Face Generation with Multilingual TTS | Hyoung-Kyu Song et.al. | 2205.06421 | null |
| 2022-05-10 | NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality | Xu Tan et.al. | 2205.04421 | link |
| 2022-05-09 | Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech | Yang Li et.al. | 2205.04120 | link |
| 2022-05-09 | ReCAB-VAE: Gumbel-Softmax Variational Inference Based on Analytic Divergence | Sangshin Oh et.al. | 2205.04104 | null |
| 2022-07-14 | Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss | Efthymios Georgiou et.al. | 2204.13437 | null |
| 2024-06-06 | Parallel Synthesis for Autoregressive Speech Generation | Po-chun Hsu et.al. | 2204.11806 | null |
| 2022-04-25 | SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech | Zhenhui Ye et.al. | 2204.11792 | link |
| 2022-04-22 | LibriS2S: A German-English Speech-to-Speech Translation Corpus | Pedro Jeuris et.al. | 2204.10593 | link |
| 2022-07-05 | Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation | Ryo Terashima et.al. | 2204.10020 | null |
| 2022-04-21 | FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis | Rongjie Huang et.al. | 2204.09934 | link |
| 2022-04-20 | Audio Deep Fake Detection System with Neural Stitching for ADD 2022 | Rui Yan et.al. | 2204.08720 | null |
| 2022-04-14 | Applying Feature Underspecified Lexicon Phonological Features in Multilingual Text-to-Speech | Cong Zhang et.al. | 2204.07228 | null |
| 2022-12-09 | Study of Indian English Pronunciation Variabilities relative to Received Pronunciation | Priyanshi Pal et.al. | 2204.06502 | null |
| 2022-04-12 | Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch | Hanbin Bae et.al. | 2204.05753 | null |
| 2023-01-30 | The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance | Lin Zhang et.al. | 2204.05177 | null |
| 2022-10-27 | Fine-grained Noise Control for Multispeaker Speech Synthesis | Karolos Nikitaras et.al. | 2204.05070 | null |
| 2022-08-31 | Karaoker: Alignment-free singing voice synthesis with speech training data | Panos Kakoulidis et.al. | 2204.04127 | null |
| 2022-08-15 | Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech | Jae-Sung Bae et.al. | 2204.04004 | null |
| 2022-04-07 | Arabic Text-To-Speech (TTS) Data Preparation | Hala Al Masri et.al. | 2204.03255 | null |
| 2022-04-07 | Unsupervised Quantized Prosody Representation for Controllable Speech Synthesis | Yutian Wang et.al. | 2204.03238 | null |
| 2022-08-24 | SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis | Georgia Maniati et.al. | 2204.03040 | null |
| 2022-09-13 | Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation | Sravya Popuri et.al. | 2204.02967 | null |
| 2022-07-02 | Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification | Jin Woo Lee et.al. | 2204.02639 | null |
| 2023-08-28 | Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech | Hyungchan Yoon et.al. | 2204.02172 | null |
| 2022-09-07 | Deliberation Model for On-Device Spoken Language Understanding | Duc Le et.al. | 2204.01893 | null |
| 2022-12-14 | Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck | Youngsik Eom et.al. | 2204.01387 | null |
| 2022-11-11 | Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis | Yixuan Zhou et.al. | 2204.00990 | null |
| 2022-06-30 | VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature | Chenpeng Du et.al. | 2204.00768 | null |
| 2022-04-01 | AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios | Yihan Wu et.al. | 2204.00436 | null |
| 2022-04-01 | Text-To-Speech Data Augmentation for Low Resource Speech Recognition | Rodolfo Zevallos et.al. | 2204.00291 | null |
| 2022-07-19 | Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech | Guangyan Zhang et.al. | 2203.17190 | null |
| 2022-03-31 | An End-to-end Chinese Text Normalization Model based on Rule-guided Flat-Lattice Transformer | Wenlin Dai et.al. | 2203.16954 | link |
| 2022-07-11 | WavThruVec: Latent speech representation as intermediate features for neural speech synthesis | Hubert Siuzdak et.al. | 2203.16930 | null |
| 2022-03-31 | A Character-level Span-based Model for Mandarin Prosodic Structure Prediction | Xueyuan Chen et.al. | 2203.16922 | link |
| 2022-07-01 | JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech | Dan Lim et.al. | 2203.16852 | link |
| 2022-03-31 | Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset | Zehui Yang et.al. | 2203.16844 | null |
| 2022-03-31 | NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism | Jingbei Li et.al. | 2203.16838 | link |
| 2022-03-31 | Effectiveness of text to speech pseudo labels for forced alignment and cross lingual pretrained models for low resource speech recognition | Anirudh Gupta et.al. | 2203.16823 | null |
| 2022-04-21 | Does Audio Deepfake Detection Generalize? | Nicolas M. Müller et.al. | 2203.16263 | null |
| 2022-03-30 | End to End Lip Synchronization with a Temporal AutoEncoder | Yoav Shalev et.al. | 2203.16224 | link |
| 2022-08-15 | Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition | Junrui Ni et.al. | 2203.15796 | link |
| 2022-06-29 | DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning | Takaaki Saeki et.al. | 2203.15683 | null |
| 2022-11-05 | Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation | Rendi Chevi et.al. | 2203.15643 | link |
| 2022-10-06 | Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus | Minchan Kim et.al. | 2203.15447 | null |
| 2022-07-11 | VoiceMe: Personalized voice generation in TTS | Pol van Rijn et.al. | 2203.15379 | link |
| 2021-07-13 | Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging | Tamás Gábor Csapó et.al. | 2107.05550 | null |
| 2021-07-07 | Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm | Elijah Gutierrez et.al. | 2107.02527 | null |
| 2022-02-25 | Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis | Erica Cooper et.al. | 2104.12292 | null |
| 2019-09-26 | Sequence to Sequence Neural Speech Synthesis with Prosody Modification Capabilities | Slava Shechtman et.al. | 1909.10302 | null |
| 2019-08-28 | Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis | Xin Wang et.al. | 1908.10256 | null |
| 2019-05-22 | Effective parameter estimation methods for an ExcitNet model in generative text-to-speech systems | Ohsung Kwon et.al. | 1905.08486 | null |
| 2017-09-26 | Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks | Yuki Saito et.al. | 1709.08041 | null |