Highlights
- Pro
Lists (1)
Sort Name ascending (A-Z)
Starred repositories
Official Implementation for our EMNLP 2025 paper: "Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation"
Beyond the Model: Scaling Medical Capability with a Large Verifier System
EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems
Interspeech 2025 [Project page]
PowerFM is an open-source repository for foundation models in the power and energy domain. It both maintains original projects and collects community-contributed open-source projects, featuring fin…
PowerWorkflow is an open-source collection of agentic workflows for power system applications. These workflows enable intelligent automation and coordination of power system operations, facilitatin…
A Benchmark for Evaluating Turn-Taking and Overlap Handling in Full-Duplex Spoken Dialogue Models
YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection
This is a simple demonstration of more advanced, agentic patterns built on top of the Realtime API.
Verbatim Automatic Speech Recognition with improved word-level timestamps and filler detection
✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
React app for inspecting, building and debugging with the Realtime API
A generative world for general-purpose robotics & embodied AI learning.
Sylber: Syllabic Embedding Representation of Speech from Raw Audio
[CVPR 2023] Official repository of paper titled "MaPLe: Multi-modal Prompt Learning".
Deep Articulatory Synthesis and Inversion
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.
Allosaurus is a pretrained universal phone recognizer for more than 2000 languages
Neural network-based forced alignment with bidirectional attention mechanism
Attempt at tracking states of the arts and recent results (bibliography) on speech recognition.
[CVPR 2023] SadTalker:Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型