-
RØDE microphones
- https://www.robots.ox.ac.uk/~jaesung/
- @huh_jaesung
Stars
A collection of datasets for the purpose of emotion recognition/detection in speech.
Code for LiFT (Linearized Feature Trajectories) video embedding
Official Pytorch implementation of "Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models".
Command line utility for forced alignment using Kaldi
Elucidated Text-To-Audio (ETTA) is a SOTA text-to-audio model with a holistic understanding of the design space and trained with synthetic captions.
Local-first AI Notepad for Private Meetings
Official implementation of RAVEn (ICLR 2023) and BRAVEn (ICASSP 2024)
Foundation Models and Data for Human-Human and Human-AI interactions.
The best OSS video generation models, created by Genmo
AVES: Animal Vocalization Encoder based on Self-Supervision
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
Acoustic impulse response generation using diffusion models
Code implementation for the paper "Large-scale Pre-training for Grounded Video Caption Generation" (ICCV 2025)
[CVPR2025] Official code for Lost in Translation Found in Context
[ECCV 2022] ByteTrack: Multi-Object Tracking by Associating Every Detection Box
Real-Time Face Recognition use SCRFD, ArcFace, ByteTrack and Similarity Measure
[NeurIPS 2025] Benchmark data and code for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
[CVPR 2025 Best Paper Award] VGGT: Visual Geometry Grounded Transformer
State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!
Official code for the paper "Scaling Multilingual Visual Speech Recognition"
Official code for the paper "Understanding Co-speech Gestures in-the-wild"
Whisper-Flamingo [Interspeech 2024] and mWhisper-Flamingo [IEEE SPL 2025] for Audio-Visual Speech Recognition and Translation
Code for ACCV 2024 paper: "3D-Aware Instance Segmentation and Tracking in Egocentric Videos"
Multimodal language model benchmark, featuring challenging examples
Code for the paper "The Sound of Water: Inferring Physical Properties from Pouring Liquids".