Highlights
- Pro
Lists (4)
Sort Name ascending (A-Z)
Stars
[INTERSPEECH 2022] This dataset is designed for multi-modal speaker diarization and lip-speech synchronization in the wild.
Official implementation for paper How Can Objects Help Video-Language Understanding
The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the *cocktail party effect* from an augmented-reality (AR) -motivated multi-sensor egocentric world view.
[ICMI 2024] SEMPI: A Database for Understanding Social Engagement in Video-Mediated Multiparty Interaction
D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning
A curated list of state-of-the-art research in embodied AI, focusing on vision-language-action (VLA) models, vision-language navigation (VLN), and related multimodal learning approaches.
[CVPR2025] Number it: Temporal Grounding Videos like Flipping Manga
Official Implementation (Pytorch) of the "VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning", AAAI 2025
[AAAI 2025 Oral] Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning
[CVPR 2025] 🔥 Official impl. of "Audio-Visual Instance Segmentation".
[ICLR 2026 Oral] SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents
Ming - facilitating advanced multimodal understanding and generation capabilities built upon the Ling LLM.
This repository contains low-bit quantization papers from 2020 to 2025 on top conference.
[WACV 2026] LASER: Lip Landmark Assisted Speaker Detection for Robustness official implemntation
ACM MM 2021: 'Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection'
Identifying "who speak when" using visual speech input and pretrained lip-sync expert
code repo for LoCoNet: Long-Short Context Network for Active Speaker Detection
The repository for Springer IJCV 2025 (LR-ASD: Lightweight and Robust Network for Active Speaker Detection)
The repository for IEEE CVPR 2023 (A Light Weight Model for Active Speaker Detection)
EMER, OV-MER (ICML25), AffectGPT (ICML25, Oral), EmoPrefer (ICLR26)
A carefully curated collection of high-quality libraries, projects, tutorials, research papers, and other essential resources focused on Mechanistic Interpretability, a growing subfield in machine …
[ICLR 2026] Official implementation of the paper "Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs"