video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is developed by the Department of Electronic Engineering at Tsin…

Python 189 24 Updated Feb 23, 2026

JavisVerse / JavisGPT

[NeurIPS'25 Spotlight] Official implementation of "JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation"

Python 73 7 Updated Feb 26, 2026

ddlBoJack / Omni-Captioner

[ICLR 2026] Data Pipeline, Models, and Benchmark for Omni-Captioner.

Python 134 Updated Apr 7, 2026

yaolinli / TimeChat-Captioner

[ICML 2026] Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

Python 38 Updated May 7, 2026

Xiaohao-Liu / Awesome-Multi-Token-Prediction

A curated list of papers, tools, and resources on Multi-Token Prediction (MTP) and related techniques in Large Language Models (LLMs), Speech-Language Models (SLMs), and more.

107 7 Updated May 16, 2026

AIDC-AI / Awesome-Unified-Multimodal-Models

Awesome Unified Multimodal Models

1,249 39 Updated Mar 24, 2026

meituan-longcat / LongCat-Next

423 21 Updated May 9, 2026

MAC-AutoML / SocialOmni

Benchmarking Audio-Visual Social Interactivity in Omni Models

Python 47 1 Updated May 7, 2026

apple / ml-egodex

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

Python 258 8 Updated Aug 20, 2025

mit-han-lab / streaming-vlm

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Python 981 62 Updated Oct 15, 2025

spatigen / vhub

Python 3 Updated Mar 11, 2026

bytedance / UI-TARS-desktop

The Open-Source Multimodal AI Agent Stack: Connecting Cutting-Edge AI Models and Agent Infra

TypeScript 34,366 3,441 Updated May 15, 2026

facebookresearch / EgoAVU

[CVPR 2026 highlight] Official release of EgoAVU Egocentric Audio-Visual Understanding

Python 30 4 Updated Apr 26, 2026

facebookresearch / EasyComDataset

The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the *cocktail party effect* from an augmented-reality (AR) -motivated multi-sensor egocentric world view.

140 10 Updated Dec 4, 2023