Stars
Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted fo…
This repository is intended to host tools and demos for ActivityNet
[ICLR 2026] FastVGGT: Fast Visual Geometry Transformer
This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025]
TALL: Temporal Activity Localization via Language Query
Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining
PyTorch code and models for VJEPA2 self-supervised learning from video.
[ICCV 2023] UniVTG: Towards Unified Video-Language Temporal Grounding
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR'21)
[ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, Lu Hou
A minimal, educational HEVC (H.265) encoder written in Python.
Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞
The official pytorch implementation of our paper "Is Space-Time Attention All You Need for Video Understanding?"
🔥🔥🔥 [IEEE TCSVT] Latest Papers, Codes and Datasets on Vid-LLMs.
Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence
Implement a ChatGPT-like LLM in PyTorch from scratch, step by step
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
每个人都能看懂的大模型知识分享,LLMs春/秋招大模型面试前必看,让你和面试官侃侃而谈
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
《代码随想录》LeetCode 刷题攻略:200道经典题目刷题顺序,共60w字的详细图解,视频难点剖析,50余张思维导图,支持C++,Java,Python,Go,JavaScript等多语言版本,从此算法学习不再迷茫!🔥🔥 来看看,你会发现相见恨晚!🚀
The code for the paper: "Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models"