Stars
[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration
A Fair and Scalable Time Series Forecasting Benchmark and Toolkit.
Recipes to train reward model for RLHF.
A Semantic Controllable Self-Supervised Learning Framework to learn general human representations from massive unlabeled human images, which can benefit downstream human-centric tasks to the maximu…
Official Repo For "Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos"
[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"
Real-time and accurate open-vocabulary end-to-end object detection
[CVPR 2025] Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer
Official Repo For OMG-LLaVA and OMG-Seg codebase [CVPR-24 and NeurIPS-24]
VGGSfM: Visual Geometry Grounded Deep Structure From Motion
[CVPR'23] Universal Instance Perception as Object Discovery and Retrieval
Unified KV Cache Compression Methods for Auto-Regressive Models
一账通是一款开源的统一身份认证授权管理解决方案,支持多种标准协议(LDAP, OAuth2, SAML, OpenID),细粒度权限控制,完整的WEB管理功能,钉钉、企业微信集成等,QQ group: 167885406
A plugin for IDA that can help to analyze binary file, it can be based on commonly used AI big models such as OpenAI and DeepSeek.
Res-SAM Framework for GPR Underground Hazard Detection
[NIPS'25 Spotlight] Mulberry, an o1-like Reasoning and Reflection MLLM Implemented via Collective MCTS
[CVPR'25]Tora: Trajectory-oriented Diffusion Transformer for Video Generation
One repository is all that is necessary for Multi-agent Reinforcement Learning (MARL)
[NeurIPS 2024] DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation
AI Manus is a general-purpose AI Agent system that supports running various tools and operations in a sandbox environment.
[CVPR2024 Highlight]GLEE: General Object Foundation Model for Images and Videos at Scale
PyTorch Implementation of AudioLCM (ACM-MM'24): a efficient and high-quality text-to-audio generation with latent consistency model.
[Neurips 2025] R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
PantoMatrix: Generating Face and Body Animation from Speech
RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning.
[ICLR 2025 Oral] TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio-Motion Embedding and Diffusion Interpolation
Allegro is a powerful text-to-video model that generates high-quality videos up to 6 seconds at 15 FPS and 720p resolution from simple text input.