Stars
Open-Sora: Democratizing Efficient Video Production for All
State-of-the-art 2D and 3D Face Analysis Project
MiniCPM-V 4.5: A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Implementation of Nougat Neural Optical Understanding for Academic Documents
BoxMOT: Pluggable SOTA multi-object tracking modules modules for segmentation, object detection and pose estimation models
Official repository of "SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory"
[NeurIPS 2024] Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation
Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
Data processing for and with foundation models! π π π½ β‘οΈ β‘οΈπΈ πΉ π·
openvla / openvla
Forked from TRI-ML/prismatic-vlmsOpenVLA: An open-source vision-language-action model for robotic manipulation.
A comprehensive list of papers using large language/multi-modal models for Robotics/RL, including papers, codes, and related websites
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
SAPIEN Manipulation Skill Framework, an open source GPU parallelized robotics simulator and benchmark, led by Hillbot, Inc.
PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838
[ICLR & NeurIPS 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.
RetinaFace: Deep Face Detection Library for Python
This is a curated list of "Embodied AI or robot with Large Language Models" research. Watch this repository for the latest updates! π₯
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
This repository provides valuable reference for researchers in the field of multimodality, please start your exploratory travel in RL-based Reasoning MLLMs!
SEED-Voken: A Series of Powerful Visual Tokenizers
Official Algorithm Implementation of ICML'23 Paper "VIMA: General Robot Manipulation with Multimodal Prompts"
[RSS 2025] Learning to Act Anywhere with Task-centric Latent Actions
RoboBrain 2.0: Advanced version of RoboBrain. See Better. Think Harder. Do Smarter. πππ
Official repo and evaluation implementation of VSI-Bench
Low-level locomotion policy training in Isaac Lab
MichalZawalski / embodied-CoT
Forked from openvla/openvlaEmbodied Chain of Thought: A robotic policy that reason to solve the task.