Stars
Data processing for and with foundation models! π π π½ β‘οΈ β‘οΈπΈ πΉ π·
[RSS 2025] Learning to Act Anywhere with Task-centric Latent Actions
SAPIEN Manipulation Skill Framework, an open source GPU parallelized robotics simulator and benchmark, led by Hillbot, Inc.
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
BoxMOT: Pluggable SOTA multi-object tracking modules modules for segmentation, object detection and pose estimation models
This is a curated list of "Embodied AI or robot with Large Language Models" research. Watch this repository for the latest updates! π₯
[TMLR 2024] repository for VLN with foundation models
[ICLR & NeurIPS 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.
SEED-Voken: A Series of Powerful Visual Tokenizers
PyTorch implementation of paper "ARTrack" and "ARTrackV2"
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
This repository provides valuable reference for researchers in the field of multimodality, please start your exploratory travel in RL-based Reasoning MLLMs!
A comprehensive list of papers using large language/multi-modal models for Robotics/RL, including papers, codes, and related websites
[CoRL 2025] Repository relating to "TrackVLA: Embodied Visual Tracking in the Wild"
[RSS 2024 & RSS 2025] VLN-CE evaluation code of NaVid and Uni-NaVid
RoboBrain 2.0: Advanced version of RoboBrain. See Better. Think Harder. Do Smarter. πππ
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
State-of-the-art 2D and 3D Face Analysis Project
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
MiniCPM-V 4.5: A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone
π€ RoboOS: A Universal Embodied Operating System for Cross-Embodied and Multi-Robot Collaboration
Vision-Language Navigation Benchmark in Isaac Lab
RetinaFace: Deep Face Detection Library for Python
Official repo and evaluation implementation of VSI-Bench
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning
Video Generation, Physical Commonsense, Semantic Adherence, VideoCon-Physics
Open-Sora: Democratizing Efficient Video Production for All
MichalZawalski / embodied-CoT
Forked from openvla/openvlaEmbodied Chain of Thought: A robotic policy that reason to solve the task.