Stars
GLM-4.6V/4.5V/4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Fara-7B: An Efficient Agentic Model for Computer Use
GELab: GUI Exploration Lab. One of the best GUI agent solutions in the galaxy, built by the StepFun-GELab team and powered by Step’s research capabilities.
A powerful Python library for creating and managing isolated desktop environments using Docker containers.
openvla / openvla
Forked from TRI-ML/prismatic-vlmsOpenVLA: An open-source vision-language-action model for robotic manipulation.
HunyuanImage-3.0: A Powerful Native Multimodal Model for Image Generation
Official PyTorch implementation for "Large Language Diffusion Models"
ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation
Witness the aha moment of VLM with less than $3.
Native Multimodal Models are World Learners
A most Frontend Collection and survey of vision-language model papers, and models GitHub repository. Continuous updates.
verl: Volcano Engine Reinforcement Learning for LLMs
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
A high-throughput and memory-efficient inference and serving engine for LLMs
Tips and resources to prepare for Behavioral interviews.
12 Lessons to Get Started Building AI Agents
Implementation of Phenaki Video, which uses Mask GIT to produce text guided videos of up to 2 minutes in length, in Pytorch
LBM: Latent Bridge Matching for Fast Image-to-Image Translation ✨ (ICCV 2025 Highlight)
Reference PyTorch implementation and models for DINOv3
ModelTC / Wan2.2-Lightning
Forked from Wan-Video/Wan2.2Wan2.2-Lightning: Speed up wan2.2 model with distillation