Stars
A plug-and-play compiler that delivers free-lunch optimizations for both inference and training.
PyTorch code and models for VJEPA2 self-supervised learning from video.
Official Python inference and LoRA trainer package for the LTX-2 audio–video generative model.
Ongoing research training transformer models at scale
code & model for arxiv paper "Autoregressive Image Generation with Masked Bit Modeling"
BitDance & UniWeTok: Open-source autoregressive model with binary visual tokens. A research project for building powerful multimodal autoregressive model.
UEval: A Benchmark for Unified Multimodal Generation
ViPE: Video Pose Engine for Geometric 3D Perception
An Efficient and User-Friendly Scaling Library for Reinforcement Learning with Large Language Models
HY-World 1.5: A Systematic Framework for Interactive World Modeling with Real-Time Latency and Geometric Consistency
NEO Series: Native Vision-Language Models from First Principles
This repo provides the official code for : 1) TransBTS: Multimodal Brain Tumor Segmentation Using Transformer (https://arxiv.org/abs/2103.04430) , accepted by MICCAI2021. 2) TransBTSV2: Towards Bet…
Native Multimodal Models are World Learners
[ICLR 2026] On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification.
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
MAGI-1: Autoregressive Video Generation at Scale
[ICLR'24 & IJCV‘25] Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching
[ICLR 2025] Diffusion Feedback Helps CLIP See Better
An Open-source RL System from ByteDance Seed and Tsinghua AIR
openvla / openvla
Forked from TRI-ML/prismatic-vlmsOpenVLA: An open-source vision-language-action model for robotic manipulation.
[ICML 2025] Official PyTorch Implementation of "History-Guided Video Diffusion"