-
Zhejiang U. -> Tsinghua U.
- Shenzhen
Highlights
- Pro
Stars
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Train Your VAE: A VAE Training and Finetuning Script for SD/FLUX
Efficient vision foundation models for high-resolution generation and perception.
[CVPR 2025 Oral] Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
[ICLR'25 Oral] Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
[ECCV 2024] Official PyTorch implementation of RoPE-ViT "Rotary Position Embedding for Vision Transformer"
Official Pytorch Implementation of Our CVPR2023 Paper: "Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization"
Official PyTorch Implementation of "Diffusion Transformers with Representation Autoencoders"
MAGI-1: Autoregressive Video Generation at Scale
Adaptive Length Image Tokenization via Recurrent Allocation | How many tokens is an image worth ?
PyTorch Implementation of `No Fuss Distance Metric Learning using Proxies`
[NeurIPS 2025] Efficient Reasoning Vision Language Models
Video Copy Segment Localization (VCSL) dataset and benchmark [CVPR2022]
The official code of "Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning"
State-of-the-Art Text Embeddings
Griffin: Aerial-Ground Cooperative Detection and Tracking Benchmark
Single-file implementation to advance vision-language-action (VLA) models with reinforcement learning.
verl: Volcano Engine Reinforcement Learning for LLMs
Code for ICML 2025 Paper "Highly Compressed Tokenizer Can Generate Without Training"
Official inference repo for FLUX.1 models
Official PyTorch implementation of FlowMo.
This repo contains the code for 1D tokenizer and generator
[SIGGRAPH 2025] Official code of the paper "FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios"
High-performance Image Tokenizers for VAR and AR
Implementation of TiTok, proposed by Bytedance in "An Image is Worth 32 Tokens for Reconstruction and Generation"
The official implementation of our paper ''IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word Emphasis''
[CVPR 2025] Multiple Object Tracking as ID Prediction