-
Zhejiang U. -> Tsinghua U.
- Shenzhen
Highlights
- Pro
Stars
📚 《从零开始构建智能体》——从零开始的智能体原理与实践教程
FIVR-200K dataset from the "FIVR: Fine-grained Incident Video Retrieval" [TMM 2019]
Official implementation of BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning. BandPO replaces canonical clipping (PPO/GRPO) with dynamic …
My Python scripts to make high-quality figures for publications in top AI conferences and journals.
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.6, DeepSeek-R1, GLM-5.1, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Gemma4, Llava, …
[CVPR 2025] Mr. DETR: Instructive Multi-Route Training for Detection Transformers
NetworKit is a growing open-source toolkit for large-scale network analysis.
ICCV 2023 Paper Global Features are All You Need for Image Retrieval and Reranking Official Repository
The iconic SVG, font, and CSS toolkit
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Train Your VAE: A VAE Training and Finetuning Script for SD/FLUX
Efficient vision foundation models for high-resolution generation and perception.
[CVPR 2025 Oral] Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
[ICLR'25 Oral] Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
[ECCV 2024] Official PyTorch implementation of RoPE-ViT "Rotary Position Embedding for Vision Transformer"
Official Pytorch Implementation of Our CVPR2023 Paper: "Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization"
Official PyTorch Implementation of "Diffusion Transformers with Representation Autoencoders"
MAGI-1: Autoregressive Video Generation at Scale
Adaptive Length Image Tokenization via Recurrent Allocation | How many tokens is an image worth ?
PyTorch Implementation of `No Fuss Distance Metric Learning using Proxies`
[NeurIPS 2025] Efficient Reasoning Vision Language Models
Video Copy Segment Localization (VCSL) dataset and benchmark [CVPR2022]
The official code of "Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning"
State-of-the-Art Text Embeddings
Griffin: Aerial-Ground Cooperative Detection and Tracking Benchmark