-
University of Oxford & Meta AI
- London
- https://junlinhan.github.io/
- @han_junlin
Stars
Tempo: Small Vision-Language Models are Smart Compressors for Long Video Understanding
🦞 Just talk to your agent — it learns and EVOLVES 🧬.
Learning to See by Looking at Noise
Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAI
Procedural Image Programs for Representation Learning - NeurIPS 2022
[CVPR 2026] Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video
A data generation pipeline for creating semi-realistic synthetic multi-object videos with rich annotations such as instance segmentation masks, depth maps, and optical flow.
A generative world for general-purpose robotics & embodied AI learning.
A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
[CVPR 2026] 👋 Dataset and Benchmark code for EgoEdit
This repository contains the official code and data for CogIP-Bench (Cognition Image Property Benchmark) and the associated alignment methods described in the paper "From Pixels to Feelings: Aligni…
Implementation of Reinforcement Pre-Training (RPT) for Language Models - ArXiv:2506.08007
State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!
Fully Open Framework for Democratized Multimodal Training
Official PyTorch Implementation of "Diffusion Transformers with Representation Autoencoders"
NeurIPS 2025 Spotlight; ICLR2024 Spotlight; CVPR 2024; EMNLP 2024
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction (ICCV 2025)
Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
verl/HybridFlow: A Flexible and Efficient RL Post-Training Framework
[ICLR 2026] Official PyTorch Implementation of RLP: Reinforcement as a Pretraining Objective
[CVPR 2025] Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
(Accepted by IJCV) Liquid: Language Models are Scalable and Unified Multi-modal Generators
🔥🔥🔥 Latest Papers, Codes and Datasets on Video-LMM Post-Training
Code for Words That Make Language Models Perceive
[CVPR 2025] 🔥 Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".