-
HKU | Research Intern at ByteDance
- Hong Kong, China
- https://provencestar.github.io/
Stars
Pixio: a SSL encoder dedicated to dense CV tasks
DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
Code repository of "GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation"
[NeurIPS'25] Official repository of Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
Official PyTorch Implementation of "Diffusion Transformers with Representation Autoencoders"
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
LongLive: Real-time Interactive Long Video Generation
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
[NeurIPS 2025] InternScenes: A Large-scale Interactive Indoor Scene Dataset with Realistic Layouts.
😎 A curated list of awesome GitHub Profile which updates in real time
Tongyi Deep Research, the Leading Open-source Deep Research Agent
Fully Open Framework for Democratized Multimodal Training
Github Pages template based upon HTML and Markdown for personal, portfolio-based websites.
The missing star history graph of GitHub repos - https://star-history.com
Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
Official Code for "Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search"
Recommend new arxiv papers of your interest daily according to your Zotero libarary.
Industry-level video foundation model for unified Text-to-Video (T2V) and Image-to-Video (I2V) generation.
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
Pixel-Level Reasoning Model trained with RL [NeuIPS25]
Pointcept: Perceive the world with sparse points, a codebase for point cloud perception research. Latest works: Concerto (NeurIPS'25), Sonata (CVPR'25 Highlight), PTv3 (CVPR'24 Oral)
Solve Visual Understanding with Reinforced VLMs