About

I am a Master's student at Shanghai Jiao Tong University, supervised by Prof. Xue Yang and Prof. Junchi Yan. I also visited Nanyang Technological University S-Lab, supervised by Prof. Ziwei Liu. Before that, I received my Bachelor's degree from SJTU in 2023. Currently, I am interning at Microsoft Research Asia, mentored by Yifan Yang, Chong Luo, and Lijuan Wang.

My research focuses on Unified Understanding and Generation, Image/Video Tokenization and Generation, Reinforcement Learning, and Multi-modal Perception.

I am looking for PhD opportunities in the coming year and am open to academic collaborations. Please feel free to contact me if you're interested, or just to make a friend (Email | WeChat)!

News

Jun 2026I graduated from SJTU! Named Outstanding Graduate of Shanghai
Jun 2026We released SkillOpt πŸ”₯
Apr 2026We released SenseNova-U1 πŸ”₯
Feb 2026Our AdapTok paper got accepted at CVPR 2026! πŸŽ‰
Feb 2026I received the Stars of Tomorrow Award from Microsoft ⭐
Jan 2026Our CastDet journal extension got accepted at IJCV 2026.
Jan 2026I am visiting NTU S-Lab supervised by Prof. Ziwei Liu, working on native multimodal understanding and generation. πŸ‡ΈπŸ‡¬
Aug 2025I joined MSRA as a research intern.
Feb 2025I started interning at NVIDIA, working on 3D neural reconstruction.
Oct 2024I received the National Scholarship.
Jul 2024Our CastDet paper got accepted at ECCV 2024 πŸŽ‰
Mar 2024I joined BOSCH as a CV intern.

Education

Visiting Student, Nanyang Technological University (S-Lab)

Jan 2026 – May 2026

Supervisor: Ziwei Liu | Topic: Unified vision-language understanding and generation

M.S. in Information Engineering, Shanghai Jiao Tong University

Sept 2023 – June 2026

GPA: 3.71/4.0 | Supervisor: Xue Yang, Junchi Yan

B.S. in Information Engineering (Elite Program), Shanghai Jiao Tong University

Sept 2019 – June 2023

GPA: 3.74/4.3 (Obtained postgraduate recommendation)

Selected Works

BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation

Preprint

Yan Li†, Zezi Zeng†, Ziwei Zhou†, Xin Gao†, Muzhao Tian†, Yifan Yang, Mingxi Cheng, Qi Dai, Yuqing Yang, Lili Qiu, Zhendong Wang, Zhengyuan Yang, Xue Yang, Lijuan Wang, Ji Li

The first comprehensive benchmark for commercial visual content generation, covering five domains, four capability dimensions, 20 evaluation tasks, and 8,000 human-verified checklist questions.

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Preprint

Yifan Yang†, Ziyang Gong†, Weiquan Huang†, Qihao Yang†, Ziwei Zhou†, Zisu Huang†, Yan Li, Xuemei Gao, Qi Dai, Bei Liu, Kai Qiu, Yuqing Yang, Dongdong Chen, Xue Yang, Chong Luo

The first systematic controllable text-space optimizer for agent skills: turns scored rollouts into bounded edits on a skill document, accepted only when improving held-out validation. Best or tied on all 52 evaluated cells across six benchmarks, seven models, and three harnesses.

EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation

Preprint

Yan Li†, Ning Liao†, Xiangyu Zhao, Shaofeng Zhang, Xiaoxing Wang, Yifan Yang, Junchi Yan, Xue Yang

A unified image tokenizer that represents images as a residual latent evolution trajectory within a shared latent space, effective for both semantic alignment and pixel reconstruction.

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

Preprint

Yan Li†, Zezi Zeng†, Yifan Yang, Yuqing Yang, Ning Liao, Weiwei Guo, Xue Yang, Lijuan Wang, Chong Luo

A multimodal web agent enabling hierarchical agentic planning over native multimodal asset generation with hierarchical self-reflection for iterative webpage refinement.

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Technical Report

Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, ..., Yan Li, et al. "Core Contributor"

A native unified multimodal paradigm coupling autoregressive language modeling with pixel-space flow matching.

AdapTok: Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

Yan Li†, Changyao Tian†, Renqiu Xia, Ning Liao, Weiwei Guo, Junchi Yan, Hongsheng Li, Jifeng Dai, Hao Li, Xue Yang

An adaptive video tokenization framework featuring temporal causality, 1D latent token space, and flexible adaptive token allocation. Achieves Pareto optimality between token quantity and reconstruction quality.

Exploiting Unlabeled Data with Multiple Expert Teachers for Open Vocabulary Aerial Object Detection and Its Orientation Adaptation

International Journal of Computer Vision (IJCV) 2026

Yan Li, Weiwei Guo, Xue Yang, Ning Liao, Shaofeng Zhang, Yi Yu, Wenxian Yu, Junchi Yan

Extended CastDet for open-vocabulary oriented detection in aerial scenes. Achieves 36.0% mAP on DIOR-R and 24.3% mAP on DOTA-R, surpassing SOTAs significantly.

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

European Conference on Computer Vision (ECCV) 2024

Yan Li, Weiwei Guo, Xue Yang, Ning Liao, Dunyun He, Jiaqi Zhou, Wenxian Yu

Pioneers open vocabulary aerial object detection (OVAD) with CastDet, achieving 46.5% mAP on VisDroneZSD, outperforming SOTA by 21.0%.

Experience

Research Intern, Microsoft Research Asia "Stars of Tomorrow Award"

Aug 2025 – Present

RL for autoregressive complex layout image generation; multimodal web generation agent (MM-WebAgent); commercial visual benchmark (BizGenEval); skill/harness evolution (SkillOpt); agentic slides generation

Visiting Student, Nanyang Technological University, S-Lab

Jan 2026 – May 2026

Core contributor to SenseNova-U1 native unified multimodal understanding and generation model

Deep Learning Engineering Intern, NVIDIA

Feb 2025 – June 2025

Neural Reconstruction Engine (NRE) for DRIVE Sim; novel view synthesis

Academic Cooperation, Shanghai AI Laboratory (OpenGVLab)

Feb 2024 – May 2025

Open-vocabulary object detection (ECCV'24, IJCV'26); adaptive video tokenizer (CVPR'26)

Computer Vision Algorithm Intern, BOSCH

Mar 2024 – July 2024

Open-vocabulary object detection; scene graph generation; semi-automatic annotation frameworks

Awards

Stars of Tomorrow Award, Microsoft2026
Outstanding Graduate of Shanghai2026
National Scholarship2024
Shanghai Municipal Scholarship2022
SJTU Excellence Scholarship2021–2024
Finalist, American Mathematical Contest in Modeling (MCM)2022