Bin Xiao
AI Researcher in Multimodal Foundation Models, Computer Vision, and Agentic AI
Member of Technical Staff, Microsoft Superintelligence Team · Redmond, WA
Florence-2 · Phi-3 Vision · HRNet · CvT · Simple Baselines
Google Scholar Profile →About
I am an AI researcher working on multimodal foundation models, computer vision, and agentic AI. My work spans vision-language model training, post-training, visual recognition, document understanding, and human pose estimation. Representative projects include Florence-2, Phi-3 Vision, HRNet, CvT, and Simple Baselines for Human Pose Estimation.
Research Focus
- Multimodal foundation models Building general-purpose vision-language models for OCR, grounding, captioning, dense prediction, and document/image understanding.
- Agentic and reasoning systems Post-training models for tool use, coding, reasoning, and real-world task completion.
- Efficient and deployable AI Designing compact and practical models that can run in real-world product settings, including low-latency and resource-constrained scenarios.
- Vision representation learning From high-resolution CNNs to convolutional vision transformers to unified multimodal representations.
Selected Work
6 selected works-
06
MAI-Thinking-1 and MAI-Code-1-Flash
Microsoft AI reasoning and coding models for agentic workflows: MAI-Thinking-1 focuses on software engineering and mathematical reasoning, while MAI-Code-1-Flash is optimized for production GitHub Copilot coding workflows and efficient reasoning per token.
-
05
Phi-3 Vision / Phi-3.5 Vision
Compact multimodal language models for visual reasoning and OCR-centric understanding, exploring how strong multimodal capability can be delivered in resource-constrained settings.
-
04
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
A unified prompt-based vision foundation model spanning captioning, OCR, grounding, and segmentation — one architecture handling many vision tasks through a single sequence-to-sequence formulation.
-
03
CvT: Introducing Convolutions to Vision Transformers
Introduced convolutional inductive biases into vision transformers for efficient visual recognition — bridging the strengths of CNNs and transformers in a single architecture.
-
02
Deep High-Resolution Representation Learning for Human Pose Estimation
Maintained high-resolution representations throughout the network, becoming a durable architecture family for pose estimation and dense visual recognition.
-
01
Simple Baselines for Human Pose Estimation and Tracking
Showed that a simple, carefully designed deconvolutional head could deliver strong human pose estimation performance — becoming a durable reference baseline for the field.
Research Trajectory
2018 – present-
2025 – present
Reasoning, coding, and agentic model post-training — building models that can plan, use tools, and complete real-world tasks.
-
2024 – 2025
Multimodal Llama post-training — extending large language models with visual understanding capabilities.
-
2024
Led Phi-3 Vision and Phi-3.5 Vision — compact multimodal LLMs that brought strong visual reasoning to resource-constrained settings.
-
2020 – 2023
Led the Florence-1 and Florence-2 vision foundation models — unifying captioning, OCR, grounding, and segmentation in a single sequence-to-sequence architecture.
-
2018 – 2021
Built durable vision architectures and baselines: CvT (convolutional vision transformers), HRNet (high-resolution representations), and SimpleBaseline (pose estimation reference).
Recognition
- HRNet, CvT, and SimpleBaseline — among the most cited papers at CVPR 2019, ICCV 2021, and ECCV 2018, with 12,000+ combined citations (as of 2026).
- 1st place — Look into Person Challenge 2019, Single-Person Pose Estimation.
- 1st place — PoseTrack Multi-Person Pose Tracking Challenge 2018.
- 2nd place — COCO Keypoint Detection Challenge 2018.
- 2nd place — Object365 Challenge 2019, Full track.