Bin Xiao

AI Researcher in Multimodal Foundation Models, Computer Vision, and Agentic AI

Member of Technical Staff, Microsoft Superintelligence Team · Redmond, WA

Florence-2 · Phi-3 Vision · HRNet · CvT · Simple Baselines

About

I am an AI researcher working on multimodal foundation models, computer vision, and agentic AI. My work spans vision-language model training, post-training, visual recognition, document understanding, and human pose estimation. Representative projects include Florence-2, Phi-3 Vision, HRNet, CvT, and Simple Baselines for Human Pose Estimation.

Research Focus

Multimodal foundation models Building general-purpose vision-language models for OCR, grounding, captioning, dense prediction, and document/image understanding.
Agentic and reasoning systems Post-training models for tool use, coding, reasoning, and real-world task completion.
Efficient and deployable AI Designing compact and practical models that can run in real-world product settings, including low-latency and resource-constrained scenarios.
Vision representation learning From high-resolution CNNs to convolutional vision transformers to unified multimodal representations.

Selected Work

6 selected works

06

2026 · Microsoft AI Blog / Tech ReportReasoning ModelCoding ModelAgentic AI

MAI-Thinking-1 and MAI-Code-1-Flash

Microsoft AI reasoning and coding models for agentic workflows: MAI-Thinking-1 focuses on software engineering and mathematical reasoning, while MAI-Code-1-Flash is optimized for production GitHub Copilot coding workflows and efficient reasoning per token.

MAI-Thinking-1 MAI-Code-1-Flash Tech Report
05

2024 · Technical ReportMultimodal LLM

Phi-3 Vision / Phi-3.5 Vision

Compact multimodal language models for visual reasoning and OCR-centric understanding, exploring how strong multimodal capability can be delivered in resource-constrained settings.

Paper Model
04

2024 · CVPROralWidely adopted on Hugging Face

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

A unified prompt-based vision foundation model spanning captioning, OCR, grounding, and segmentation — one architecture handling many vision tasks through a single sequence-to-sequence formulation.

Paper Model
03

ICCV 20213,000+ citationsMost cited ICCV 2021

CvT: Introducing Convolutions to Vision Transformers

Introduced convolutional inductive biases into vision transformers for efficient visual recognition — bridging the strengths of CNNs and transformers in a single architecture.

Paper Code
02

CVPR 20197,000+ citationsMost cited CVPR 2019

Deep High-Resolution Representation Learning for Human Pose Estimation

Maintained high-resolution representations throughout the network, becoming a durable architecture family for pose estimation and dense visual recognition.

Paper Code
01

ECCV 20182,900+ citationsMost cited ECCV 2018

Simple Baselines for Human Pose Estimation and Tracking

Showed that a simple, carefully designed deconvolutional head could deliver strong human pose estimation performance — becoming a durable reference baseline for the field.

Paper Code

Research Trajectory

2018 – present

2025 – present
Reasoning, coding, and agentic model post-training — building models that can plan, use tools, and complete real-world tasks.
2024 – 2025
Multimodal Llama post-training — extending large language models with visual understanding capabilities.
2024
Led Phi-3 Vision and Phi-3.5 Vision — compact multimodal LLMs that brought strong visual reasoning to resource-constrained settings.
2020 – 2023
Led the Florence-1 and Florence-2 vision foundation models — unifying captioning, OCR, grounding, and segmentation in a single sequence-to-sequence architecture.
2018 – 2021
Built durable vision architectures and baselines: CvT (convolutional vision transformers), HRNet (high-resolution representations), and SimpleBaseline (pose estimation reference).

Recognition

HRNet, CvT, and SimpleBaseline — among the most cited papers at CVPR 2019, ICCV 2021, and ECCV 2018, with 12,000+ combined citations (as of 2026).
1st place — Look into Person Challenge 2019, Single-Person Pose Estimation.
1st place — PoseTrack Multi-Person Pose Tracking Challenge 2018.
2nd place — COCO Keypoint Detection Challenge 2018.
2nd place — Object365 Challenge 2019, Full track.

Contact

Scholar

Google Scholar →

GitHub

github.com/leoxiaobin →

ORCID

0000-0001-6477-5911 →

LinkedIn →

Hugging Face

huggingface.co/leoxiaobin →

Location

Microsoft Superintelligence Team · Redmond, WA

Bin Xiao

About

Research Focus

Selected Work

MAI-Thinking-1 and MAI-Code-1-Flash

Phi-3 Vision / Phi-3.5 Vision

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

CvT: Introducing Convolutions to Vision Transformers

Deep High-Resolution Representation Learning for Human Pose Estimation

Simple Baselines for Human Pose Estimation and Tracking

Research Trajectory

Recognition

Contact