Bin Xiao

AI Researcher in Multimodal Foundation Models, Computer Vision, and Agentic AI

Member of Technical Staff, Microsoft Superintelligence Team · Redmond, WA

Florence-2 · Phi-3 Vision · HRNet · CvT · Simple Baselines

Google Scholar Profile →

About

I am an AI researcher working on multimodal foundation models, computer vision, and agentic AI. My work spans vision-language model training, post-training, visual recognition, document understanding, and human pose estimation. Representative projects include Florence-2, Phi-3 Vision, HRNet, CvT, and Simple Baselines for Human Pose Estimation.

Research Focus

  • Multimodal foundation models Building general-purpose vision-language models for OCR, grounding, captioning, dense prediction, and document/image understanding.
  • Agentic and reasoning systems Post-training models for tool use, coding, reasoning, and real-world task completion.
  • Efficient and deployable AI Designing compact and practical models that can run in real-world product settings, including low-latency and resource-constrained scenarios.
  • Vision representation learning From high-resolution CNNs to convolutional vision transformers to unified multimodal representations.

Selected Work

6 selected works
  1. 06

    2026 · Microsoft AI Blog / Tech ReportReasoning ModelCoding ModelAgentic AI

    MAI-Thinking-1 and MAI-Code-1-Flash

    Microsoft AI reasoning and coding models for agentic workflows: MAI-Thinking-1 focuses on software engineering and mathematical reasoning, while MAI-Code-1-Flash is optimized for production GitHub Copilot coding workflows and efficient reasoning per token.

  2. 05

    2024 · Technical ReportMultimodal LLM

    Phi-3 Vision / Phi-3.5 Vision

    Compact multimodal language models for visual reasoning and OCR-centric understanding, exploring how strong multimodal capability can be delivered in resource-constrained settings.

  3. 04

    2024 · CVPROralWidely adopted on Hugging Face

    Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

    A unified prompt-based vision foundation model spanning captioning, OCR, grounding, and segmentation — one architecture handling many vision tasks through a single sequence-to-sequence formulation.

  4. 03

    ICCV 20213,000+ citationsMost cited ICCV 2021

    CvT: Introducing Convolutions to Vision Transformers

    Introduced convolutional inductive biases into vision transformers for efficient visual recognition — bridging the strengths of CNNs and transformers in a single architecture.

  5. 02

    CVPR 20197,000+ citationsMost cited CVPR 2019

    Deep High-Resolution Representation Learning for Human Pose Estimation

    Maintained high-resolution representations throughout the network, becoming a durable architecture family for pose estimation and dense visual recognition.

  6. 01

    ECCV 20182,900+ citationsMost cited ECCV 2018

    Simple Baselines for Human Pose Estimation and Tracking

    Showed that a simple, carefully designed deconvolutional head could deliver strong human pose estimation performance — becoming a durable reference baseline for the field.

Research Trajectory

2018 – present
  1. 2025 – present

    Reasoning, coding, and agentic model post-training — building models that can plan, use tools, and complete real-world tasks.

  2. 2024 – 2025

    Multimodal Llama post-training — extending large language models with visual understanding capabilities.

  3. 2024

    Led Phi-3 Vision and Phi-3.5 Vision — compact multimodal LLMs that brought strong visual reasoning to resource-constrained settings.

  4. 2020 – 2023

    Led the Florence-1 and Florence-2 vision foundation models — unifying captioning, OCR, grounding, and segmentation in a single sequence-to-sequence architecture.

  5. 2018 – 2021

    Built durable vision architectures and baselines: CvT (convolutional vision transformers), HRNet (high-resolution representations), and SimpleBaseline (pose estimation reference).

Recognition

  • HRNet, CvT, and SimpleBaseline — among the most cited papers at CVPR 2019, ICCV 2021, and ECCV 2018, with 12,000+ combined citations (as of 2026).
  • 1st place — Look into Person Challenge 2019, Single-Person Pose Estimation.
  • 1st place — PoseTrack Multi-Person Pose Tracking Challenge 2018.
  • 2nd place — COCO Keypoint Detection Challenge 2018.
  • 2nd place — Object365 Challenge 2019, Full track.

Contact

LinkedIn
Location
Microsoft Superintelligence Team · Redmond, WA