Stars
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
[CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents
Latest Advances on System-2 Reasoning
Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
ScaleCUA is the open-sourced computer use agents that can operate on corss-platform environments (Windows, macOS, Ubuntu, Android).
Building a comprehensive and handy list of papers for GUI agents
AndroidWorld is an environment and benchmark for autonomous agents
This is a collection of resources for computer-use GUI agents, including videos, blogs, papers, and projects.
The model, data and code for the visual GUI Agent SeeClick
Paper list for Personal LLM Agents
OS-ATLAS: A Foundation Action Model For Generalist GUI Agents
[ICML2025] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
[NeurIPS'25] GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
[ICLR'25 Oral] UGround: Universal GUI Visual Grounding for GUI Agents
GUI Grounding for Professional High-Resolution Computer Use
Neural Code Intelligence Survey 2024; Reading lists and resources
Official implementation for "You Only Look at Screens: Multimodal Chain-of-Action Agents" (Findings of ACL 2024)
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding
[ICLR 2025] A trinity of environments, tools, and benchmarks for general virtual agents
A RLHF Infrastructure for Vision-Language Models
[ACL 2025] Code and data for OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
Official Code for "Coser: Coordinating LLM-Based Persona Simulation of Established Roles"
GUICourse: From General Vision Langauge Models to Versatile GUI Agents
Code, benchmark and environment for "ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows"
[ACL 2025] A Neural-Symbolic Self-Training Framework
[ACL 2025] An inference-time decoding strategy with adaptive foresight sampling