Stars
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
[NeurIPS'25] GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence
GUI Grounding for Professional High-Resolution Computer Use
AndroidWorld is an environment and benchmark for autonomous agents
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Building a comprehensive and handy list of papers for GUI agents
[ACL 2025] Code and data for OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
[CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents
Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"
ScaleCUA is the open-sourced computer use agents that can operate on corss-platform environments (Windows, macOS, Ubuntu, Android).
Official Repo for "Why Settle for One? Text-to-ImageSet Generation and Evaluation"
Code, benchmark and environment for "ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows"
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding
Neural Code Intelligence Survey 2024; Reading lists and resources
[ICLR'25 Oral] UGround: Universal GUI Visual Grounding for GUI Agents
The model, data and code for the visual GUI Agent SeeClick
Code for "From Ideal to Real: Unified and Data-Efficient Dense Prediction for Real-World Scenarios"
Official Code for "Coser: Coordinating LLM-Based Persona Simulation of Established Roles"
[ICLR 2025] A trinity of environments, tools, and benchmarks for general virtual agents
Latest Advances on System-2 Reasoning
This is a collection of resources for computer-use GUI agents, including videos, blogs, papers, and projects.