Stars
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
[CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents
Latest Advances on System-2 Reasoning
Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"
ScaleCUA is the open-sourced computer use agents that can operate on corss-platform environments (Windows, macOS, Ubuntu, Android).
Building a comprehensive and handy list of papers for GUI agents
AndroidWorld is an environment and benchmark for autonomous agents
OS-ATLAS: A Foundation Action Model For Generalist GUI Agents
[ICML2025] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
[NeurIPS'25] GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
[ICLR'25 Oral] UGround: Universal GUI Visual Grounding for GUI Agents
GUI Grounding for Professional High-Resolution Computer Use
Official implementation for "You Only Look at Screens: Multimodal Chain-of-Action Agents" (Findings of ACL 2024)
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding
[ICLR 2025] A trinity of environments, tools, and benchmarks for general virtual agents
A RLHF Infrastructure for Vision-Language Models
Official Code for "Coser: Coordinating LLM-Based Persona Simulation of Established Roles"
GUICourse: From General Vision Langauge Models to Versatile GUI Agents
Code, benchmark and environment for "ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows"
[ACL 2025] An inference-time decoding strategy with adaptive foresight sampling
A Self-Training Framework for Vision-Language Reasoning
[ACL 2025] A Generalizable and Purely Unsupervised Self-Training Framework
An Arena-style Automated Evaluation Benchmark for Detailed Captioning
HKUNLP / RSA
Forked from chang-github-00/RSARetrieved Sequence Augmentation for Protein Representation Learning
[ACL 2025] AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant
Official repo for "AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability"