Stars
JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence
JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence
Official Repo for "Why Settle for One? Text-to-ImageSet Generation and Evaluation"
ScaleCUA is the open-sourced computer use agents that can operate on corss-platform environments (Windows, macOS, Ubuntu, Android).
Code for "From Ideal to Real: Unified and Data-Efficient Dense Prediction for Real-World Scenarios"
[NeurIPS'25] GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Code, benchmark and environment for "ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows"
[EMNLP2025 Main] Code, Result and Files for paper[Do Large Language Models Excel in Complex Logical Reasoning with Formal Language?]
[ACL 2025] A Generalizable and Purely Unsupervised Self-Training Framework
GUI Grounding for Professional High-Resolution Computer Use
[ICML2025] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Official Code for "Coser: Coordinating LLM-Based Persona Simulation of Established Roles"
[ACL 2025] An inference-time decoding strategy with adaptive foresight sampling
An Arena-style Automated Evaluation Benchmark for Detailed Captioning
[CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents
Latest Advances on System-2 Reasoning
[ACL 2025] AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant
The model, data and code for the visual GUI Agent SeeClick
[ACL 2025] Code and data for OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
OS-ATLAS: A Foundation Action Model For Generalist GUI Agents
Building a comprehensive and handy list of papers for GUI agents