Lists (2)
Sort Name ascending (A-Z)
Stars
Model Context Protocol (MCP) server for AI-assisted development ("vibe coding") of MDK applications.
AI agents can now use real Android and iOS apps, just like a human.
[CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.
Every front-end GUI client for ChatGPT, Claude, and other LLMs
[NeurIPS'25] GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
GUI Grounding for Professional High-Resolution Computer Use
Agent S: an open agentic framework that uses computers like a human
This is the repo for the paper "OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use" (ACL 2025 Oral).
Python script to upload videos on YouTube using Selenium
[ICML'24] SeeAct is a system for generalist web agents that autonomously carry out tasks on any given website, with a focus on large multimodal models (LMMs) such as GPT-4V(ision).
AI agent using GPT-4V(ision) capable of using a mouse/keyboard to interact with web UI
Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS …
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
ui-screenshot-to-prompt is an AI-powered tool that analyzes UI images to generate detailed prompts for AI coders. It uses computer vision and natural language processing to break down UI components…
A curated list of of awesome UI agents resources, encompassing Web, App, OS, and beyond (continually updated)
A collection of AI Agents papers (Updated biweekly)
A one stop repository for generative AI research updates, interview resources, notebooks and much more!
JavaScript API for Chrome and Firefox
The model, data and code for the visual GUI Agent SeeClick
verl: Volcano Engine Reinforcement Learning for LLMs
[NeurIPS 2025] 🌐 WebThinker: Empowering Large Reasoning Models with Deep Research Capability
[NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web" -- the first LLM-based web agent and benchmark for generalist web agents
A simple screen parsing tool towards pure vision based GUI agent