Skip to content

Hang0ling/GUIShow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

108 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

鬼手GUIShow

GUIShow

做最快、最高效、最顶层的真实可用GUI Agent。

鬼手GUIShow 是一个面向真实手机 GUI 的执行型 Phone Agent 原型。它把手机控制拆成任务队列、分层执行、低成本验证和人工接管;当任务走到聊天框、评论框、私信框等需要语言的位置,它向外部大脑 Agent 请求一句确切的话,然后输入、发送、验证。

它不是另一个 AutoGLM wrapper。
它不是 Siri。
它不是把截图丢给模型然后祈祷的 Mobile Agent。

它是一只手。

冷、快、少废话。

Premise

手机 GUI 自动化不是玄学。

大多数动作不需要“理解屏幕”。它们只需要被正确路由:

tap / swipe / keyevent
am start / deep link
resource-id / bounds
OCR / CV probe
VLM only when necessary
human takeover when the machine starts pretending

如果一个 Agent 每一步都要截图问模型,它不是智能,它是在现场补作业。

鬼手的判断很简单:

能走系统能力,就不看图。
能走节点,就不问 VLM。
能等大脑,就不自作主张。
能插队短任务,就不空等长任务。

The Difference

维度 截图型 Phone Agent GUIShow
控制方式 看图、猜、点 分层路由,直接执行
搜索任务 找按钮,点输入框,慢慢来 Deep Link / Intent 优先
输入框 模型现场编话 外部大脑 Agent 返回 exact text
等待状态 卡住,挂起,浪费时间 动态队列释放前台锁
验证 再截图,再问模型 focus / UI tree / region probe
卡住以后 继续试错 人工接管是第六层能力
成本 每步都重 模型调用只用在必要处

AutoGLM、Siri、MobileAgent 都有各自的位置。
GUIShow 的位置更窄,也更不客气:

把手机上的事做完。

Architecture

GUIShow Architecture

用户 / 业务系统
  -> 用户隔离任务库
  -> 动态任务队列
  -> 鬼手执行 Agent
  -> 六层执行路径
  -> Android 手机

外部大脑 Agent 不直接摸手机。
鬼手不擅自生成对外文本。
二者通过输入请求交接:

phone reaches input context
  -> request next_utterance
  -> brain returns exact text
  -> GUIShow types, sends, verifies

Six Layers

Layer Name Use Case Typical Time Cost
L0 Direct ADB tap, swipe, back, home, keyevent, input 20-800ms local
L1 Intent / Deep Link search pages, profiles, product pages, chat entry 0.5-3s local
L2 UI Cache known resource-id, bounds, editable nodes 0.2-1.5s local
L3 OCR / CV Probe keyboard state, send button state, visual confirmation 50-300ms probe low
L4 VLM / AutoGLM unknown screen, modal, ambiguous visual semantics 1.5-8s model
L5 Human Takeover risk, dead end, account-sensitive action human time attention

Default policy:

Use the lowest sufficient layer.
Escalate only when evidence fails.
Stop when confidence becomes theater.

Dynamic Task Queue

GUIShow Dynamic Queue

GUIShow 不崇拜等待。

如果一个长任务进入等待态,比如:

打开 WhatsApp -> 进入聊天框 -> 等外部大脑 Agent 返回回复文本

前台执行权不应该闲置。队列可以继续调度不依赖这个等待结果的短任务:

等待输入时:
  -> 点进某个用户主页
  -> 给第一条内容点赞
  -> 给某个朋友关注
  -> 回来继续原长任务

这不是“多线程乱点”。这是有边界的动态调度:

  • 同一台手机同一时刻只有一个前台动作序列。
  • 进入 waiting 的长任务释放前台锁。
  • 不依赖等待结果的短任务可以插队执行。
  • 外部输入返回后,原任务恢复。
  • 账号敏感动作仍然受风险策略约束。

Latency is inventory.
GUIShow spends it.

Speech Belongs To The Brain

GUIShow 不写话术。

遇到聊天框、评论框、私信框、回复框:

detect input context
  -> /api/brain/input/request
  -> purpose = next_utterance
  -> wait for external brain response
  -> type exact response

搜索框例外。搜索框使用任务关键词。
登录、验证码、支付、隐私字段不自动填。
公开发布、点赞、关注、购买等代表账号意志的动作,必须接入业务策略。

Input Detection Strategy

输入框不应该靠玄学识别。手机自己会弹键盘,系统也会留下痕迹。

GUIShow 的输入检测路径:

UIAutomator focused EditText
OR input_method reports IME shown
OR Accessibility focused editable node
OR WebRTC detects keyboard / input region
OR previous action clicked reply/comment/message/search and keyboard appears

分类后执行:

Input Type Behavior
Search box use task keyword
Chat / comment / DM / reply request external brain text
Login / OTP / payment / private field stop or hand over
Generic form use task params or ask brain

Implemented

  • Local HTTP console at :8787.
  • ADB execution layer: tap, swipe, keyevent, type, activity, package launch.
  • WebRTC Companion App for live phone screen streaming.
  • Region probe over WebRTC frames: color, brightness, perceptual hash.
  • Fast state recognizer from focus and cheap visual evidence.
  • User-scoped task records: user_id, priority, status.
  • Brain input handoff:
    • POST /api/brain/input/request
    • POST /api/brain/input/respond
  • Human control:
    • Pause / Take Over
    • Resume AI
    • Interrupt AI
    • pointer event on live surface interrupts future AI actions
  • Brain-sourced text rule:
    • type_text(source=brain) requests external text.
    • no response means waiting, not hallucinated typing.

Repository Map

.
├── mobile_agent/
│   ├── server.py              # HTTP API and browser console
│   ├── brain.py               # users, tasks, brain input requests
│   ├── task_manager.py        # scheduling, foreground lock, interrupts
│   ├── router.py              # natural language -> intent -> actions
│   ├── executor.py            # action execution
│   ├── adb_controller.py      # ADB wrapper
│   ├── vision.py              # WebRTC frames, probes, VLM hook
│   ├── state_recognizer.py    # fast state recognition
│   └── docs/
│       └── chat_app_agent_architecture.md
├── companion_app/             # Android screen capture + WebRTC companion
├── docs/assets/               # README diagrams
├── Reference/                 # upstream and external reference projects
│   ├── Open-AutoGLM/
│   └── x-plug/
├── tools/platform-tools/      # local ADB tools
└── README.md

Reference/ is a research shelf. GUIShow's active runtime path is mobile_agent/ and companion_app/.

Quick Start

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export PATH="$PWD/tools/platform-tools:$PATH"
adb devices
PATH="$PWD/tools/platform-tools:$PATH" .venv/bin/python -m mobile_agent.server \
  --host 0.0.0.0 \
  --port 8787 \
  --fps 0.2 \
  --max-width 720 \
  --jpeg-quality 55 \
  --focus-interval 10

Open:

http://127.0.0.1:8787

API

Submit a task:

curl -X POST http://127.0.0.1:8787/api/tasks \
  -H 'content-type: application/json' \
  -d '{
    "user_id": "demo_user",
    "priority": 8,
    "goal": "打开 X 搜索 OpenAI,准备评论"
  }'

Request speech from the external brain:

curl -X POST http://127.0.0.1:8787/api/brain/input/request \
  -H 'content-type: application/json' \
  -d '{
    "user_id": "demo_user",
    "purpose": "next_utterance",
    "context": {
      "app": "X",
      "surface": "reply_box"
    }
  }'

Answer with exact text:

curl -X POST http://127.0.0.1:8787/api/brain/input/respond \
  -H 'content-type: application/json' \
  -d '{
    "request_id": "brain_input_xxx",
    "response_text": "这条信息值得继续看。"
  }'

Interrupt and take over:

curl -X POST http://127.0.0.1:8787/api/control/interrupt \
  -H 'content-type: application/json' \
  -d '{
    "source": "human",
    "reason": "manual takeover",
    "takeover": true
  }'

Roadmap

  • Persistent task queue: SQLite / Postgres.
  • Production external brain Agent integration.
  • Automatic input-context detector: UIAutomator + IME + Accessibility + CV.
  • App route library: X, Instagram, TikTok, Xiaohongshu, Xianyu, Amazon, WhatsApp, Snapchat.
  • Dynamic queue runner with dependency-aware scheduling.
  • OCR/CV templates for send buttons, comment boxes, search boxes, keyboard states.
  • VLM fallback loop.
  • Multi-device scheduler.
  • Risk policies for delete, payment, purchase, follow, like, public post.

Boundary

GUIShow is a research prototype for GUI automation.

It is not a spam engine.
It is not a bot farm.
It is not an excuse to outsource judgment to a process with rootless confidence.

Default rules:

  • Delete requires confirmation.
  • Payment, purchase, login, OTP, and private fields are not filled automatically.
  • Public comments, DMs, likes, follows, and posts must be governed by business policy.
  • Human takeover overrides AI.

Last Line

Others wait for the model to understand the screen.

GUIShow asks a colder question:

Why is the model involved at all?

About

最快、最高效、最顶层的真实可用屏幕控制GUIAgent

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors