做最快、最高效、最顶层的真实可用GUI Agent。
鬼手GUIShow 是一个面向真实手机 GUI 的执行型 Phone Agent 原型。它把手机控制拆成任务队列、分层执行、低成本验证和人工接管;当任务走到聊天框、评论框、私信框等需要语言的位置,它向外部大脑 Agent 请求一句确切的话,然后输入、发送、验证。
它不是另一个 AutoGLM wrapper。
它不是 Siri。
它不是把截图丢给模型然后祈祷的 Mobile Agent。
它是一只手。
冷、快、少废话。
手机 GUI 自动化不是玄学。
大多数动作不需要“理解屏幕”。它们只需要被正确路由:
tap / swipe / keyevent
am start / deep link
resource-id / bounds
OCR / CV probe
VLM only when necessary
human takeover when the machine starts pretending
如果一个 Agent 每一步都要截图问模型,它不是智能,它是在现场补作业。
鬼手的判断很简单:
能走系统能力,就不看图。
能走节点,就不问 VLM。
能等大脑,就不自作主张。
能插队短任务,就不空等长任务。
| 维度 | 截图型 Phone Agent | GUIShow |
|---|---|---|
| 控制方式 | 看图、猜、点 | 分层路由,直接执行 |
| 搜索任务 | 找按钮,点输入框,慢慢来 | Deep Link / Intent 优先 |
| 输入框 | 模型现场编话 | 外部大脑 Agent 返回 exact text |
| 等待状态 | 卡住,挂起,浪费时间 | 动态队列释放前台锁 |
| 验证 | 再截图,再问模型 | focus / UI tree / region probe |
| 卡住以后 | 继续试错 | 人工接管是第六层能力 |
| 成本 | 每步都重 | 模型调用只用在必要处 |
AutoGLM、Siri、MobileAgent 都有各自的位置。
GUIShow 的位置更窄,也更不客气:
把手机上的事做完。
用户 / 业务系统
-> 用户隔离任务库
-> 动态任务队列
-> 鬼手执行 Agent
-> 六层执行路径
-> Android 手机
外部大脑 Agent 不直接摸手机。
鬼手不擅自生成对外文本。
二者通过输入请求交接:
phone reaches input context
-> request next_utterance
-> brain returns exact text
-> GUIShow types, sends, verifies
| Layer | Name | Use Case | Typical Time | Cost |
|---|---|---|---|---|
| L0 | Direct ADB | tap, swipe, back, home, keyevent, input | 20-800ms | local |
| L1 | Intent / Deep Link | search pages, profiles, product pages, chat entry | 0.5-3s | local |
| L2 | UI Cache | known resource-id, bounds, editable nodes | 0.2-1.5s | local |
| L3 | OCR / CV Probe | keyboard state, send button state, visual confirmation | 50-300ms probe | low |
| L4 | VLM / AutoGLM | unknown screen, modal, ambiguous visual semantics | 1.5-8s | model |
| L5 | Human Takeover | risk, dead end, account-sensitive action | human time | attention |
Default policy:
Use the lowest sufficient layer.
Escalate only when evidence fails.
Stop when confidence becomes theater.
GUIShow 不崇拜等待。
如果一个长任务进入等待态,比如:
打开 WhatsApp -> 进入聊天框 -> 等外部大脑 Agent 返回回复文本
前台执行权不应该闲置。队列可以继续调度不依赖这个等待结果的短任务:
等待输入时:
-> 点进某个用户主页
-> 给第一条内容点赞
-> 给某个朋友关注
-> 回来继续原长任务
这不是“多线程乱点”。这是有边界的动态调度:
- 同一台手机同一时刻只有一个前台动作序列。
- 进入
waiting的长任务释放前台锁。 - 不依赖等待结果的短任务可以插队执行。
- 外部输入返回后,原任务恢复。
- 账号敏感动作仍然受风险策略约束。
Latency is inventory.
GUIShow spends it.
GUIShow 不写话术。
遇到聊天框、评论框、私信框、回复框:
detect input context
-> /api/brain/input/request
-> purpose = next_utterance
-> wait for external brain response
-> type exact response
搜索框例外。搜索框使用任务关键词。
登录、验证码、支付、隐私字段不自动填。
公开发布、点赞、关注、购买等代表账号意志的动作,必须接入业务策略。
输入框不应该靠玄学识别。手机自己会弹键盘,系统也会留下痕迹。
GUIShow 的输入检测路径:
UIAutomator focused EditText
OR input_method reports IME shown
OR Accessibility focused editable node
OR WebRTC detects keyboard / input region
OR previous action clicked reply/comment/message/search and keyboard appears
分类后执行:
| Input Type | Behavior |
|---|---|
| Search box | use task keyword |
| Chat / comment / DM / reply | request external brain text |
| Login / OTP / payment / private field | stop or hand over |
| Generic form | use task params or ask brain |
- Local HTTP console at
:8787. - ADB execution layer: tap, swipe, keyevent, type, activity, package launch.
- WebRTC Companion App for live phone screen streaming.
- Region probe over WebRTC frames: color, brightness, perceptual hash.
- Fast state recognizer from focus and cheap visual evidence.
- User-scoped task records:
user_id, priority, status. - Brain input handoff:
POST /api/brain/input/requestPOST /api/brain/input/respond
- Human control:
- Pause / Take Over
- Resume AI
- Interrupt AI
- pointer event on live surface interrupts future AI actions
- Brain-sourced text rule:
type_text(source=brain)requests external text.- no response means waiting, not hallucinated typing.
.
├── mobile_agent/
│ ├── server.py # HTTP API and browser console
│ ├── brain.py # users, tasks, brain input requests
│ ├── task_manager.py # scheduling, foreground lock, interrupts
│ ├── router.py # natural language -> intent -> actions
│ ├── executor.py # action execution
│ ├── adb_controller.py # ADB wrapper
│ ├── vision.py # WebRTC frames, probes, VLM hook
│ ├── state_recognizer.py # fast state recognition
│ └── docs/
│ └── chat_app_agent_architecture.md
├── companion_app/ # Android screen capture + WebRTC companion
├── docs/assets/ # README diagrams
├── Reference/ # upstream and external reference projects
│ ├── Open-AutoGLM/
│ └── x-plug/
├── tools/platform-tools/ # local ADB tools
└── README.md
Reference/ is a research shelf. GUIShow's active runtime path is mobile_agent/
and companion_app/.
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtexport PATH="$PWD/tools/platform-tools:$PATH"
adb devicesPATH="$PWD/tools/platform-tools:$PATH" .venv/bin/python -m mobile_agent.server \
--host 0.0.0.0 \
--port 8787 \
--fps 0.2 \
--max-width 720 \
--jpeg-quality 55 \
--focus-interval 10Open:
http://127.0.0.1:8787
Submit a task:
curl -X POST http://127.0.0.1:8787/api/tasks \
-H 'content-type: application/json' \
-d '{
"user_id": "demo_user",
"priority": 8,
"goal": "打开 X 搜索 OpenAI,准备评论"
}'Request speech from the external brain:
curl -X POST http://127.0.0.1:8787/api/brain/input/request \
-H 'content-type: application/json' \
-d '{
"user_id": "demo_user",
"purpose": "next_utterance",
"context": {
"app": "X",
"surface": "reply_box"
}
}'Answer with exact text:
curl -X POST http://127.0.0.1:8787/api/brain/input/respond \
-H 'content-type: application/json' \
-d '{
"request_id": "brain_input_xxx",
"response_text": "这条信息值得继续看。"
}'Interrupt and take over:
curl -X POST http://127.0.0.1:8787/api/control/interrupt \
-H 'content-type: application/json' \
-d '{
"source": "human",
"reason": "manual takeover",
"takeover": true
}'- Persistent task queue: SQLite / Postgres.
- Production external brain Agent integration.
- Automatic input-context detector: UIAutomator + IME + Accessibility + CV.
- App route library: X, Instagram, TikTok, Xiaohongshu, Xianyu, Amazon, WhatsApp, Snapchat.
- Dynamic queue runner with dependency-aware scheduling.
- OCR/CV templates for send buttons, comment boxes, search boxes, keyboard states.
- VLM fallback loop.
- Multi-device scheduler.
- Risk policies for delete, payment, purchase, follow, like, public post.
GUIShow is a research prototype for GUI automation.
It is not a spam engine.
It is not a bot farm.
It is not an excuse to outsource judgment to a process with rootless confidence.
Default rules:
- Delete requires confirmation.
- Payment, purchase, login, OTP, and private fields are not filled automatically.
- Public comments, DMs, likes, follows, and posts must be governed by business policy.
- Human takeover overrides AI.
Others wait for the model to understand the screen.
GUIShow asks a colder question:
Why is the model involved at all?