鬼手GUIShow

做最快、最高效、最顶层的真实可用GUI Agent。

鬼手GUIShow 是一个面向真实手机 GUI 的执行型 Phone Agent 原型。它把手机控制拆成任务队列、分层执行、低成本验证和人工接管；当任务走到聊天框、评论框、私信框等需要语言的位置，它向外部大脑 Agent 请求一句确切的话，然后输入、发送、验证。

它不是另一个 AutoGLM wrapper。
它不是 Siri。
它不是把截图丢给模型然后祈祷的 Mobile Agent。

它是一只手。

冷、快、少废话。

Premise

手机 GUI 自动化不是玄学。

大多数动作不需要“理解屏幕”。它们只需要被正确路由：

tap / swipe / keyevent
am start / deep link
resource-id / bounds
OCR / CV probe
VLM only when necessary
human takeover when the machine starts pretending

如果一个 Agent 每一步都要截图问模型，它不是智能，它是在现场补作业。

鬼手的判断很简单：

能走系统能力，就不看图。
能走节点，就不问 VLM。
能等大脑，就不自作主张。
能插队短任务，就不空等长任务。

The Difference

维度	截图型 Phone Agent	GUIShow
控制方式	看图、猜、点	分层路由，直接执行
搜索任务	找按钮，点输入框，慢慢来	Deep Link / Intent 优先
输入框	模型现场编话	外部大脑 Agent 返回 exact text
等待状态	卡住，挂起，浪费时间	动态队列释放前台锁
验证	再截图，再问模型	focus / UI tree / region probe
卡住以后	继续试错	人工接管是第六层能力
成本	每步都重	模型调用只用在必要处

AutoGLM、Siri、MobileAgent 都有各自的位置。
GUIShow 的位置更窄，也更不客气：

把手机上的事做完。

Architecture

用户 / 业务系统
  -> 用户隔离任务库
  -> 动态任务队列
  -> 鬼手执行 Agent
  -> 六层执行路径
  -> Android 手机

外部大脑 Agent 不直接摸手机。
鬼手不擅自生成对外文本。
二者通过输入请求交接：

phone reaches input context
  -> request next_utterance
  -> brain returns exact text
  -> GUIShow types, sends, verifies

Six Layers

Layer	Name	Use Case	Typical Time	Cost
L0	Direct ADB	tap, swipe, back, home, keyevent, input	20-800ms	local
L1	Intent / Deep Link	search pages, profiles, product pages, chat entry	0.5-3s	local
L2	UI Cache	known resource-id, bounds, editable nodes	0.2-1.5s	local
L3	OCR / CV Probe	keyboard state, send button state, visual confirmation	50-300ms probe	low
L4	VLM / AutoGLM	unknown screen, modal, ambiguous visual semantics	1.5-8s	model
L5	Human Takeover	risk, dead end, account-sensitive action	human time	attention

Default policy:

Use the lowest sufficient layer.
Escalate only when evidence fails.
Stop when confidence becomes theater.

Dynamic Task Queue

GUIShow 不崇拜等待。

如果一个长任务进入等待态，比如：

打开 WhatsApp -> 进入聊天框 -> 等外部大脑 Agent 返回回复文本

前台执行权不应该闲置。队列可以继续调度不依赖这个等待结果的短任务：

等待输入时：
  -> 点进某个用户主页
  -> 给第一条内容点赞
  -> 给某个朋友关注
  -> 回来继续原长任务

这不是“多线程乱点”。这是有边界的动态调度：

同一台手机同一时刻只有一个前台动作序列。
进入 waiting 的长任务释放前台锁。
不依赖等待结果的短任务可以插队执行。
外部输入返回后，原任务恢复。
账号敏感动作仍然受风险策略约束。

Latency is inventory.
GUIShow spends it.

Speech Belongs To The Brain

GUIShow 不写话术。

遇到聊天框、评论框、私信框、回复框：

detect input context
  -> /api/brain/input/request
  -> purpose = next_utterance
  -> wait for external brain response
  -> type exact response

搜索框例外。搜索框使用任务关键词。
登录、验证码、支付、隐私字段不自动填。
公开发布、点赞、关注、购买等代表账号意志的动作，必须接入业务策略。

Input Detection Strategy

输入框不应该靠玄学识别。手机自己会弹键盘，系统也会留下痕迹。

GUIShow 的输入检测路径：

UIAutomator focused EditText
OR input_method reports IME shown
OR Accessibility focused editable node
OR WebRTC detects keyboard / input region
OR previous action clicked reply/comment/message/search and keyboard appears

分类后执行：

Input Type	Behavior
Search box	use task keyword
Chat / comment / DM / reply	request external brain text
Login / OTP / payment / private field	stop or hand over
Generic form	use task params or ask brain

Implemented

Local HTTP console at :8787.
ADB execution layer: tap, swipe, keyevent, type, activity, package launch.
WebRTC Companion App for live phone screen streaming.
Region probe over WebRTC frames: color, brightness, perceptual hash.
Fast state recognizer from focus and cheap visual evidence.
User-scoped task records: user_id, priority, status.
Brain input handoff:
- POST /api/brain/input/request
- POST /api/brain/input/respond
Human control:
- Pause / Take Over
- Resume AI
- Interrupt AI
- pointer event on live surface interrupts future AI actions
Brain-sourced text rule:
- type_text(source=brain) requests external text.
- no response means waiting, not hallucinated typing.

Repository Map

.
├── mobile_agent/
│   ├── server.py              # HTTP API and browser console
│   ├── brain.py               # users, tasks, brain input requests
│   ├── task_manager.py        # scheduling, foreground lock, interrupts
│   ├── router.py              # natural language -> intent -> actions
│   ├── executor.py            # action execution
│   ├── adb_controller.py      # ADB wrapper
│   ├── vision.py              # WebRTC frames, probes, VLM hook
│   ├── state_recognizer.py    # fast state recognition
│   └── docs/
│       └── chat_app_agent_architecture.md
├── companion_app/             # Android screen capture + WebRTC companion
├── docs/assets/               # README diagrams
├── Reference/                 # upstream and external reference projects
│   ├── Open-AutoGLM/
│   └── x-plug/
├── tools/platform-tools/      # local ADB tools
└── README.md

Reference/ is a research shelf. GUIShow's active runtime path is mobile_agent/ and companion_app/.

Quick Start

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

export PATH="$PWD/tools/platform-tools:$PATH"
adb devices

PATH="$PWD/tools/platform-tools:$PATH" .venv/bin/python -m mobile_agent.server \
  --host 0.0.0.0 \
  --port 8787 \
  --fps 0.2 \
  --max-width 720 \
  --jpeg-quality 55 \
  --focus-interval 10

Open:

http://127.0.0.1:8787

API

Submit a task:

curl -X POST http://127.0.0.1:8787/api/tasks \
  -H 'content-type: application/json' \
  -d '{
    "user_id": "demo_user",
    "priority": 8,
    "goal": "打开 X 搜索 OpenAI，准备评论"
  }'

Request speech from the external brain:

curl -X POST http://127.0.0.1:8787/api/brain/input/request \
  -H 'content-type: application/json' \
  -d '{
    "user_id": "demo_user",
    "purpose": "next_utterance",
    "context": {
      "app": "X",
      "surface": "reply_box"
    }
  }'

Answer with exact text:

curl -X POST http://127.0.0.1:8787/api/brain/input/respond \
  -H 'content-type: application/json' \
  -d '{
    "request_id": "brain_input_xxx",
    "response_text": "这条信息值得继续看。"
  }'

Interrupt and take over:

curl -X POST http://127.0.0.1:8787/api/control/interrupt \
  -H 'content-type: application/json' \
  -d '{
    "source": "human",
    "reason": "manual takeover",
    "takeover": true
  }'

Roadmap

Persistent task queue: SQLite / Postgres.
Production external brain Agent integration.
Automatic input-context detector: UIAutomator + IME + Accessibility + CV.
App route library: X, Instagram, TikTok, Xiaohongshu, Xianyu, Amazon, WhatsApp, Snapchat.
Dynamic queue runner with dependency-aware scheduling.
OCR/CV templates for send buttons, comment boxes, search boxes, keyboard states.
VLM fallback loop.
Multi-device scheduler.
Risk policies for delete, payment, purchase, follow, like, public post.

Boundary

GUIShow is a research prototype for GUI automation.

It is not a spam engine.
It is not a bot farm.
It is not an excuse to outsource judgment to a process with rootless confidence.

Default rules:

Delete requires confirmation.
Payment, purchase, login, OTP, and private fields are not filled automatically.
Public comments, DMs, likes, follows, and posts must be governed by business policy.
Human takeover overrides AI.

Last Line

Others wait for the model to understand the screen.

GUIShow asks a colder question:

Why is the model involved at all?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

鬼手GUIShow

Premise

The Difference

Architecture

Six Layers

Dynamic Task Queue

Speech Belongs To The Brain

Input Detection Strategy

Implemented

Repository Map

Quick Start

API

Roadmap

Boundary

Last Line

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
Reference		Reference
companion_app		companion_app
docs/assets		docs/assets
mobile_agent		mobile_agent
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

鬼手GUIShow

Premise

The Difference

Architecture

Six Layers

Dynamic Task Queue

Speech Belongs To The Brain

Input Detection Strategy

Implemented

Repository Map

Quick Start

API

Roadmap

Boundary

Last Line

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages