Skip to content

showlab/AUI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

84 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Agent-friendly UI

Computer-Use Agents as Judges for Generative User Interface
Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang, Philip Torr, Mike Zheng Shou
Torr Vision Group @ Oxford University, Show Lab @ National University of Singapore, Microsoft

๐Ÿ“„ Paper ย  | ย  ๐Ÿค— HF Daily Paper ย  | ย  ๐Ÿค— HF Demo ย  | ย  ๐ŸŒ Project

What does Agent-friendly UI look like? Check out below demo:

aui_demo_video_h200.mp4

The left UI is designed for ๐Ÿง‘๐Ÿปโ€๐Ÿ’ปhumansโ€”prioritizing aesthetics.

The right UI is redesigned for ๐Ÿค–agentsโ€”focused on clarity and functionality.

๐Ÿ”ฅ Update

  • [2025.11.20] Huggingface Demo is released.
  • [2025.11.19] Arxiv paper is released.
  • [2025.10.30] Code is released.

๐Ÿ“– TL;DR

Can Computer-Use Agents offer feedback to assist Coders to Generate UI?


โš™๏ธ Environments

1. Requirements

  • Use Python 3.10+ in an isolated environment:
    conda create -n aui python=3.10
    conda activate aui
  • Install dependencies and Playwright browsers:
    pip install -r requirements.txt
    python -m playwright install

2. Configure Models

  • Local model servers are recommended (e.g., using VLLM).
  • Edit configs/models.yaml to point to your endpoints:
    • Coder: Qwen3-Coder-30B (http://localhost:8001/v1)
    • CUA: UI-TARS-1.5-7B (http://localhost:8002/v1)
    • Verifier: GPT-5 or Qwen2.5-VL-72B
  • Export API keys (if using proprietary models):
    export AZURE_OPENAI_API_KEY="YOUR_KEY"

๐Ÿš€ Quick Start

Our AUI contains the following stages.

Pipeline:

  • 0๏ธโƒฃ Preparation: Generate initial websites and tasks per app.
  • 1๏ธโƒฃ Task Solvability Check: Judge extracts task-state rules on initial websites to determine task validity.
  • 2๏ธโƒฃ CUA Navigation Test: CUA executes supported tasks; oracle evaluation is rule-based.
  • 3๏ธโƒฃ Iterative Refinement:
    1. Revise: Update websites based on unsupported tasks (Task Solvability Feedback) and CUA failures (Navigation Feedback via Dashboard).
    2. ReJudge: Re-evaluate task solvability on revised websites.
    3. ReTest: CUA executes tasks on revised websites.

For normal usage, you only need the single entrypoint run.py from the repo root:

cd betterui_release
/users/husiyuan/miniconda3/envs/ui/bin/python run.py \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --v0-dir full_52_apps \
  --experiment exp_integrated \
  --revision-type integrated \
  --cua-models uitars

This command sequentially runs Stage 0 โ†’ Stage 3. Advanced users who want full control over each stage can expand the section below to see what each stage does and the corresponding commands.

Show Stage 0โ€“3 details

Stage 0 โ€“ Preparation

  • src/stage0_generate_websites.py: generate initial v0 websites for all apps and coder models.
  • src/stage0_generate_tasks.py: generate 30 tasks per app (GPT-5) based on app labels.

Stage 1 โ€“ Judge v0 (Task Solvability)

  • src/stage1_judge_v0.py: Judge extracts state and completion rules for each task on v0 websites.

Stage 2 โ€“ CUA Test v0 (Navigation)

  • src/stage2_cua_test_v0.py: CUA runs only supported tasks (with rules) on v0 websites; success is evaluated by rules (oracle).

Stage 3 โ€“ Revision + Re-eval

  • src/stage3_0_revise.py: revise websites using unsupported-task feedback, CUA failures, or integrated signals.
  • src/stage3_1_judge_v1.py: re-run judge on v1 websites to update task support.
  • src/stage3_2_cua_test_v1.py: re-run CUA on v1 websites with oracle evaluation.

1) Generate Initial Websites (3 coder models ร— 52 apps)

python src/stage0_generate_websites.py \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --v0-dir full_52_apps

2) Generate Tasks (30 tasks per app via GPT-5)

python src/stage0_generate_tasks.py \
  --apps all \
  --v0-dir full_52_apps

3) Metric 1: Judge Initial Websites (Task Solvability)

python src/stage1_judge_v0.py \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --v0-dir full_52_apps

4) Metric 2: CUA Navigation Test (Initial)

python src/stage2_cua_test_v0.py \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --v0-dir full_52_apps \
  --cua-models uitars

5) Stage 3: Iterative Refinement (Choose a revision strategy)

  • Option A: CUA Revision (Fix based on navigation failures)

    python src/stage3_0_revise.py \
      --experiment exp_cua_fix \
      --models gpt5,qwen,gpt4o \
      --apps all \
      --revision-type cua \
      --v0-dir full_52_apps
  • Option B: Unsupported Task Revision (Fix based on missing features)

    python src/stage3_0_revise.py \
      --experiment exp_func_fix \
      --models gpt5,qwen,gpt4o \
      --apps all \
      --revision-type unsupported \
      --v0-dir full_52_apps
  • Option C: Integrated Revision (Combine both โ€“ Recommended)

    python src/stage3_0_revise.py \
      --experiment exp_integrated \
      --models gpt5,qwen,gpt4o \
      --apps all \
      --revision-type integrated \
      --v0-dir full_52_apps

6) Re-evaluate Revised Websites

# Re-Judge Task Solvability
python src/stage3_1_judge_v1.py \
  --experiment exp_integrated \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --revision-type integrated \
  --v0-dir full_52_apps

# Re-Run CUA Navigation Test
python src/stage3_2_cua_test_v1.py \
  --experiment exp_integrated \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --revision-type integrated \
  --cua-models uitars \
  --v0-dir full_52_apps

๐Ÿ—‚๏ธ Data Layout

Initial Data (Stage 0-2)

v0/{v0_dir}/
  websites/{app}/{model}/index.html         # Initial Generated Websites
  tasks/{app}/
    tasks.json                              # Generated Tasks
    states/{model}/rules.json               # Stage 1: Validation Rules
    v0_cua_results/{model}/{cua_model}/     # Stage 2: CUA Results
      results.json
      trajectories/task_{i}/step_*.png|json # Trajectories

Experiments (Stage 3)

experiments/{experiment}/
  runs/{run_key}/
    stage3_0/{app}/{model}/v1_website/index.html    # Revised Websites
    stage3_1/{app}/{model}/rules.json               # Revised Rules
    stage3_2/{cua_model}/{app}/{model}/
      trajectories/{task_id}/                       # New Trajectories
      run_summary.json

๐Ÿ“ Evaluations

  1. Function Completeness (FC): Percentage of tasks that are functionally supported by the UI (determined by the Judge).
  2. CUA Success Rate (SR): Percentage of valid tasks successfully completed by the CUA.

Key Components:

  • Verifier: A GPT-5 based judge that extracts rule-based checks to validate task solvability.
  • CUA Dashboard: A visual summary tool that compresses long interaction trajectories into a single image, highlighting key failure points for the Coder.
  • Revision Strategies:
    • unsupported: Adds missing features for unsolvable tasks.
    • cua: Fixes usability issues preventing agent navigation (destylization, simplification).
    • integrated: Combines both for maximum performance.

๐ŸŽ“ Citations

If you find this project helpful, please consider citing our paper:

@misc{lin2025aui,
      title={Computer-Use Agents as Judges for Generative User Interface}, 
      author={Kevin Qinghong Lin and Siyuan Hu and Linjie Li and Zhengyuan Yang and Lijuan Wang and Philip Torr and Mike Zheng Shou},
      year={2025},
      eprint={2511.15567},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.15567}, 
}

๐Ÿ™ Acknowledgements

  • Apps are adapted from OpenAI's coding examples.
  • Thanks to the open-source community for browser automation (Playwright) and agent tooling.

About

Computer-Use Agents as Judges for Generative UI

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages