⭐ StaR

See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles

This repository is the code implementation of our paper

See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles

🚀 News

2025.9.18 We release the state control benchmark in our paper.
2025.9.17 We release the preprocess and evaluation code of our paper.
2025.9.17 We release the video demo of our paper.

Video Demo

We provide the video demo corresponding to the Section 5.5 of our paper. The target instruction is turn wifi on, with the toggle initially set to on, thereby serving as a test for false-positive toggling. The video demo is available at VideoDemo.

OS-Atlas-7B without StaR fails to execute the instruction correctly, resulting in a false-positive toggle. The agent mistakenly perceives the current toggle state as off and incorrectly clicks the toggle, resulting in an unintended state change. It then repeatedly toggles between on and off, falling into an infinite loop and ultimately failing the task.
OS-Atlas-7B with StaR, by contrast, executes the instruction successfully. At the critical decision step, the agent adaptively applies the state-aware reasoning chain, correctly perceiving the current toggle state as on and appropriately deciding to finish the task, thereby completing the instruction as intended.

Dependencies

Install requirements by:

pip install -r requirements.txt

For training the agents, we adopt the LLaMA-Factory framework. We provide the reference of the source code of 0.9.4.dev0 version. Navigate to the LLaMA-Factory directory and install the dependencies:

cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation

For evaluating the agent in the dynamic environment, we adopt the AndroidWorld framework. Please refer to their repository for deployment instructions.

State Control Benchmark

We construct a state control benchmark with binary toggle instructions from public datasets to evaluate agent performance on toggle execution.

Examples are provided in this repository:

Benchmark samples: Examples
Corresponding screenshots: ImagePaths

An example record of the state control benchmark are presented as below:

{
    "images": [
        "GUIData/stateControlBenchmark/AITW_episode_8680156250447271550_step_10.jpg"
    ], 
    "img_filename": "episode_8680156250447271550_step_10.jpg", 
    "bbox": [
        814.4,
        360.5,
        914.4,
        460.5
    ], 
    "image_height": 732,
    "image_width": 412,
    "clickCoordinate": [
        864.4,
        410.5
    ], 
    "useBbox": false, 
    "annotation": {
        "is_switch": true, 
        "feature": "picture-in-picture", 
        "state_before_action": "Enabled", 
        "state_after_action": "Disabled",
        "action_effect": "The action turn off picture-in-picture by changing the switch from Enabled to Disabled" 
    },
    "rawClickCoordinate": [
        356,
        300
    ], 
    "posInstruction": "turn off picture-in-picture", 
    "negInstruction": "turn on picture-in-picture", 
    "posAtlasAction": "CLICK <point>[[864.4, 410.5]]</point>", 
    "negAtlasAction": "COMPLETE"
}

The description of each field in the state control benchmark is presented below:

Field Name	Description
`images`	Corresponding GUI screenshot path
`img_filename`	Corresponding GUI screenshot filename, not used
`bbox`	Bounding box of the target element, normalized to `[0, 1000]`
`image_height`, `image_width`	Height and width of the original screenshot (in pixels)
`clickCoordinate`	Normalized click coordinate of the target element, normalized to `[0, 1000]`
`useBbox`	Whether to use the bounding box to locate the target element (boolean)
`annotation`	Annotations related to the target element interaction
└── `is_switch`	Whether the target element is a toggle switch
└── `feature`	Feature name controlled by the toggle (e.g., picture-in-picture)
└── `state_before_action`	State of the toggle before the click (e.g., Enabled)
└── `state_after_action`	State of the toggle after the click (e.g., Disabled)
└── `action_effect`	Description of the effect caused by the action (natural language)
`rawClickCoordinate`	Raw click coordinate (in pixels, not normalized)
`posInstruction`	Positive instruction — vary the toggle state
`negInstruction`	Negative instruction — maintains the current state
`posAtlasAction`	Positive label action (OS-Atlas format)
`negAtlasAction`	Negative label action (OS-Atlas format)

The full benchmark is available on huggingface.

Data Preprocessing

Data preprocessing scripts are provided in the dataPreprocessor directory. Hyperparameters are configured using YAML files.

Example: Preprocess State Control Benchmark for UI-TARS-7B

type: state_cot
model: uitars
apiKey: "Your API key for zhipuai"
diversity: true
agentCount: 10
llamafactory: false
stateJsonPathTrain: ./data/state/state_control_benchmark_train.json
stateJsonPathTest: ./data/state/state_control_benchmark_test.json

Example: Preprocess AndroidControl Benchmark for UI-TARS-7B

type: android_control
model: atlas
apiKey: "Your API key for zhipuai"
state: false
low_level: false
agentCount: 10
llamafactory: true
acjsonPath: GUIData/android_control/jsons
acimagePath: GUIData/android_control/images
aclayoutPath: GUIData/android_control/layouts
cot_trained: true

Example: Merge Data for AgentCPM-GUI-8B Training

mergeConfigList: 
  - "dataPreprocessorYamls/acg/aitz_cot_trained_llamafactory.yaml"
  - "dataPreprocessorYamls/acg/androidControl_high_cot_trained_llamafactory.yaml"
  - "dataPreprocessorYamls/acg/androidControl_low_cot_trained_llamafactory.yaml"
  - "dataPreprocessorYamls/acg/gui_odyssey_cot_trained_llamafactory.yaml"
  - "dataPreprocessorYamls/acg/state_cot_llama_factory.yaml"
model: agentcpmgui
type: agentic_with_state_cot

To preprocess data, run:

python preprocessData.py --config <path to config yaml> --mergeConfig <path to merge config yaml>

Train the Agents

The implementation of training the agents are based on LLaMA-Factory. After preprocessing and merging the data, configure the training settings in LLaMA-Factory. (see Preprocess the Data for more details). See README for more details.

Evaluate the Agent on Agentic Benchmarks

Evaluation scripts are provided in the evaluator directory. Hyperparameters are configured via YAML files.

Example: Evaluate UI-TARS-7B on State Control Benchmark

testJsonPath: "test Json path in data/GUIState/uitars"
modelPath: "path to the agent model"
devicesIDs: "CUDA device IDs for evaluation, such as [0,1,2,3]"
agentCount: 4 # The process number of the evaluation
agentType: uitars
max_new_tokens: 512
benchmarkSetting: high
type: state_action # see (./evaluator/evaluators.py)
recordSavePath: uitars_state_action_predict_test.json # save record file name in ./analyses

Example: Evaluate OS-Atlas-7B on AndroidControl-H Benchmark

testJsonPath: "test Json path in data/GUIAgentic/android_control/atlas/"
modelPath: "path to the agent model"
devicesIDs: "CUDA device IDs for evaluation, such as [0,1,2,3]"
agentCount: 4 # The process number of the evaluation
agentType: atlas
max_new_tokens: 512
benchmarkSetting: high
type: android_control
recordSavePath: atlas_android_control_high_action_predict_test.json # save record file name in ./analyses

To evaluate:

python evaluate.py --config <path to config yaml>

Evaluate the Agent on Dynamic Environment

To further assess real-world applicability, we construct a dynamic evaluation benchmark consisting of 20 real-world toggle control tasks. This benchmark is implemented on the Android emulator from AndroidStudio and built upon the AndroidWorld framework, enabling evaluation under dynamic and realistic mobile environments. See README for more details.

Acknowledgement

This work can not be done without the help of the following repos:

Citation

If you find this work useful, please consider citing:

@article{wu2025see,
	title={See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles}, 
	author={Zongru Wu and Rui Mao and Zhiyuan Tian and Pengzhou Cheng and Tianjie Ju and Zheng Wu and Lingzhong Dong and Haiyue Sheng and Zhuosheng Zhang and Gongshen Liu},
	year={2025},
	journal={arXiv preprint arXiv:2509.13615},
	url={https://arxiv.org/abs/2509.13615}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

⭐ StaR

See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles

🚀 News

Video Demo

Dependencies

State Control Benchmark

Data Preprocessing

Train the Agents

Evaluate the Agent on Agentic Benchmarks

Evaluate the Agent on Dynamic Environment

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
GUIData/stateControlBenchmark		GUIData/stateControlBenchmark
LLaMA-Factory		LLaMA-Factory
android_world		android_world
assets		assets
data/state		data/state
dataPreprocessor		dataPreprocessor
evaluator		evaluator
LICENSE		LICENSE
README.md		README.md
VideoDemo.mp4		VideoDemo.mp4
evaluate.py		evaluate.py
preprocessData.py		preprocessData.py
requirements.txt		requirements.txt

License

ZrW00/StaR

Folders and files

Latest commit

History

Repository files navigation

⭐ StaR

See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles

🚀 News

Video Demo

Dependencies

State Control Benchmark

Data Preprocessing

Train the Agents

Evaluate the Agent on Agentic Benchmarks

Evaluate the Agent on Dynamic Environment

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages