GUIPivot

Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks

This repository is the code implementation of our paper:

Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks

We propose a pivot task name query inference to smooth coordinate-oriented grounding and action-oriented reasoning, imporoving the performance of MLLM-powered GUI agents in resource-constrained scenarios.

🆙 Updates

We have released the code for data preprocessing, query refinement, and evaluation.
We have released the training data in sharegpt format.

Dependencies

The training for grounding, query inference, and reasoning is based on LLaMA-Factory. Please refer to the LLaMA-Factory repo for installation instructions for the training environment.
After setting up LLaMA-Factory for training, install additional requirements by running:

    pip3 install -r requirements.txt

Training Data

We have released the JSON training data in LLaMA-Factory-supported sharegpt format, along with dataset_info.json for LLaMA-Factory training in the [./data/] directory. Please unzip ./data/GUI.zip using the following commands to obtain the JSON data. The UIBERT dataset for query inference is also provided in ./data/GUI/uibert_query_summary_polished_preprocessed_train.json

    zip -s 0 GUI.zip --out combined.zip
    unzip combined.zip

The original datasets can be downloaded as follows:
- UIBERT: UIBERT is a subset of the OS-Atlas grounding dataset. Refer to the above link for downloading.
- AITZ: AITZ is a mobile agent dataset derived from a subset of AITW and annotated by proprietary MLLMs for CoAT components. Refer to the above link for downloading.
- AndroidControl: AndroidControl is a mobile agent dataset comprising 15,283 demonstrations with step-by-step instructions. Refer to the above link for downloading the TFRecord file and extracting the image files.
After downloading the original datasets, adjust the folder structure as follows:
- UIBERT: Move all images to ./GUIData/UIBert/screenshots.
- AITZ: Move all images to ./GUIData/android_in_the_zoo/train/images and ./GUIData/android_in_the_zoo/test/images according to the original split setting.
- AndroidControl: Move the original dataset to ./GUIData/android_control_parsed and run the following command for processing.

    python mergeAndroidControl.py

After adjusting the folder structure, it should look like this:

.
└── GUIData
    ├── UIBert
    │   ├── screenshots
    │   │   └── image_files
    │   └── uibert_raw.json
    ├── android_control
    │   ├── images
    │   │   └── image_files
    │   ├── jsons
    │   │   └── json_files
    │   └── layouts
    │       └── accessibility_tree_files
    └── android_in_the_zoo
        ├── test
        │   ├── images
        │   │   └── image_files
        │   └── jsons
        │       └── json_files
        └── train
            ├── images
            │   └── image_files
            └── jsons
                └── json_files

Constructing Dataset for Query Inference

We provide the code implementation for constructing UIBERT for query inference. The detailed process is as follows:

Prepare the original UIBERT dataset and adjust the folder structure as mentioned above.
Configure the API key for Qwen-VL-Max in refineQuery.py:

client = OpenAI(
    api_key="your key", 
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

Run python refineQuery.py to construct the refined dataset. The results will be saved at ./data/GUI/uibert_query_summary_polished_preprocessed.json.

Example triplets $\langle s, q_r, c \rangle$ from the refined UIBERT dataset, along with the original query $q$ are provided as follows:

Construct JSON Datasets in Sharegpt Format

For constructing AITZ, run python preprocessAITZActionPredictAblations.py. The results will be saved at ./data/GUI/aitz_ablation_[1-10]_action_predict_test.json and ./data/GUI/aitz_ablation_[1-10]_action_predict_train.json. For the meaning of the ID in the filenames, please refer to our paper.
For constructing AndroidControl, run python preprocessAndroidControlActionPredictHigh.py and python preprocessAndroidControlActionPredictLow.py. The results will be saved at ./data/GUI/android_control_[high/low]_action_predict_test.json and ./data/GUI/android_control_[high/low]_action_predict_train.json.

All JSON datasets ending in train will be used for training, and those ending in test will be used for evaluation.

Evaluation

After acquiring the fine-tuned model for reasoning, configure and run evaluate.py to obtain the evaluation results on two mobile agnet benchmarks.

Detailed arguments setting:

python evaluate.py \
    --type [acactioneval for AndroidControl, aitzablationactioneval for AITZ] \
    --testJsonPath [the path to the test split JSON dataset] \
    --log [the path to save the result record] \
    --modelPath [the path to the fine-tuned transformer model for reasoning]

For the detailed implementation of the evaluation, refer to ./eval/EvalAgentAction.py.

Acknowledgement

This work can not be done without the help of the following repos:

LLaMA-Factory: https://github.com/hiyouga/LLaMA-Factory
flash-attn: https://github.com/Dao-AILab/flash-attention

Citation

@article{wu2025smoothing,
  title={Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks},
  author={Wu, Zongru and Cheng, Pengzhou and Wu, Zheng and Ju, Tianjie and Zhang, Zhuosheng and Liu, Gongshen},
  journal={arXiv preprint arXiv:2503.00401},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
GUIdata		GUIdata
data		data
doc_images		doc_images
eval		eval
yamls		yamls
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
generateAblationYaml.py		generateAblationYaml.py
mergeAndroidControl.py		mergeAndroidControl.py
preprocessAITZActionPredictAblations.py		preprocessAITZActionPredictAblations.py
preprocessAndroidControlActionPredictHigh.py		preprocessAndroidControlActionPredictHigh.py
preprocessAndroidControlActionPredictLow.py		preprocessAndroidControlActionPredictLow.py
refineQuery.py		refineQuery.py
requirements.txt		requirements.txt
templates.py		templates.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GUIPivot

Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks

🆙 Updates

Dependencies

Training Data

Constructing Dataset for Query Inference

Construct JSON Datasets in Sharegpt Format

Evaluation

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Languages

License

ZrW00/GUIPivot

Folders and files

Latest commit

History

Repository files navigation

GUIPivot

Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks

🆙 Updates

Dependencies

Training Data

Constructing Dataset for Query Inference

Construct JSON Datasets in Sharegpt Format

Evaluation

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages