Skip to content

ZrW00/GUIPivot

Repository files navigation

GUIPivot

Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks

arXiv

This repository is the code implementation of our paper:

Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks

We propose a pivot task name query inference to smooth coordinate-oriented grounding and action-oriented reasoning, imporoving the performance of MLLM-powered GUI agents in resource-constrained scenarios.

🆙 Updates

  • We have released the code for data preprocessing, query refinement, and evaluation.
  • We have released the training data in sharegpt format.

Dependencies

  • The training for grounding, query inference, and reasoning is based on LLaMA-Factory. Please refer to the LLaMA-Factory repo for installation instructions for the training environment.
  • After setting up LLaMA-Factory for training, install additional requirements by running:
    pip3 install -r requirements.txt

Training Data

    zip -s 0 GUI.zip --out combined.zip
    unzip combined.zip
  • The original datasets can be downloaded as follows:

    • UIBERT: UIBERT is a subset of the OS-Atlas grounding dataset. Refer to the above link for downloading.
    • AITZ: AITZ is a mobile agent dataset derived from a subset of AITW and annotated by proprietary MLLMs for CoAT components. Refer to the above link for downloading.
    • AndroidControl: AndroidControl is a mobile agent dataset comprising 15,283 demonstrations with step-by-step instructions. Refer to the above link for downloading the TFRecord file and extracting the image files.
  • After downloading the original datasets, adjust the folder structure as follows:

    python mergeAndroidControl.py
  • After adjusting the folder structure, it should look like this:
.
└── GUIData
    ├── UIBert
    │   ├── screenshots
    │   │   └── image_files
    │   └── uibert_raw.json
    ├── android_control
    │   ├── images
    │   │   └── image_files
    │   ├── jsons
    │   │   └── json_files
    │   └── layouts
    │       └── accessibility_tree_files
    └── android_in_the_zoo
        ├── test
        │   ├── images
        │   │   └── image_files
        │   └── jsons
        │       └── json_files
        └── train
            ├── images
            │   └── image_files
            └── jsons
                └── json_files

Constructing Dataset for Query Inference

We provide the code implementation for constructing UIBERT for query inference. The detailed process is as follows:

  1. Prepare the original UIBERT dataset and adjust the folder structure as mentioned above.
  2. Configure the API key for Qwen-VL-Max in refineQuery.py:
client = OpenAI(
    api_key="your key", 
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
  1. Run python refineQuery.py to construct the refined dataset. The results will be saved at ./data/GUI/uibert_query_summary_polished_preprocessed.json.

Example triplets $\langle s, q_r, c \rangle$ from the refined UIBERT dataset, along with the original query $q$ are provided as follows:

Construct JSON Datasets in Sharegpt Format

  • For constructing AITZ, run python preprocessAITZActionPredictAblations.py. The results will be saved at ./data/GUI/aitz_ablation_[1-10]_action_predict_test.json and ./data/GUI/aitz_ablation_[1-10]_action_predict_train.json. For the meaning of the ID in the filenames, please refer to our paper.
  • For constructing AndroidControl, run python preprocessAndroidControlActionPredictHigh.py and python preprocessAndroidControlActionPredictLow.py. The results will be saved at ./data/GUI/android_control_[high/low]_action_predict_test.json and ./data/GUI/android_control_[high/low]_action_predict_train.json.

All JSON datasets ending in train will be used for training, and those ending in test will be used for evaluation.

Evaluation

After acquiring the fine-tuned model for reasoning, configure and run evaluate.py to obtain the evaluation results on two mobile agnet benchmarks.

Detailed arguments setting:

python evaluate.py \
    --type [acactioneval for AndroidControl, aitzablationactioneval for AITZ] \
    --testJsonPath [the path to the test split JSON dataset] \
    --log [the path to save the result record] \
    --modelPath [the path to the fine-tuned transformer model for reasoning]

For the detailed implementation of the evaluation, refer to ./eval/EvalAgentAction.py.

Acknowledgement

This work can not be done without the help of the following repos:

Citation

@article{wu2025smoothing,
  title={Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks},
  author={Wu, Zongru and Cheng, Pengzhou and Wu, Zheng and Ju, Tianjie and Zhang, Zhuosheng and Liu, Gongshen},
  journal={arXiv preprint arXiv:2503.00401},
  year={2025}
}

About

Repo for GUI pivot

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages