paper: DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents
DaMo is a novel solution for predicting optimal data mixtures in multitask supervised fine-tuning of multimodal large language models (MLLMs). Left: Given
pip install uv
cd DaMo
uv sync
First, you need to sample random proportions. Run the following code with the specified number_of_training_datasets and batch_size to obtain all possible data mixture proportion values in the complete
uv run python get_P_fix.py
Randomly sample a small number of data mixture proportion values from mllm_train_and_eval.py. For the evaluation process of PhoneAgentBench, see phoneAgentBench. We provide a small set of experimental sample points for fitting the MLP model, which can be found in DaMo/src/processed_data_random_50.xlsx.
To predict the downstream task performance of the MLLM after training on unseen data mixtures, use the existing experimental sample points to fit an MLP model. The input of the MLP model is data mixture proportion and training step, while the output is downstream task performance. Evaluate the
uv run python mlp_regressor.py
Based on the MLP model, predict the downstream task performance of the MLLM after training on unseen data mixtures, obtain the optimal data mixture proportion.
uv run predict.py
Train the MLLM using the predicted optimal data mixture proportion and evaluate its performance on the downstream task as in Step 2.
We develop a novel benchmark suite specifically designed for mobile phone agents. This suite encompasses six carefully curated datasets focusing on key mobile phone application tasks, thereby offering a holistic assessment of phone agents' performance across diverse capabilities critical to real-world mobile applications.
- Multimodal Task Planning (MT-Plan)
MT-Plan is designed to evaluate multimodal task planning capabilities in phone agent scenarios. It takes <image + query> as input and outputs a planning structured as a directed acyclic graph (DAG).
Download embedding model from BAAI/bge-large-zh. Download MLLM checkpoint from InternVL2_5-4B, Qwen2.5-VL-3B-Instruct, Qwen2.5-VL-7B-Instruct, InternVL3-14B or your SFT checkpoint. Modify the configuration in eval_mt-plan.py to specify the paths for the MLLM, embedding model, and dataset. Run the following code to obtain the model's predictions and evaluation results on the MT-Plan task.
uv run python eval_mt-plan.py