Training Phi3-V with PEFT

This repository contains a script for training the Phi3-V model with Parameter-Efficient Fine-Tuning (PEFT) techniques using various configurations and options.

Supported Features

Training on the mixture of NLP data and vision-language data
Flexible selection of LoRA target modules
Deepspeed Zero-2
Deepspeed Zero-3
PyTorch FSDP
Gradient checkpointing (only compatible with ZeRO-3 for now)
QLoRA
Disable/enable Flash Attention 2

Installation

Install the required packages using either requirements.txt or environment.yml.

Using `requirements.txt`

pip install -r requirements.txt

Using `environment.yml`

conda env create -f environment.yml
conda activate phi3v

Model Download

Before training, download the Phi3-V model from HuggingFace. It is recommended to use the huggingface-cli to do this.

Install the HuggingFace CLI:

pip install -U "huggingface_hub[cli]"

Download the model:

huggingface-cli download microsoft/Phi-3-vision-128k-instruct --local-dir Phi-3-vision-128k-instruct --resume-download

Usage

To run the training script, use the following command:

bash scripts/train.sh

Note: Remember to replace the paths in train.sh with your specific paths.

Arguments

--data_path (str): Path to the LLaVA formatted training data (a JSON file). (Required)
--image_folder (str): Path to the images folder as referenced in the LLaVA formatted training data. (Required)
--model_id (str): Path to the Phi3-V model. (Required)
--proxy (str): Proxy settings (default: None).
--output_dir (str): Output directory for model checkpoints (default: "output/test_train").
--num_train_epochs (int): Number of training epochs (default: 1).
--per_device_train_batch_size (int): Training batch size per GPU per forwarding step.
--gradient_accumulation_steps (int): Gradient accumulation steps (default: 4).
--deepspeed_config (str): Path to DeepSpeed config file (default: "scripts/zero2.json").
--num_lora_modules (int): Number of target modules to add LoRA (-1 means all layers).
--lora_namespan_exclude (str): Exclude modules with namespans to add LoRA.
--max_seq_length (int): Maximum sequence length (default: 3072).
--quantization (flag): Enable quantization.
--disable_flash_attn2 (flag): Disable Flash Attention 2.
--report_to (str): Reporting tool (choices: 'tensorboard', 'wandb', 'none') (default: 'tensorboard').
--logging_dir (str): Logging directory (default: "./tf-logs").
--lora_rank (int): LoRA rank (default: 128).
--lora_alpha (int): LoRA alpha (default: 256).
--lora_dropout (float): LoRA dropout (default: 0.05).
--logging_steps (int): Logging steps (default: 1).
--dataloader_num_workers (int): Number of data loader workers (default: 4).

Dataset Preparation

The script requires a dataset formatted according to the LLaVA specification. The dataset should be a JSON file where each entry contains information about conversations and images. Ensure that the image paths in the dataset match the provided --image_folder.

Example Dataset

[
  {
    "id": "000000033471",
    "image": "000000033471.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWhat are the colors of the bus in the image?"
      },
      {
        "from": "gpt",
        "value": "The bus in the image is white and red."
      },
      {
        "from": "human",
        "value": "What feature can be seen on the back of the bus?"
      },
      {
        "from": "gpt",
        "value": "The back of the bus features an advertisement."
      },
      {
        "from": "human",
        "value": "Is the bus driving down the street or pulled off to the side?"
      },
      {
        "from": "gpt",
        "value": "The bus is driving down the street, which is crowded with people and other vehicles."
      }
    ]
  }
  ...
]

TODO

Add support for DeepSpeed ZeRO-3.
Add support for FSDP
Add support for simultaneously finetuning img_projector
Add support for full finetuning
Add support for grounded finetuning
Add support for multi-image finetuning
More advanced PEFT method (e.g., DoRA)
FSDP with ActivationCheckpointing Wrapper
Intergration with Chuanhu Chat

License

This project is licensed under the Apache-2.0 License. See the LICENSE file for details.

This project borrowed code from LLaVA and Microsoft Phi-3-vision-128k-instruct. Thanks to both projects for their contributions.

Citation

If you use this codebase in your work, please cite this project:

@misc{phi3vfinetuning2023,
  author = {Gai Zhenbiao & Shao Zhenwei},
  title = {Phi3V-Finetuning},
  year = {2023},
  publisher = {GitHub},
  url = {https://github.com/GaiZhenbiao/Phi3V-Finetuning},
  note = {GitHub repository},
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
images		images
logs		logs
model		model
readme		readme
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
inference_phi3v_lora.ipynb		inference_phi3v_lora.ipynb
requirements.txt		requirements.txt
train_phi3v.py		train_phi3v.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training Phi3-V with PEFT

Table of Contents

Supported Features

Installation

Using `requirements.txt`

Using `environment.yml`

Model Download

Usage

Arguments

Dataset Preparation

TODO

License

Citation

About

Releases

Packages

Contributors 2

Languages

License

GaiZhenbiao/Phi3V-Finetuning

Folders and files

Latest commit

History

Repository files navigation

Training Phi3-V with PEFT

Table of Contents

Supported Features

Installation

Using requirements.txt

Using environment.yml

Model Download

Usage

Arguments

Dataset Preparation

TODO

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Using `requirements.txt`

Using `environment.yml`

Packages