Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama 3.2-11B-vision fully fine-tuned model file question #727

Open
Kidand opened this issue Oct 15, 2024 · 2 comments
Open

Llama 3.2-11B-vision fully fine-tuned model file question #727

Kidand opened this issue Oct 15, 2024 · 2 comments
Assignees

Comments

@Kidand
Copy link

Kidand commented Oct 15, 2024

During the use of LoRA fine-tuning, everything was normal, but the following issue arose during full-scale fine-tuning.

I use the following script for full fine-tuning :

#!/bin/bash

NNODES=1
NPROC_PER_NODE=4
LR=1e-5
NUM_EPOCHS=1
BATCH_SIZE_TRAINING=2
MODEL_NAME="/xxx/models--meta-llama--Llama-3.2-11B-Vision-Instruct/snapshots/075e8feb24b6a50981f6fdc161622f741a8760b1"
DIST_CHECKPOINT_ROOT_FOLDER="./finetuned_model"
DIST_CHECKPOINT_FOLDER="fine-tuned"
DATASET="custom_dataset"
CUSTOM_DATASET_TEST_SPLIT="test"
CUSTOM_DATASET_FILE="recipes/quickstart/finetuning/datasets/xxx_dataset.py"
RUN_VALIDATION=True
BATCHING_STRATEGY="padding"
OUTPUT_DIR="finetune/output"

torchrun --master_port 12412 \
         --nnodes $NNODES \
         --nproc_per_node $NPROC_PER_NODE \
         recipes/quickstart/finetuning/finetuning.py \
         --enable_fsdp \
         --lr $LR \
         --num_epochs $NUM_EPOCHS \
         --batch_size_training $BATCH_SIZE_TRAINING \
         --model_name $MODEL_NAME \
         --dist_checkpoint_root_folder $DIST_CHECKPOINT_ROOT_FOLDER \
         --dist_checkpoint_folder $DIST_CHECKPOINT_FOLDER \
         --use_fast_kernels \
         --dataset $DATASET \
         --custom_dataset.test_split $CUSTOM_DATASET_TEST_SPLIT \
         --custom_dataset.file $CUSTOM_DATASET_FILE \
         --run_validation $RUN_VALIDATION \
         --batching_strategy $BATCHING_STRATEGY \
         --output_dir $OUTPUT_DIR

The model was not saved to the finetune/output folder I specified, and moreover, the model weight files appear as follows, preventing me from performing inference.

ls
__0_0.distcp  __1_0.distcp  __2_0.distcp  __3_0.distcp  train_params.yaml

How can I save the weights of a fully fine-tuned model to a specified path, ensuring that the saved model weight file follows the standard transformers structure?

@wukaixingxp
Copy link
Contributor

Hi! We are working on the model conversion script by merging this PR. You can try it as a temp solution. Thanks!

@wukaixingxp
Copy link
Contributor

The PR has been merged. Please try it and let me know if you have more questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants