-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint feature via steps instead of epoch #724
Comments
Hi @mylesgoose I think that could be a great idea. Can you share a bit how the interface would look after this integration? |
well i actually implemented it above. :-) @mreso i had to change quite allot of things so this would need to be tested by others before pull request. torchrun --nnodes 1 --nproc_per_node 8 recipes/quickstart/finetuning/finetuning.py \
--enable_fsdp \
--lr 1e-5 \
--num_epochs 1 \
--batch_size_training 2 \
--model_name meta-llama/Llama-3.2-11B-Vision-Instruct \
--dist_checkpoint_root_folder ./finetuned_model \
--dist_checkpoint_folder ./finetuned_model \
--use_fast_kernels True \
--dataset "custom_dataset" \
--custom_dataset.test_split "test" \
--custom_dataset.file "/home/myles/llama-recipes/recipes/quickstart/finetuning/datasets/json_dataset.py" \
--run_validation True \
--batching_strategy padding \
--use_wandb True \
--gradient_accumulation_steps 1 \
--checkpoint_interval 5 \
--max_checkpoints_to_keep 2 \
--context_length 4096 \
--gradient_clipping False \
--gradient_clipping_threshold 1.0 \
--max_train_step 0 \
--max_eval_step 0 \
--num_workers_dataloader 16 \
--weight_decay 0.0 \
--gamma 0.85 \
--seed 42 \
--use_fp16 False \
--mixed_precision True \
--val_batch_size 1 \
--peft_method "lora" \
--use_peft False \
--from_peft_checkpoint "" \
--output_dir "./finetuned_model" \
--freeze_layers False \
--num_freeze_layers 1 \
--quantization None \
--one_gpu False \
--save_model True \
--save_optimizer True \
--save_metrics True \
--flop_counter False \
--flop_counter_start 3 \
--use_profiler False \
--profiler_dir "./finetuned_model/profiler/results" def display_llama_art():
global llama_art_printed
if not llama_art_printed and (not torch.distributed.is_available() or torch.distributed.get_rank() == 0):
llama_art = r"""
.
+=-
:*#+: :==
:#%#+. :+*#+
+%#+=---------+*#%#:
=+==+++====-=+*%#:
:-==+=+*++==---:
-=%+=--::+%==-.
.+@:---::@+%=:
.:::*%#*:.: .
.:-:.-+*#=..;
.:-:_+*+_....
:::------:.:.
.--======--:
.--=++++=--. .:
.:-==++++=-:. ....:::.......:=-::.
.:-==+++=-:.:--=========------::--===:
.::-==++=-:::-==+==++=====--===---::=++-.
.::-===+=--:-===++==++====---=-:---::=+-.
.:---+++=---===++++-======-=--:+----:.=+:
.::--=++=--====+++==--==-=---::+===-::==:
..:-============+=------=---:.+===+-:.=::
...::-============-:-:=----:.=++=+=-...:
..::--====+======-:--=---::-+++++-:.
...:::-:---=====::------:-+*++++=-
....::::::---=-:::------++*++==+:
.:-==--=-----:::::-===*#*++++=:
.:-=++----==-:.:--++=-+#####+=-
.-=-++-.:==+=: .:=++=-*%#**+-:
.:====::-=++=. :=++==---==--::
.:-+=-:-==++- :=++==+-:--:-:-.
:-===:=====-: :-+==*=-==-::...
.:=+=:====---: -+**++=-=++=-:...
.: :---:-====--: -=+*++*--*#+=::.
.: :=--:+##*+-:. ==++*+: .=*==-:.
.: :--::+##*- .==***= -+=-::
.: ::-.:=+-. .=++#= -++=:.
.: :--. -=: .=+++- :+*=:
.: --:. :=:. =+#*: =**-:
.: -=:. =+=: :+**= +**+.
.: ..:==: =++-.+**#*: :#*=.
.: =+++- +#*=. ..... -#+-:
.: .... -#*+-.
"""
print(llama_art) # Print the art
llama_art_printed = True # Set the flag to True``` |
Great! Could you prepare the checkpointing pieces into a PR. Happy to review this. |
@mreso I think that fork is quite prepared. however I think it should be pulled into a separate development fork in your rep as still has some work. as discussed with the resuming from the saved checkpoint. i have tested running the saved checkpoint and converting to HF and it worked. also your convert to hf script did not work with the llama vision models so i made a new one included i have done a pr for that already. I also don't use conda environments i just compiled everything from source and used all the latest version of things like cuda and torch etc. so i will have to setup a conda env to test the pr so its going to be more reproducible with the programs you have listed in yoru requirments.txt file. I think the plan then would be to download your updated repo at main. modify the checkpoint file as per that one above and then push to new pr branch in my account then pull request that branch? you guys have had 5 pushes since i modified thous files so i will be able then to determine if its compatible with your changes. |
Yes, separating the checkpointing changes from the dataset examples and testing within the right env will be a good idea. You can rebase your changes onto the current main if necessary. Let me know if you need help with that. |
🚀 The feature, motivation and pitch
at the moment the scritp only saves via epoch. for large data sets this is quite bad.
Alternatives
i crated an alternative here here
Additional context
the script will now save at the specified interval during the traning. and mark the files or folder according to the step and epoch. also it fixes some of the errors found in the original logic
The text was updated successfully, but these errors were encountered: