How do I resume training a finetuned model from the epoch it has ended

Bleking · October 20, 2024, 4:13pm

Hi. Is there any way to continue from the epoch at which the training was complete, and then resume training from the very epoch until the new epoch?

I recently finetuned LLaVA-v1.5-13B with 3 epochs and got results. Now, I would like to gradually increase the epochs, to 5 first and then larger than that later.
Because 3 epochs took 45 steps, the checkpoint was saved as ‘checkpoint-45’ so I set the value of ‘resume_from_checkpoint’ argument as ‘checkpoint-45’.

However, I am sure the epoch has ended at epoch 3, but whenever I start from the very checkpoint, the epoch always starts from epoch 1, not 3. I am wondering if I had set something wrong.

I will provide you with my script for your information.

#!/bin/bash

deepspeed llava/train/train_xformers.py
–lora_enable True --lora_r 8 --lora_alpha 16 --mm_projector_lr 2e-5
–deepspeed ./scripts/zero3_offload.json
–model_name_or_path liuhaotian/llava-v1.5-13b
–version v1
–data_path ./playground/data/floorplan_data/floorplan_vqa-train.json
–valid_data_path ./playground/data/floorplan_data/floorplan_vqa-valid.json
–image_folder ./playground/data/floorplan_data/
–vision_tower openai/clip-vit-large-patch14-336
–mm_projector_type mlp2x_gelu
–mm_vision_select_layer -2
–mm_use_im_start_end False
–mm_use_im_patch_token False
–image_aspect_ratio pad
–group_by_modality_length True
–fp16 True
–output_dir ./checkpoints/llava-v1.5-13b
–num_train_epochs 5
–per_device_train_batch_size 4
–per_device_eval_batch_size 4
–gradient_accumulation_steps 4
–evaluation_strategy “steps”
–save_strategy “steps”
–save_steps 3
–save_total_limit 2
–learning_rate 2e-5
–weight_decay 0.01
–warmup_ratio 0.03
–lr_scheduler_type “cosine”
–logging_steps 1
–tf32 False
–model_max_length 2048
–gradient_checkpointing True
–dataloader_num_workers 4
–lazy_preprocess True
–report_to wandb
–load_best_model_at_end True
–resume_from_checkpoint ./checkpoints/llava-v1.5-13b/checkpoint-45

I even tried setting 'model_name_or_path ’ as ‘./checkpoints/llava-v1.5-13b/checkpoint-45’ but it only gave me “ValueError: Target module Dropout(p=0.05, inplace=False) is not supported. Currently, only the following modules are supported: torch.nn.Linear, torch.nn.Embedding, torch.nn.Conv2d, transformers.pytorch_utils.Conv1D.” error message.

Again, how can I see the model running from the very epoch at which it had finished being trained when I continue finetuning it?

Thank you.

Bleking · October 22, 2024, 8:55am

I guess I had figured out the reason for starting from epoch 1, not epoch 3.

It was because I had changed the global batch size from 64 to 32 later, by changing the gradient_accumulation_steps value from 8 to 4.

The things just get calculated by the number of steps.

For people who are facing the similar resuming from a checkpoint(epoch) issue as I did, it is important that you be careful with the batch size change.

I hope it could help you.

John6666 · October 22, 2024, 10:04am

I’ve never used a trainer before, but I wonder if this is the problem around here…

Bleking · October 31, 2024, 6:05pm

Hello. Yes, I guess it is a similar problem with what I had faced back then. Yes, in my case, once the training process reached the designated epoch, I had to set the checkpoint from which I wanted to continue training.
As long as you keep the batch settings the same, you can see the process continue from the desired step. The change of the batch size did contribute to the change of the resumed epoch.

Topic		Replies	Views
Resume Training, but reset epochs 🤗Transformers	0	939	September 16, 2022
Training models for smaller epochs and then continue trianing 🤗Transformers	5	1322	January 16, 2021
How to resume training from lora checkpoint Beginners	0	514	July 4, 2024
Continue fine-tuning with Trainer() after completing the initial training process Beginners	9	5636	January 19, 2022
How to resume training from checkpoint Models	0	591	April 11, 2024

How do I resume training a finetuned model from the epoch it has ended

Related topics