How do I resume training a finetuned model from the epoch it has ended

Hi. Is there any way to continue from the epoch at which the training was complete, and then resume training from the very epoch until the new epoch?

I recently finetuned LLaVA-v1.5-13B with 3 epochs and got results. Now, I would like to gradually increase the epochs, to 5 first and then larger than that later.
Because 3 epochs took 45 steps, the checkpoint was saved as ‘checkpoint-45’ so I set the value of ‘resume_from_checkpoint’ argument as ‘checkpoint-45’.

However, I am sure the epoch has ended at epoch 3, but whenever I start from the very checkpoint, the epoch always starts from epoch 1, not 3. I am wondering if I had set something wrong.

I will provide you with my script for your information.

#!/bin/bash

deepspeed llava/train/train_xformers.py
–lora_enable True --lora_r 8 --lora_alpha 16 --mm_projector_lr 2e-5
–deepspeed ./scripts/zero3_offload.json
–model_name_or_path liuhaotian/llava-v1.5-13b
–version v1
–data_path ./playground/data/floorplan_data/floorplan_vqa-train.json
–valid_data_path ./playground/data/floorplan_data/floorplan_vqa-valid.json
–image_folder ./playground/data/floorplan_data/
–vision_tower openai/clip-vit-large-patch14-336
–mm_projector_type mlp2x_gelu
–mm_vision_select_layer -2
–mm_use_im_start_end False
–mm_use_im_patch_token False
–image_aspect_ratio pad
–group_by_modality_length True
–fp16 True
–output_dir ./checkpoints/llava-v1.5-13b
–num_train_epochs 5
–per_device_train_batch_size 4
–per_device_eval_batch_size 4
–gradient_accumulation_steps 4
–evaluation_strategy “steps”
–save_strategy “steps”
–save_steps 3
–save_total_limit 2
–learning_rate 2e-5
–weight_decay 0.01
–warmup_ratio 0.03
–lr_scheduler_type “cosine”
–logging_steps 1
–tf32 False
–model_max_length 2048
–gradient_checkpointing True
–dataloader_num_workers 4
–lazy_preprocess True
–report_to wandb
–load_best_model_at_end True
–resume_from_checkpoint ./checkpoints/llava-v1.5-13b/checkpoint-45

I even tried setting 'model_name_or_path ’ as ‘./checkpoints/llava-v1.5-13b/checkpoint-45’ but it only gave me “ValueError: Target module Dropout(p=0.05, inplace=False) is not supported. Currently, only the following modules are supported: torch.nn.Linear, torch.nn.Embedding, torch.nn.Conv2d, transformers.pytorch_utils.Conv1D.” error message.

Again, how can I see the model running from the very epoch at which it had finished being trained when I continue finetuning it?

Thank you.

1 Like

I guess I had figured out the reason for starting from epoch 1, not epoch 3.

It was because I had changed the global batch size from 64 to 32 later, by changing the gradient_accumulation_steps value from 8 to 4.

The things just get calculated by the number of steps.

For people who are facing the similar resuming from a checkpoint(epoch) issue as I did, it is important that you be careful with the batch size change.

I hope it could help you.

1 Like

I’ve never used a trainer before, but I wonder if this is the problem around here…

1 Like

Hello. Yes, I guess it is a similar problem with what I had faced back then. Yes, in my case, once the training process reached the designated epoch, I had to set the checkpoint from which I wanted to continue training.
As long as you keep the batch settings the same, you can see the process continue from the desired step. The change of the batch size did contribute to the change of the resumed epoch.

1 Like