Hi. Is there any way to continue from the epoch at which the training was complete, and then resume training from the very epoch until the new epoch?
I recently finetuned LLaVA-v1.5-13B with 3 epochs and got results. Now, I would like to gradually increase the epochs, to 5 first and then larger than that later.
Because 3 epochs took 45 steps, the checkpoint was saved as ‘checkpoint-45’ so I set the value of ‘resume_from_checkpoint’ argument as ‘checkpoint-45’.
However, I am sure the epoch has ended at epoch 3, but whenever I start from the very checkpoint, the epoch always starts from epoch 1, not 3. I am wondering if I had set something wrong.
I will provide you with my script for your information.
#!/bin/bash
deepspeed llava/train/train_xformers.py
–lora_enable True --lora_r 8 --lora_alpha 16 --mm_projector_lr 2e-5
–deepspeed ./scripts/zero3_offload.json
–model_name_or_path liuhaotian/llava-v1.5-13b
–version v1
–data_path ./playground/data/floorplan_data/floorplan_vqa-train.json
–valid_data_path ./playground/data/floorplan_data/floorplan_vqa-valid.json
–image_folder ./playground/data/floorplan_data/
–vision_tower openai/clip-vit-large-patch14-336
–mm_projector_type mlp2x_gelu
–mm_vision_select_layer -2
–mm_use_im_start_end False
–mm_use_im_patch_token False
–image_aspect_ratio pad
–group_by_modality_length True
–fp16 True
–output_dir ./checkpoints/llava-v1.5-13b
–num_train_epochs 5
–per_device_train_batch_size 4
–per_device_eval_batch_size 4
–gradient_accumulation_steps 4
–evaluation_strategy “steps”
–save_strategy “steps”
–save_steps 3
–save_total_limit 2
–learning_rate 2e-5
–weight_decay 0.01
–warmup_ratio 0.03
–lr_scheduler_type “cosine”
–logging_steps 1
–tf32 False
–model_max_length 2048
–gradient_checkpointing True
–dataloader_num_workers 4
–lazy_preprocess True
–report_to wandb
–load_best_model_at_end True
–resume_from_checkpoint ./checkpoints/llava-v1.5-13b/checkpoint-45
I even tried setting 'model_name_or_path ’ as ‘./checkpoints/llava-v1.5-13b/checkpoint-45’ but it only gave me “ValueError: Target module Dropout(p=0.05, inplace=False) is not supported. Currently, only the following modules are supported: torch.nn.Linear
, torch.nn.Embedding
, torch.nn.Conv2d
, transformers.pytorch_utils.Conv1D
.” error message.
Again, how can I see the model running from the very epoch at which it had finished being trained when I continue finetuning it?
Thank you.