Help required in incremental training of LLM's using HF

Rammohan12 · June 1, 2023, 6:14am

I am using the following type of script, what I want to do is train sequentially over all domains of m2d2. By sequentially I mean I want to loop over all domains one by one train on them for one epoch save the checkpoint after the first domain and use this checkpoint to start training on second domain.

python run_clm.py     --model_name_or_path gpt2     --dataset_name machelreid/m2d2     --dataset_config_name $task_name     --per_device_train_batch_size 2     --per_device_eval_batch_size 2     --do_train     --num_train_epochs 1     --output_dir ./tmp/$task --resume_from_checkpoint ./tmp/$task --cache_dir ./cache

However when I do this and check the files in the corresponding task folders I get
In task 0 dir I get :

all_results.json  checkpoint-2500  config.json             pytorch_model.bin        tokenizer_config.json  train_results.json
checkpoint-1000   checkpoint-3000  eval_results.json       README.md                tokenizer.json         vocab.json
checkpoint-1500   checkpoint-3500  generation_config.json  runs                     trainer_state.json
checkpoint-2000   checkpoint-500   merges.txt              special_tokens_map.json  training_args.bin

In task 1 dir I get :

all_results.json  checkpoint-4000  checkpoint-6500  checkpoint-9000         merges.txt               tokenizer_config.json  vocab.json
checkpoint-10000  checkpoint-4500  checkpoint-7000  checkpoint-9500         pytorch_model.bin        tokenizer.json
checkpoint-10500  checkpoint-5000  checkpoint-7500  config.json             README.md                trainer_state.json
checkpoint-11000  checkpoint-5500  checkpoint-8000  eval_results.json       runs                     training_args.bin
checkpoint-11500  checkpoint-6000  checkpoint-8500  generation_config.json  special_tokens_map.json  train_results.json

In task 2 dir I get :

all_results.json   generation_config.json  README.md                tokenizer_config.json  training_args.bin
config.json        merges.txt              runs                     tokenizer.json         train_results.json
eval_results.json  pytorch_model.bin       special_tokens_map.json  trainer_state.json     vocab.json

from task 3 onwards there is nothing in the dir.

Can someone please help me out to figure out why this is happening and what can I do to get the desired result.

Topic		Replies	Views
Checkpoint missing Optimizer.pt? How to Resume? 🤗Transformers	7	5487	May 18, 2021
Checkpoint-500 not being generated for llama-7b fine tuning Beginners	2	254	July 24, 2023
How to continue training and not overwrite checkpoint number? 🤗Transformers	2	1633	November 2, 2022
Continuing model training takes seconds in next round 🤗Transformers	3	1413	June 1, 2023
OOM run_seq2seq.py from checkpoint 🤗Transformers	0	188	March 8, 2021

Help required in incremental training of LLM's using HF

Related topics