I am using the following type of script, what I want to do is train sequentially over all domains of m2d2. By sequentially I mean I want to loop over all domains one by one train on them for one epoch save the checkpoint after the first domain and use this checkpoint to start training on second domain.
python run_clm.py --model_name_or_path gpt2 --dataset_name machelreid/m2d2 --dataset_config_name $task_name --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --do_train --num_train_epochs 1 --output_dir ./tmp/$task --resume_from_checkpoint ./tmp/$task --cache_dir ./cache
However when I do this and check the files in the corresponding task folders I get
In task 0 dir I get :
all_results.json checkpoint-2500 config.json pytorch_model.bin tokenizer_config.json train_results.json
checkpoint-1000 checkpoint-3000 eval_results.json README.md tokenizer.json vocab.json
checkpoint-1500 checkpoint-3500 generation_config.json runs trainer_state.json
checkpoint-2000 checkpoint-500 merges.txt special_tokens_map.json training_args.bin
In task 1 dir I get :
all_results.json checkpoint-4000 checkpoint-6500 checkpoint-9000 merges.txt tokenizer_config.json vocab.json
checkpoint-10000 checkpoint-4500 checkpoint-7000 checkpoint-9500 pytorch_model.bin tokenizer.json
checkpoint-10500 checkpoint-5000 checkpoint-7500 config.json README.md trainer_state.json
checkpoint-11000 checkpoint-5500 checkpoint-8000 eval_results.json runs training_args.bin
checkpoint-11500 checkpoint-6000 checkpoint-8500 generation_config.json special_tokens_map.json train_results.json
In task 2 dir I get :
all_results.json generation_config.json README.md tokenizer_config.json training_args.bin
config.json merges.txt runs tokenizer.json train_results.json
eval_results.json pytorch_model.bin special_tokens_map.json trainer_state.json vocab.json
from task 3 onwards there is nothing in the dir.
Can someone please help me out to figure out why this is happening and what can I do to get the desired result.