Training from a checkpoint

ghashami · March 4, 2024, 1:10am

I am training my model from a checkpoint that i have created. I have two questions:

it seems max_steps has to be set to any value larger than the steps the checkpoint was created. Is that right? For example if the checkpoint is checkpoint-15000, my max_steps while training from the checkpoint has to be a value larger than 15000. Is my understanding correct?
after training from the checkpoint, my train_loss is 0.0 and no additional checkpoint is created. Why is that?

Here are my parameters:

output='roberta-base_lr_2e-4_wd_1e-5_entity_masking_fp16'
nohup accelerate launch --config_file ~/.cache/huggingface/accelerate/default_config.yaml --num_processes 8 \
RESEARCH_run_mlm_with_hce_encoder.py \
    --model_name_or_path roberta-base \
	--do_train True  \
	--do_eval False  \
	--remove_unused_columns False  \
	--label_names labels  \
	--train_file s3://path/part-00058.snappy.parquet \
	--validation_file s3://path2/*.parquet \
	--output_dir ../TMP_models/${output}  \
	--save_total_limit 5  \
	--per_device_train_batch_size 8  \
	--per_device_eval_batch_size 8  \
	--eval_steps 1 \
	--logging_steps 100 \
	--save_steps 100  \
	--fp16 True \
	--dataloader_drop_last True \
	--learning_rate 2e-4 \
	--weight_decay 1e-5 \
	--overwrite_cache False \
	--overwrite_output_dir False  \
	--gradient_accumulation_steps 4 \
	--logging_dir ../logs/${output} \
	--num_train_epochs 1 \
	--streaming True \
	--max_steps 16000 \
	--masking_strategy "token_sep" \
	--resume_from_checkpoint /home/ubuntu/workspace/TMP_models/roberta-base_lr_2e-4_wd_1e-5_entity_masking_fp16/checkpoint-15000/ \
	--masking_prefix_flag True > nohup.out &

Please someone help me.

ghashami · March 4, 2024, 1:14am

I just saw that when train_loss is 0.0, i get this warning that:

There seems to be not a single sample in your epoch_iterator, stopping training at step 15000! This is expected if you're using an IterableDataset and set num_steps (16000) higher than the number of available samples.

What should i set as max_steps?

Topic		Replies	Views
Cannot Resume Training Beginners	1	1375	December 15, 2020
Continuing Pre Training from Model Checkpoint Models	12	42173	January 13, 2025
Saving check_points for run_mlm.py 🤗Transformers	1	748	January 27, 2021
No skipping steps after loading from checkpoint 🤗Transformers	16	7541	April 21, 2022
How to continue training and not overwrite checkpoint number? 🤗Transformers	2	1631	November 2, 2022

Training from a checkpoint

Related topics