I am training my model from a checkpoint that i have created. I have two questions:
-
it seems max_steps has to be set to any value larger than the steps the checkpoint was created. Is that right? For example if the checkpoint is checkpoint-15000, my max_steps while training from the checkpoint has to be a value larger than 15000. Is my understanding correct?
-
after training from the checkpoint, my train_loss is 0.0 and no additional checkpoint is created. Why is that?
Here are my parameters:
output='roberta-base_lr_2e-4_wd_1e-5_entity_masking_fp16'
nohup accelerate launch --config_file ~/.cache/huggingface/accelerate/default_config.yaml --num_processes 8 \
RESEARCH_run_mlm_with_hce_encoder.py \
--model_name_or_path roberta-base \
--do_train True \
--do_eval False \
--remove_unused_columns False \
--label_names labels \
--train_file s3://path/part-00058.snappy.parquet \
--validation_file s3://path2/*.parquet \
--output_dir ../TMP_models/${output} \
--save_total_limit 5 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--eval_steps 1 \
--logging_steps 100 \
--save_steps 100 \
--fp16 True \
--dataloader_drop_last True \
--learning_rate 2e-4 \
--weight_decay 1e-5 \
--overwrite_cache False \
--overwrite_output_dir False \
--gradient_accumulation_steps 4 \
--logging_dir ../logs/${output} \
--num_train_epochs 1 \
--streaming True \
--max_steps 16000 \
--masking_strategy "token_sep" \
--resume_from_checkpoint /home/ubuntu/workspace/TMP_models/roberta-base_lr_2e-4_wd_1e-5_entity_masking_fp16/checkpoint-15000/ \
--masking_prefix_flag True > nohup.out &
Please someone help me.