Resuming training BERT from scratch with run_mlm.py

Initiated training BERT from scratch with run_mlm.py as follows:

python run_mlm.py --model_type bert
–train_file ./data/mk.txt --output_dir ./models/bert-base-uncased
–overwrite_output_dir --tokenizer_name ./models/bert-base-uncased
–line_by_line True --do_train
–per_device_train_batch_size 4 --num_train_epochs 100
–save_steps 100000 --save_total_limit 500
–max_seq_length 512 --logging_steps 500
–use_fast_tokenizer --report_to wandb
–disable_tqdm True `

Training stopped due to power outage, having saved latest checkpoint:
.\models\bert-base-uncased\checkpoint-1700000

Which is the most appropriate command, give initial one, to resume training from the last saved checkpoint, and preserving all of the parameters mentioned above?

hi @striki-ai

if you remove the --overwrite_output_dir option and run the same command again, then the script will detect the last checkpoint and resume training from there.

5 Likes

Related to this question, I’ve been trying to continue training, but with a new/lower learning rate. How do I do that?