TL;DR; Any tips on adding epochs doing MLM pretraining via accelerator with different batch sizes?
This is to get tips on working around the issues mentioned in the following Github Q&A item and feature request:
https://github.com/huggingface/transformers/issues/7198:
how to continue training from a checkpoint with Trainer?
https://github.com/huggingface/transformers/issues/21271:
issue warning about different batch size being used for --resume_from_checkpoint
My problem is that I want to resume a costly masked language model (MLM) pretraining run under an AWS 4-gpu server. The run completed one epoch with a low batch size, and I want to do a few more with a larger batch size for better throughput.
This is not supported in the code due to the way the code re-adjusts the resume step point to account for the different batch size:
transformers/run_mlm_no_trainer.py at main · huggingface/transformers · GitHub
A workaround I implemented was to reset the step to 0 after the code does its calibration. I did this because I couldn’t find comparable support for last checkpoint restarting from the trainer-based version of the MLM script:
transformers/run_mlm.py at main · huggingface/transformers · GitHub