Resuming accelerate-based pretraining with different batch size

TL;DR; Any tips on adding epochs doing MLM pretraining via accelerator with different batch sizes?

This is to get tips on working around the issues mentioned in the following Github Q&A item and feature request:

https://github.com/huggingface/transformers/issues/7198:
how to continue training from a checkpoint with Trainer?

https://github.com/huggingface/transformers/issues/21271: 
issue warning about different batch size being used for --resume_from_checkpoint

My problem is that I want to resume a costly masked language model (MLM) pretraining run under an AWS 4-gpu server. The run completed one epoch with a low batch size, and I want to do a few more with a larger batch size for better throughput.

This is not supported in the code due to the way the code re-adjusts the resume step point to account for the different batch size:
transformers/run_mlm_no_trainer.py at main · huggingface/transformers · GitHub

A workaround I implemented was to reset the step to 0 after the code does its calibration. I did this because I couldn’t find comparable support for last checkpoint restarting from the trainer-based version of the MLM script:
transformers/run_mlm.py at main · huggingface/transformers · GitHub