Resuming accelerate-based pretraining with different batch size

tohara-pandologic · January 31, 2023, 8:13pm

TL;DR; Any tips on adding epochs doing MLM pretraining via accelerator with different batch sizes?

This is to get tips on working around the issues mentioned in the following Github Q&A item and feature request:

https://github.com/huggingface/transformers/issues/7198:
how to continue training from a checkpoint with Trainer?

https://github.com/huggingface/transformers/issues/21271: 
issue warning about different batch size being used for --resume_from_checkpoint

My problem is that I want to resume a costly masked language model (MLM) pretraining run under an AWS 4-gpu server. The run completed one epoch with a low batch size, and I want to do a few more with a larger batch size for better throughput.

This is not supported in the code due to the way the code re-adjusts the resume step point to account for the different batch size:
transformers/run_mlm_no_trainer.py at main · huggingface/transformers · GitHub

A workaround I implemented was to reset the step to 0 after the code does its calibration. I did this because I couldn’t find comparable support for last checkpoint restarting from the trainer-based version of the MLM script:
transformers/run_mlm.py at main · huggingface/transformers · GitHub

Topic		Replies	Views
Cannot Resume Training Beginners	1	1374	December 15, 2020
Different models when loading checkpoint (run_mlm) 🤗Transformers	2	504	February 24, 2021
Resuming training BERT from scratch with run_mlm.py Intermediate	2	2205	October 31, 2021
Continuing Pre Training from Model Checkpoint Models	12	41964	January 13, 2025
How to continue BERT training 🤗Transformers	1	1342	March 4, 2022

Resuming accelerate-based pretraining with different batch size

Related topics