Trainer with adaptive batch size?

I’m training a custom T5 model from scratch. My training examples have input and label lengths (in terms of token count) ranging from 200 to 2000. The average example has an input and output length of 300. I am using the Huggingface Trainer to do this.

My GPU only has enough memory for a batch size of 1 for examples where the input and label length are at their maximum. Therefore, my Trainer instance uses per_device_train_batch_size = 1 and gradient_accumulation_steps = 128. However, this is quite wasteful for batches containing only short examples. For batches where the inputs and labels are bounded in length by 500, for instance, my GPU could handle per_device_train_batch_size = 16 and gradient_accumulation_steps = 8 to achieve the same effective batch size. If I could somehow vary the values of these parameters (keeping the effective batch size constant) according to input and label length as Trainer iterates through my training data, I could improve the throughput and lessen the training time considerably.

Any thoughts on how I might achieve this? (Would it be easy to modify trainer.py to achieve this? If so, where in the file? Or is there a straightforward approach I’m not seeing?)