If I am trying to recreate paper’s results, and they used a Batch size of 16 with 30 epochs, gradient accumulation steps of 3.
But I can only fit a batch size of 8 on my GPU.
I am using a Linear warm up scheduler with get_linear_schedule_with_warmup(
self.opt, num_warmup_steps=0, num_training_steps=(dataset_size / effective_batch_size) * self.hparams.max_epochs
How should I adjust my max epochs, and gradient accumulation steps if I am using half the batch size?
Should I train for the same number of epochs. But increase the gradient accumulation steps?