Humm, can’t you always increase the equivalent batchsize using gradient accumulation? Like
for batch_idx % 8 == 7:
optimizer.step()
and you can just reduce lr
the into
lr = lr/8
Also adding more cards have the same effect, you can basically using 8 GPU with batch_size = 1 and it is the same as 1 GPU with batch_size = 8
So I think the Question really is what batch_size (or equivalent batch size aka GPU_num * Gradient Accumulation * Batch
) is optimum