I’m building an image captioner using Huggingface models with Pytorch and I’m getting different results for the first iteration (and for the following iterations obviously) when the effective batch size is the same (mini batch of 4 with 16 gradient accumulation steps vs 64 batch size vs any other combination that leads to 64 effective batch size).
For a batch size of 16 with 4 gradient accumulation steps the loss of the first iteration is 4.095348954200745 and for a batch size of 4 with 16 gradient accumulation steps the loss of the first iteration is 4.097771629691124.
Here’s a gist link to the code I’m running. https://gist.github.com/miguelscarv/55c7ce58c911a743fd54be258e3e0d9c
I’ve also included the vision_encoder_decoder.py code because I had to make some slight modifications for it to accept CLIP models.
I’ve disabled Dropout, set the seed, didn’t shuffle the dataloader, and the models I’m using don’t have batch specific layers like batchnorm layers as far as I know. The model’s im using are gpt2 and laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K (in case you want to look at the models). I also don’t think it’s a precision issue because in the code I look at the sum of the loss divided by the grad acc steps vs the loss divided by the grad acc steps and them summed and it leads to the exact same result…
In the same gist I added an attempt at re creating this issue with pure Pytorch and the MNIST dataset so you can more easily run it. This is the torch_test.py file. The loss difference isn’t as large as the one I just described, but it’s still a difference that shouldn’t be there in my opinion since there are no precision issues (I think)