Importance of padding for tokens and same size inputs for transformers

We usually pad our inputs/tokens to the transformers to have the same size (e.g. 512 in bert)
I have a general question regarding padding to have a fix size (in your case 512).
would transformers be able to learn if inputs are not pad to fix sizes?
lets say I have batch size of B, I can fix the size of the data to be the same as the one with the largest sample in each batch, in this case in each iteration the batch will have different sizes.
But I was wondering if it will actually work for transformers? and if not what is the reason and why?

I really appreciate it if someone that know the answer or tried it before can help me here.


Hey @seyeeet
I have found a tutorial where the input pipeline is just the way you have described to be. Here the inputs are padded to the biggest sequence in the batch, where each batch has different padding lengths.

Tutorial link: भाषा की समझ के लिए ट्रांसफार्मर मॉडल  |  Text  |  TensorFlow

Hope this helps :slight_smile:

1 Like