We usually pad our inputs/tokens to the transformers to have the same size (e.g. 512 in bert)
I have a general question regarding padding to have a fix size (in your case 512).
would transformers be able to learn if inputs are not pad to fix sizes?
lets say I have batch size of B
, I can fix the size of the data to be the same as the one with the largest sample in each batch, in this case in each iteration the batch will have different sizes.
But I was wondering if it will actually work for transformers? and if not what is the reason and why?
I really appreciate it if someone that know the answer or tried it before can help me here.
Thanks
S