Importance of padding for tokens and same size inputs for transformers

seyeeet · October 22, 2021, 7:46pm

We usually pad our inputs/tokens to the transformers to have the same size (e.g. 512 in bert)
I have a general question regarding padding to have a fix size (in your case 512).
would transformers be able to learn if inputs are not pad to fix sizes?
lets say I have batch size of B, I can fix the size of the data to be the same as the one with the largest sample in each batch, in this case in each iteration the batch will have different sizes.
But I was wondering if it will actually work for transformers? and if not what is the reason and why?

I really appreciate it if someone that know the answer or tried it before can help me here.
Thanks

S

ariG23498 · October 22, 2021, 7:56pm

Hey @seyeeet
I have found a tutorial where the input pipeline is just the way you have described to be. Here the inputs are padded to the biggest sequence in the batch, where each batch has different padding lengths.

Tutorial link: भाषा की समझ के लिए ट्रांसफार्मर मॉडल | Text | TensorFlow

Hope this helps

Topic		Replies	Views
Sequences shorter than model's input window size 🤗Transformers	2	1172	January 4, 2022
The (hidden) meaning behind the embedding of the padding token? Awesome paper	2	6297	July 14, 2021
How does the Transformer handle different batch sizes? 🤗Transformers	3	3584	January 24, 2024
Padding in Decision Transformers Inference Models	0	296	May 22, 2023
Need clarity on "padding" parameter in Bert Tokenizer 🤗Tokenizers	0	486	December 8, 2022

Importance of padding for tokens and same size inputs for transformers

Related topics