Sequences shorter than model's input window size

Hi, I wanted to better understand how does it work/reference on GitHub, of how does the transformers library handle inputs which are smaller in size than the model’s input window.

For example with dynamic batching, one batch could have a max size of 32 tokens, how does the transformer library handle this into making that sequence be model_input_window_size input tokens?

Does it add the pad token to each to complete up to model_input_window_size and masks with 0 those tokens automatically so we don’t have to do it manually?


You can use the padding=True flag within your Tokenizer. This ensures that for your batch, anything that is smaller than that amount is padded. (This is usually even smaller than the model input) Here is an example. If you drop the padding=True flag, you will get a ValueError.

from transformers import AutoTokenizer

checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

inp = tokenizer(['This is a sentence', 'This is another'], padding=True, return_tensors='pt')

Hi, thanks for your response.

So what I understand is that adding padding=True allows that for those two inputs, padding is added so that all of them have the same amount of tokens as the max in that batch (adding padding and mask with zeros).

But what I don’t understand is for example in that case, supposing that the first string has 4 tokens and the second one 3 tokens (I think distilbert uses sentence-piece tokenization but for the sake of simplicity).

How does that sequence for example (without including the tensors part for simplicity):
[[1,6,22,56],[1,6,278, pad_token_id]] gets processed in the forward pass before being feed to the model as both sequences have length 4 which doesn’t match the 512 window input of the model?