Sequences shorter than model's input window size

ignacio-ferreira-dev · January 3, 2022, 3:06pm

Hi, I wanted to better understand how does it work/reference on GitHub, of how does the transformers library handle inputs which are smaller in size than the model’s input window.

For example with dynamic batching, one batch could have a max size of 32 tokens, how does the transformer library handle this into making that sequence be model_input_window_size input tokens?

Does it add the pad token to each to complete up to model_input_window_size and masks with 0 those tokens automatically so we don’t have to do it manually?

Thanks

jon-fernandes · January 3, 2022, 9:20pm

You can use the padding=True flag within your Tokenizer. This ensures that for your batch, anything that is smaller than that amount is padded. (This is usually even smaller than the model input) Here is an example. If you drop the padding=True flag, you will get a ValueError.

from transformers import AutoTokenizer

checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

inp = tokenizer(['This is a sentence', 'This is another'], padding=True, return_tensors='pt')
inp

ignacio-ferreira-dev · January 4, 2022, 9:45pm

Hi, thanks for your response.

So what I understand is that adding padding=True allows that for those two inputs, padding is added so that all of them have the same amount of tokens as the max in that batch (adding padding and mask with zeros).

But what I don’t understand is for example in that case, supposing that the first string has 4 tokens and the second one 3 tokens (I think distilbert uses sentence-piece tokenization but for the sake of simplicity).

How does that sequence for example (without including the tensors part for simplicity):
[[1,6,22,56],[1,6,278, pad_token_id]] gets processed in the forward pass before being feed to the model as both sequences have length 4 which doesn’t match the 512 window input of the model?

Topic		Replies	Views
How can I make sure Tokenizer pads to a fixed length? 🤗Tokenizers	2	2108	March 29, 2022
Importance of padding for tokens and same size inputs for transformers 🤗Transformers	1	681	October 22, 2021
ValueError: Expected input batch_size (8) to match target batch_size (280) Beginners	1	1941	November 18, 2024
Training with varying lengths of sequences Beginners	0	1619	May 31, 2023
DeBERTa - ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length 🤗Tokenizers	2	1483	October 3, 2023

Sequences shorter than model's input window size

Related topics