Padding with sliding window

davibarreira · June 9, 2022, 4:11pm

Hello, friends. I’m new to the forum, so if there is some issue with this question, please let me know.

So, I want to tokenize a long sequence, and I’m trying to use the sliding window option.

tokenizer = AutoTokenizer.from_pretrained('../../data/model/', do_lower_case=False)
tokenized_example = tokenizer(data[0], return_overflowing_tokens=True, max_length=512)

Everything worked fine, but I then wish to obtain the tensor for such encoding.

tokenized_example = tokenizer(data[0], return_overflowing_tokens=True, max_length=512,  return_tensors="pt")

Now, this returns an error, because the last window is not padded, hence, instead of 512 tokens, it has 490. Ok, so I need padding, but

tokenized_example = tokenizer(data[0], padding='max_length', return_overflowing_tokens=True, max_length=512,  return_tensors="pt")

The code above does not work, because it seems that the window sliding is not sliding when we have the padding activated. Hence, instead of, say, 12 tensors of 512, I have only one tensor of 6000.

How can one circumvent this?

lucasresck · September 3, 2022, 9:45pm

Opa, e aí, Davi?

I also had an undesirable tensor shape with these arguments:

tokenized = tokenizer(
    "test "*1000,
    return_overflowing_tokens=True,
    max_length=512,
    return_tensors='pt',
    padding='max_length',
)
print(tokenized['input_ids'].shape)
>> torch.Size([1, 2002])

Try explicitly activating the truncation:

tokenized = tokenizer(
    "Test "*1000,
    return_overflowing_tokens=True,
    max_length=512,
    return_tensors='pt',
    padding='max_length',
    truncation=True,
)
print(tokenized['input_ids'].shape)
>> torch.Size([4, 512])

To fully benefit from a sliding window, also try the stride parameter. It controls the overlap between two consecutive “windows.”

Topic		Replies	Views
How to pad tokens to a fixed length on a single sentence? Beginners	1	3197	May 19, 2021
Not sure why padding isn't working for me Beginners	2	1604	January 22, 2021
Sequences shorter than model's input window size 🤗Transformers	2	1182	January 4, 2022
How padding in huggingface tokenizer works? 🤗Tokenizers	4	7042	November 22, 2021
Migration guide from v2.X to v3.X for the tokenizer API 🤗Transformers	0	761	July 7, 2020

Padding with sliding window

Related topics