Padding with sliding window

Hello, friends. I’m new to the forum, so if there is some issue with this question, please let me know.

So, I want to tokenize a long sequence, and I’m trying to use the sliding window option.

tokenizer = AutoTokenizer.from_pretrained('../../data/model/', do_lower_case=False)
tokenized_example = tokenizer(data[0], return_overflowing_tokens=True, max_length=512)

Everything worked fine, but I then wish to obtain the tensor for such encoding.

tokenized_example = tokenizer(data[0], return_overflowing_tokens=True, max_length=512,  return_tensors="pt")

Now, this returns an error, because the last window is not padded, hence, instead of 512 tokens, it has 490. Ok, so I need padding, but

tokenized_example = tokenizer(data[0], padding='max_length', return_overflowing_tokens=True, max_length=512,  return_tensors="pt")

The code above does not work, because it seems that the window sliding is not sliding when we have the padding activated. Hence, instead of, say, 12 tensors of 512, I have only one tensor of 6000.

How can one circumvent this?

2 Likes

Opa, e aí, Davi?

I also had an undesirable tensor shape with these arguments:

tokenized = tokenizer(
    "test "*1000,
    return_overflowing_tokens=True,
    max_length=512,
    return_tensors='pt',
    padding='max_length',
)
print(tokenized['input_ids'].shape)
>> torch.Size([1, 2002])

Try explicitly activating the truncation:

tokenized = tokenizer(
    "Test "*1000,
    return_overflowing_tokens=True,
    max_length=512,
    return_tensors='pt',
    padding='max_length',
    truncation=True,
)
print(tokenized['input_ids'].shape)
>> torch.Size([4, 512])

To fully benefit from a sliding window, also try the stride parameter. It controls the overlap between two consecutive “windows.”