Hello, friends. I’m new to the forum, so if there is some issue with this question, please let me know.
So, I want to tokenize a long sequence, and I’m trying to use the sliding window option.
tokenizer = AutoTokenizer.from_pretrained('../../data/model/', do_lower_case=False)
tokenized_example = tokenizer(data[0], return_overflowing_tokens=True, max_length=512)
Everything worked fine, but I then wish to obtain the tensor for such encoding.
tokenized_example = tokenizer(data[0], return_overflowing_tokens=True, max_length=512, return_tensors="pt")
Now, this returns an error, because the last window is not padded, hence, instead of 512 tokens, it has 490. Ok, so I need padding, but
tokenized_example = tokenizer(data[0], padding='max_length', return_overflowing_tokens=True, max_length=512, return_tensors="pt")
The code above does not work, because it seems that the window sliding is not sliding when we have the padding activated. Hence, instead of, say, 12 tensors of 512, I have only one tensor of 6000.
How can one circumvent this?