Tokenizer ignores repeated whitespaces

I am using T5 model and tokenizer for a downstream task. I want to add certain whitesapces to the tokenizer like line ending (\t) and tab (\t). Adding these tokens work but somehow the tokenizer always ignores the second whitespace. So, it tokenizes the sequence “\n\n” as a single line ending and the sequence"\n\n\n\n" is tokenized as two line endings and so on. See below to reproduce.

from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-large")

tokenizer.encode("\n") # returns [32100, 1] as expected
tokenizer.encode("\n\n") # returns [32100, 1] but expected would be [32100, 32100, 1]
tokenizer.encode("\n\n\n\n") # returns [32100, 32100, 1] but expected would be [32100, 32100, 32100, 32100, 1]

what is the reasoning behind this behaviour? Is it a bug or something related to how tokenizer works? I noticed that this only happens for added whitespaces but not for other characters.

Is there way to prevent tokenizer from ignoring the repeated whitespaces?

It looks like you have to add it as a special token instead

tokenizer.add_special_tokens({"pad_token": AddedToken("\n")})

Then it should tokenize the '\n' the way you want to. This is a little hacky because the reason your newline char is getting stripped has to do with how the tokenize method for the t5 model default strips tokens. If you encode it as a special token it has different stripping behavior that allows you to keep the new line chars

1 Like

Hi @courtneysprouse131

Thank you very much for your answer!! I dont want to add it as a pad token because then all of my padded sequences will contain EOL at the end. In HuggingFace, there are reserved special tokens. I did the following.

from tokenizers import AddedToken
tokenizer.add_special_tokens({"additional_special_tokens": [AddedToken("\n")]})
print(tokenizer.encode("\n\n")) # [32100, 32100, 1] as expected!!

Thanks a lot for pointing into the right direction!

Ahh makes sense! Awesome glad I could help!

1 Like