I am using T5 model and tokenizer for a downstream task. I want to add certain whitesapces to the tokenizer like line ending (\t) and tab (\t). Adding these tokens work but somehow the tokenizer always ignores the second whitespace. So, it tokenizes the sequence “\n\n” as a single line ending and the sequence"\n\n\n\n" is tokenized as two line endings and so on. See below to reproduce.
from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-large")
tokenizer.add_tokens(["\n"])
tokenizer.encode("\n") # returns [32100, 1] as expected
tokenizer.encode("\n\n") # returns [32100, 1] but expected would be [32100, 32100, 1]
tokenizer.encode("\n\n\n\n") # returns [32100, 32100, 1] but expected would be [32100, 32100, 32100, 32100, 1]
what is the reasoning behind this behaviour? Is it a bug or something related to how tokenizer works? I noticed that this only happens for added whitespaces but not for other characters.
Is there way to prevent tokenizer from ignoring the repeated whitespaces?