Tokenizer ignores repeated whitespaces

berkayberabi · May 12, 2022, 10:44am

I am using T5 model and tokenizer for a downstream task. I want to add certain whitesapces to the tokenizer like line ending (\t) and tab (\t). Adding these tokens work but somehow the tokenizer always ignores the second whitespace. So, it tokenizes the sequence “\n\n” as a single line ending and the sequence"\n\n\n\n" is tokenized as two line endings and so on. See below to reproduce.

from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-large")
tokenizer.add_tokens(["\n"])

tokenizer.encode("\n") # returns [32100, 1] as expected
tokenizer.encode("\n\n") # returns [32100, 1] but expected would be [32100, 32100, 1]
tokenizer.encode("\n\n\n\n") # returns [32100, 32100, 1] but expected would be [32100, 32100, 32100, 32100, 1]

what is the reasoning behind this behaviour? Is it a bug or something related to how tokenizer works? I noticed that this only happens for added whitespaces but not for other characters.

Is there way to prevent tokenizer from ignoring the repeated whitespaces?

courtneysprouse131 · May 16, 2022, 5:14pm

It looks like you have to add it as a special token instead

tokenizer.add_special_tokens({"pad_token": AddedToken("\n")})

Then it should tokenize the '\n' the way you want to. This is a little hacky because the reason your newline char is getting stripped has to do with how the tokenize method for the t5 model default strips tokens. If you encode it as a special token it has different stripping behavior that allows you to keep the new line chars

berkayberabi · May 19, 2022, 1:54pm

Hi @courtneysprouse131

Thank you very much for your answer!! I dont want to add it as a pad token because then all of my padded sequences will contain EOL at the end. In HuggingFace, there are reserved special tokens. I did the following.

from tokenizers import AddedToken
tokenizer.add_special_tokens({"additional_special_tokens": [AddedToken("\n")]})
print(tokenizer.special_tokens_map)
print(tokenizer.encode("\n\n")) # [32100, 32100, 1] as expected!!

Thanks a lot for pointing into the right direction!

courtneysprouse131 · May 19, 2022, 2:12pm

Ahh makes sense! Awesome glad I could help!

Topic		Replies	Views
T5Tokenizer add a whitespace token after added special tokens 🤗Tokenizers	0	336	November 22, 2023
`GPT2Tokenizer` Tokenizer handling `\n\n` differently in different settings 🤗Tokenizers	4	789	October 4, 2023
Why follow Flan-T5 template when T5 tokenizer ignores multiple newlines 🤗Transformers	0	113	May 15, 2024
Padding not working when loading a tokenizer trained via the tokenizers library into transformers 🤗Transformers	1	6224	June 11, 2023
2 tokens for one character in T5 🤗Tokenizers	2	1619	August 10, 2023

Tokenizer ignores repeated whitespaces

Related topics