T5Tokenizer add a whitespace token after added special tokens

xianf · November 22, 2023, 6:44am

I am using the mt5 Tokenizer and I want to add “\n” to the origin tokenizer. I add "additional_special_tokens": ["\n"] to the tokenizer_config.json. But the output is not what I want.
The script is like:

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
print(tokenizer.tokenize("你好\n啊", add_special_tokens=False))

and the output is

['▁', '你', '好', '\n', '▁', '啊']

I want to remove this '▁' token after '\n'.
Of course, I can do it with some more codes with Python. But can I do this with the tokenizer itself?

Topic		Replies	Views
Tokenizer ignores repeated whitespaces 🤗Tokenizers	3	3329	May 19, 2022
Tokenizer is splitting special token 🤗Tokenizers	3	18	June 30, 2025
Slow Tokenizer adds whitespace after special token 🤗Transformers	4	1402	August 8, 2023
Adding token to t5-base vocab does not respect space 🤗Tokenizers	0	728	January 13, 2022
`add_tokens` with argument `special_tokens=True` vs `add_special_tokens` 🤗Tokenizers	0	361	April 5, 2023

T5Tokenizer add a whitespace token after added special tokens

Related topics