I am using the mt5 Tokenizer and I want to add “\n” to the origin tokenizer. I add "additional_special_tokens": ["\n"]
to the tokenizer_config.json
. But the output is not what I want.
The script is like:
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
print(tokenizer.tokenize("你好\n啊", add_special_tokens=False))
and the output is
['▁', '你', '好', '\n', '▁', '啊']
I want to remove this '▁'
token after '\n'
.
Of course, I can do it with some more codes with Python. But can I do this with the tokenizer itself?