Adding token to t5-base vocab does not respect space

rahular · January 13, 2022, 6:48pm

I am adding { and } to the t5-base tokenizer with

tokenizer.add_tokens(['{', '}'])

This successfully adds the tokens, but with a slight problem. When I try to tokenize the following sentence, the space before the { and } is gone

>>> tokenizer.tokenize('hello { this is a sentence } bye')
['▁hello', '{', '▁this', '▁is', '▁', 'a', '▁sentence', '}', '▁by', 'e']

This will result in an imperfect reconstruction of the sentence when decoding

>>> tokenizer.batch_decode(tokenizer(["hello { this is a sentence } bye"])['input_ids'], skip_special_tokens=True, clean_up_tokenization_spaces=False)
['hello{ this is a sentence} bye']

I also tried adding the tokens with different configs using AddedToken without success. Any ideas on how I can make { and } their own words?

Topic		Replies	Views
How to make tokenizer add the spaces correctly when decoding a sequence when set add_prefix_space=False 🤗Tokenizers	0	565	October 9, 2023
Added Tokens Not Decoding with Spaces 🤗Tokenizers	3	2831	January 19, 2024
2 tokens for one character in T5 🤗Tokenizers	2	1616	August 10, 2023
How to properly add news tokens to tokenizer vocab? Beginners	0	154	May 14, 2024
T5Tokenizer add a whitespace token after added special tokens 🤗Tokenizers	0	336	November 22, 2023

Adding token to t5-base vocab does not respect space

Related topics