I am adding { and } to the t5-base tokenizer with
tokenizer.add_tokens(['{', '}'])
This successfully adds the tokens, but with a slight problem. When I try to tokenize the following sentence, the space before the { and } is gone
>>> tokenizer.tokenize('hello { this is a sentence } bye')
['âhello', '{', 'âthis', 'âis', 'â', 'a', 'âsentence', '}', 'âby', 'e']
This will result in an imperfect reconstruction of the sentence when decoding
>>> tokenizer.batch_decode(tokenizer(["hello { this is a sentence } bye"])['input_ids'], skip_special_tokens=True, clean_up_tokenization_spaces=False)
['hello{ this is a sentence} bye']
I also tried adding the tokens with different configs using AddedToken without success. Any ideas on how I can make { and } their own words?