I am adding {
and }
to the t5-base
tokenizer with
tokenizer.add_tokens(['{', '}'])
This successfully adds the tokens, but with a slight problem. When I try to tokenize the following sentence, the space before the {
and }
is gone
>>> tokenizer.tokenize('hello { this is a sentence } bye')
['▁hello', '{', '▁this', '▁is', '▁', 'a', '▁sentence', '}', '▁by', 'e']
This will result in an imperfect reconstruction of the sentence when decoding
>>> tokenizer.batch_decode(tokenizer(["hello { this is a sentence } bye"])['input_ids'], skip_special_tokens=True, clean_up_tokenization_spaces=False)
['hello{ this is a sentence} bye']
I also tried adding the tokens with different configs using AddedToken
without success. Any ideas on how I can make {
and }
their own words?