I use a
RobertaTokenizer to tokenize sentences that contains french characters like é or ç.
I need the generated tokens with the
Ġ character and the french characters well formatted.
For instance with the input
input = "3 allées paris, 75000"
[tokenizer.decode([token]) for token in tokenizer.encode(input)] outputs
['<s>', ' 3', ' all', 'ées', ' paris', ',', ' 7', '5000', '</s>'] so the
Ġ are replaced by spaces.
['Ġ3', 'Ġall', 'Ã©es', 'Ġparis', ',', 'Ġ7', '5000'] so the french characters are not well formatted.
I used to do this, and it used to work:
inputs = self.tokenizer.encode_plus(input, return_tensors="pt") ids = inputs['input_ids'].cpu().tolist() clean_tokens = [self.tokenizer.decode([token]) for token in ids]
But for some reasons I cannot understand, it does not output the tokens with the Ġ characters anymore and I cannot figure out what was the breaking change.
Do you have any idea ?