I use a RobertaTokenizer
to tokenize sentences that contains french characters like é or ç.
I need the generated tokens with the Ġ
character and the french characters well formatted.
For instance with the input
input = "3 allées paris, 75000"
[tokenizer.decode([token]) for token in tokenizer.encode(input)]
outputs ['<s>', ' 3', ' all', 'ées', ' paris', ',', ' 7', '5000', '</s>']
so the Ġ
are replaced by spaces.
And tokenizer.tokenize(input)
outputs ['Ġ3', 'Ġall', 'ées', 'Ġparis', ',', 'Ġ7', '5000']
so the french characters are not well formatted.
I used to do this, and it used to work:
inputs = self.tokenizer.encode_plus(input, return_tensors="pt")
ids = inputs['input_ids'].cpu().tolist()
clean_tokens = [self.tokenizer.decode([token]) for token in ids[0]]
But for some reasons I cannot understand, it does not output the tokens with the Ġ characters anymore and I cannot figure out what was the breaking change.
Do you have any idea ?