RobertaTokenizer decode and tokenize do not have the same output

BettyFbr · October 24, 2023, 2:43pm

I use a RobertaTokenizer to tokenize sentences that contains french characters like é or ç.
I need the generated tokens with the Ġ character and the french characters well formatted.

For instance with the input
input = "3 allées paris, 75000"
[tokenizer.decode([token]) for token in tokenizer.encode(input)] outputs ['<s>', ' 3', ' all', 'ées', ' paris', ',', ' 7', '5000', '</s>'] so the Ġ are replaced by spaces.
And tokenizer.tokenize(input) outputs ['Ġ3', 'Ġall', 'Ã©es', 'Ġparis', ',', 'Ġ7', '5000'] so the french characters are not well formatted.

I used to do this, and it used to work:

inputs = self.tokenizer.encode_plus(input, return_tensors="pt")
ids = inputs['input_ids'].cpu().tolist()
clean_tokens = [self.tokenizer.decode([token]) for token in ids[0]]

But for some reasons I cannot understand, it does not output the tokens with the Ġ characters anymore and I cannot figure out what was the breaking change.
Do you have any idea ?

Topic		Replies	Views
Issue with tokenizer.tokenize 🤗Tokenizers	3	503	November 16, 2020
Using roberta for token-classification, strange characters Models	0	267	July 10, 2023
GPT2Tokenizer.decode maps unicode sequences to the same string '�' 🤗Tokenizers	3	1196	March 15, 2023
Ġ token inserted by ByteLevelBPETokenizer 🤗Transformers	0	547	November 1, 2023
Detokenising output of Roberta tokeniser Beginners	0	445	April 6, 2022

RobertaTokenizer decode and tokenize do not have the same output

Related topics