Detokenising output of Roberta tokeniser

rbawden · April 6, 2022, 6:36pm

I have tokenised and normalised output produced by a Roberta-style model and would like to convert it back to its untokenised and unnormalised form.

The output texts look like this:
output_text="1 . ĠLouis Ġaurait Ġtous Ġles Ġcorps Ġont Ġjeter ĠÃł Ġse ĠsÃ©parer Ġl ' un Ġde Ġl ' autre"

And I would like to get the following output (I know it’s not good French, but that’s ok):
1. Louis aurait tous les corps ont jeter à se séparer l'un de l'autre

My tokeniser is stored in a local file (tokeniser.json). How can I detokenise and normalise the text as described above?

Here is what I have tried so far:

from transformers import PreTrainedTokenizerFast
t = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")
outids = t.convert_tokens_to_ids("1 . ĠLouis Ġaurait Ġtous Ġles Ġcorps Ġont Ġjeter ĠÃł Ġse ĠsÃ©parer Ġl ' un Ġde Ġl ' autre".split())
t.decode(outids)

This gives "1. ĠLouis Ġaurait Ġtous Ġles Ġcorps Ġont Ġjeter ĠÃł Ġse ĠsÃ©parer Ġl'un Ġde Ġl'autre", i.e. it has not detokenised or denormalised at all.

I have tried the convert_tokens_to_string function, but get the following error:

File "/home/ME/miniconda3/envs/py37/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 532, in convert_tokens_to_string
    return self.backend_tokenizer.decoder.decode(tokens)
AttributeError: 'NoneType' object has no attribute 'decode'

I can quite easily write a custom function to detokenise based on the Ġ symbol, but I still have the problem of decoding the normalised text (i.e. Ãł to à). It seems that the problem is in the Byte pretokenisation. I have tried using the encode function in python (with various different encodings) followed by decoding to utf8 and this does not work - I cannot see which encoding has been used (or whether it is actually an encoding that text can be recovered from).

I would be really grateful for any help!

Topic		Replies	Views
RobertaTokenizer decode and tokenize do not have the same output 🤗Tokenizers	0	247	October 24, 2023
Creating a custom tokenizer for Roberta Beginners	5	4322	August 1, 2021
Tokenizer decoding using BERT, RoBERTa, XLNet, GPT2 Beginners	7	8430	September 21, 2020
Issue with tokenizer.tokenize 🤗Tokenizers	3	503	November 16, 2020
Tokenized sequence lengths 🤗Tokenizers	6	2023	March 10, 2022

Detokenising output of Roberta tokeniser

Related topics