I have tokenised and normalised output produced by a Roberta-style model and would like to convert it back to its untokenised and unnormalised form.
The output texts look like this:
output_text="1 . ĠLouis Ġaurait Ġtous Ġles Ġcorps Ġont Ġjeter ĠÃł Ġse ĠsÃ©parer Ġl ' un Ġde Ġl ' autre"
And I would like to get the following output (I know it’s not good French, but that’s ok):
1. Louis aurait tous les corps ont jeter à se séparer l'un de l'autre
My tokeniser is stored in a local file (
tokeniser.json). How can I detokenise and normalise the text as described above?
Here is what I have tried so far:
from transformers import PreTrainedTokenizerFast t = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json") outids = t.convert_tokens_to_ids("1 . ĠLouis Ġaurait Ġtous Ġles Ġcorps Ġont Ġjeter ĠÃł Ġse ĠsÃ©parer Ġl ' un Ġde Ġl ' autre".split()) t.decode(outids)
"1. ĠLouis Ġaurait Ġtous Ġles Ġcorps Ġont Ġjeter ĠÃł Ġse ĠsÃ©parer Ġl'un Ġde Ġl'autre", i.e. it has not detokenised or denormalised at all.
I have tried the
convert_tokens_to_string function, but get the following error:
File "/home/ME/miniconda3/envs/py37/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 532, in convert_tokens_to_string return self.backend_tokenizer.decoder.decode(tokens) AttributeError: 'NoneType' object has no attribute 'decode'
I can quite easily write a custom function to detokenise based on the
Ġ symbol, but I still have the problem of decoding the normalised text (i.e.
à). It seems that the problem is in the Byte pretokenisation. I have tried using the encode function in python (with various different encodings) followed by decoding to utf8 and this does not work - I cannot see which encoding has been used (or whether it is actually an encoding that text can be recovered from).
I would be really grateful for any help!