I have tokenised and normalised output produced by a Roberta-style model and would like to convert it back to its untokenised and unnormalised form.
The output texts look like this:
output_text="1 . ĠLouis Ġaurait Ġtous Ġles Ġcorps Ġont Ġjeter ĠÃł Ġse Ġséparer Ġl ' un Ġde Ġl ' autre"
And I would like to get the following output (I know it’s not good French, but that’s ok):
1. Louis aurait tous les corps ont jeter à se séparer l'un de l'autre
My tokeniser is stored in a local file (tokeniser.json
). How can I detokenise and normalise the text as described above?
Here is what I have tried so far:
from transformers import PreTrainedTokenizerFast
t = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")
outids = t.convert_tokens_to_ids("1 . ĠLouis Ġaurait Ġtous Ġles Ġcorps Ġont Ġjeter ĠÃł Ġse Ġséparer Ġl ' un Ġde Ġl ' autre".split())
t.decode(outids)
This gives "1. ĠLouis Ġaurait Ġtous Ġles Ġcorps Ġont Ġjeter ĠÃł Ġse Ġséparer Ġl'un Ġde Ġl'autre"
, i.e. it has not detokenised or denormalised at all.
I have tried the convert_tokens_to_string
function, but get the following error:
File "/home/ME/miniconda3/envs/py37/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 532, in convert_tokens_to_string
return self.backend_tokenizer.decoder.decode(tokens)
AttributeError: 'NoneType' object has no attribute 'decode'
I can quite easily write a custom function to detokenise based on the Ġ
symbol, but I still have the problem of decoding the normalised text (i.e. Ãł
to à
). It seems that the problem is in the Byte pretokenisation. I have tried using the encode function in python (with various different encodings) followed by decoding to utf8 and this does not work - I cannot see which encoding has been used (or whether it is actually an encoding that text can be recovered from).
I would be really grateful for any help!