Hi I’m new to the transformer model and when I run this code
from transformers import ProphetNetTokenizer, ProphetNetForConditionalGeneration, ProphetNetConfig
model = ProphetNetForConditionalGeneration.from_pretrained('microsoft/xprophetnet-large-wiki100-cased-xglue-qg')
tokenizer = ProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased-xglue-qg')
```from https://huggingface.co/microsoft/xprophetnet-large-wiki100-cased-xglue-qg. I got an "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaf in position 51: invalid start byte" error when I run "tokenizer = ProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased-xglue-qg')" Can any one help? Thank you!
Slight hack to fix the decoder error - I tried editing ProphetNetTokenizer itself, specifically the
load_vocab method and changed the encoding to ‘latin-1’.
The edit is made here, I did it locally so I’m not sure whether this could be done for Colab notebooks.
This stackoverflow page gave some suggestions for a fix, of which the
latin-1 encoding seemed to work.
Caveat: I haven’t tried using the tokenizer any further than just getting the ProphetNetTokenizer.from_pretrained() step working.
It gave a warning of
"Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.", so there may be some extra steps needed after.
The overall issue is something that can’t be encoded to utf-8 in the vocabulary used. Hope this helps