UnicodeDecodeError with xprophetnet-large-wiki100-cased-xglue-qg model

Hi I’m new to the transformer model and when I run this code

from transformers import ProphetNetTokenizer, ProphetNetForConditionalGeneration, ProphetNetConfig

model = ProphetNetForConditionalGeneration.from_pretrained('microsoft/xprophetnet-large-wiki100-cased-xglue-qg')
tokenizer = ProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased-xglue-qg')
```from https://huggingface.co/microsoft/xprophetnet-large-wiki100-cased-xglue-qg. I got an "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaf in position 51: invalid start byte" error when I run "tokenizer = ProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased-xglue-qg')"  Can any one help? Thank you!

Slight hack to fix the decoder error - I tried editing ProphetNetTokenizer itself, specifically the load_vocab method and changed the encoding to ‘latin-1’.

The edit is made here, I did it locally so I’m not sure whether this could be done for Colab notebooks.

This stackoverflow page gave some suggestions for a fix, of which the latin-1 encoding seemed to work.

Caveat: I haven’t tried using the tokenizer any further than just getting the ProphetNetTokenizer.from_pretrained() step working.

It gave a warning of "Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.", so there may be some extra steps needed after.

The overall issue is something that can’t be encoded to utf-8 in the vocabulary used. Hope this helps

1 Like