UnicodeDecodeError with xprophetnet-large-wiki100-cased-xglue-qg model

airlandfengdi · June 28, 2021, 5:06pm

Hi I’m new to the transformer model and when I run this code

from transformers import ProphetNetTokenizer, ProphetNetForConditionalGeneration, ProphetNetConfig

model = ProphetNetForConditionalGeneration.from_pretrained('microsoft/xprophetnet-large-wiki100-cased-xglue-qg')
tokenizer = ProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased-xglue-qg')
```from https://huggingface.co/microsoft/xprophetnet-large-wiki100-cased-xglue-qg. I got an "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaf in position 51: invalid start byte" error when I run "tokenizer = ProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased-xglue-qg')"  Can any one help? Thank you!

jc1991 · June 29, 2021, 4:27pm

Slight hack to fix the decoder error - I tried editing ProphetNetTokenizer itself, specifically the load_vocab method and changed the encoding to ‘latin-1’.

The edit is made here, I did it locally so I’m not sure whether this could be done for Colab notebooks.

This stackoverflow page gave some suggestions for a fix, of which the latin-1 encoding seemed to work.

Caveat: I haven’t tried using the tokenizer any further than just getting the ProphetNetTokenizer.from_pretrained() step working.

It gave a warning of "Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.", so there may be some extra steps needed after.

The overall issue is something that can’t be encoded to utf-8 in the vocabulary used. Hope this helps

Topic		Replies	Views
UnicodeDecodeError: xprophetnet-large-wiki100-cased-xglue-qg model 🤗Transformers	0	345	June 28, 2021
Encoding error with fine-tuned model Models	1	823	October 4, 2021
Encoding error while fine-tuning 🤗Transformers	2	3488	August 10, 2021
Convert_graph_to_onnx doesn't meet UnicodeDecodeError 🤗Transformers	0	258	October 21, 2021
Parser Error, ERROR: Exception in ASGI application 🤗AutoTrain	2	863	December 5, 2024

UnicodeDecodeError with xprophetnet-large-wiki100-cased-xglue-qg model

Related topics