I’ve been trying to use a pre-trained pegasus model for generating paraphrases of an input sentence using the most popular paraphrasing model on the huggingface model hub. However I’m running into an out of index error, and what’s strange about the error is that it only occasionally happens: most sentences get correctly paraphrased by the model but maybe one in every 100 sentences will run into an error. If I run:
import torch from transformers import PegasusForConditionalGeneration, PegasusTokenizer para_model_name = 'tuner007/pegasus_paraphrase' para_tokenizer = PegasusTokenizer.from_pretrained(para_model_name) para_model = PegasusForConditionalGeneration.from_pretrained(para_model_name) text = [' (Chng et al'] batch = para_tokenizer(text, truncation=True, padding='longest', max_length=200, return_tensors="pt") translated = para_model.generate(**batch, max_length=200, num_beams=10, num_return_sequences=1, temperature=1.5)
I get the following error (when run on cpu):
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse) 1911 # remove once script supports set_grad_enabled 1912 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type) -> 1913 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) 1914 1915 IndexError: index out of range in self
I assumed at first it was that the tokenization scheme, and it assigned indices beyond the shape of the embedding matrix, however strangely if I change the input text to:
text = [' (Chng et alt']
then the tokens are very similar, [143, 19152, 4652, 3256, 2700, 1] for the previous error-inducing input and [143, 19152, 4652, 3256, 20913, 1] now, but the model now works. This seems a bit backwards though as the one with a higher maximum tokenized value works and leads to no errors. So I’m a bit stuck, I don’t know if internally the model generates out of vocabulary words but that seems implausible given how popular the model is (it’s been downloaded 80000 times this month), so any help would be greatly appreciated.
Thank you very much!