Out of index error when using pre-trained Pegasus model

Hey everyone,

I’ve been trying to use a pre-trained pegasus model for generating paraphrases of an input sentence using the most popular paraphrasing model on the huggingface model hub. However I’m running into an out of index error, and what’s strange about the error is that it only occasionally happens: most sentences get correctly paraphrased by the model but maybe one in every 100 sentences will run into an error. If I run:

import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

para_model_name = 'tuner007/pegasus_paraphrase'
para_tokenizer = PegasusTokenizer.from_pretrained(para_model_name)
para_model = PegasusForConditionalGeneration.from_pretrained(para_model_name)

text = [' (Chng et al']
batch = para_tokenizer(text, truncation=True, padding='longest', max_length=200, 
return_tensors="pt")
translated = para_model.generate(**batch, max_length=200, num_beams=10, 
num_return_sequences=1, temperature=1.5)

I get the following error (when run on cpu):

/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, 
padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
    1911         # remove once script supports set_grad_enabled
    1912         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1913     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
    1914 
    1915 

IndexError: index out of range in self   

I assumed at first it was that the tokenization scheme, and it assigned indices beyond the shape of the embedding matrix, however strangely if I change the input text to:

text = [' (Chng et alt']

then the tokens are very similar, [143, 19152, 4652, 3256, 2700, 1] for the previous error-inducing input and [143, 19152, 4652, 3256, 20913, 1] now, but the model now works. This seems a bit backwards though as the one with a higher maximum tokenized value works and leads to no errors. So I’m a bit stuck, I don’t know if internally the model generates out of vocabulary words but that seems implausible given how popular the model is (it’s been downloaded 80000 times this month), so any help would be greatly appreciated.

Thank you very much!

It seems that the problem is related to the shape of the “embed_positions” layer of the decoder.
This code snippet:

  • works fine with a max_length of 60 (or less) in ‘generate’ (throws no error)
    and
  • also works fine with max_length=200 and a different Pegasus model, e.g.: “sshleifer/distill-pegasus-xsum-16-4”
    (examples below)

The two models differ in the shape of this layer:

tuner007/pegasus_paraphrase:
(decoder): PegasusDecoder(
(embed_tokens): Embedding(96103, 1024, padding_idx=0)
(embed_positions): PegasusSinusoidalPositionalEmbedding(60, 1024)

sshleifer/distill-pegasus-xsum-16-4:
(decoder): PegasusDecoder(
(embed_tokens): Embedding(96103, 1024, padding_idx=0)
(embed_positions): PegasusSinusoidalPositionalEmbedding(1024, 1024)

tuner007/pegasus_paraphrase, max_length=60, config.max_position_embeddings=60 :

import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

para_model_name = 'tuner007/pegasus_paraphrase'
para_tokenizer = PegasusTokenizer.from_pretrained(para_model_name)
para_model = PegasusForConditionalGeneration.from_pretrained(para_model_name)

text = [' (Chng et al']
batch = para_tokenizer(text, truncation=True, padding='longest', max_length=200, 
return_tensors="pt")
translated = para_model.generate(**batch, 
                                 #max_length=200, 
                                 max_length=60,
                                 num_beams=10, 
                                 num_return_sequences=1, 
                                 temperature=1.5)

print(para_model.config.max_position_embeddings)

sshleifer/distill-pegasus-xsum-16-4, max_length=200, config.max_position_embeddings=1024 :

import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

para_model_name = "sshleifer/distill-pegasus-xsum-16-4" 
para_tokenizer_s = PegasusTokenizer.from_pretrained(para_model_name)
para_model_s = PegasusForConditionalGeneration.from_pretrained(para_model_name)

text = [' (Chng et al']
batch = para_tokenizer_s(text, truncation=True, padding='longest', max_length=200, 
return_tensors="pt")
translated = para_model_s.generate(**batch, max_length=200, num_beams=10, 
num_return_sequences=1, temperature=1.5)

print(para_model_s.config.max_position_embeddings)
1 Like

Amazing! I just couldn’t figure out what was wrong before, but that makes sense. Thank you so much!