Out of index error when using pre-trained Pegasus model

adianl · March 30, 2021, 5:10am

Hey everyone,

I’ve been trying to use a pre-trained pegasus model for generating paraphrases of an input sentence using the most popular paraphrasing model on the huggingface model hub. However I’m running into an out of index error, and what’s strange about the error is that it only occasionally happens: most sentences get correctly paraphrased by the model but maybe one in every 100 sentences will run into an error. If I run:

import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

para_model_name = 'tuner007/pegasus_paraphrase'
para_tokenizer = PegasusTokenizer.from_pretrained(para_model_name)
para_model = PegasusForConditionalGeneration.from_pretrained(para_model_name)

text = [' (Chng et al']
batch = para_tokenizer(text, truncation=True, padding='longest', max_length=200, 
return_tensors="pt")
translated = para_model.generate(**batch, max_length=200, num_beams=10, 
num_return_sequences=1, temperature=1.5)

I get the following error (when run on cpu):

/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, 
padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
    1911         # remove once script supports set_grad_enabled
    1912         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1913     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
    1914 
    1915 

IndexError: index out of range in self

I assumed at first it was that the tokenization scheme, and it assigned indices beyond the shape of the embedding matrix, however strangely if I change the input text to:

text = [' (Chng et alt']

then the tokens are very similar, [143, 19152, 4652, 3256, 2700, 1] for the previous error-inducing input and [143, 19152, 4652, 3256, 20913, 1] now, but the model now works. This seems a bit backwards though as the one with a higher maximum tokenized value works and leads to no errors. So I’m a bit stuck, I don’t know if internally the model generates out of vocabulary words but that seems implausible given how popular the model is (it’s been downloaded 80000 times this month), so any help would be greatly appreciated.

Thank you very much!

elsanns · March 31, 2021, 3:28pm

It seems that the problem is related to the shape of the “embed_positions” layer of the decoder.
This code snippet:

works fine with a max_length of 60 (or less) in ‘generate’ (throws no error)
and
also works fine with max_length=200 and a different Pegasus model, e.g.: “sshleifer/distill-pegasus-xsum-16-4”
(examples below)

The two models differ in the shape of this layer:

tuner007/pegasus_paraphrase:
(decoder): PegasusDecoder(
(embed_tokens): Embedding(96103, 1024, padding_idx=0)
(embed_positions): PegasusSinusoidalPositionalEmbedding(60, 1024)

sshleifer/distill-pegasus-xsum-16-4:
(decoder): PegasusDecoder(
(embed_tokens): Embedding(96103, 1024, padding_idx=0)
(embed_positions): PegasusSinusoidalPositionalEmbedding(1024, 1024)

tuner007/pegasus_paraphrase, max_length=60, config.max_position_embeddings=60 :

import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

para_model_name = 'tuner007/pegasus_paraphrase'
para_tokenizer = PegasusTokenizer.from_pretrained(para_model_name)
para_model = PegasusForConditionalGeneration.from_pretrained(para_model_name)

text = [' (Chng et al']
batch = para_tokenizer(text, truncation=True, padding='longest', max_length=200, 
return_tensors="pt")
translated = para_model.generate(**batch, 
                                 #max_length=200, 
                                 max_length=60,
                                 num_beams=10, 
                                 num_return_sequences=1, 
                                 temperature=1.5)

print(para_model.config.max_position_embeddings)

sshleifer/distill-pegasus-xsum-16-4, max_length=200, config.max_position_embeddings=1024 :

import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

para_model_name = "sshleifer/distill-pegasus-xsum-16-4" 
para_tokenizer_s = PegasusTokenizer.from_pretrained(para_model_name)
para_model_s = PegasusForConditionalGeneration.from_pretrained(para_model_name)

text = [' (Chng et al']
batch = para_tokenizer_s(text, truncation=True, padding='longest', max_length=200, 
return_tensors="pt")
translated = para_model_s.generate(**batch, max_length=200, num_beams=10, 
num_return_sequences=1, temperature=1.5)

print(para_model_s.config.max_position_embeddings)

adianl · April 1, 2021, 12:39pm

Amazing! I just couldn’t figure out what was wrong before, but that makes sense. Thank you so much!

Topic		Replies	Views
Using Pegasus for Paraphrasing Beginners	0	496	January 7, 2022
Out of index error in pipeline Beginners	9	6500	June 22, 2022
Index out of range in transformer summarization 🤗Transformers	2	112	December 16, 2024
[HELP] How to fix IndexError: index out of range in self Beginners	1	1547	March 31, 2023
IndexError: index out of range in self - Text Generation with GPT2 Beginners	2	5766	November 27, 2023

Out of index error when using pre-trained Pegasus model

Related topics