Out of index error in pipeline

Hi,

I tried this on both the downloaded pretrained pegasus model (‘google/pegasus-xsum’) and on model I finetuned from it.

But when trying to predict for some text I get IndexError: index out of range in self

Not sure what to tweak? In my tuned model I tried setting
max_length=512, truncation=True, padding=True … but it didn’t help?

Error from original downloaded model:

tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-xsum')
model = PegasusForConditionalGeneration.from_pretrained('google/pegasus-xsum').to(device)


pipe_tuned = pipeline(
    "summarization", 
    model=model, 
    tokenizer=tokenizer,
    device=device_id)

pipe_tuned('some long text')

Any recommendations to try?

Hi,

You may need to specify the arguments “max_length”, “padding” and “truncation” in the tokenizer.

I tried tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-xsum', max_length=512, truncation=True, padding=True), but get the same error

hi @yulgm ,

i had same error before.

i debug encoding’s data, i find tensor contain -1 value.

so checkout is there load correctly range values.

Or, change ‘padding=True’ to ‘padding=‘max_length’’ ?
i seem that is little different ‘True’ and ‘max_length’.

regards.

I tried it, but didn’t work. I tried one of the failing texts and got this error: “Token indices sequence length is longer than the specified maximum sequence length for this model (753 > 512). Running this sequence through the model will result in indexing errors”. Does this mean ‘truncation’ doesn’t work?

So I’m wondering if it is a bug or intended behaviour.

I tested passing truncation and padding into the definition of tokenizer, vs when applying it

When adding parameters to definition directly:

tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-xsum', model_max_length=512, max_length=512, truncation=True, padding=True)

Then I tested on my long text and get following results:

len(tokenizer(text).input_ids)
→ returns 713

BUT

len(tokenizer(text, max_length=512, truncation=True, padding=True).input_ids)
→ returns 512.

So definitions in ‘tokenizer’ itself don’t record? But how do I pass it into a pipeline then? Is it not possible?

My failing pipeline is:
pipeline("summarization", model=model, tokenizer=tokenizer, device=device_id)

In hf docs, example show like this

from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch

src_text = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
]

model_name = "google/pegasus-xsum"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)
batch = tokenizer(src_text, truncation=True, padding="longest", return_tensors="pt").to(device)
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
assert (
    tgt_text[0]
    == "California's largest electricity provider has turned off power to hundreds of thousands of customers."
)

So, if you want truncation your text,
set parameter 'truncation=True ’ to tokenizer(), not from_pretrained()

try this code. it show correctly truncate input text.

from transformers import PegasusTokenizer

tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-xsum')

text_over_512 = "".join(['i' for i in range(756)])
text_under_112 = "".join(['i' for i in range(112)])
print(len(text_over_512))
print(len(text_under_112))
#  
encoding_1 = tokenizer(text_over_512, truncation=True, padding= "max_length")
encoding_2 = tokenizer(text_under_112, truncation=True, padding= "max_length")

print(f"text_over_512's input_ids length : {len(encoding_1['input_ids'])}")
print(f"text_under_112's input_ids length : {len(encoding_2['input_ids'])}")

image

1 Like

So this means I have to use tokenizer.batch_decode(translated, skip_special_tokens=True) to make prediction, correct? Cannot use the pipeline?

i tried to get same error in your env.

declare pipe line with just tokenizer,

after you call with pipeline, add parameter of truncation.

they contain args_parser.

summerize pipeline

example

from transformers import pipeline, PegasusTokenizer, PegasusForConditionalGeneration


# Named entity recognition pipeline, passing in a specific model and tokenizer
model = PegasusForConditionalGeneration.from_pretrained('google/pegasus-xsum')
tokenizer =PegasusTokenizer.from_pretrained('google/pegasus-xsum')

summerize_pipe = pipeline("summarization", model=model, tokenizer=tokenizer)
str_input_ = "A Highway Network is an architecture designed to ease gradient-based training of very deep networks. They allow unimpeded information flow across several layers on information highways. The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. A Highway Network is an architecture designed to ease gradient-based training of very deep networks. They allow unimpeded information flow across several layers on information highways. The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. A Highway Network is an architecture designed to ease gradient-based training of very deep networks. They allow unimpeded information flow across several layers on information highways. The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network."

# make long input
str_input_over512 = str_input_*4

summerize_pipe(str_input_over512, truncation=True)

output :

[{'summary_text': 'A Highway Network is an architecture designed to ease gradient-based training of very deep networks.'}]
1 Like

Thank you very much! This worked for my code

1 Like