Out of index error in pipeline

yulgm · June 19, 2022, 10:04pm

Hi,

I tried this on both the downloaded pretrained pegasus model (‘google/pegasus-xsum’) and on model I finetuned from it.

But when trying to predict for some text I get IndexError: index out of range in self

Not sure what to tweak? In my tuned model I tried setting
max_length=512, truncation=True, padding=True … but it didn’t help?

Error from original downloaded model:

tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-xsum')
model = PegasusForConditionalGeneration.from_pretrained('google/pegasus-xsum').to(device)


pipe_tuned = pipeline(
    "summarization", 
    model=model, 
    tokenizer=tokenizer,
    device=device_id)

pipe_tuned('some long text')

Any recommendations to try?

Slinae · June 20, 2022, 1:30am

Hi,

You may need to specify the arguments “max_length”, “padding” and “truncation” in the tokenizer.

yulgm · June 20, 2022, 2:54am

I tried tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-xsum', max_length=512, truncation=True, padding=True), but get the same error

cog · June 20, 2022, 6:36am

hi @yulgm ,

i had same error before.

i debug encoding’s data, i find tensor contain -1 value.

so checkout is there load correctly range values.

Or, change ‘padding=True’ to ‘padding=‘max_length’’ ?
i seem that is little different ‘True’ and ‘max_length’.

regards.

yulgm · June 21, 2022, 1:30pm

I tried it, but didn’t work. I tried one of the failing texts and got this error: “Token indices sequence length is longer than the specified maximum sequence length for this model (753 > 512). Running this sequence through the model will result in indexing errors”. Does this mean ‘truncation’ doesn’t work?

yulgm · June 21, 2022, 2:17pm

So I’m wondering if it is a bug or intended behaviour.

I tested passing truncation and padding into the definition of tokenizer, vs when applying it

When adding parameters to definition directly:

tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-xsum', model_max_length=512, max_length=512, truncation=True, padding=True)

Then I tested on my long text and get following results:

len(tokenizer(text).input_ids)
→ returns 713

BUT

len(tokenizer(text, max_length=512, truncation=True, padding=True).input_ids)
→ returns 512.

So definitions in ‘tokenizer’ itself don’t record? But how do I pass it into a pipeline then? Is it not possible?

My failing pipeline is:
pipeline("summarization", model=model, tokenizer=tokenizer, device=device_id)

cog · June 22, 2022, 12:47am

In hf docs, example show like this

from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch

src_text = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
]

model_name = "google/pegasus-xsum"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)
batch = tokenizer(src_text, truncation=True, padding="longest", return_tensors="pt").to(device)
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
assert (
    tgt_text[0]
    == "California's largest electricity provider has turned off power to hundreds of thousands of customers."
)

So, if you want truncation your text,
set parameter 'truncation=True ’ to tokenizer(), not from_pretrained()

try this code. it show correctly truncate input text.

from transformers import PegasusTokenizer

tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-xsum')

text_over_512 = "".join(['i' for i in range(756)])
text_under_112 = "".join(['i' for i in range(112)])
print(len(text_over_512))
print(len(text_under_112))
#  
encoding_1 = tokenizer(text_over_512, truncation=True, padding= "max_length")
encoding_2 = tokenizer(text_under_112, truncation=True, padding= "max_length")

print(f"text_over_512's input_ids length : {len(encoding_1['input_ids'])}")
print(f"text_under_112's input_ids length : {len(encoding_2['input_ids'])}")

yulgm · June 22, 2022, 1:21am

So this means I have to use tokenizer.batch_decode(translated, skip_special_tokens=True) to make prediction, correct? Cannot use the pipeline?

cog · June 22, 2022, 3:18am

i tried to get same error in your env.

declare pipe line with just tokenizer,

after you call with pipeline, add parameter of truncation.

they contain args_parser.

summerize pipeline

example

from transformers import pipeline, PegasusTokenizer, PegasusForConditionalGeneration


# Named entity recognition pipeline, passing in a specific model and tokenizer
model = PegasusForConditionalGeneration.from_pretrained('google/pegasus-xsum')
tokenizer =PegasusTokenizer.from_pretrained('google/pegasus-xsum')

summerize_pipe = pipeline("summarization", model=model, tokenizer=tokenizer)
str_input_ = "A Highway Network is an architecture designed to ease gradient-based training of very deep networks. They allow unimpeded information flow across several layers on information highways. The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. A Highway Network is an architecture designed to ease gradient-based training of very deep networks. They allow unimpeded information flow across several layers on information highways. The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. A Highway Network is an architecture designed to ease gradient-based training of very deep networks. They allow unimpeded information flow across several layers on information highways. The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network."

# make long input
str_input_over512 = str_input_*4

summerize_pipe(str_input_over512, truncation=True)

output :

[{'summary_text': 'A Highway Network is an architecture designed to ease gradient-based training of very deep networks.'}]

yulgm · June 22, 2022, 2:14pm

Thank you very much! This worked for my code

Topic		Replies	Views
Tokenizer behaviour with pipeline 🤗Tokenizers	0	927	August 1, 2023
How to specify sequence length when using "feature-extraction" 🤗Transformers	3	1298	April 28, 2021
Predictions with pipeline fails to truncate test set 🤗Transformers	0	181	January 23, 2024
Truncating sequence -- within a pipeline Beginners	7	5837	May 3, 2024
Out of index error when using pre-trained Pegasus model Intermediate	2	1992	April 1, 2021

Out of index error in pipeline

Related topics