Predictions with pipeline fails to truncate test set

dsasasa · January 23, 2024, 9:21am

I’m getting this error which is clear:
Token indices sequence length is longer than the specified maximum sequence length for this model (2215 > 2048)

But I don’t understand how come I’m getting it since I’m specifying a tokenizer:

pp = pipeline(task="text-generation", model="awesome_model", tokenizer="awesome_model")
results = [out for out in tqdm(pp(KeyDataset(self.ds['test'], "text")))]

Previously in my training I do the following:

        tokenizer = AutoTokenizer.from_pretrained(self.pretrained_model_name, )
        tokenizer.pad_token = tokenizer.eos_token
        tokenized_ds = self.ds.map(
            lambda x: tokenizer(x['text'], max_length=700, truncation=True),
            batched=True,
        )

How should I do the predictions on test?

Topic		Replies	Views
Truncating sequence -- within a pipeline Beginners	7	5820	May 3, 2024
Tokenizer behaviour with pipeline 🤗Tokenizers	0	923	August 1, 2023
Token indices sequence length is longer than the specified maximum sequence length for this model 🤗Transformers	1	5470	July 21, 2023
How do I setup a TextClassificationPipeline that truncates token sequences Beginners	0	327	September 29, 2021
Limit max # of tokens for inference in pipeline? Beginners	0	1080	April 7, 2023

Predictions with pipeline fails to truncate test set

Related topics