Truncating sequence -- within a pipeline

AlanFeder · July 16, 2020, 11:25pm

Hi all,

Thanks for making this forum!

I have a list of tests, one of which apparently happens to be 516 tokens long. I have been using the feature-extraction pipeline to process the texts, just using the simple function:

nlp = pipeline('feature-extraction')

When it gets up to the long text, I get an error:

Token indices sequence length is longer than the specified maximum sequence length for this model (516 > 512). Running this sequence through the model will result in indexing errors

Alternately, if I do the sentiment-analysis pipeline (created by nlp2 = pipeline('sentiment-analysis'), I did not get the error.

Is there a way for me put an argument in the pipeline function to make it truncate at the max model input length? I tried reading this, but I was not sure how to make everything else in pipeline the same/default, except for this truncation.

AlanFeder · July 20, 2020, 7:50pm

One quick follow-up – I just realized that the message earlier is just a warning, and not an error, which comes from the tokenizer portion. I then get an error on the model portion:

IndexError: index out of range in self

So I have two questions:

Is there a way to just add an argument somewhere that does the truncation automatically?
Is there a way for me to split out the tokenizer/model, truncate in the tokenizer, and then run that truncated in the model?

Thank you!

cx00 · April 12, 2021, 2:59pm

Hello, have you found a solution to this? I have also come across this problem and haven’t found a solution.

AlanFeder · April 13, 2021, 10:15pm

I have not – I just moved out of the “pipeline” framework, and used the building blocks. It wasn’t too bad

cx00 · April 27, 2021, 3:40pm

I see, do you think you could share a snippet of how you did that? I am unsure of what to do with the output of a model. Thanks again!

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

pt_batch = tokenizer(
    "We are very happy to show you the 🤗 Transformers library.",
    truncation=True,
    max_length=10,
    return_tensors="pt"
)

pt_outputs = pt_model(**pt_batch)

pt_outputs

Which gives…

SequenceClassifierOutput(loss=None, logits=tensor([[-4.2644, 4.6002]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

AlanFeder · April 27, 2021, 7:15pm

try adding something like the following:

from torch.nn import Softmax

smax = Softmax(dim=-1)

probs0 = smax(pt_outputs.logits)
probs0 = probs0.flatten().detach().numpy()

prob_pos = probs[1]

Now prob_pos should be the probability that the sentence is positive.

Does that work/make sense?

ttj · July 15, 2021, 11:53am

Passing truncation=True in __call__ seems to suppress the error.

Mohit2112 · May 3, 2024, 5:59pm

Just add “truncation=True” in your pipeline initiation
pipe = pipeline("text-classification", max_length=512, truncation=True)

Topic		Replies	Views
How to specify sequence length when using "feature-extraction" 🤗Transformers	3	1298	April 28, 2021
Tokenizer behaviour with pipeline 🤗Tokenizers	0	926	August 1, 2023
How do I setup a TextClassificationPipeline that truncates token sequences Beginners	0	328	September 29, 2021
Predictions with pipeline fails to truncate test set 🤗Transformers	0	181	January 23, 2024
Out of index error in pipeline Beginners	9	6517	June 22, 2022

Truncating sequence -- within a pipeline

I see, do you think you could share a snippet of how you did that? I am unsure of what to do with the output of a model. Thanks again!

Related topics