Truncating sequence -- within a pipeline

Hi all,

Thanks for making this forum!

I have a list of tests, one of which apparently happens to be 516 tokens long. I have been using the feature-extraction pipeline to process the texts, just using the simple function:

nlp = pipeline('feature-extraction')

When it gets up to the long text, I get an error:

Token indices sequence length is longer than the specified maximum sequence length for this model (516 > 512). Running this sequence through the model will result in indexing errors

Alternately, if I do the sentiment-analysis pipeline (created by nlp2 = pipeline('sentiment-analysis'), I did not get the error.

Is there a way for me put an argument in the pipeline function to make it truncate at the max model input length? I tried reading this, but I was not sure how to make everything else in pipeline the same/default, except for this truncation.

3 Likes

One quick follow-up – I just realized that the message earlier is just a warning, and not an error, which comes from the tokenizer portion. I then get an error on the model portion:

IndexError: index out of range in self

So I have two questions:

  1. Is there a way to just add an argument somewhere that does the truncation automatically?
  2. Is there a way for me to split out the tokenizer/model, truncate in the tokenizer, and then run that truncated in the model?

Thank you!

2 Likes

Hello, have you found a solution to this? I have also come across this problem and haven’t found a solution.

I have not – I just moved out of the “pipeline” framework, and used the building blocks. It wasn’t too bad :smiley:

1 Like

I see, do you think you could share a snippet of how you did that? I am unsure of what to do with the output of a model. Thanks again!

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

pt_batch = tokenizer(
    "We are very happy to show you the 🤗 Transformers library.",
    truncation=True,
    max_length=10,
    return_tensors="pt"
)

pt_outputs = pt_model(**pt_batch)

pt_outputs

Which gives…

SequenceClassifierOutput(loss=None, logits=tensor([[-4.2644, 4.6002]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

try adding something like the following:

from torch.nn import Softmax

smax = Softmax(dim=-1)

probs0 = smax(pt_outputs.logits)
probs0 = probs0.flatten().detach().numpy()

prob_pos = probs[1]

Now prob_pos should be the probability that the sentence is positive.

Does that work/make sense?

Passing truncation=True in __call__ seems to suppress the error.

Just add “truncation=True” in your pipeline initiation
pipe = pipeline("text-classification", max_length=512, truncation=True)