I have a list of tests, one of which apparently happens to be 516 tokens long. I have been using the feature-extraction pipeline to process the texts, just using the simple function:
nlp = pipeline('feature-extraction')
When it gets up to the long text, I get an error:
Token indices sequence length is longer than the specified maximum sequence length for this model (516 > 512). Running this sequence through the model will result in indexing errors
Alternately, if I do the sentiment-analysis pipeline (created by nlp2 = pipeline('sentiment-analysis'), I did not get the error.
Is there a way for me put an argument in the pipeline function to make it truncate at the max model input length? I tried reading this, but I was not sure how to make everything else in pipeline the same/default, except for this truncation.
One quick follow-up – I just realized that the message earlier is just a warning, and not an error, which comes from the tokenizer portion. I then get an error on the model portion:
IndexError: index out of range in self
So I have two questions:
Is there a way to just add an argument somewhere that does the truncation automatically?
Is there a way for me to split out the tokenizer/model, truncate in the tokenizer, and then run that truncated in the model?