Zero-Shot Classification Pipeline - Truncating


Is it possible to specify arguments for truncating and padding the text input to a certain length when using the transformers pipeline for zero-shot classification?

For instance, if I am using the following:
classifier = pipeline(“zero-shot-classification”, device=0)

Do I need to first specify those arguments such as truncation=True, padding=‘max_length’, max_length=256, etc in the tokenizer / config, and then pass it to the pipeline?

Thank you in advance

hey @valkyrie the pipelines in transformers call a _parse_and_tokenize function that automatically takes care of padding and truncation - see here for the zero-shot example.

so the short answer is that you shouldn’t need to provide these arguments when using the pipeline. do you have a special reason to want to do so?

Hey @lewtun, the reason why I wanted to specify those is because I am doing a comparison with other text classification methods like DistilBERT and BERT for sequence classification, in where I have set the maximum length parameter (and therefore the length to truncate and pad to) to 256 tokens. Because of that I wanted to do the same with zero-shot learning, and also hoping to make it more efficient.

hey @valkyrie i had a bit of a closer look at the _parse_and_tokenize function of the zero-shot pipeline and indeed it seems that you cannot specify the max_length parameter for the tokenizer.

so if you really want to change this, one idea could be to subclass ZeroShotClassificationPipeline and then override _parse_and_tokenize to include the parameters you’d like to pass to the tokenizer’s __call__ method.

hope that helps!

1 Like

I will try that, thank you!

1 Like