Is it possible to specify arguments for truncating and padding the text input to a certain length when using the transformers pipeline for zero-shot classification?
For instance, if I am using the following:
classifier = pipeline(“zero-shot-classification”, device=0)
Do I need to first specify those arguments such as truncation=True, padding=‘max_length’, max_length=256, etc in the tokenizer / config, and then pass it to the pipeline?
hey @valkyrie the pipelines in transformers call a _parse_and_tokenize function that automatically takes care of padding and truncation - see here for the zero-shot example.
so the short answer is that you shouldn’t need to provide these arguments when using the pipeline. do you have a special reason to want to do so?
Hey @lewtun, the reason why I wanted to specify those is because I am doing a comparison with other text classification methods like DistilBERT and BERT for sequence classification, in where I have set the maximum length parameter (and therefore the length to truncate and pad to) to 256 tokens. Because of that I wanted to do the same with zero-shot learning, and also hoping to make it more efficient.
hey @valkyrie i had a bit of a closer look at the _parse_and_tokenize function of the zero-shot pipeline and indeed it seems that you cannot specify the max_length parameter for the tokenizer.
so if you really want to change this, one idea could be to subclass ZeroShotClassificationPipeline and then override _parse_and_tokenize to include the parameters you’d like to pass to the tokenizer’s __call__ method.