Zero-Shot Classification Pipeline - Truncating

valkyrie · May 20, 2021, 10:33am

Hi,

Is it possible to specify arguments for truncating and padding the text input to a certain length when using the transformers pipeline for zero-shot classification?

For instance, if I am using the following:
classifier = pipeline(“zero-shot-classification”, device=0)

Do I need to first specify those arguments such as truncation=True, padding=‘max_length’, max_length=256, etc in the tokenizer / config, and then pass it to the pipeline?

Thank you in advance

lewtun · May 20, 2021, 12:23pm

hey @valkyrie the pipelines in transformers call a _parse_and_tokenize function that automatically takes care of padding and truncation - see here for the zero-shot example.

so the short answer is that you shouldn’t need to provide these arguments when using the pipeline. do you have a special reason to want to do so?

valkyrie · May 20, 2021, 1:19pm

Hey @lewtun, the reason why I wanted to specify those is because I am doing a comparison with other text classification methods like DistilBERT and BERT for sequence classification, in where I have set the maximum length parameter (and therefore the length to truncate and pad to) to 256 tokens. Because of that I wanted to do the same with zero-shot learning, and also hoping to make it more efficient.

lewtun · May 26, 2021, 5:19pm

hey @valkyrie i had a bit of a closer look at the _parse_and_tokenize function of the zero-shot pipeline and indeed it seems that you cannot specify the max_length parameter for the tokenizer.

so if you really want to change this, one idea could be to subclass ZeroShotClassificationPipeline and then override _parse_and_tokenize to include the parameters you’d like to pass to the tokenizer’s __call__ method.

hope that helps!

valkyrie · May 27, 2021, 7:57am

I will try that, thank you!

Topic		Replies	Views
How do I setup a TextClassificationPipeline that truncates token sequences Beginners	0	326	September 29, 2021
Truncating sequence -- within a pipeline Beginners	7	5801	May 3, 2024
Tokenizer truncation Beginners	1	1788	June 14, 2022
Limit max # of tokens for inference in pipeline? Beginners	0	1080	April 7, 2023
MAX_LEN in ZeroShot Models	0	279	November 21, 2022

Zero-Shot Classification Pipeline - Truncating

Related topics