Is zeroshot classification tokenizing the input sequence more than once?

benwtrent · April 5, 2022, 1:29pm

I am digging into the zero shot classification pipeline and it seems that for every text sequence, both sequences are encoded every time.

I could be reading the code incorrectly, so I wanted to verify here.

There are gains to be made if the first sequence (the user provided input) was tokenized only once per label pair. But, it seems to me that the transfomers (nor tokenizer lib) do this?

Example for what I am meaning: For the input “Who are you voting for in 2020?” with labels [“politics”, “sports”, “technology”], the sequence “Who are you voting for in 2020?” is tokenized 3 separate times. Even if this hits tokenizer cache, it seems to be doing unnecessary work.

Topic		Replies	Views
Zero shot classification with manual pytorch Beginners	0	732	August 27, 2021
Speeding up zero shot classification [Solved] Beginners	5	6119	September 9, 2020
Zero-Shot Classification Pipeline - Truncating Beginners	4	1172	May 27, 2021
Is it possible to get labels instead of pre-defining it in zero-shot classification? Beginners	0	334	February 21, 2021
Zero shot learning 🤗Transformers	0	303	December 4, 2020

Is zeroshot classification tokenizing the input sequence more than once?

Related topics