Is zeroshot classification tokenizing the input sequence more than once?

I am digging into the zero shot classification pipeline and it seems that for every text sequence, both sequences are encoded every time.

I could be reading the code incorrectly, so I wanted to verify here.

There are gains to be made if the first sequence (the user provided input) was tokenized only once per label pair. But, it seems to me that the transfomers (nor tokenizer lib) do this?

Example for what I am meaning: For the input “Who are you voting for in 2020?” with labels [“politics”, “sports”, “technology”], the sequence “Who are you voting for in 2020?” is tokenized 3 separate times. Even if this hits tokenizer cache, it seems to be doing unnecessary work.