I was wondering if there was a way to speed up zero shot classification as outlined here if I was to use pytorch directly.
For example I’m guessing this default method tokenises and pads to length 512 whereas most of my text is < 50 words. I’ve had some experience in using BertWordPieceTokenizer. So I’m guessing it would also be faster to tokenize everything in one go and send it to a pytorch model directly, rather than one by one which is what I’m guessing is happening here?
Would really appreciate even a starting point if such a thing is possible.
By default it actually pads to the length of the longest sequence in the batch, so that part is efficient. The thing to keep in mind with this method is that each sequence/label pair has to be fed through the model together. So if you are running with a large # of candidate labels, that’s going to be your bottleneck. The other thing is that the default model, bart-large-mnli, is pretty big. Theoretically, the pipeline can be used with any model that has been trained on NLI, which you can specify with the model parameter when you create the pipeline. So you could try out some smaller models, but you probably won’t get anything to work as well as Bart or Roberta in terms of accuracy.
Thanks for the quick response @joeddav. I tried the following:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-mnli")
model = AutoModel.from_pretrained("facebook/bart-large-mnli")
sequence = "Who are you voting for in 2020?"
template = "This example is {}."
x = tokenizer(sequence, template.format("politics"), return_tensors="pt")
y = model(**x)
print([a.shape for a in y])
# [torch.Size([1, 17, 1024]), torch.Size([1, 17, 1024])]
whereas I was expecting the output to be of size (1, 3) giving the logits/ probabilities of entailment etc.
So seems like the model I got was a headless model? Is there a way to get the model with the head.
I see your point about max length being set, but considering my classes are always the same, I probably don’t need to tokenise that all the time. I could in theory tokenise the inputs and the classes separately and join them with a seperatory. Atleast this is what I was hoping to do with above code segment.
Oh, I also realized I should add here for any readers that if you want to use pipelines with GPU, you can just pass device=0 where 0 is the device number, to the pipeline factory:
Or you could try this project onnx_transformers, which let’s you speed up HF pipelines using onnx and also includes zero-shot-classification. Note that BART is not tested on ONNX yet, so it uses roberta-large-mnli instead of BART