way to make inference Zero Shot pipeline faster?

Can you guys give me tips how to make Zero Shot pipeline inference faster?

My current approach right now is reducing the model size/parameter
(trying to train “base model” instead of "large model)

Is there another approach?

There’s some discussion in this topic that you could check out.

Here are a few things you can do:

  • Try out one of the community-uploaded distilled models on the hub (thx @valhalla) . I’ve found them to get pretty similar performance on zero shot classification and some of them are much smaller and faster. I’d start with valhalla/distilbart-mnli-12-3 (models can be specified by passing e.g. pipeline("zero-shot-classification", model="valhalla/distilbart-mnli-12-3") when you construct a model.
  • If you’re on GPU, make sure you’re passing device=0 to the pipeline factory in to utilize cuda.
  • If you’re on CPU, try running the pipeline with ONNX Runtime. You should get a boost. Here’s a project (thx again @valhalla) that lets you use HF pipelines with ORT automatically.
  • If you have a lot of candidate labels, try to get clever about passing just the most likely ones to the pipeline. Passing a large # of labels for each sentence is really going to slow you down since each sentence/label pair has to be passed to the model together. If you have 100 possible labels but you can use some kind of heuristic or simpler model to narrow it down, that will help a lot.
  • Use mixed precision. This is pretty easy if using PyTorch 1.6.

thanks for the advice @joeddav and also thanks to @valhalla for amazing project about distiled model

distiled model seems interesting…will try to look for it,
and it would be great if its support more model,such as xlm-roberta

i’m trying to train/reproduce your xlm-model
but using base-model instead of large one to improve inference speed.

will try use this step maybe

using xlm-r because it support more languages

any tips that i should aware of ?
thanks again

I noticed using the zero-shot-classification pipeline that loading the model (i.e. this line: classifier = pipeline(“zero-shot-classification”, device=0)) takes about 60 seconds, but that inference afterward is quite fast. Is there a way to speed up the model/tokenizer loading process? Thanks!

the pipeline actually loads the model twice, once in the get_framework function and then again on line 3296

For now, if you want, you could just hardcode the framework as pt and remove the call to get_framework to load the model once.

Seems like we should add a utility function to file_utils.py to check whether a tf/pt model file exists at a path without having to download it so that we don’t have to do this. Thoughts from @sgugger or @lysandre maybe?

I’m confused about what the question is. You can pass a local path to XyzModel.from_pretrained and it won’t download anything.

Awesome thanks! – I made the change below and load time dropped from 61 second to 32 seconds:

classifier = pipeline(“zero-shot-classification”, device=0) ----> classifier = pipeline(“zero-shot-classification”, framework=“pt”, device=0)

@sgugger the issue is whether the model is local or not the pipeline loads it twice, which adds up significant time for big models like bart-large

Yes, so this has nothing to do with files_utils.

I should have been clearer. The problem is that the get_framework function in the pipelines implementation determines the framework with a try/catch, attempting to load the model in pytorch and if it fails, loading it with tensorflow. But then it just throws the model away, so the pipeline constructor has to load it again later.

What I was trying to say is: would it make sense to have a utility in file_utils.py that can tell you whether a file exists without having to download the whole thing? In this case, it would allow us to check whether a model file exists for a particular framework (e.g. pytorch_model.bin) without having to wait for the large file to download if it does. I imagine that could be useful in other places too, but I’m not the expert, so I thought I’d see what you thought :slight_smile: If it wouldn’t be useful elsewhere, there’s probably an easier workaround without leaving pipelines.py.

(sorry for the belated response)

1 Like

Understood now! This function could certainly be useful, yes.