Hi
Can you guys give me tips how to make Zero Shot pipeline inference faster?
My current approach right now is reducing the model size/parameter
(trying to train “base model” instead of "large model)
Is there another approach?
Hi
Can you guys give me tips how to make Zero Shot pipeline inference faster?
My current approach right now is reducing the model size/parameter
(trying to train “base model” instead of "large model)
Is there another approach?
There’s some discussion in this topic that you could check out.
Here are a few things you can do:
valhalla/distilbart-mnli-12-3
(models can be specified by passing e.g. pipeline("zero-shot-classification", model="valhalla/distilbart-mnli-12-3")
when you construct a model.device=0
to the pipeline factory in to utilize cuda.thanks for the advice @joeddav and also thanks to @valhalla for amazing project about distiled model
distiled model seems interesting…will try to look for it,
and it would be great if its support more model,such as xlm-roberta
anyway
i’m trying to train/reproduce your xlm-model
but using base-model instead of large one to improve inference speed.
will try use this step maybe
using xlm-r because it support more languages
any tips that i should aware of ?
thanks again
I noticed using the zero-shot-classification pipeline that loading the model (i.e. this line: classifier = pipeline(“zero-shot-classification”, device=0)) takes about 60 seconds, but that inference afterward is quite fast. Is there a way to speed up the model/tokenizer loading process? Thanks!
the pipeline actually loads the model twice, once in the get_framework function and then again on line 3296
For now, if you want, you could just hardcode the framework as pt
and remove the call to get_framework
to load the model once.
Seems like we should add a utility function to file_utils.py
to check whether a tf/pt model file exists at a path without having to download it so that we don’t have to do this. Thoughts from @sgugger or @lysandre maybe?
I’m confused about what the question is. You can pass a local path to XyzModel.from_pretrained
and it won’t download anything.
Awesome thanks! – I made the change below and load time dropped from 61 second to 32 seconds:
classifier = pipeline(“zero-shot-classification”, device=0) ----> classifier = pipeline(“zero-shot-classification”, framework=“pt”, device=0)
@sgugger the issue is whether the model is local or not the pipeline loads it twice, which adds up significant time for big models like bart-large
Yes, so this has nothing to do with files_utils
.
I should have been clearer. The problem is that the get_framework
function in the pipelines implementation determines the framework with a try/catch, attempting to load the model in pytorch and if it fails, loading it with tensorflow. But then it just throws the model away, so the pipeline constructor has to load it again later.
What I was trying to say is: would it make sense to have a utility in file_utils.py
that can tell you whether a file exists without having to download the whole thing? In this case, it would allow us to check whether a model file exists for a particular framework (e.g. pytorch_model.bin
) without having to wait for the large file to download if it does. I imagine that could be useful in other places too, but I’m not the expert, so I thought I’d see what you thought If it wouldn’t be useful elsewhere, there’s probably an easier workaround without leaving
pipelines.py
.
(sorry for the belated response)
Understood now! This function could certainly be useful, yes.
@joeddav @valhalla
Thanks for the amazing work on the original zero-shot classification model/pipeline and the distilled versions. I am a bit late to the party but I want to know if there are new techniques that I can apply to speed up inference on a single GPU if I am using valhalla/distilbart-mnli-12-3 or a fine-tuned version of that. I read through this guide: GPU inference but to be honest I am not sure if these can be applied to the valhalla/distilbart-mnli-12-3 model given that some layers have been removed during distillation. I would really appreciate some pointers from you guys. Thank you!
Techniques which are often used to speed up models include quantization. I’d recommend taking a look at the Optimum library which implements a lot of optimizations for models available in the Transformers library, tailored towards various hardware providers: 🤗 Optimum. ONNX is a framework often used by companies in production to speed up inference.
set torch_dtype=torch.float16 likely faster, reference pipelines doc, pipe = pipeline(“text-generation”, model=model_id, torch_dtype=torch.float16, batch_size=2, device=0)