way to make inference Zero Shot pipeline faster?

munggok · October 6, 2020, 1:27pm

Hi
Can you guys give me tips how to make Zero Shot pipeline inference faster?

My current approach right now is reducing the model size/parameter
(trying to train “base model” instead of "large model)

Is there another approach?

joeddav · October 6, 2020, 1:39pm

There’s some discussion in this topic that you could check out.

Here are a few things you can do:

Try out one of the community-uploaded distilled models on the hub (thx @valhalla) . I’ve found them to get pretty similar performance on zero shot classification and some of them are much smaller and faster. I’d start with valhalla/distilbart-mnli-12-3 (models can be specified by passing e.g. pipeline("zero-shot-classification", model="valhalla/distilbart-mnli-12-3") when you construct a model.
If you’re on GPU, make sure you’re passing device=0 to the pipeline factory in to utilize cuda.
If you’re on CPU, try running the pipeline with ONNX Runtime. You should get a boost. Here’s a project (thx again @valhalla) that lets you use HF pipelines with ORT automatically.
If you have a lot of candidate labels, try to get clever about passing just the most likely ones to the pipeline. Passing a large # of labels for each sentence is really going to slow you down since each sentence/label pair has to be passed to the model together. If you have 100 possible labels but you can use some kind of heuristic or simpler model to narrow it down, that will help a lot.
Use mixed precision. This is pretty easy if using PyTorch 1.6.

munggok · October 6, 2020, 3:30pm

thanks for the advice @joeddav and also thanks to @valhalla for amazing project about distiled model

distiled model seems interesting…will try to look for it,
and it would be great if its support more model,such as xlm-roberta

anyway
i’m trying to train/reproduce your xlm-model
but using base-model instead of large one to improve inference speed.

will try use this step maybe

using xlm-r because it support more languages

any tips that i should aware of ?
thanks again

Buckeyes2019 · December 23, 2020, 3:45am

I noticed using the zero-shot-classification pipeline that loading the model (i.e. this line: classifier = pipeline(“zero-shot-classification”, device=0)) takes about 60 seconds, but that inference afterward is quite fast. Is there a way to speed up the model/tokenizer loading process? Thanks!

valhalla · December 23, 2020, 6:05am

the pipeline actually loads the model twice, once in the get_framework function and then again on line 3296

github.com

huggingface/transformers/blob/master/src/transformers/pipelines.py#L3296


    logger.warning(
        "Model might be a PyTorch model (ending with `.bin`) but PyTorch is not available. "
        "Trying to load the model with Tensorflow."
    )
if model_class is None:
    raise ValueError(
        f"Pipeline using {framework} framework, but this framework is not supported by this pipeline."
    )
model = model_class.from_pretrained(model, config=config, revision=revision, **model_kwargs)
if task == "translation" and model.config.task_specific_params:
    for key in model.config.task_specific_params:
        if key.startswith("translation"):
            task = key
            warnings.warn(
                '"translation" task was used, instead of "translation_XX_to_YY", defaulting to "{}"'.format(
                    task
                ),
                UserWarning,
            )

For now, if you want, you could just hardcode the framework as pt and remove the call to get_framework to load the model once.

joeddav · December 23, 2020, 5:17pm

Seems like we should add a utility function to file_utils.py to check whether a tf/pt model file exists at a path without having to download it so that we don’t have to do this. Thoughts from @sgugger or @lysandre maybe?

sgugger · December 23, 2020, 5:48pm

I’m confused about what the question is. You can pass a local path to XyzModel.from_pretrained and it won’t download anything.

Buckeyes2019 · December 23, 2020, 6:07pm

Awesome thanks! – I made the change below and load time dropped from 61 second to 32 seconds:

classifier = pipeline(“zero-shot-classification”, device=0) ----> classifier = pipeline(“zero-shot-classification”, framework=“pt”, device=0)

valhalla · December 23, 2020, 6:26pm

@sgugger the issue is whether the model is local or not the pipeline loads it twice, which adds up significant time for big models like bart-large

sgugger · December 23, 2020, 6:38pm

Yes, so this has nothing to do with files_utils.

joeddav · January 7, 2021, 2:37pm

I should have been clearer. The problem is that the get_framework function in the pipelines implementation determines the framework with a try/catch, attempting to load the model in pytorch and if it fails, loading it with tensorflow. But then it just throws the model away, so the pipeline constructor has to load it again later.

What I was trying to say is: would it make sense to have a utility in file_utils.py that can tell you whether a file exists without having to download the whole thing? In this case, it would allow us to check whether a model file exists for a particular framework (e.g. pytorch_model.bin) without having to wait for the large file to download if it does. I imagine that could be useful in other places too, but I’m not the expert, so I thought I’d see what you thought If it wouldn’t be useful elsewhere, there’s probably an easier workaround without leaving pipelines.py.

(sorry for the belated response)

sgugger · January 7, 2021, 3:36pm

Understood now! This function could certainly be useful, yes.

sc3051 · May 31, 2024, 11:20pm

@joeddav @valhalla
Thanks for the amazing work on the original zero-shot classification model/pipeline and the distilled versions. I am a bit late to the party but I want to know if there are new techniques that I can apply to speed up inference on a single GPU if I am using valhalla/distilbart-mnli-12-3 or a fine-tuned version of that. I read through this guide: GPU inference but to be honest I am not sure if these can be applied to the valhalla/distilbart-mnli-12-3 model given that some layers have been removed during distillation. I would really appreciate some pointers from you guys. Thank you!

nielsr · June 3, 2024, 2:14pm

Techniques which are often used to speed up models include quantization. I’d recommend taking a look at the Optimum library which implements a lot of optimizations for models available in the Transformers library, tailored towards various hardware providers: 🤗 Optimum. ONNX is a framework often used by companies in production to speed up inference.

hgbchatfp · November 24, 2024, 6:19pm

set torch_dtype=torch.float16 likely faster, reference pipelines doc, pipe = pipeline(“text-generation”, model=model_id, torch_dtype=torch.float16, batch_size=2, device=0)

Topic		Replies	Views
Batched pipeline inference has little speed improvement on longer texts Beginners	1	1883	October 27, 2023
Speeding up zero shot classification [Solved] Beginners	5	6031	September 9, 2020
What's the best way to speed up inference on a large dataset? Beginners	3	3904	March 13, 2022
Inference using Pipeline and TensorFlow Beginners	0	497	December 2, 2021
The most efficient way for predictions(zero-shot classification) on huge dataset Beginners	0	526	July 6, 2022

way to make inference Zero Shot pipeline faster?

Related topics