Supporting ONNX optimized models

bdalal · August 27, 2020, 4:06pm

In a lot of cases, ONNX optimized models offer much better performance benefits as compared to using PyTorch models.
This performance boost coupled with the pipelines offered by HuggingFace are a really great combo for delivering a great experience both in terms of inference speed and model performance.

Right now, it’s possible to use ONNX models with a little bit of modification to the pipeline.py code. On my tests on the QuestionAnsweringPipeline with SQuADv2 dataset, I see performance improvements of 1.5-2x with models like bert-large-uncased-whole-word-masking-finetuned-squad and distilbert-base-cased-distilled-squad

I’m wondering if this is something that the devs/community considers worthwhile for supporting directly in the transformers repo. I’m doing some work on this for my own project, but it’s mostly a hacky impl. at this point.
If there’s greater interest in this then I could try to integrate ONNX support for inference more fully in the code base.

Would love to hear everyone’s thoughts.

Thanks

sgugger · August 31, 2020, 1:57pm

I know @mfuntowicz is working on this so he may have more input.

a-ware · September 1, 2020, 7:24pm

We would love to see onnx support in the pipelines, it would make serving models much more scalable.
Actually we have never worked with onnx, but the statistics look pretty good.

What do you think about automatically generating onnx models next to TF and pytorch models, so over the time any uploaded model on HF hub would be available as onnx model too.

We would love to help you there, but need to read more into onnx runtime.

a-ware · September 6, 2020, 5:45am

Very cool pipelines implementation

dataiku · October 14, 2020, 8:03am

Hi there,

Are there any plans for Huggingface to distribute pre-trained models as ONNX files? My use case is embedded pre-trained model inference.

I don’t necessarily need the raw pytorch or tensorflow model. The default ONNX quantized export would be enough for me.

I can do that myself using a combination of the huggingface download utils and the ONNX conversion script. Ideally, I would get the ONNX file directly from

[ADDITION] It would probably not be as simple as “get the ONNX file” if we need a plug-and-play experience. Happy to brainstorm on the right format!

Cheers,

Alex Combessie

valhalla · October 14, 2020, 8:29am

Hi @dataiku, give this a try https://github.com/patil-suraj/onnx_transformers

Note: this is a utility project to allow easy inference, which creates the grapth on the fly.

Pinging @julien-c to see if it’s possible to host onnx files on hub.

dataiku · October 14, 2020, 9:11am

Hi @valhalla,

Just had a look. I am exactly looking for that kind of simplicity: nlp = pipeline("sentiment-analysis", onnx=True)

Good prototype. The main caveats I can think of are:
(i) ONNX conversion is done on-device - I’d rather pull pre-computed ONNX files from the model hub
(ii) some level duplication of huggingface source code
(iii) possible improvements in the tokenizer choices to use fast versions when available

I would really like to have this option native in huggingface so I can use it in production applications when inference speed matters a lot.

Cheers,

Alex

valhalla · October 14, 2020, 9:15am

i) we could pull if they are available
ii) Duplicated to provide the same API
iii) Easy to add. Will add it.

HF is considering this in the pipeline V2
see [RFC] Transformers Pipeline v2

dataiku · October 14, 2020, 9:46am

Thanks for pointing it out to [RFC] Transformers Pipeline v2. That’s good to know.

For (ii) is it possible to leverage Python class inheritance features to only add/override the methods which are necessary for ONNX?

valhalla · October 14, 2020, 9:48am

definitely possible! Copied everything so that I can have full control

julien-c · October 15, 2020, 5:23pm

We do have a few model in onnx on the hub, and we support it as a searchable filetype: https://huggingface.co/models?filter=onnx

(cc @mfuntowicz)

We’ll look into auto-converting models to onnx in the future. For now, feel free to ask model authors (or us) to upload converted files manually

valhalla · October 16, 2020, 5:17pm

If we upload onnx models, will they appear in onnx filter ?

ierezell · February 12, 2021, 8:19pm

For now there is only two models at https://huggingface.co/models?filter=onnx.

And I agree with @bdalal, it’s “quite easy” to convert a model to an onnx checkpoint and use it in let’s say NodeJs which don’t have a mature ecosystem for Deeplearning.

I would convert them it if I need to fintune a model but for hub models it would be more convenient to download them directly (if converted automatically). I almost envisaged to convert the models I use and host them for my team to use…

Also a model.save(args, onnx=True) to save a onnx file locally (by calling convert_graph_to_onnx) as well would be really useful !

So here is the 2cents of a really happy user

Thanks for this wonderfull tool !
Have a great day

sam-writer · August 30, 2021, 6:56pm

Is it possible to use ONNX models directly from the Hub, either by directly referencing the file, or via the from_pretrained() method?

That is, before converting to ONNX, I instantiate a (private) model via model = AutoModelForSeq2SeqLM.from_pretrained('org/model_name', use_auth_token=True). If I push a .onnx file up to 'org/model_name', how can I load it?

julien-c · September 1, 2021, 9:40am

@mfuntowicz @lysandre or @Narsil might know!

Narsil · September 1, 2021, 1:41pm

Hi @sam-writer, there’s currently no way to do that with from_pretrained, and I don’t think it will be in the foreseeable future (first of all we would have to choose a runtime for that onnx file, and using onnxruntime forces a lot onto the library, for which transformers is probably not the best place to do).

That being said, if you have a .onnx file on your repo, the best way to use it is with you preferred runtime (onnxruntime most likely).

from huggingface_hub import cached_download

file = cached_download("https://huggingface.co/{USERNAME}/{MODEL_ID}/raw/main/{FILE}.onnx", use_auth_token=os.getenv("API_TOKEN"))

sess = InferenceSession(file)
# go on with inferencing with onnx

Does that help ?

sam-writer · September 1, 2021, 4:48pm

YES! Thank you Narsil, I didn’t know about cached_download, nor how the file structure worked with /raw/main. This is great, thanks so much.

Topic		Replies	Views
Transformers.onnx vs optimum.onnxruntime 🤗Optimum	1	1141	September 12, 2022
Optimize large scale transformer model inference with ONNX Runtime Models	0	380	January 18, 2022
Optimizing models using ONNX Models	1	1118	October 21, 2020
Looking for help converting transformers to ONNX with HF Optimum 🤗Transformers	0	277	November 9, 2023
:rocket: Optimum Transformers: accelerated NLP pipelines with Infinity speed 🤗Transformers	4	665	March 25, 2022

Supporting ONNX optimized models

Related topics