Supporting ONNX optimized models

In a lot of cases, ONNX optimized models offer much better performance benefits as compared to using PyTorch models.
This performance boost coupled with the pipelines offered by HuggingFace are a really great combo for delivering a great experience both in terms of inference speed and model performance.

Right now, it’s possible to use ONNX models with a little bit of modification to the code. On my tests on the QuestionAnsweringPipeline with SQuADv2 dataset, I see performance improvements of 1.5-2x with models like bert-large-uncased-whole-word-masking-finetuned-squad and distilbert-base-cased-distilled-squad

I’m wondering if this is something that the devs/community considers worthwhile for supporting directly in the transformers repo. I’m doing some work on this for my own project, but it’s mostly a hacky impl. at this point.
If there’s greater interest in this then I could try to integrate ONNX support for inference more fully in the code base.

Would love to hear everyone’s thoughts.



I know @mfuntowicz is working on this so he may have more input.


We would love to see onnx support in the pipelines, it would make serving models much more scalable.
Actually we have never worked with onnx, but the statistics look pretty good.

What do you think about automatically generating onnx models next to TF and pytorch models, so over the time any uploaded model on HF hub would be available as onnx model too.

We would love to help you there, but need to read more into onnx runtime.

1 Like

Very cool pipelines implementation

Hi there,

Are there any plans for Huggingface to distribute pre-trained models as ONNX files? My use case is embedded pre-trained model inference.

I don’t necessarily need the raw pytorch or tensorflow model. The default ONNX quantized export would be enough for me.

I can do that myself using a combination of the huggingface download utils and the ONNX conversion script. Ideally, I would get the ONNX file directly from :hugs:

[ADDITION] It would probably not be as simple as “get the ONNX file” if we need a plug-and-play experience. Happy to brainstorm on the right format!


Alex Combessie


Hi @dataiku, give this a try

Note: this is a utility project to allow easy inference, which creates the grapth on the fly.

Pinging @julien-c to see if it’s possible to host onnx files on hub.

1 Like

Hi @valhalla,

Just had a look. I am exactly looking for that kind of simplicity: nlp = pipeline("sentiment-analysis", onnx=True) :slight_smile:

Good prototype. The main caveats I can think of are:
(i) ONNX conversion is done on-device - I’d rather pull pre-computed ONNX files from the model hub
(ii) some level duplication of huggingface source code
(iii) possible improvements in the tokenizer choices to use fast versions when available

I would really like to have this option native in huggingface so I can use it in production applications when inference speed matters a lot.



i) we could pull if they are available
ii) Duplicated to provide the same API
iii) Easy to add. Will add it.

HF is considering this in the pipeline V2
see [RFC] Transformers Pipeline v2

Thanks for pointing it out to [RFC] Transformers Pipeline v2. That’s good to know.

For (ii) is it possible to leverage Python class inheritance features to only add/override the methods which are necessary for ONNX?

definitely possible! Copied everything so that I can have full control

We do have a few model in onnx on the hub, and we support it as a searchable filetype:

(cc @mfuntowicz)

We’ll look into auto-converting models to onnx in the future. For now, feel free to ask model authors (or us) to upload converted files manually


If we upload onnx models, will they appear in onnx filter ?

For now there is only two models at

And I agree with @bdalal, it’s “quite easy” to convert a model to an onnx checkpoint and use it in let’s say NodeJs which don’t have a mature ecosystem for Deeplearning.

I would convert them it if I need to fintune a model but for hub models it would be more convenient to download them directly (if converted automatically). I almost envisaged to convert the models I use and host them for my team to use…

Also a, onnx=True) to save a onnx file locally (by calling convert_graph_to_onnx) as well would be really useful !

So here is the 2cents of a really happy user :slight_smile:

Thanks for this wonderfull tool !
Have a great day

Is it possible to use ONNX models directly from the Hub, either by directly referencing the file, or via the from_pretrained() method?

That is, before converting to ONNX, I instantiate a (private) model via model = AutoModelForSeq2SeqLM.from_pretrained('org/model_name', use_auth_token=True). If I push a .onnx file up to 'org/model_name', how can I load it?

@mfuntowicz @lysandre or @Narsil might know!

Hi @sam-writer, there’s currently no way to do that with from_pretrained, and I don’t think it will be in the foreseeable future (first of all we would have to choose a runtime for that onnx file, and using onnxruntime forces a lot onto the library, for which transformers is probably not the best place to do).

That being said, if you have a .onnx file on your repo, the best way to use it is with you preferred runtime (onnxruntime most likely).

from huggingface_hub import cached_download

file = cached_download("{USERNAME}/{MODEL_ID}/raw/main/{FILE}.onnx", use_auth_token=os.getenv("API_TOKEN"))

sess = InferenceSession(file)
# go on with inferencing with onnx

Does that help ?

1 Like

YES! Thank you Narsil, I didn’t know about cached_download, nor how the file structure worked with /raw/main. This is great, thanks so much.