In a lot of cases, ONNX optimized models offer much better performance benefits as compared to using PyTorch models.
This performance boost coupled with the pipelines offered by HuggingFace are a really great combo for delivering a great experience both in terms of inference speed and model performance.
I’m wondering if this is something that the devs/community considers worthwhile for supporting directly in the transformers repo. I’m doing some work on this for my own project, but it’s mostly a hacky impl. at this point.
If there’s greater interest in this then I could try to integrate ONNX support for inference more fully in the code base.
We would love to see onnx support in the pipelines, it would make serving models much more scalable.
Actually we have never worked with onnx, but the statistics look pretty good.
What do you think about automatically generating onnx models next to TF and pytorch models, so over the time any uploaded model on HF hub would be available as onnx model too.
We would love to help you there, but need to read more into onnx runtime.
Are there any plans for Huggingface to distribute pre-trained models as ONNX files? My use case is embedded pre-trained model inference.
I don’t necessarily need the raw pytorch or tensorflow model. The default ONNX quantized export would be enough for me.
I can do that myself using a combination of the huggingface download utils and the ONNX conversion script. Ideally, I would get the ONNX file directly from
[ADDITION] It would probably not be as simple as “get the ONNX file” if we need a plug-and-play experience. Happy to brainstorm on the right format!
Just had a look. I am exactly looking for that kind of simplicity: nlp = pipeline("sentiment-analysis", onnx=True)
Good prototype. The main caveats I can think of are:
(i) ONNX conversion is done on-device - I’d rather pull pre-computed ONNX files from the model hub
(ii) some level duplication of huggingface source code
(iii) possible improvements in the tokenizer choices to use fast versions when available
I would really like to have this option native in huggingface so I can use it in production applications when inference speed matters a lot.
And I agree with @bdalal, it’s “quite easy” to convert a model to an onnx checkpoint and use it in let’s say NodeJs which don’t have a mature ecosystem for Deeplearning.
I would convert them it if I need to fintune a model but for hub models it would be more convenient to download them directly (if converted automatically). I almost envisaged to convert the models I use and host them for my team to use…
Also a model.save(args, onnx=True) to save a onnx file locally (by calling convert_graph_to_onnx) as well would be really useful !
So here is the 2cents of a really happy user
Thanks for this wonderfull tool !
Have a great day
Is it possible to use ONNX models directly from the Hub, either by directly referencing the file, or via the from_pretrained() method?
That is, before converting to ONNX, I instantiate a (private) model via model = AutoModelForSeq2SeqLM.from_pretrained('org/model_name', use_auth_token=True). If I push a .onnx file up to 'org/model_name', how can I load it?
Hi @sam-writer, there’s currently no way to do that with from_pretrained, and I don’t think it will be in the foreseeable future (first of all we would have to choose a runtime for that onnx file, and using onnxruntime forces a lot onto the library, for which transformers is probably not the best place to do).
That being said, if you have a .onnx file on your repo, the best way to use it is with you preferred runtime (onnxruntime most likely).
from huggingface_hub import cached_download
file = cached_download("https://huggingface.co/{USERNAME}/{MODEL_ID}/raw/main/{FILE}.onnx", use_auth_token=os.getenv("API_TOKEN"))
sess = InferenceSession(file)
# go on with inferencing with onnx