In a lot of cases, ONNX optimized models offer much better performance benefits as compared to using PyTorch models.
This performance boost coupled with the pipelines offered by HuggingFace are a really great combo for delivering a great experience both in terms of inference speed and model performance.
Right now, it’s possible to use ONNX models with a little bit of modification to the pipeline.py code. On my tests on the QuestionAnsweringPipeline with SQuADv2 dataset, I see performance improvements of 1.5-2x with models like bert-large-uncased-whole-word-masking-finetuned-squad
and distilbert-base-cased-distilled-squad
I’m wondering if this is something that the devs/community considers worthwhile for supporting directly in the transformers repo. I’m doing some work on this for my own project, but it’s mostly a hacky impl. at this point.
If there’s greater interest in this then I could try to integrate ONNX support for inference more fully in the code base.
Would love to hear everyone’s thoughts.
Thanks