PyTorch’s native quantization isn’t the same as the dynamic quantization proposed by the hugging face pipeline ?
convert_graph_to_onnx.py script in
transformers uses ONNX Runtime (ORT) for dynamic quantization: transformers/convert_graph_to_onnx.py at af8afdc88dcb07261acf70aee75f2ad00a4208a4 · huggingface/transformers · GitHub
This is different from the dynamic quantization provided in PyTorch which only supports quantization of the linear layers, whereas ORT has a larger set of operators to work with (and hence gets better compression). The reason I suggested the PyTorch approach first is that it’s one line of code and you can quickly see if it meets your latency requirements or not.
Secondly, ONNX and ORT are separate things? The accelaration isn’t provided by the ONNX model itself and need to be run by ONNX runtime ?
Yes they are different things. ONNX is the specification (sometimes called the intermediate representation) while ORT is an accelerator which uses specific optimisations of the ONNX graph to speed-up inference (you can find a list of other accelerators here: ONNX | Supported Tools)
A nice picture on how the pieces fit together can be seen here:
The basic idea is that ONNX acts as a kind of “common denominator” for various accelerators / backends, and each accelerator can specify its own execution provider for the target hardware. For example, when you create an
InferenceSession with ORT you can specify whether you are running on CPU / GPU: Execution Providers - onnxruntime
Could you precise me more about the status of web application ? My goal isn’t to provide a web site where client could run inference on their data via web navigator. I just want to put my model on the cloud and provide an API from which I could send request in real-time for the need of my global application. Is that filled the needs of FastAPI + Docker ?
Sorry for being imprecise - I just meant that you can create an endpoint like you said. FastAPI + Docker certainly can meet those needs, but might not be worth the effort if Cortex can do the job
Is it possible to deploy on EC2 instances without k8s ? If yes, what are the methods ?
Yes you could either deploy the FastAPI app directly in the EC2 instance or containerize the app with Docker and then deploy the container in the instance. There are many tutorials online, e.g. Deploying FastAPI Web Application in AWS | by Meetakoti Kirankumar | Medium