Deploy multilingual sentence tansformer into cloud

Hi community,

I am new to transformer models and particularly interested by a multilingual sentence transformer (stsb-xlm-r-multilingual, 1.1 Go).

I have passed multiple times to search on how to deploy this model on cloud with the highest throughput (up to 1000 requests/sec) and lowest latency (<1 sec), while trying to saving costs. It seems to be painful.

Would anyone have advice on deployment requirements (CPU, GPU, RAM,…), clouds (AWS, GCP,…), and frameworks (torchserve, triton,…) to meet my needs?

Thanks!

Hi,

Would anyone have advice?

Thanks !

Hi @Matthieu,

The answer to your question would require some experimentation with the model in the various scenarios, but it’s generally cheaper to use CPUs and scale horizontally so you can achieve high throughput. You can find a some nice case studies here:

Regarding frameworks, I usually deploy my models with FastAPI + Docker and have not found a strong need to use torchserve yet (which is experimental in any case). For clouds I don’t have a strong opinion, but have generally found the UX of GCP to be better than AWS :slight_smile: Other alternatives include Azure or Paperspace Gradient, but I’d definitely do some investigation before committing to any of the providers

HTH!

2 Likes

Hi @lewtun,

It is effectively cheaper to use CPUs and scale horizontally so you can achieve a better throughput. However, this method doesn’t improve latency in the case of real-time inference with batch size=1.

https://toriml.medium.com/why-you-should-use-different-backends-when-deploying-dl-models-e3d23ee58a9d

Is using FastAPI + Docker a common easy way to deploy on kubernetes ? Would you have any reference to guide me step by step?

Thanks for guide through the choice of better UX cloud, but does GCP contrains the multitude of choices concerning the different target architecture ?
Would you use directly EC2 instances or Sagemaker ?

Thanks !

You’re right about latency - so to improve that I’d first try using PyTorch’s native quantization to see if it meets your needs, followed by ONNX + ORT (as you seem to be exploring in a separate thread :))

If that’s still not fast enough, you could try fine-pruning your models with the brand-new nn_pruning library (link) followed by ORT optimization

Is using FastAPI + Docker a common easy way to deploy on kubernetes ? Would you have any reference to guide me step by step?

This is definitely a common workflow when you want to deploy you model as a web application that other services can interact with (an alternative to FastAPI is Flask, but I personally feel it is more clunky to use). Here’s a pretty good tutorial on getting started with them (just ignore the dependency-injection stuff because your model is way to big!): How to properly ship and deploy your machine learning model | by Tivadar Danka | Towards Data Science

Depending on your use case / environment restrictions, an alternative to the whole web app / docker / k8s complexity could be cortex labs: GitHub - cortexlabs/cortex: Production infrastructure for machine learning at scale

I haven’t tried it myself but it looks like a nice approach to deploying models by just defining a few python classes and performance testing them via a CLI

Thanks for guide through the choice of better UX cloud, but does GCP contrains the multitude of choices concerning the different target architecture ?
Would you use directly EC2 instances or Sagemaker ?

Ah that’s a good point re hardware - as far as I know AWS has more options than GCP (sometimes I couldn’t get access to a specific VM on GCP). I’ve usually used EC2 instances in the past so can’t comment on Sagemaker, although some friends of mine warned me that Sagemaker can be pricey :money_mouth_face:

You’re right about latency - so to improve that I’d first try using PyTorch’s native quantization to see if it meets your needs, followed by ONNX + ORT (as you seem to be exploring in a separate thread :))

PyTorch’s native quantization isn’t the same as the dynamic quantization proposed by the hugging face pipeline ? Secondly, ONNX and ORT are separate things? The accelaration isn’t provided by the ONNX model itself and need to be run by ONNX runtime ?

This is definitely a common workflow when you want to deploy you model as a web application that other services can interact with (an alternative to FastAPI is Flask, but I personally feel it is more clunky to use)

Could you precise me more about the status of web application ? My goal isn’t to provide a web site where client could run inference on their data via web navigator. I just want to put my model on the cloud and provide an API from which I could send request in real-time for the need of my global application. Is that filled the needs of FastAPI + Docker ?

Depending on your use case / environment restrictions, an alternative to the whole web app / docker / k8s complexity could be cortex labs: GitHub - cortexlabs/cortex: Deploy, manage, and scale machine learning models in production

Thanks for the link, I indeed already had a look at cortex which seems interesting !

Ah that’s a good point re hardware - as far as I know AWS has more options than GCP (sometimes I couldn’t get access to a specific VM on GCP). I’ve usually used EC2 instances in the past so can’t comment on Sagemaker, although some friends of mine warned me that Sagemaker can be pricey :money_mouth_face:

Is it possible to deploy on EC2 instances without k8s ? If yes, what are the methods ?

PyTorch’s native quantization isn’t the same as the dynamic quantization proposed by the hugging face pipeline ?

The convert_graph_to_onnx.py script in transformers uses ONNX Runtime (ORT) for dynamic quantization: https://github.com/huggingface/transformers/blob/af8afdc88dcb07261acf70aee75f2ad00a4208a4/src/transformers/convert_graph_to_onnx.py#L421

This is different from the dynamic quantization provided in PyTorch which only supports quantization of the linear layers, whereas ORT has a larger set of operators to work with (and hence gets better compression). The reason I suggested the PyTorch approach first is that it’s one line of code and you can quickly see if it meets your latency requirements or not.

Secondly, ONNX and ORT are separate things? The accelaration isn’t provided by the ONNX model itself and need to be run by ONNX runtime ?

Yes they are different things. ONNX is the specification (sometimes called the intermediate representation) while ORT is an accelerator which uses specific optimisations of the ONNX graph to speed-up inference (you can find a list of other accelerators here: ONNX | Supported Tools)

A nice picture on how the pieces fit together can be seen here:

image

The basic idea is that ONNX acts as a kind of “common denominator” for various accelerators / backends, and each accelerator can specify its own execution provider for the target hardware. For example, when you create an InferenceSession with ORT you can specify whether you are running on CPU / GPU: https://www.onnxruntime.ai/docs/reference/execution-providers/

Could you precise me more about the status of web application ? My goal isn’t to provide a web site where client could run inference on their data via web navigator. I just want to put my model on the cloud and provide an API from which I could send request in real-time for the need of my global application. Is that filled the needs of FastAPI + Docker ?

Sorry for being imprecise - I just meant that you can create an endpoint like you said. FastAPI + Docker certainly can meet those needs, but might not be worth the effort if Cortex can do the job :slight_smile:

Is it possible to deploy on EC2 instances without k8s ? If yes, what are the methods ?

Yes you could either deploy the FastAPI app directly in the EC2 instance or containerize the app with Docker and then deploy the container in the instance. There are many tutorials online, e.g. Deploying FastAPI Web Application in AWS | by Meetakoti Kirankumar | Medium

1 Like

Hi @lewtun many thanks for the detailed answers!

What is the best solution according latence and throughput among the options for deploying to EC2 you mentionned to me ? FastAPI or FAST API + docker or cortex ?

Hey @Matthieu, I haven’t tried cortex so cannot comment on their latency / throughput (although I guess it is pretty good since their whole business model is based on production ML). For FastAPI vs FastAPI + Docker, I would not expect any noticeable differences, although the advantage of using Docker is that you can quickly deploy the API on different machines without needing to manually install everything each time.

If you try cortex, I would be really interested to hear what you think of it!

Thanks @lewtun for your feedback!

I will send you feedback about cortex whether I try it!

1 Like

Hi @Matthieu,

Any updates on your deployment process?