Best practices for choosing instance size for inference


I was wondering what a good procedure is for choosing instance sizes when deploying hugginface models as Sagemaker inference endpoints. Are there resources availalbe what instance sizes are good for different model sizes and the corresponding performance (latency, number invocations/min, etc.)?

Now I’m deploying facebook/bart-large-cnn and I just manually test for different instance types to find out what works for our use case but I feel like this can be done a bit faster.

1 Like

Don’t know of anything on the Huggingface side. Would be a great resource for the month that it’s current for. :smiley:

From the docs:

The size and type of data can have a great effect on which hardware configuration is most effective. When the same model is trained on a recurring basis, initial testing across a spectrum of instance types can discover configurations that are more cost-effective in the long run. Additionally, algorithms that train most efficiently on GPUs might not require GPUs for efficient inference. Experiment to determine the most cost effectiveness solution. To get an automatic instance recommendation or conduct custom load tests, use Amazon SageMaker Inference Recommender.