How to deploy a fine tuned t5 model in production

Hi All,

I am trying to deploy a fine-tuned t5 model in production. This is something new to me, to deploy a PyTorch model in production. I went through the presentation from Hugging Face on youtube, about how they deploy the model. And some of the other blog posts.

It is mentioned by HF that they deploy the model on Cython environment as it gives a ~100 times boost to the inference. So, is it always advisable to run a model in production on Cython?
Converting a model in Pytorch to TF does it help and is advisable or not?
What is the preferred container approach to adopt to run multiple models on a set of GPUs?

I know some of these questions would be basic, I apologize for it, but I want to make sure that I follow the correct guidelines to deploy a model in production.

Thank you
Amit

Hi @as-stevens,

I don’t know what blog post you’re referring to for using Cython to get 100x but I guess it really depends where the bottleneck is.
For t5 models, they are Seq2Seq models, and I would recommend to stick to PyTorch and finding a way to optimize the hot path (decoder path). TF could work, but transformers currently can’t use various graph optimizations in TF (we’re working on it).

Or you can try to run it on our hosted inference API to alleviate the hassle of managing all the different layers: https://huggingface.co/pricing (Some optimizations are only enabled for customers)

Hope that helps.
Cheers,
Nicolas

1 Like