I am not new to ML/NLP, but I am new to HuggingFace and am trying to piece together what is what when it comes to deployment of a solution, rather than just a model.
All tutorials/docs I have read concentrate on deploying a model using Inference Endpoints or Sagemaker – they all assume that the response from the model is all there is to the solution (e.g., get sentiment analysis from this blurb of text).
However, in my case, I am using a model as just one part of an entire solution, which also includes some post-processing and database querying after NLP tasks. In this case, I would not only need to deploy the model, but also the database and all other python code needed for a comprehensive solution.
In such a case, it seems like I would need, at a minimum:
reverse proxy to hide the secret tokens and limit bogus traffic
deployed python code that contains custom algorithms and calls to the model and database
deployed model
deployed database
I really really want to limit the complexity of doing all of this. So, Inference Endpoints seems worth the extra cost for the convenience it gives you. However, if I am also deploying a database and additional python code as well as a proxy server, I am worried that if HuggingFace does not provide those things, then there will be latency issues between the python code and calling the model. I wonder if that is mitigated at all if you deploy the python and database code to the same cloud provider that you use for Inference Endpoints (AWS or Azure)?
Does anyone have a recommendation on how best to deploy all of this? Before learning about Inference Endpoints, I was assuming I would have to package all of this into one or more Docker images and deploy to something like Cloud Run. But as I said, I would rather not spend a lot of time and effort figuring out devops and how to package things up and deploy them.
Any help would be greatly appreciated!