I wanted to contribute to an opensource project by adding the very nice “fullstop restore punctuation” to it.
And i did, hugging face make it easy to find the right tools. It’s running well when I developed it as a flask singleprocess project, but now that I think about deploying it, I wonder how it’s going to use the memory. If i keep it as a 4 worker gunicorn project. Is it going to take 4 times more memory ? (my contribution would cost around 2.5 * 4 = 10Gb of memory, less than ideal for a non-essential feature)
If so, is there some recommendation on how to have the model loaded as some kind of shared memory data ? Is there a way to load it only once ? I’m just doing CPU-based inferences.
I could put the feature behind a single purpose API and pray punctuation inference is so fast that I only need a single worker for it, but if at some point I’m afraid I run in the same problem for some heavier tasks and I will have moved a lot of things around for a new design that does not solve the memory usage for most models.
I know Hugging Face hosted inference service is an excellent way to solve this, but in my case I would prefer that for an opensource program, users have a fully self-hosted way to make it work.