Memory usage of gunicorn workers?

Hello,

I wanted to contribute to an opensource project by adding the very nice “fullstop restore punctuation” to it.

And i did, hugging face make it easy to find the right tools. It’s running well when I developed it as a flask singleprocess project, but now that I think about deploying it, I wonder how it’s going to use the memory. If i keep it as a 4 worker gunicorn project. Is it going to take 4 times more memory ? (my contribution would cost around 2.5 * 4 = 10Gb of memory, less than ideal for a non-essential feature)

If so, is there some recommendation on how to have the model loaded as some kind of shared memory data ? Is there a way to load it only once ? I’m just doing CPU-based inferences.

I could put the feature behind a single purpose API and pray punctuation inference is so fast that I only need a single worker for it, but if at some point I’m afraid I run in the same problem for some heavier tasks and I will have moved a lot of things around for a new design that does not solve the memory usage for most models.

I know Hugging Face hosted inference service is an excellent way to solve this, but in my case I would prefer that for an opensource program, users have a fully self-hosted way to make it work.

Best regards,

s.

We could somewhat work around the issue for our use case by limiting the number of process, but I’m still looking for a real solution in case the same model is used many times. So feel free to share if you have any advice on the topic.