Memory usage of gunicorn workers?


I wanted to contribute to an opensource project by adding the very nice “fullstop restore punctuation” to it.

And i did, hugging face make it easy to find the right tools. It’s running well when I developed it as a flask singleprocess project, but now that I think about deploying it, I wonder how it’s going to use the memory. If i keep it as a 4 worker gunicorn project. Is it going to take 4 times more memory ? (my contribution would cost around 2.5 * 4 = 10Gb of memory, less than ideal for a non-essential feature)

If so, is there some recommendation on how to have the model loaded as some kind of shared memory data ? Is there a way to load it only once ? I’m just doing CPU-based inferences.

I could put the feature behind a single purpose API and pray punctuation inference is so fast that I only need a single worker for it, but if at some point I’m afraid I run in the same problem for some heavier tasks and I will have moved a lot of things around for a new design that does not solve the memory usage for most models.

I know Hugging Face hosted inference service is an excellent way to solve this, but in my case I would prefer that for an opensource program, users have a fully self-hosted way to make it work.

Best regards,


We could somewhat work around the issue for our use case by limiting the number of process, but I’m still looking for a real solution in case the same model is used many times. So feel free to share if you have any advice on the topic.