Scaling inference with GPT2 on Docker; CPU only

tensorphlo · November 13, 2020, 3:08am

I’m trying to deploy a fine-tuned GPT2LMHeadModel in a Docker container using gunicorn. With 2 workers, I was surprised to find that I can serve one request at a time in about 40 seconds, but two concurrent requests takes more than 10 minutes each.

If I reduce it to 1 worker and send 2 concurrent requests, then it blocks the second while serving the first (again taking about 40 seconds) and then serves the second, so the total latency for the second is about 80 seconds. Much better than the first case, but who wants a completely blocking server?!

My understanding is that each worker runs in its own process and has its own copy of the model, so I am surprised to see this evidence of competition for resources/CPU. I did some profiling with cprofile and in general, pytorch matrix-multiply functions (there are several) all seem to take significantly longer per call when there are concurrent requests. At the same time, I don’t see why that would be, unless pytorch is implicitly using a lot of parallel processing on multiple cores sneakily, such that using multiple requests then stretch those already-stretched resources. And even then, if there are half as many resources available per request, I’d expect each request to take ~twice as long, not 20 times as long.

Any ideas about where I should look to find some kind of lock contention, or how to profile with greater granularity? I’ll eventually put this on GPU, but I want to solve this problem first (also, I’m noticing similar trends when I do try on GPU).

UPDATE: For various reasons, I assumed that this was NOT an issue with Docker, but it actually does appear to be. Latency increases slightly when I run it directly on my machine, but on the order of 5-10 seconds, not 10 minutes. Then when I run in a Docker container, I get the results described above.

Will keep digging and report back in case I find anything interesting.

rbarria · April 15, 2024, 10:00am

Here I am speculating, but one possible reason for the 10 minutes using 2 workers, might be that you are not pre-loading the model for the 2nd worker.

So when it receives the request, it first loads the model…then serves the request

Gunicorn has a “pre-load” option to solve this kind of problem…

Topic		Replies	Views
Deploying inference model size and performance 🤗Transformers	6	5207	July 9, 2024
Concurrent inference on a single GPU Beginners	3	2544	November 28, 2021
Memory usage of gunicorn workers? Beginners	1	3022	July 21, 2022
Multiple Requests to HuggingFace InferenceEndpoints are not working with custom Docker deployment. :-( Inference Endpoints on the Hub	0	514	March 26, 2024
Deploying LLM in Production: Performance Degradation with Multiple Users 🤗Transformers	6	4802	June 7, 2024

Scaling inference with GPT2 on Docker; CPU only

Related topics