I’m trying to deploy a fine-tuned GPT2LMHeadModel in a Docker container using gunicorn. With 2 workers, I was surprised to find that I can serve one request at a time in about 40 seconds, but two concurrent requests takes more than 10 minutes each.
If I reduce it to 1 worker and send 2 concurrent requests, then it blocks the second while serving the first (again taking about 40 seconds) and then serves the second, so the total latency for the second is about 80 seconds. Much better than the first case, but who wants a completely blocking server?!
My understanding is that each worker runs in its own process and has its own copy of the model, so I am surprised to see this evidence of competition for resources/CPU. I did some profiling with cprofile and in general, pytorch matrix-multiply functions (there are several) all seem to take significantly longer per call when there are concurrent requests. At the same time, I don’t see why that would be, unless pytorch is implicitly using a lot of parallel processing on multiple cores sneakily, such that using multiple requests then stretch those already-stretched resources. And even then, if there are half as many resources available per request, I’d expect each request to take ~twice as long, not 20 times as long.
Any ideas about where I should look to find some kind of lock contention, or how to profile with greater granularity? I’ll eventually put this on GPU, but I want to solve this problem first (also, I’m noticing similar trends when I do try on GPU).
UPDATE: For various reasons, I assumed that this was NOT an issue with Docker, but it actually does appear to be. Latency increases slightly when I run it directly on my machine, but on the order of 5-10 seconds, not 10 minutes. Then when I run in a Docker container, I get the results described above.
Will keep digging and report back in case I find anything interesting.