Model super slow and barely uses any CPU or memory

Alex1607 · July 3, 2024, 10:02pm

Hey, I am currently setting up QWEN2 for me on my server via docker. I got it working so far, but it answers super, and I mean like super slow. A simple “Hello!” took like 3 minutes to respond to. I set it up to use the CPU which, of course, is slower than via a GPU but I guess it shouldn’t be that slow. I gave it 16 GB RAM and 6 Threads. However whenever it runs it doesn’t show any significant CPU usage. It sits at around just 1,5 threads and 2 GB of the RAM being used. Do I need to do something to have it use multiple cores and more memory so it can be speeded up? Edit: I also noticed I am getting this error in the console while starting: /usr/local/lib/python3.9/site-packages/transformers/modeling_utils.py:2525: UserWarning: Attempting to save a model with offloaded modules. Ensure that unallocated cpu memory exceeds the shard_size (5GB default) from what I have read it is because it had to use the disc. Which for me doesn’t make sense because I gave it more than enough memory and it just doesn’t use it. My entire code is here: Dockerfile · GitHub

nimatov · July 3, 2024, 11:37pm

Hi,
I am afraid 16GB can be not enough . Model weights ~ 15 GB . And , I guess, the OS priority for model loading is lower that causes model offloading . I suggest to use bigger memory . Disappearing the warning will tell you your in right track. What Cloud instance do you use ?

Alex1607 · July 4, 2024, 6:27am

Thank you for your response!
Fair enough, I now also tried to run it with 128 GB of RAM but it still only used a single GB of it. So it seems like something is preventing it from taking more than that, which I just can’t find the reason for.
The server which I am using isn’t a cloud instance, however it’s a core i9 9900k, 128gb ram and 1tb nvme, no GPU unfortunately but as I don’t plan to train it and just want to use it, I guess that should be fine.

nimatov · July 4, 2024, 9:04am

Hi,

Do you use Windows OS ? Linux is much better and well tested to work with LLMs and it relates also to memory management . Is it an option to reproduce your case on Colab Pro or Colab Pro+ ?

Alex1607 · July 4, 2024, 9:27am

So what I also tried is that I used the same model via ollama and with that, it got usable. So from almost 5 minutes for a simple “Hello!” down to a second or two.
Out of curiosity, I also tried a model far too big for my pc / server via ollama, and it got the same speed as I had with the smaller when using the python code. Thus my guess is that the python library just stores everything onto the disc, which makes the whole thing super slow.

I tried it on Windows and Linux and had the issues on both.

But anyways, I would consider the issue resolved for now, I just had to use ollama in a docker image and can then use the API they offer for working with it.
Thank you, nonetheless @nimatov!

Topic		Replies	Views
[RuntimeError] GPU is required to quantize or run quantize model – Qwen1.5-0.5B-Chat in my Space Beginners	3	40	May 23, 2025
Extra GPU usage on custom Qwen2-VL 🤗Transformers	0	152	October 28, 2024
Loading models sometimes maxes DISK%, then crashes Intermediate	2	2883	October 8, 2020
Hardware suggestions Beginners	1	23	June 6, 2025
New to Docker model getting stuck Models	0	23	February 2, 2025

Model super slow and barely uses any CPU or memory

Related topics