Hey, I am currently setting up QWEN2 for me on my server via docker. I got it working so far, but it answers super, and I mean like super slow. A simple “Hello!” took like 3 minutes to respond to. I set it up to use the CPU which, of course, is slower than via a GPU but I guess it shouldn’t be that slow. I gave it 16 GB RAM and 6 Threads. However whenever it runs it doesn’t show any significant CPU usage. It sits at around just 1,5 threads and 2 GB of the RAM being used. Do I need to do something to have it use multiple cores and more memory so it can be speeded up? Edit: I also noticed I am getting this error in the console while starting: /usr/local/lib/python3.9/site-packages/transformers/modeling_utils.py:2525: UserWarning: Attempting to save a model with offloaded modules. Ensure that unallocated cpu memory exceeds the
shard_size (5GB default)
from what I have read it is because it had to use the disc. Which for me doesn’t make sense because I gave it more than enough memory and it just doesn’t use it. My entire code is here: Dockerfile · GitHub
Hi,
I am afraid 16GB can be not enough . Model weights ~ 15 GB . And , I guess, the OS priority for model loading is lower that causes model offloading . I suggest to use bigger memory . Disappearing the warning will tell you your in right track. What Cloud instance do you use ?
Thank you for your response!
Fair enough, I now also tried to run it with 128 GB of RAM but it still only used a single GB of it. So it seems like something is preventing it from taking more than that, which I just can’t find the reason for.
The server which I am using isn’t a cloud instance, however it’s a core i9 9900k, 128gb ram and 1tb nvme, no GPU unfortunately but as I don’t plan to train it and just want to use it, I guess that should be fine.
Hi,
Do you use Windows OS ? Linux is much better and well tested to work with LLMs and it relates also to memory management . Is it an option to reproduce your case on Colab Pro or Colab Pro+ ?
So what I also tried is that I used the same model via ollama and with that, it got usable. So from almost 5 minutes for a simple “Hello!” down to a second or two.
Out of curiosity, I also tried a model far too big for my pc / server via ollama, and it got the same speed as I had with the smaller when using the python code. Thus my guess is that the python library just stores everything onto the disc, which makes the whole thing super slow.
I tried it on Windows and Linux and had the issues on both.
But anyways, I would consider the issue resolved for now, I just had to use ollama in a docker image and can then use the API they offer for working with it.
Thank you, nonetheless @nimatov!