I am trying to run a large DeepSeek-R1-Distill-Qwen-32B-Uncensored-Q8_0-GGUF language model (~34.8 GB) on the Hugging Face Spaces platform using an Nvidia L40S GPU (48 GB VRAM). The model successfully loads on VRAM, but an error (runtime error) occurs while attempting to initialize, after which the model starts loading again, resulting in memory exhaustion. There are no specific error messages in the logs, and the failure occurs a few minutes after initialization starts, but with no explicit indication that the wait time has been exceeded.
I need help diagnosing and solving this problem. Below I provide all the configuration details, steps taken, and application code.
Ollama? Llamacpp? Ollama seems to have model specific issue.
If you know exactly how to run it, it would be easier if you tell me about it )
Iām sorryā¦ If I knew, I would tell you straight away, but I havenāt succeeded in building in the Hugging Face GPU Gradio space with Llamacpp-python 0.3.5 or later either. DeepSeek should require at least 0.3.5 or 0.3.6. Ollama is not available because it is not in the system to begin with. Perhaps available in the Docker spaceā¦?
Works but old
https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu124/llama_cpp_python-0.3.4-cp310-cp310-linux_x86_64.whl
Doesnāt work (or rather, works in CPU modeā¦)
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
llama-cpp-python
It canāt use GGUF, but Iāll leave the code I made for the Zero GPU space using Transformers and BnB. This should make the model usable. I hope Llamacpp-python will be available soonā¦
huge respect )) i have been trying for 5 days to get it up and running and no way, but itās already working thanks!
I got excited early, I responded to a āhiā message normally once, the rest of the time it responds to me with my message and thatās it. But whatās already running is progress, Iāll look into it further.
===== Application Startup at 2025-03-14 18:08:23 =====
Could not load bitsandbytes native library: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found (required by /usr/local/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so) Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 85, in <module> lib = get_native_library() File "/usr/local/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 72, in get_native_library dll = ct.cdll.LoadLibrary(str(binary_path)) File "/usr/local/lib/python3.10/ctypes/__init__.py", line 452, in LoadLibrary return self._dlltype(name) File "/usr/local/lib/python3.10/ctypes/__init__.py", line 374, in __init__ self._handle = _dlopen(self._name, mode) OSError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version
GLIBCXX_3.4.32ā not found (required by /usr/local/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so)
ā Those bitsandbytes warnings are expected on ZeroGPU ā
GLIBCXX_3.4.32' not found
Donāt worry about what this message means. Itās just something like that.
By the way, it was buggy, so I fixed it.
Out of 10 times, 1 time he responds normally to āhelloā, but he canāt do anything more complicated than that, so Iām still looking for a solution.
I think I probably made a mistake somewhere. Iāll check it tomorrow.
thank you
Maybe fixed.
Unfortunately no, I tried to disable quantization but then the model does not fit in memory, I tried to increase quantization to 8 bits, but it did not change significantly
I tried adding a system promt, but it doesnāt affect the result either.
Thatās strangeā¦ I wonder if itās different from the model Iām using for testingā¦
Iām testing it again now. BTW, thatās normal for quantization-related things. I quantized it because I didnāt have enough VRAM.
Yes, I saw in the code that you applied quantization to 4 bits, and Iām trying a different model now, Iāll report back soon.
I can not find in search Original Model: DeepSeek-R1-Distill-Qwen-32B-Uncensored I see only versions after quantization of this model, but there is no original file. or it is not available on huggingface and should be taken elsewhere ?
This one. nicoboss/DeepSeek-R1-Distill-Qwen-32B-Uncensored Ā· Hugging Face
Iāve figured out the cause, but itās a problem with the VRAM. The standard Transformers cache implementation is easy to use, but it eats up VRAMā¦
I think Iāll try to implement a better version tomorrow.
For now, Iāve uploaded a version that doesnāt remember the conversation history, but there are no problems with the operation.
Iām running using
Nvidia 1x L40S
vCPU: 8
RAM (RAM): ~62GB
VRAM (GPU memory): 48 GB
and the model responds much faster, and always responds to the first message, but it is not stable and after the first message it hangs and does not respond to the next messages.