Problem with launching DeepSeek-R1-Distill-Qwen-32B-Uncensored-Q8_0-GGUF

Cosmos911 · March 12, 2025, 10:30pm

I am trying to run a large DeepSeek-R1-Distill-Qwen-32B-Uncensored-Q8_0-GGUF language model (~34.8 GB) on the Hugging Face Spaces platform using an Nvidia L40S GPU (48 GB VRAM). The model successfully loads on VRAM, but an error (runtime error) occurs while attempting to initialize, after which the model starts loading again, resulting in memory exhaustion. There are no specific error messages in the logs, and the failure occurs a few minutes after initialization starts, but with no explicit indication that the wait time has been exceeded.
I need help diagnosing and solving this problem. Below I provide all the configuration details, steps taken, and application code.

John6666 · March 13, 2025, 6:10am

Ollama? Llamacpp? Ollama seems to have model specific issue.

github.com/ollama/ollama

Missing tool support for DeepSeek-R1 Distillates based on Qwen

opened 11:10AM - 21 Jan 25 UTC

odrobnik

bug

### What is the issue? I tried `deepseek-r1:70B` and ollama claims that it does…n't support tools. ``` { "error": { "message": "registry.ollama.ai/library/deepseek-r1:70B does not support tools", "type": "api_error", "param": null, "code": null } ``` Looks to me like the template you have is missing the rules for tools. The current Ollama template: ``` {{- if .System }}{{ .System }}{{ end }} {{- range $i, $_ := .Messages }} {{- $last := eq (len (slice $.Messages $i)) 1}} {{- if eq .Role "user" }}<｜User｜>{{ .Content }} {{- else if eq .Role "assistant" }}<｜Assistant｜>{{ .Content }}{{- if not $last }}<｜end▁of▁sentence｜>{{- end }} {{- end }} {{- if and $last (ne .Role "assistant") }}<｜Assistant｜>{{- end }} {{- end }} ``` The template from https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF has tool calls stuff: ``` {% if not add_generation_prompt is defined %} {% set add_generation_prompt = false %} {% endif %} {% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %} {%- for message in messages -%} {%- if message['role'] == 'system' -%} {% set ns.system_prompt = message['content'] %} {%- endif -%} {%- endfor -%} {{ bos_token }}{{ ns.system_prompt }} {%- for message in messages -%} {%- if message['role'] == 'user' -%} {%- set ns.is_tool = false -%} {{ '<｜User｜>' + message['content'] }} {%- endif -%} {%- if message['role'] == 'assistant' and message['content'] is none -%} {%- set ns.is_tool = false -%} {%- for tool in message['tool_calls'] -%} {%- if not ns.is_first -%} {{ '<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<｜tool▁call▁end｜>' }} {%- set ns.is_first = true -%} {%- else -%} {{ '\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<｜tool▁call▁end｜>' }} {{ '<｜tool▁calls▁end｜><｜end▁of▁sentence｜>' }} {%- endif -%} {%- endfor -%} {%- endif -%} {%- if message['role'] == 'assistant' and message['content'] is not none -%} {%- if ns.is_tool -%} {{ '<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>' }} {%- set ns.is_tool = false -%} {%- else -%} {% set content = message['content'] %} {% if '</think>' in content %} {% set content = content.split('</think>')[-1] %} {% endif %} {{ '<｜Assistant｜>' + content + '<｜end▁of▁sentence｜>' }} {%- endif -%} {%- endif -%} {%- if message['role'] == 'tool' -%} {%- set ns.is_tool = true -%} {%- if ns.is_output_first -%} {{ '<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>' }} {%- set ns.is_output_first = false -%} {%- else -%} {{ '\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>' }} {%- endif -%} {%- endif -%} {%- endfor -%} {% if ns.is_tool %} {{ '<｜tool▁outputs▁end｜>' }} {% endif %} {% if add_generation_prompt and not ns.is_tool %} {{ '<｜Assistant｜>' }} {% endif %} ``` ### OS macOS ### GPU Apple ### CPU _No response_ ### Ollama version 0.5.7

github.com/ollama/ollama

Deepseek (various) 236b crashes on run

opened 11:00PM - 27 Nov 24 UTC

Maltz42

bug needs more info

### What is the issue? Deepseek V2, V2.5, and V2-coder all crash with an OOM …error when loading the 236b size. Other versions of Deepseek may as well, that's all I've tested. Hardware is dual A6000's with 48GB each. ``` Error: llama runner process has terminated: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 882903040 llama_new_context_with_model: failed to allocate compute buffers ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version v0.4.5

Cosmos911 · March 14, 2025, 2:15pm

If you know exactly how to run it, it would be easier if you tell me about it )

John6666 · March 14, 2025, 3:25pm

I’m sorry… If I knew, I would tell you straight away, but I haven’t succeeded in building in the Hugging Face GPU Gradio space with Llamacpp-python 0.3.5 or later either. DeepSeek should require at least 0.3.5 or 0.3.6. Ollama is not available because it is not in the system to begin with. Perhaps available in the Docker space…?

Works but old

https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu124/llama_cpp_python-0.3.4-cp310-cp310-linux_x86_64.whl

Doesn’t work (or rather, works in CPU mode…)

--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
llama-cpp-python

John6666 · March 14, 2025, 4:57pm

It can’t use GGUF, but I’ll leave the code I made for the Zero GPU space using Transformers and BnB. This should make the model usable. I hope Llamacpp-python will be available soon…

Cosmos911 · March 14, 2025, 5:48pm

huge respect )) i have been trying for 5 days to get it up and running and no way, but it’s already working thanks!

Cosmos911 · March 14, 2025, 6:04pm

I got excited early, I responded to a “hi” message normally once, the rest of the time it responds to me with my message and that’s it. But what’s already running is progress, I’ll look into it further.

===== Application Startup at 2025-03-14 18:08:23 =====

Could not load bitsandbytes native library: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found (required by /usr/local/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so) Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 85, in <module> lib = get_native_library() File "/usr/local/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 72, in get_native_library dll = ct.cdll.LoadLibrary(str(binary_path)) File "/usr/local/lib/python3.10/ctypes/__init__.py", line 452, in LoadLibrary return self._dlltype(name) File "/usr/local/lib/python3.10/ctypes/__init__.py", line 374, in __init__ self._handle = _dlopen(self._name, mode) OSError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32’ not found (required by /usr/local/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so)
↑ Those bitsandbytes warnings are expected on ZeroGPU ↑

John6666 · March 14, 2025, 10:55pm

GLIBCXX_3.4.32' not found

Don’t worry about what this message means. It’s just something like that.
By the way, it was buggy, so I fixed it.

Cosmos911 · March 15, 2025, 12:32pm

I use cloning your repository and end up with an AI that forwards me my messages)))

Cosmos911 · March 15, 2025, 12:47pm

Out of 10 times, 1 time he responds normally to “hello”, but he can’t do anything more complicated than that, so I’m still looking for a solution.

John6666 · March 15, 2025, 12:50pm

I think I probably made a mistake somewhere. I’ll check it tomorrow.

Cosmos911 · March 15, 2025, 1:51pm

thank you

John6666 · March 16, 2025, 8:28am

Maybe fixed.

Cosmos911 · March 16, 2025, 1:02pm

Unfortunately no, I tried to disable quantization but then the model does not fit in memory, I tried to increase quantization to 8 bits, but it did not change significantly

Cosmos911 · March 16, 2025, 1:04pm

I tried adding a system promt, but it doesn’t affect the result either.

John6666 · March 16, 2025, 1:09pm

That’s strange… I wonder if it’s different from the model I’m using for testing…
I’m testing it again now. BTW, that’s normal for quantization-related things. I quantized it because I didn’t have enough VRAM.

Cosmos911 · March 16, 2025, 1:45pm

Yes, I saw in the code that you applied quantization to 4 bits, and I’m trying a different model now, I’ll report back soon.

Cosmos911 · March 16, 2025, 1:57pm

I can not find in search Original Model: DeepSeek-R1-Distill-Qwen-32B-Uncensored I see only versions after quantization of this model, but there is no original file. or it is not available on huggingface and should be taken elsewhere ?

John6666 · March 16, 2025, 3:03pm

This one. nicoboss/DeepSeek-R1-Distill-Qwen-32B-Uncensored · Hugging Face

I’ve figured out the cause, but it’s a problem with the VRAM. The standard Transformers cache implementation is easy to use, but it eats up VRAM…
I think I’ll try to implement a better version tomorrow.

For now, I’ve uploaded a version that doesn’t remember the conversation history, but there are no problems with the operation.

Cosmos911 · March 16, 2025, 3:45pm

I’m running using
Nvidia 1x L40S
vCPU: 8
RAM (RAM): ~62GB
VRAM (GPU memory): 48 GB

and the model responds much faster, and always responds to the first message, but it is not stable and after the first message it hangs and does not respond to the next messages.

Topic		Replies	Views
Hello experts please help on running local DeepSeek-R1-0528-Qwen3-8B Beginners	2	94	June 10, 2025
DeepSeek-R1-Distill-Llama-8B - CUDA out of Memory - RTX 4090 24GB Beginners	2	303	February 26, 2025
Deepseek Inference so slow Beginners	3	48	July 12, 2025
Tokenizer.template not working with ollama 🤗Hub	0	267	February 1, 2025
Why the model provide an error response ever time Beginners	5	25	March 4, 2025

Problem with launching DeepSeek-R1-Distill-Qwen-32B-Uncensored-Q8_0-GGUF

Works but old

Doesn’t work (or rather, works in CPU mode…)

Related topics