DeepSeek-R1-Distill-Llama-8B - CUDA out of Memory - RTX 4090 24GB

First of all I am very very new at LLM’s and HuggingFace Transformers!
But I thought I would share this in case it helps another newbie along the learning journey.

pipe = pipeline(“text-generation”, model=“/media/charlesh11/LLM/Hugging_Face/Models/DeepSeek-R1-Distill-Llama-8B”)

Results in:
OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 245.38 MiB is free.

So I tried a smaller model:
pipe = pipeline(“question-answering”, model=“/media/charlesh11/LLM/Hugging_Face/Models/DeepSeek-R1-Distill-Qwen-1.5B”)

nvtop shows:
GPU MEM - 6374MiB 26%

So then I tried passing the same setting from config.json on the command line:
pipe = pipeline(“question-answering”, model=“/media/charlesh11/LLM/Hugging_Face/Models/DeepSeek-R1-Distill-Qwen-1.5B”, torch_dtype=torch.bfloat16)

nvtop show:
GPU MEM - 3518MiB 14%

45% smaller in memory than a straight load that I suppose relies on the config.json file.

So I tried passing the argument on the command line for Llama-8B:
pipe = pipeline(“text-generation”, model=“/media/charlesh11/LLM/Hugging_Face/Models/DeepSeek-R1-Distill-Llama-8B”, torch_dtype=torch.bfloat16)

Now it loads and is usable.
NVTOP shows: 15830MiB 64%

config.json snippet:
torch_dtype: “bfloat16”,

HuggingFace Model Memory Calculator says 16GB for model.

But I am guessing it is trying to load just under 29GB when not passing the parameter and relying on the config.json.

Your results may vary.

2 Likes

I don’t know the exact reason, but as your verification shows, some of the Transformers pipelines consume significantly more VRAM and RAM than just loading models. Since there are no issues about this, I wonder if that’s just the way it’s designed…
So I often load the models and tokenizers myself.

Well, even if it’s a pipeline, in this case, 8-bit or 4-bit quantization is sufficient, so I think it’s easier to just quantize.

Also, if your goal is just to execute, it’s easier and uses less memory to use something like Ollama rather than Transformers.

Snippet from this page:


By default, the weights are loaded in full precision (torch.float32) regardless of the actual data type the weights are stored in such as torch.float16. Set torch_dtype=“auto” to load the weights in the data type defined in a model’s config.json file to automatically load the most memory-optimal data type.

pipe = pipeline(“text-generation”, model=“/media/charlesh11/LLM/Hugging_Face/Models/DeepSeek-R1-Distill-Llama-8B”, torch_dtype="auto)

NVTOP shows: 15830MiB 64%

So now it uses the value in the config.json file.

1 Like