First of all I am very very new at LLM’s and HuggingFace Transformers!
But I thought I would share this in case it helps another newbie along the learning journey.
pipe = pipeline(“text-generation”, model=“/media/charlesh11/LLM/Hugging_Face/Models/DeepSeek-R1-Distill-Llama-8B”)
Results in:
OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 245.38 MiB is free.
So I tried a smaller model:
pipe = pipeline(“question-answering”, model=“/media/charlesh11/LLM/Hugging_Face/Models/DeepSeek-R1-Distill-Qwen-1.5B”)
nvtop shows:
GPU MEM - 6374MiB 26%
So then I tried passing the same setting from config.json on the command line:
pipe = pipeline(“question-answering”, model=“/media/charlesh11/LLM/Hugging_Face/Models/DeepSeek-R1-Distill-Qwen-1.5B”, torch_dtype=torch.bfloat16)
nvtop show:
GPU MEM - 3518MiB 14%
45% smaller in memory than a straight load that I suppose relies on the config.json file.
So I tried passing the argument on the command line for Llama-8B:
pipe = pipeline(“text-generation”, model=“/media/charlesh11/LLM/Hugging_Face/Models/DeepSeek-R1-Distill-Llama-8B”, torch_dtype=torch.bfloat16)
Now it loads and is usable.
NVTOP shows: 15830MiB 64%
config.json snippet:
torch_dtype: “bfloat16”,
HuggingFace Model Memory Calculator says 16GB for model.
But I am guessing it is trying to load just under 29GB when not passing the parameter and relying on the config.json.
Your results may vary.