CUDA out of memory on Nvidia A10G + Codellama on HuggingFace Spaces

lkthomas · February 8, 2024, 7:04am

The 2xA10G large already provide 48GB VRAM, but out of memory still occurred, how could I fix this?

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacty of 21.99 GiB of which 599.00 MiB is free. Process 252091 has 21.39 GiB memory in use. Of the allocated memory 20.77 GiB is allocated by PyTorch, and 345.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

CKeibel · February 8, 2024, 7:24am

Hey @lkthomas!
Which version of the model do you use in terms of the number of parameters? The 70B version? (I think this needs around 140GB)
If this is the case, you can try a smaller model or load the model using a bits and bytes configuration with a smaller precision.

lkthomas · February 8, 2024, 7:29am

let me test out codellama/CodeLlama-7b-Python-hf and see if it works; And how could I config the “bites and bytes” on Spaces setting?

CKeibel · February 8, 2024, 7:38am

For me, the performance of most 7b models is good enough. I have to admit that I have never used the code llama.

bitsandbytes
Probably you first need to install it and maybe restart the notebook kernel afterwards.

pip install bitsandbytes

Here is a good blog post that refers to codellama in general.

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "codellama/CodeLlama-34b-hf"
quantization_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_compute_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto",
)

prompt = 'def remove_non_ascii(s: str) -> str:\n    """ '
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output = model.generate(
    inputs["input_ids"],
    max_new_tokens=200,
    do_sample=True,
    top_p=0.9,
    temperature=0.1,
)
output = output[0].to("cpu")
print(tokenizer.decode(output))

If I can help any further, please let me know

lkthomas · February 8, 2024, 7:39am

Sorry I am using ChatUI and not a Python Notebook, is there a variable I could set?

CKeibel · February 8, 2024, 7:48am

Mhh never used ChatUI.
If I see correctly, you can specify the model, right? Then you could at least try to load the 7b model. I don’t think you can load it with lower precision, because the model parameters look like they are for the generation. But I’m not sure about that.

lkthomas · February 8, 2024, 7:49am

7b seems loading just fine

Topic		Replies	Views
torch.cuda.OutOfMemoryError 🤗Transformers	0	2052	July 5, 2023
OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 🤗Transformers	0	511	June 5, 2024
Out of memory error Beginners	0	835	January 26, 2023
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 39.56 GiB total capacity; 37.84 GiB already allocated; 242.56 MiB free; 37.96 GiB reserved in total by PyTorch) 🤗Transformers	2	5345	June 7, 2023
OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU Spaces	6	1694	July 7, 2024

CUDA out of memory on Nvidia A10G + Codellama on HuggingFace Spaces

Related topics