CUDA out of memory on Nvidia A10G + Codellama on HuggingFace Spaces


The 2xA10G large already provide 48GB VRAM, but out of memory still occurred, how could I fix this?

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacty of 21.99 GiB of which 599.00 MiB is free. Process 252091 has 21.39 GiB memory in use. Of the allocated memory 20.77 GiB is allocated by PyTorch, and 345.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Hey @lkthomas!
Which version of the model do you use in terms of the number of parameters? The 70B version? (I think this needs around 140GB)
If this is the case, you can try a smaller model or load the model using a bits and bytes configuration with a smaller precision.

let me test out codellama/CodeLlama-7b-Python-hf and see if it works; And how could I config the “bites and bytes” on Spaces setting?

For me, the performance of most 7b models is good enough. I have to admit that I have never used the code llama.

bitsandbytes
Probably you first need to install it and maybe restart the notebook kernel afterwards.

pip install bitsandbytes

Here is a good blog post that refers to codellama in general.

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "codellama/CodeLlama-34b-hf"
quantization_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_compute_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto",
)

prompt = 'def remove_non_ascii(s: str) -> str:\n    """ '
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output = model.generate(
    inputs["input_ids"],
    max_new_tokens=200,
    do_sample=True,
    top_p=0.9,
    temperature=0.1,
)
output = output[0].to("cpu")
print(tokenizer.decode(output))

If I can help any further, please let me know :hugs:

Sorry I am using ChatUI and not a Python Notebook, is there a variable I could set?

Mhh never used ChatUI.
If I see correctly, you can specify the model, right? Then you could at least try to load the 7b model. I don’t think you can load it with lower precision, because the model parameters look like they are for the generation. But I’m not sure about that.

7b seems loading just fine