torch.cuda.OutOfMemoryError for CodeLlama models in H100 single GPU inference

Hi, I am facing an out-of-memory issue in H100 for CodeLlama 34b/70b inference (sometimes working for single sequence output). For 13b models also I couldn’t generate more than 5 completion sequences. Below are my script and package versions.

accelerate==0.28.0
tokenizers==0.13.3
torch==2.2.1 (    torchaudio-2.2.1|py310_cu121)
transformers==4.33.0
triton==2.2.0
evaluate==0.4.1

Also tried with touch CUDA 11.8

torch==2.2.1 (pytorch-2.2.1|py3.10_cuda11.8_cudnn8.7.0_0)

Nvidia Driver and CUDA (tried both CUDA 11.8 and 12.1)

NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2   
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-34b-Instruct-hf", cache_dir=self.cache_dir)
model = AutoModelForCausalLM.from_pretrained(self.model_name, cache_dir=self.cache_dir).to(self.device)

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda:3")

try:
    samples = model.generate(
        input_ids, 
        temperature=0.9,
        top_k=40, 
        top_p=0.5,
        repetition_penalty=1.1,
        max_new_tokens=1000,
        min_new_tokens=3, 
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        num_return_sequences=20)
except ...
    pass
Traceback (most recent call last):
  File "/remote/vg_llm/nalaka/chatsv/Evaluation/./Eval.py", line 505, in <module>
    runMain()
  File "/remote/vg_llm/nalaka/chatsv/Evaluation/./Eval.py", line 471, in runMain
    runEval(config)
  File "/remote/vg_llm/nalaka/chatsv/Evaluation/./Eval.py", line 482, in runEval
    model = create_model(testing_model, config)
  File "/remote/vg_llm/nalaka/chatsv/Evaluation/./Eval.py", line 438, in create_model
    model = HuggingFace_API(model_cfg)
  File "/remote/vg_llm/nalaka/chatsv/Evaluation/hf_model.py", line 10, in __init__
    self.init()
  File "/remote/vg_llm/nalaka/chatsv/Evaluation/hf_model.py", line 32, in init
    cache_dir=self.cache_dir).to(self.device)
  File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2053, in to
    return super().to(*args, **kwargs)
  File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
    return self._apply(convert)
  File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
    param_applied = fn(param)
  File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 688.00 MiB. GPU 3 has a total capacity of 79.11 GiB of which 306.56 MiB is free. Including non-PyTorch memory, this process has 78.68 GiB memory in use. Of the allocated memory 78.12 GiB is allocated by PyTorch, and 120.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Hello,

You might consider loading it in a quantized version or a smaller one. A 34 Bn parameters model does not fit in a single H100 GPU.
Using “load_in_8_bits = True” in the AutoModelForCausalLM.from_pretrained might be a good start even though I think it would still fail at generation.

I advise you to start with a 7B one which still does a fantastic job !

Hope this helps !

1 Like

Can you share more information about the sequence length you are using? If the sequence lengths are too long, then the memory requirements will be very high.

The easiest option you can do that could potentially reduce the OOM errors is using flash attention or memory efficient attention.

When loading the model, you can do:

AutoModelForCausalLM.from_pretrained(
self.model_name, 
cache_dir=self.cache_dir,
device_map=self.device,
attn_implementation="sdpa"
)

The sdpa implementation will automatically choose flash attention or memory efficient attention and it can significantly help reduce memory and increase speed at long sequence lengths.

`Apart from that, your other choices are to

  1. use quantization to make the model take less memory
  • The downside is that the model quality takes a hit and generation won’t be as fast as half precision.
  1. use shorter sequences
  • This might be a non-starter
  1. return fewer sequences. In this case, you can do multiple calls to achieve the same result.