torch.cuda.OutOfMemoryError for CodeLlama models in H100 single GPU inference

nalakar · March 18, 2024, 7:21pm

Hi, I am facing an out-of-memory issue in H100 for CodeLlama 34b/70b inference (sometimes working for single sequence output). For 13b models also I couldn’t generate more than 5 completion sequences. Below are my script and package versions.

accelerate==0.28.0
tokenizers==0.13.3
torch==2.2.1 (    torchaudio-2.2.1|py310_cu121)
transformers==4.33.0
triton==2.2.0
evaluate==0.4.1

Also tried with touch CUDA 11.8

torch==2.2.1 (pytorch-2.2.1|py3.10_cuda11.8_cudnn8.7.0_0)

Nvidia Driver and CUDA (tried both CUDA 11.8 and 12.1)

NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-34b-Instruct-hf", cache_dir=self.cache_dir)
model = AutoModelForCausalLM.from_pretrained(self.model_name, cache_dir=self.cache_dir).to(self.device)

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda:3")

try:
    samples = model.generate(
        input_ids, 
        temperature=0.9,
        top_k=40, 
        top_p=0.5,
        repetition_penalty=1.1,
        max_new_tokens=1000,
        min_new_tokens=3, 
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        num_return_sequences=20)
except ...
    pass

Traceback (most recent call last):
  File "/remote/vg_llm/nalaka/chatsv/Evaluation/./Eval.py", line 505, in <module>
    runMain()
  File "/remote/vg_llm/nalaka/chatsv/Evaluation/./Eval.py", line 471, in runMain
    runEval(config)
  File "/remote/vg_llm/nalaka/chatsv/Evaluation/./Eval.py", line 482, in runEval
    model = create_model(testing_model, config)
  File "/remote/vg_llm/nalaka/chatsv/Evaluation/./Eval.py", line 438, in create_model
    model = HuggingFace_API(model_cfg)
  File "/remote/vg_llm/nalaka/chatsv/Evaluation/hf_model.py", line 10, in __init__
    self.init()
  File "/remote/vg_llm/nalaka/chatsv/Evaluation/hf_model.py", line 32, in init
    cache_dir=self.cache_dir).to(self.device)
  File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2053, in to
    return super().to(*args, **kwargs)
  File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
    return self._apply(convert)
  File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
    param_applied = fn(param)
  File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 688.00 MiB. GPU 3 has a total capacity of 79.11 GiB of which 306.56 MiB is free. Including non-PyTorch memory, this process has 78.68 GiB memory in use. Of the allocated memory 78.12 GiB is allocated by PyTorch, and 120.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

samchain · March 18, 2024, 7:56pm

Hello,

You might consider loading it in a quantized version or a smaller one. A 34 Bn parameters model does not fit in a single H100 GPU.
Using “load_in_8_bits = True” in the AutoModelForCausalLM.from_pretrained might be a good start even though I think it would still fail at generation.

I advise you to start with a 7B one which still does a fantastic job !

Hope this helps !

nbroad · March 21, 2024, 10:09pm

Can you share more information about the sequence length you are using? If the sequence lengths are too long, then the memory requirements will be very high.

The easiest option you can do that could potentially reduce the OOM errors is using flash attention or memory efficient attention.

When loading the model, you can do:

AutoModelForCausalLM.from_pretrained(
self.model_name, 
cache_dir=self.cache_dir,
device_map=self.device,
attn_implementation="sdpa"
)

The sdpa implementation will automatically choose flash attention or memory efficient attention and it can significantly help reduce memory and increase speed at long sequence lengths.

`Apart from that, your other choices are to

use quantization to make the model take less memory

The downside is that the model quality takes a hit and generation won’t be as fast as half precision.

use shorter sequences

This might be a non-starter

return fewer sequences. In this case, you can do multiple calls to achieve the same result.

Topic		Replies	Views
Getting error when running inference in multiple GPUs 🤗Transformers	0	648	October 13, 2023
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 39.56 GiB total capacity; 37.84 GiB already allocated; 242.56 MiB free; 37.96 GiB reserved in total by PyTorch) 🤗Transformers	2	5347	June 7, 2023
Cuda Out of Memory when fine tuning llm model 🤗Transformers	3	1164	May 7, 2024
Running into OOM on GPU with quantized llama-3-8b for text generation inference Models	0	494	June 29, 2024
Inquiry Regarding Out of Memory Issue During LoRA Fine-Tuning Models	2	105	May 5, 2025

torch.cuda.OutOfMemoryError for CodeLlama models in H100 single GPU inference

Related topics