torch.cuda.OutOfMemoryError for CodeLlama models in H100 single GPU inference

nbroad · March 21, 2024, 10:09pm

Can you share more information about the sequence length you are using? If the sequence lengths are too long, then the memory requirements will be very high.

The easiest option you can do that could potentially reduce the OOM errors is using flash attention or memory efficient attention.

When loading the model, you can do:

AutoModelForCausalLM.from_pretrained(
self.model_name, 
cache_dir=self.cache_dir,
device_map=self.device,
attn_implementation="sdpa"
)

The sdpa implementation will automatically choose flash attention or memory efficient attention and it can significantly help reduce memory and increase speed at long sequence lengths.

`Apart from that, your other choices are to

use quantization to make the model take less memory

The downside is that the model quality takes a hit and generation won’t be as fast as half precision.

use shorter sequences

This might be a non-starter

return fewer sequences. In this case, you can do multiple calls to achieve the same result.

Topic		Replies	Views
CUDA out of memory on Nvidia A10G + Codellama on HuggingFace Spaces Beginners	6	502	February 8, 2024
Llama3 OutOfMemory on an A100 when doing CausalLM 🤗Transformers	0	204	June 5, 2024
LLM ingores max_memory in inference Models	0	132	June 20, 2024
Running into OOM on GPU with quantized llama-3-8b for text generation inference Models	0	505	June 29, 2024
codellama/CodeLlama-70b-Instruct-hf TGI server out-of-memory error in H100 Models	2	288	March 22, 2024

torch.cuda.OutOfMemoryError for CodeLlama models in H100 single GPU inference

Related topics