torch.cuda.OutOfMemoryError for CodeLlama models in H100 single GPU inference

Can you share more information about the sequence length you are using? If the sequence lengths are too long, then the memory requirements will be very high.

The easiest option you can do that could potentially reduce the OOM errors is using flash attention or memory efficient attention.

When loading the model, you can do:

AutoModelForCausalLM.from_pretrained(
self.model_name, 
cache_dir=self.cache_dir,
device_map=self.device,
attn_implementation="sdpa"
)

The sdpa implementation will automatically choose flash attention or memory efficient attention and it can significantly help reduce memory and increase speed at long sequence lengths.

`Apart from that, your other choices are to

  1. use quantization to make the model take less memory
  • The downside is that the model quality takes a hit and generation won’t be as fast as half precision.
  1. use shorter sequences
  • This might be a non-starter
  1. return fewer sequences. In this case, you can do multiple calls to achieve the same result.