torch.cuda.OutOfMemoryError for CodeLlama models in H100 single GPU inference

samchain · March 18, 2024, 7:56pm

Hello,

You might consider loading it in a quantized version or a smaller one. A 34 Bn parameters model does not fit in a single H100 GPU.
Using “load_in_8_bits = True” in the AutoModelForCausalLM.from_pretrained might be a good start even though I think it would still fail at generation.

I advise you to start with a 7B one which still does a fantastic job !

Hope this helps !

Topic		Replies	Views
CUDA out of memory on Nvidia A10G + Codellama on HuggingFace Spaces Beginners	6	502	February 8, 2024
Llama3 OutOfMemory on an A100 when doing CausalLM 🤗Transformers	0	204	June 5, 2024
LLM ingores max_memory in inference Models	0	132	June 20, 2024
Running into OOM on GPU with quantized llama-3-8b for text generation inference Models	0	505	June 29, 2024
codellama/CodeLlama-70b-Instruct-hf TGI server out-of-memory error in H100 Models	2	288	March 22, 2024

torch.cuda.OutOfMemoryError for CodeLlama models in H100 single GPU inference

Related topics