Running into OOM on GPU with quantized llama-3-8b for text generation inference

radicalprotnns · June 29, 2024, 2:33am

I am using llama-3-8b-instruct for prompting. My GPU only has a memory of 12G. I thus used the quantized version of llama-3-8b-instruct by using the argument load_in_8bit=True. I am running 500 prompts through this model, one prompt at a time, instead of performing batching, etc. because I don’t have enough memory. After prompting the model 5 times, I also clear the cache via torch.cuda.empty_cache(). However, I run into an out of memory issue.

After 26 prompts, I get the error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.10 GiB. GPU 0 has a total capacity of 11.91 GiB of which 965.25 MiB is free. Including non-PyTorch memory, this process has 10.97 GiB memory in use. Of the allocated memory 9.54 GiB is allocated by PyTorch, and 762.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.3 documentation)

Any suggestions to address this issue? Why would running inference just 1 prompt at a time even result in OOM? The prompts are roughly the same in size and the model was already able to process 26.

I have tried 4bit quantization and that’s running so far. I also tried 8bit quantization, as above, but following the suggestion in the error which is to try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. So far, it’s been running. [UPDATE: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True with 8bit quantization ended up in OOM].

Does anyone have more suggestions to address this issue? Thanks!

Topic		Replies	Views
Fine tune Meta-Llama-3.1-8B OOM error after the 1st training step Models	0	173	September 6, 2024
Streamlit + Llama 3, takes too much gpu memory? Models	0	196	July 13, 2024
Llama 3.1 70-B run on 32 GB Vram? 🤗Transformers	5	4104	September 20, 2024
GPU Optimisation Quantised Llama x Nvidia T4 Beginners	2	265	January 8, 2025
LLM ingores max_memory in inference Models	0	134	June 20, 2024

Running into OOM on GPU with quantized llama-3-8b for text generation inference

Related topics