Running into OOM on GPU with quantized llama-3-8b for text generation inference

I am using llama-3-8b-instruct for prompting. My GPU only has a memory of 12G. I thus used the quantized version of llama-3-8b-instruct by using the argument load_in_8bit=True. I am running 500 prompts through this model, one prompt at a time, instead of performing batching, etc. because I don’t have enough memory. After prompting the model 5 times, I also clear the cache via torch.cuda.empty_cache(). However, I run into an out of memory issue.

After 26 prompts, I get the error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.10 GiB. GPU 0 has a total capacity of 11.91 GiB of which 965.25 MiB is free. Including non-PyTorch memory, this process has 10.97 GiB memory in use. Of the allocated memory 9.54 GiB is allocated by PyTorch, and 762.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.3 documentation)

Any suggestions to address this issue? Why would running inference just 1 prompt at a time even result in OOM? The prompts are roughly the same in size and the model was already able to process 26.

I have tried 4bit quantization and that’s running so far. I also tried 8bit quantization, as above, but following the suggestion in the error which is to try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. So far, it’s been running. [UPDATE: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True with 8bit quantization ended up in OOM].

Does anyone have more suggestions to address this issue? Thanks!