GPU Optimisation Quantised Llama x Nvidia T4

Bdg01 · January 8, 2025, 6:20am

Hi all,

Currently I’ve quantised my Llama 3.2 8 billion and it takes up 5GB to load into NVIDIA T4 when I load it in. Furthermore, I load all my texts into a dataframe with n rows with the longest text requiring 10GB of GPU loading.

I’ve run !nvidia-smi and it takes up approximately 5GB out of 16GB when I load in the Quantised llama model

I’ve applied the following script on just 1 row (just 1 text)

with torch.no_grad(): 
    result = llama_model calls PROMPT on text

torch.cuda.empty_cache()
torch.cuda.reset_accumulated_memory_stats()
torch.cuda.reset_peak_memory_stats()

And the memory is now at 10GB. Whatever I do, I cannot seem to reset the memory back to 5GB when I first loaded in the Quantised model. Is this supposed to happen? I’m also hitting OOM issue very often for most transcripts. I’m hoping to save cost by staying on a T4 GPU but would this be possible?

Furthermore, is it wise to use some long context transformers like Linformer to pre-process the text (create summaries) and then use my quantised model to loop through the dataframe and generate answers? But my end goal is to fit a text together or 2 entire texts together and let Llama do its thing on it.

Thank you for reading through this & appreciate your help!

Alanturner2 · January 8, 2025, 7:24am

Hi @Bdg01!

The memory behavior you’re seeing is expected—torch.cuda.empty_cache() doesn’t fully release GPU memory; it only clears unused allocations. Use gc.collect() before clearing the cache and process data in smaller chunks to manage memory better.

The spike from 5GB to 10GB likely comes from intermediate computations during inference. To reduce this:

Use dynamic padding to avoid padding to the model’s max context length.
Explore 4-bit/8-bit quantization (e.g., bitsandbytes) or CPU offloading with Hugging Face Accelerate.

Preprocessing with long-context models (like Linformer) for summarization is a smart approach but be cautious about information loss. For combining two texts, consider hierarchical attention or preprocess to focus on key content.

If memory remains tight, upgrading to a GPU with more VRAM might be necessary.
Hope this help!
Alan

Bdg01 · January 8, 2025, 11:10pm

Thank you @Alanturner2 for your help!

Topic		Replies	Views
Running into OOM on GPU with quantized llama-3-8b for text generation inference Models	0	490	June 29, 2024
Llama 3.1 8b Instruct - Memory Usage More than Reported Models	5	433	February 18, 2025
LLaMA 7B GPU Memory Requirement 🤗Transformers	19	151772	February 23, 2025
Streamlit + Llama 3, takes too much gpu memory? Models	0	187	July 13, 2024
CUDA out of memory on multi-GPU 🤗Transformers	1	2642	March 6, 2024

GPU Optimisation Quantised Llama x Nvidia T4

Related topics