GPU Optimisation Quantised Llama x Nvidia T4

Hi all,

Currently I’ve quantised my Llama 3.2 8 billion and it takes up 5GB to load into NVIDIA T4 when I load it in. Furthermore, I load all my texts into a dataframe with n rows with the longest text requiring 10GB of GPU loading.

I’ve run !nvidia-smi and it takes up approximately 5GB out of 16GB when I load in the Quantised llama model

I’ve applied the following script on just 1 row (just 1 text)

with torch.no_grad(): 
    result = llama_model calls PROMPT on text

torch.cuda.empty_cache()
torch.cuda.reset_accumulated_memory_stats()
torch.cuda.reset_peak_memory_stats()

And the memory is now at 10GB. Whatever I do, I cannot seem to reset the memory back to 5GB when I first loaded in the Quantised model. Is this supposed to happen? I’m also hitting OOM issue very often for most transcripts. I’m hoping to save cost by staying on a T4 GPU but would this be possible?

Furthermore, is it wise to use some long context transformers like Linformer to pre-process the text (create summaries) and then use my quantised model to loop through the dataframe and generate answers? But my end goal is to fit a text together or 2 entire texts together and let Llama do its thing on it.

Thank you for reading through this :smile: & appreciate your help!

1 Like

Hi @Bdg01!

The memory behavior you’re seeing is expected—torch.cuda.empty_cache() doesn’t fully release GPU memory; it only clears unused allocations. Use gc.collect() before clearing the cache and process data in smaller chunks to manage memory better.

The spike from 5GB to 10GB likely comes from intermediate computations during inference. To reduce this:

  • Use dynamic padding to avoid padding to the model’s max context length.
  • Explore 4-bit/8-bit quantization (e.g., bitsandbytes) or CPU offloading with Hugging Face Accelerate.

Preprocessing with long-context models (like Linformer) for summarization is a smart approach but be cautious about information loss. For combining two texts, consider hierarchical attention or preprocess to focus on key content.

If memory remains tight, upgrading to a GPU with more VRAM might be necessary.
Hope this help! :blush:
Alan


2 Likes

Thank you @Alanturner2 for your help!

1 Like