Hi all,
Currently I’ve quantised my Llama 3.2 8 billion and it takes up 5GB to load into NVIDIA T4 when I load it in. Furthermore, I load all my texts into a dataframe with n rows with the longest text requiring 10GB of GPU loading.
I’ve run !nvidia-smi and it takes up approximately 5GB out of 16GB when I load in the Quantised llama model
I’ve applied the following script on just 1 row (just 1 text)
with torch.no_grad():
result = llama_model calls PROMPT on text
torch.cuda.empty_cache()
torch.cuda.reset_accumulated_memory_stats()
torch.cuda.reset_peak_memory_stats()
And the memory is now at 10GB. Whatever I do, I cannot seem to reset the memory back to 5GB when I first loaded in the Quantised model. Is this supposed to happen? I’m also hitting OOM issue very often for most transcripts. I’m hoping to save cost by staying on a T4 GPU but would this be possible?
Furthermore, is it wise to use some long context transformers like Linformer to pre-process the text (create summaries) and then use my quantised model to loop through the dataframe and generate answers? But my end goal is to fit a text together or 2 entire texts together and let Llama do its thing on it.
Thank you for reading through this & appreciate your help!