Streamlit + Llama 3, takes too much gpu memory?

Senthy · July 13, 2024, 10:36pm

Thanks for taking the time out of your day to read this,

I had integrated meta-llama/Meta-Llama-3-8B-Instruct, on to my pc, which went perfect. I had decided I had wanted to integrate a streamlit ui, to make it easier to access this. However I came upon a plethora of issues, mainly regarding the quantization of such. It stated I didn’t have enough ram, I was wondering if there are any settings I can alter within my code to make this possible, or if i should just use a different model. I do understand that Llama 3 is a huge model, but does streamlit push the llama model too much where it cant even make a ui?

Here are my quantization settings using bnb config: bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type=‘nf4’,
bnb_4bit_compute_dtype=torch.bfloat16
)
and here is my text generator class:
text_generator = pipeline(
task=“text-generation”,
model=model,
tokenizer=tokenizer,
max_new_tokens=64, *note this was originally 128, but i decreased this
)
I had tried implementing loading in 8 bit fp 32, but it seems that the llama model doesn’t support that.

As I end this post, I have a 2060 super with 8 dedicated gpu memory, any tips would be helpful.
Thankful for your time!

Topic		Replies	Views
GPU Optimisation Quantised Llama x Nvidia T4 Beginners	2	218	January 8, 2025
How to run large LLMs like Llama 3.1 70B or Mixtral 8x22B with limited GPU VRAM? Beginners	2	1664	September 26, 2024
Llama 3.1 8b Instruct - Memory Usage More than Reported Models	5	471	February 18, 2025
Loading llama3.21B in quantized config shows no change in size Beginners	1	58	December 10, 2024
Llama-2 on colab Beginners	3	11382	November 28, 2023

Streamlit + Llama 3, takes too much gpu memory?

Related topics