Streamlit + Llama 3, takes too much gpu memory?

Thanks for taking the time out of your day to read this,

I had integrated meta-llama/Meta-Llama-3-8B-Instruct, on to my pc, which went perfect. I had decided I had wanted to integrate a streamlit ui, to make it easier to access this. However I came upon a plethora of issues, mainly regarding the quantization of such. It stated I didn’t have enough ram, I was wondering if there are any settings I can alter within my code to make this possible, or if i should just use a different model. I do understand that Llama 3 is a huge model, but does streamlit push the llama model too much where it cant even make a ui?

Here are my quantization settings using bnb config: bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type=‘nf4’,
bnb_4bit_compute_dtype=torch.bfloat16
)
and here is my text generator class:
text_generator = pipeline(
task=“text-generation”,
model=model,
tokenizer=tokenizer,
max_new_tokens=64, *note this was originally 128, but i decreased this
)
I had tried implementing loading in 8 bit fp 32, but it seems that the llama model doesn’t support that.

As I end this post, I have a 2060 super with 8 dedicated gpu memory, any tips would be helpful.
Thankful for your time!