Loading llama3.21B in quantized config shows no change in size

suruti94 · December 9, 2024, 4:11am

Hi,

This is what I have:

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type=“nf4”,
bnb_4bit_compute_dtype=torch.bfloat16
)
model_id = “meta-llama/Llama-3.2-1B”
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={“”:0})
print(model.get_memory_footprint())

The result is : 1012013184

why don’t I see a reduction in memory footprint ? When I use “facebook/opt-350m” I see the footprint is 207m. I have a GTX 1080 on my system. What am i missing ?

Thanks
Mohan

suruti94 · December 10, 2024, 2:34am

I measured it again with this code. It measures both the model footprint as reported by pytorch and gpu memory allocated.

torch.cuda.reset_peak_memory_stats(device=None)
model_id = “meta-llama/Llama-3.2-1B”
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map={“”:0})
#model = AutoModelForCausalLM.from_pretrained(model_id, device_map={“”:0})
print(model.get_memory_footprint())
print(f"gpu used {torch.cuda.max_memory_allocated(device=None)}")

It was 4.9b vs 1gb (footprint and cuda memory was pretty close) between original vs 4bit. Sorry for the confusion.

-thanks
Mohan

Topic		Replies	Views
Qunatized model with LORA takes much more GPU memory than the un-quantized model with LORA for the (E-5-Large Embedding Transformer) 🤗Transformers	4	1804	October 8, 2023
QLoRA memory requirement with 3B model loads GPU with 10GB of memory with 4bit quantization Intermediate	0	1200	December 19, 2023
Load_in_8bit vs. loading 8-bit quantized model 🤗Transformers	6	7180	May 13, 2024
How to run large LLMs like Llama 3.1 70B or Mixtral 8x22B with limited GPU VRAM? Beginners	2	1870	September 26, 2024
Streamlit + Llama 3, takes too much gpu memory? Models	0	195	July 13, 2024

Loading llama3.21B in quantized config shows no change in size

Related topics