Loading llama3.21B in quantized config shows no change in size

Hi,

This is what I have:

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type=“nf4”,
bnb_4bit_compute_dtype=torch.bfloat16
)
model_id = “meta-llama/Llama-3.2-1B”
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={“”:0})
print(model.get_memory_footprint())

The result is : 1012013184

why don’t I see a reduction in memory footprint ? When I use “facebook/opt-350m” I see the footprint is 207m. I have a GTX 1080 on my system. What am i missing ?

Thanks
Mohan

1 Like

I measured it again with this code. It measures both the model footprint as reported by pytorch and gpu memory allocated.

torch.cuda.reset_peak_memory_stats(device=None)
model_id = “meta-llama/Llama-3.2-1B”
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map={“”:0})
#model = AutoModelForCausalLM.from_pretrained(model_id, device_map={“”:0})
print(model.get_memory_footprint())
print(f"gpu used {torch.cuda.max_memory_allocated(device=None)}")

It was 4.9b vs 1gb (footprint and cuda memory was pretty close) between original vs 4bit. Sorry for the confusion.

-thanks
Mohan

1 Like