QLoRA memory requirement with 3B model loads GPU with 10GB of memory with 4bit quantization

I am experimenting fine-tuning with QLoRA. Here are the settings:

model_id = "openlm-research/open_llama_3b_v2"

qlora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
)

supervised_finetuning_trainer = SFTTrainer(
    base_model,
    train_dataset=formatted_dataset["train"],
    eval_dataset=formatted_dataset["test"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        max_steps=1000,
        output_dir="./SFTOpenLM-Dolly15k",
        optim="paged_adamw_8bit",
        fp16=True,
    ),
    tokenizer=tokenizer,
    peft_config=qlora_config,
    dataset_text_field="text",
    max_seq_length=512
)

base_model.get_memory_footprint() returns ~2.5GB. Shouldn’t it be ~1.5GB because of 4 bits quantization, which is half-byte per weight: 310^90.5=1.5GB. Also, training with SFTTrainer with the above settings loads GPU with ~10GB. Is it too big? I was expecting around 4-5GBs.

I have watched this https://www.youtube.com/watch?v=g68qlo9Izf0&t=13m14s video. And the calculations seem not to be aligned what I see in the experiment above.

Does anyone know what I am missing? And does anyone know how to calculate memory requirements in advance?