I am experimenting fine-tuning with QLoRA. Here are the settings:
model_id = "openlm-research/open_llama_3b_v2"
qlora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
base_model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
)
supervised_finetuning_trainer = SFTTrainer(
base_model,
train_dataset=formatted_dataset["train"],
eval_dataset=formatted_dataset["test"],
args=transformers.TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
learning_rate=2e-4,
max_steps=1000,
output_dir="./SFTOpenLM-Dolly15k",
optim="paged_adamw_8bit",
fp16=True,
),
tokenizer=tokenizer,
peft_config=qlora_config,
dataset_text_field="text",
max_seq_length=512
)
base_model.get_memory_footprint() returns ~2.5GB. Shouldn’t it be ~1.5GB because of 4 bits quantization, which is half-byte per weight: 310^90.5=1.5GB. Also, training with SFTTrainer with the above settings loads GPU with ~10GB. Is it too big? I was expecting around 4-5GBs.
I have watched this https://www.youtube.com/watch?v=g68qlo9Izf0&t=13m14s video. And the calculations seem not to be aligned what I see in the experiment above.
Does anyone know what I am missing? And does anyone know how to calculate memory requirements in advance?