I would like to fine-tune a large model (1.3B and 7B) using LoRA. I have to reduce the cost further than just using LoRA. A tutorial suggested to quantise the weights to 8 bit. Hence, my code is this:
bandb_config = BitsAndBytesConfig(load_in_4bit=True)
config = AutoConfig.from_pretrained(cfg.model.architecture, quantization_config=bandb_config)
model = AutoModelForCausalLM.from_pretrained(cfg.model.architecture, config=config, cache_dir=model_dir)
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
r=cfg.training.lora.rank,
lora_alpha=cfg.training.lora.alpha,
lora_dropout=cfg.training.lora.dropout,
)
model = get_peft_model(model, peft_config)
print("Trainable Parameters:")
model.print_trainable_parameters()
trainer = Trainer(bf16=True, ...)
trainer.train()
However, it has no effect: Using 4 bit, 8 bit and no quantization require the same memory and time to train. Further, after defining model I checked the weights that were loaded. they are all fp32
.