I’m finetuning a Llama-2 sequence classification model with peft and qlora, and evaluating every 100 steps. I also save a checkpoint every 100 steps. When I load the checkpoint and do inference on the same validation set as during training, the accuracy is really much lower. Here’s the relevant code:
Training:
q_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForSequenceClassification.from_pretrained(
"meta-llama/Llama-2-13b-hf",
quantization_config=q_config,
device_map="auto",
num_labels=n_labels,
)
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False
peft_config = LoraConfig(
r=16,
lora_alpha=64,
lora_dropout=0.1,
bias="none",
task_type=TaskType.SEQ_CLS,
target_modules=['v_proj', 'down_proj', 'up_proj', 'q_proj', 'gate_proj', 'k_proj', 'o_proj']
)
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
training_args = TrainingArguments(...)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=ds_train,
eval_dataset=ds_test,
compute_metrics=compute_metrics,
tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("final-checkpoint")
Then for inference, I load the model as follows:
model = AutoModelForSequenceClassification.from_pretrained(
"final-checkpoint",
device_map="auto",
num_labels=n_labels,
quantization_config=q_config
)
Doing inference with this latest model gives much much worse predictions on the test set then during training. I’ve tried loading with other instantiation methods (AutoPeftModelForSequenceClassification
…) but the result is the same.
What am I doing wrong? Is it the saving that is wrong, or the loading? Something in the parameters?
Thing is, if a model has been training for days, and you cannot save it or load it again, then …?
Thank you for your help.