I’m finetuning a Llama-2 sequence classification model with peft and qlora, and evaluating every 100 steps. I also save a checkpoint every 100 steps. When I load the checkpoint and do inference on the same validation set as during training, the accuracy is really much lower. Here’s the relevant code:
q_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16 ) model = AutoModelForSequenceClassification.from_pretrained( "meta-llama/Llama-2-13b-hf", quantization_config=q_config, device_map="auto", num_labels=n_labels, ) model.config.pad_token_id = tokenizer.pad_token_id model.config.use_cache = False peft_config = LoraConfig( r=16, lora_alpha=64, lora_dropout=0.1, bias="none", task_type=TaskType.SEQ_CLS, target_modules=['v_proj', 'down_proj', 'up_proj', 'q_proj', 'gate_proj', 'k_proj', 'o_proj'] ) model.gradient_checkpointing_enable() model = prepare_model_for_kbit_training(model) model = get_peft_model(model, peft_config) training_args = TrainingArguments(...) trainer = Trainer( model=model, args=training_args, train_dataset=ds_train, eval_dataset=ds_test, compute_metrics=compute_metrics, tokenizer=tokenizer, ) trainer.train() trainer.save_model("final-checkpoint")
Then for inference, I load the model as follows:
model = AutoModelForSequenceClassification.from_pretrained( "final-checkpoint", device_map="auto", num_labels=n_labels, quantization_config=q_config )
Doing inference with this latest model gives much much worse predictions on the test set then during training. I’ve tried loading with other instantiation methods (
AutoPeftModelForSequenceClassification…) but the result is the same.
What am I doing wrong? Is it the saving that is wrong, or the loading? Something in the parameters?
Thing is, if a model has been training for days, and you cannot save it or load it again, then …?
Thank you for your help.