Trainer in PEFT doesn't report evaluation metrics

Hi,

I am fine-tuning the Llama3-8B model for sequence tasks using QLora. Whenever I use QLora, the evaluation metrics never gets printed (see the results at the bottom). However, when I use the full 32-bit precision model without having any PEFT, the metrics get printed. I am not sure if this is a bug, so I am asking it here. Also, I observed the same output when using other models such as BERT. So it’s not limited to Llama models only. If you need any other details, please let me know. Thanks!

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.half,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

# quantization_config=BitsAndBytesConfig(
#     quant_type="dynamic",  # Use dynamic quantization
#     bits=4  # Specify 4-bit quantization
# )

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Ensure the tokenizer has a padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForSequenceClassification.from_pretrained(checkpoint,
                                                           #torch_dtype=torch.half,
                                                           quantization_config=quantization_config,
                                                           )

Below is my compute_metrics function:

def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(p.label_ids, preds, average='weighted')
    acc = accuracy_score(p.label_ids, preds)
    return {
        'accuracy': acc,
        'precision': precision,
        'recall': recall,
        'f1': f1,
    }

This is my training arguments and trainer:

# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    num_train_epochs=2,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    warmup_steps=500,
    weight_decay=0.01,
    #logging_dir='./logs',
    evaluation_strategy="steps",
    eval_steps=500,
    logging_steps=50
)


# Initialize the Trainer
trainer = Trainer(
    model = peft_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

Outputs:

{'eval_runtime': 1229.9596, 'eval_samples_per_second': 20.326, 'eval_steps_per_second': 2.541, 'epoch': 1.92}
{'loss': 0.2013, 'grad_norm': 90.90315246582031, 'learning_rate': 6.956521739130435e-07, 'epoch': 1.94}
{'loss': 0.124, 'grad_norm': 1.2543754577636719, 'learning_rate': 5.217391304347826e-07, 'epoch': 1.95}
{'loss': 0.1548, 'grad_norm': 0.5100873708724976, 'learning_rate': 3.4782608695652175e-07, 'epoch': 1.97}
{'loss': 0.2046, 'grad_norm': 0.08409222960472107, 'learning_rate': 1.7391304347826088e-07, 'epoch': 1.98}
{'loss': 0.2284, 'grad_norm': 0.019938422366976738, 'learning_rate': 0.0, 'epoch': 2.0}
{'train_runtime': 20509.1509, 'train_samples_per_second': 2.438, 'train_steps_per_second': 0.305, 'train_loss': 0.24805847625732422, 'epoch': 2.0}
Validation Results: {'eval_runtime': 1226.4144, 'eval_samples_per_second': 20.385, 'eval_steps_per_second': 2.548, 'epoch': 2.0}