Trainer in PEFT doesn't report evaluation metrics

pursuitofds · July 5, 2024, 6:05pm

Hi,

I am fine-tuning the Llama3-8B model for sequence tasks using QLora. Whenever I use QLora, the evaluation metrics never gets printed (see the results at the bottom). However, when I use the full 32-bit precision model without having any PEFT, the metrics get printed. I am not sure if this is a bug, so I am asking it here. Also, I observed the same output when using other models such as BERT. So it’s not limited to Llama models only. If you need any other details, please let me know. Thanks!

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.half,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

# quantization_config=BitsAndBytesConfig(
#     quant_type="dynamic",  # Use dynamic quantization
#     bits=4  # Specify 4-bit quantization
# )

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Ensure the tokenizer has a padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForSequenceClassification.from_pretrained(checkpoint,
                                                           #torch_dtype=torch.half,
                                                           quantization_config=quantization_config,
                                                           )

Below is my compute_metrics function:

def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(p.label_ids, preds, average='weighted')
    acc = accuracy_score(p.label_ids, preds)
    return {
        'accuracy': acc,
        'precision': precision,
        'recall': recall,
        'f1': f1,
    }

This is my training arguments and trainer:

# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    num_train_epochs=2,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    warmup_steps=500,
    weight_decay=0.01,
    #logging_dir='./logs',
    evaluation_strategy="steps",
    eval_steps=500,
    logging_steps=50
)


# Initialize the Trainer
trainer = Trainer(
    model = peft_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

Outputs:

{'eval_runtime': 1229.9596, 'eval_samples_per_second': 20.326, 'eval_steps_per_second': 2.541, 'epoch': 1.92}
{'loss': 0.2013, 'grad_norm': 90.90315246582031, 'learning_rate': 6.956521739130435e-07, 'epoch': 1.94}
{'loss': 0.124, 'grad_norm': 1.2543754577636719, 'learning_rate': 5.217391304347826e-07, 'epoch': 1.95}
{'loss': 0.1548, 'grad_norm': 0.5100873708724976, 'learning_rate': 3.4782608695652175e-07, 'epoch': 1.97}
{'loss': 0.2046, 'grad_norm': 0.08409222960472107, 'learning_rate': 1.7391304347826088e-07, 'epoch': 1.98}
{'loss': 0.2284, 'grad_norm': 0.019938422366976738, 'learning_rate': 0.0, 'epoch': 2.0}
{'train_runtime': 20509.1509, 'train_samples_per_second': 2.438, 'train_steps_per_second': 0.305, 'train_loss': 0.24805847625732422, 'epoch': 2.0}
Validation Results: {'eval_runtime': 1226.4144, 'eval_samples_per_second': 20.385, 'eval_steps_per_second': 2.548, 'epoch': 2.0}

Conor-13 · November 17, 2024, 8:38pm

Hi, I am running into this same problem using TinyBERT + LoRA. Were you ever able to figure out how to fix this? Thanks!

John6666 · November 18, 2024, 1:39am

I haven’t done any in-depth research on this, but I’ve seen more than a few people on this forum, Discord, and github who are having the same problem…
It seems that some people have fixed it by upgrading PEFT, but there may be some kind of bug, including with other libraries.

pip install -U peft bitsandbytes

BenjaminB · November 29, 2024, 3:09pm

The problem you encounter is possibly the same as the one described in this issue: eval_loss missing when using peft model with STFTrainer · Issue #1881 · huggingface/peft · GitHub. Please check if the solution (or “hack”) described there works for you.

Z063 · June 17, 2025, 9:18am

The problem is in “Trainer.evaluation_loop()” where the “losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)” doesn’t return the labels.
I add “label_names=[“labels”]” in my “TrainingArguments” and then I got it.

Topic		Replies	Views
Fine tuning a LLaMa 3 with QLora - metrics calculation Beginners	1	892	October 17, 2024
Llama-2 Sequence Classification: Much lower accuracy on inference from checkpoint compared to model 🤗Transformers	5	5952	February 20, 2024
Reduced inference f1 score with QLoRA finetuned model Intermediate	1	882	September 6, 2023
Different results from checkpoint evaluation when loading fine-tuned LLM model Intermediate	5	3241	September 22, 2023
Llama2 fine-tunning with PEFT QLora and testing the model 🤗Transformers	13	15275	December 21, 2023

Trainer in PEFT doesn't report evaluation metrics

Related topics