Hallucination with trainer.evaluate() on LLMs

sadra · July 25, 2023, 11:44pm

I am trying to use the trainer to do an evaluation with decoder-only LLMs. I want to have custom metrics with the model. this is specifically related to reloading a quantized Lora checkpoint with do_eval=True and do_train=False.

The issue is model has a lot of hallucinations on predictions, even in the prompt portion. While the same model generates almost perfectly when loaded externally and making predictions using the model.generate() method.

This is specifically on llama-2, but the same issue also shows up with Falcon.
The code is mostly inspired by the run_clm.py example and the Falcon training script.

Here is the code that I came up with:

peft_config = PeftConfig.from_pretrained(script_args.peft_model)
base_model = AutoModelForCausalLM.from_pretrained(#quantization_config=bnb_config
peft_config.base_model_name_or_path,trust_remote_code=True,torch_dtype=torch.bfloat16, device_map={"": 0},load_in_8bit=True)
lora_model = PeftModel.from_pretrained(base_model, script_args.peft_model).to('cuda')

here is the helper functions for metrics

    def compute_metrics(eval_preds):
        preds, labels = eval_preds
        acc_metric = evaluate.load("accuracy")
        acc = acc_metric.compute(predictions=preds, references=labels)

        preds = np.where(preds != -100, preds, tokenizer.pad_token_id)
        decoded_preds = tokenizer.batch_decode(preds)
        print(decoded_preds)
        return {"ACC":acc}

    def preprocess_logits_for_metrics(logits, labels):
        if isinstance(logits, tuple):
            logits = logits[0]
        return logits.argmax(dim=-1)

here is the trainer and evaluation:

trainer = Trainer(
      model=lora_model,
      train_dataset=train_dataset,
      eval_dataset=dev_dataset,
      args=training_arguments,
      data_collator=DataCollatorSupervisedFineTuning(tokenizer),
      compute_metrics=compute_metrics,
      preprocess_logits_for_metrics = preprocess_logits_for_metrics,
  )
  if training_arguments.do_eval:
      metrics = trainer.evaluate()
      print(metrics)

Here is an example of what happens on decoded_preds vs. output of using model.generate()

## Exampleru
##ract the from the file text and the above above above.

vs.
model.generate()
with default, params expect extending max_new_tokens.

# Instruction
Extract data from the following document using the rules specified below.

library versions:
transformers 4.32.0.dev0
peft 0.5.0.dev0

celsowm · February 19, 2024, 11:54pm

did you find the solution?

Topic		Replies	Views
Different results from checkpoint evaluation when loading fine-tuned LLM model Intermediate	5	3232	September 22, 2023
Trainer in PEFT doesn't report evaluation metrics 🤗Transformers	4	467	June 17, 2025
Eval with trainer not running with PEFT LoRA model 🤗Transformers	1	1605	September 10, 2023
Training loop for LoRA 🤗Transformers	3	255	September 18, 2024
Fine tuning a LLaMa 3 with QLora - metrics calculation Beginners	1	879	October 17, 2024

Hallucination with trainer.evaluate() on LLMs

Related topics