Identical Evaluation Metrics for SFT & DPO–Fine-Tuned LoRA Adapter on SeaLLMs-v3-7B

Hello everyone,

I’m running into a puzzling situation where my SFT and DPO evaluations produce exactly the same n-gram metrics—even after fine-tuning via DPO. I expected DPO to alter the model’s behavior (and thus change BLEU/ROUGE/etc.), but instead both runs yield:

model exact_match rouge1_f1 rouge2_f1 rougeL_f1 bleu meteor inference_time_s
SeaLLMs-v3-7B 0 0.715663 0.652622 0.709211 0.558454 0.732766 ~58s
(DPO) 0 0.715663 0.652622 0.709211 0.558454 0.732766 ~60s

1. My workflow

  1. SFT training via TRL’s SFTTrainer
  • QLoRA (r=16, α=32, dropout=0.05), bf16, 3 epochs
  • Saved adapter in sft_output_SeaLLMs-v3-7B/
  1. Preference dataset creation (pairwise “chosen vs rejected”) → cleaned JSONL
  2. DPO training via TRL’s DPOTrainer
base_model.config.use_cache = False
base_model.enable_input_require_grads()
base_model.gradient_checkpointing_enable()
model = PeftModel.from_pretrained(base_model, sft_output_dir, ...)
trainer = DPOTrainer(model=model, args=dpo_args, train_dataset=..., processing_class=tokenizer)
trainer.train()
model.save_pretrained("dpo_output_SeaLLMs-v3-7B/")
  1. Evaluation Notebooks
  • SFT_Evaluation.ipynb loads PeftModel.from_pretrained("sft_output_…")
  • DPO_Evaluation.ipynb loads PeftModel.from_pretrained("dpo_output_…")
  • Both run 4-bit quantized inference (BitsAndBytesConfig), batch-generate, then compute EM / ROUGE-1/2/L / BLEU / METEOR on the same held-out test set.

2. Environment

  • Transformers 4.40.0
  • :hugs: TRL 0.11.3
  • PEFT 0.15.0
  • bitsandbytes (4-bit NF4 quant)
  • Python 3.10
  • Evaluate library from :hugs:
  • GPU: A100 (4-bit inference on GPUs 3,4,5)

3. Questions

  1. Why are the SFT & DPO metrics identical?
    Is there a scenario where DPO doesn’t actually modify the n-gram outputs, or am I accidentally evaluating the same checkpoint twice?
  2. Adapter loading sanity
  • Should I be calling model.merge_and_unload() before eval?
  • Any quick tricks to diff the state-dict of the SFT vs DPO adapter?
  1. Debugging DPO updates
    How can I inspect reward/loss signals or gradient norms during DPO training to confirm that the policy is truly being updated?
  2. Best practices for “before vs after” sampling
    Do you recommend any lightweight workflow/snippet for sampling a few prompts pre- and post-DPO to spot qualitative changes?

I’d really appreciate any pointers, example snippets, or pitfalls to watch out for. Thank you! :folded_hands:

1 Like

The fact that it’s the same with both PPO and DPO means that, although I don’t know the reason, I think the model weights are probably not being overwritten. For example, requires_grad=False may be set. traineble=True may also be necessary.

model = PeftModel.from_pretrained(model, peft_dir, is_trainable=True).to(device)