Identical Evaluation Metrics for SFT & DPO–Fine-Tuned LoRA Adapter on SeaLLMs-v3-7B

xmriz · May 22, 2025, 1:21am

Hello everyone,

I’m running into a puzzling situation where my SFT and DPO evaluations produce exactly the same n-gram metrics—even after fine-tuning via DPO. I expected DPO to alter the model’s behavior (and thus change BLEU/ROUGE/etc.), but instead both runs yield:

model	exact_match	rouge1_f1	rouge2_f1	rougeL_f1	bleu	meteor	inference_time_s
SeaLLMs-v3-7B	0	0.715663	0.652622	0.709211	0.558454	0.732766	~58s
(DPO)	0	0.715663	0.652622	0.709211	0.558454	0.732766	~60s

1. My workflow

SFT training via TRL’s SFTTrainer

QLoRA (r=16, α=32, dropout=0.05), bf16, 3 epochs
Saved adapter in sft_output_SeaLLMs-v3-7B/

Preference dataset creation (pairwise “chosen vs rejected”) → cleaned JSONL
DPO training via TRL’s DPOTrainer

base_model.config.use_cache = False
base_model.enable_input_require_grads()
base_model.gradient_checkpointing_enable()
model = PeftModel.from_pretrained(base_model, sft_output_dir, ...)
trainer = DPOTrainer(model=model, args=dpo_args, train_dataset=..., processing_class=tokenizer)
trainer.train()
model.save_pretrained("dpo_output_SeaLLMs-v3-7B/")

Evaluation Notebooks

SFT_Evaluation.ipynb loads PeftModel.from_pretrained("sft_output_…")
DPO_Evaluation.ipynb loads PeftModel.from_pretrained("dpo_output_…")
Both run 4-bit quantized inference (BitsAndBytesConfig), batch-generate, then compute EM / ROUGE-1/2/L / BLEU / METEOR on the same held-out test set.

2. Environment

Transformers 4.40.0
TRL 0.11.3
PEFT 0.15.0
bitsandbytes (4-bit NF4 quant)
Python 3.10
Evaluate library from
GPU: A100 (4-bit inference on GPUs 3,4,5)

3. Questions

Why are the SFT & DPO metrics identical?
Is there a scenario where DPO doesn’t actually modify the n-gram outputs, or am I accidentally evaluating the same checkpoint twice?
Adapter loading sanity

Should I be calling model.merge_and_unload() before eval?
Any quick tricks to diff the state-dict of the SFT vs DPO adapter?

Debugging DPO updates
How can I inspect reward/loss signals or gradient norms during DPO training to confirm that the policy is truly being updated?
Best practices for “before vs after” sampling
Do you recommend any lightweight workflow/snippet for sampling a few prompts pre- and post-DPO to spot qualitative changes?

I’d really appreciate any pointers, example snippets, or pitfalls to watch out for. Thank you!

John6666 · May 22, 2025, 5:43am

The fact that it’s the same with both PPO and DPO means that, although I don’t know the reason, I think the model weights are probably not being overwritten. For example, requires_grad=False may be set. traineble=True may also be necessary.

github.com/huggingface/peft

PeftModel is_trainable=True causes generate output to be garbage.

opened 12:49PM - 21 Mar 24 UTC

closed 03:03PM - 08 May 24 UTC

o1lo01ol1o

### TL:DR When loading a pretrained base model and then subsequently loading …a trained PeftModel adaptor on `mistralai/Mistral-7B-Instruct-v0.2`, `model.generate()` behaves as expected. However, if the `PeftModel` has is_trainable set to `True`, the output is garbage. Example: ```python base_model = "mistralai/Mistral-7B-Instruct-v0.2" model = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.bfloat16) peft_dir = "./foo" device = "cuda" model = PeftModel.from_pretrained(model, peft_dir, is_trainable=True).to(device) inputs = tokenizer.encode("[INST]Sing a nice song.[/INST]", return_tensors="pt").to(device) single_output = model.generate(inputs, max_length=5500, do_sample=True) print(tokenizer.decode(single_output[0])) ``` ```console ID' is isnstyle ;ement L, all Emb be, ", encoded....ified, :,get " A,mall >ves Gget,itia +x ' Revals.,! -- and,fc.' 2; A contract The{ thethe.PL I .; type [ I it are; =>;as........ Ch * <'. , yourli ' my a: all Consult:,, N;ly.olid, Al'ep|;s aest is&,x ' and2 =id-- Itly{: Agreement''Contract ,, nt All; fontia I N'ies Cap FAs many{ the new ,lyown A friend The face, I., I a,w=the Not;:um Ald --, & =xt-- ; ' a and;.0foria {'ia_msI forS .,, spliting want Develop work (, - (c theie1fssart --ics This -sF:: ". _--, $,(1;,icper- " " I 1, I- ``` I expected to be able to generate (or, at a minimum, something like generate) during a training procedure. See below. ### More Context I've implemented (I think) a simplified greedy search as part of a training procedure by following the `_greedy_search` implementation in the `GenerationMixin` class. I'm not completely surprised that there is an issue with calling `generate` since the `@torch.nograd()` annotation annotates the `generate` function. However, I would be surprised if there were a fundamental reason a greedy search could not be performed during training. My implementation, like `generate(),` however, produces garbage during training and I need to know why. My implementation follows: ```python def simplified_greedy_search(model, tokenizer, input_ids, max_length, debug:bool = True): # Initialize variables generated_ids = [] logits_sequence = [] eos_token_id = tokenizer.eos_token_id model_kwargs = {} if debug: # debugging logs show that .generate produces the same type of garbage as this function. test_input = tokenizer.decode(input_ids[0]) with open("simplified_greedy_search_input.log", "a") as f: f.write(test_input + "\n") # use model.generate to generate the sequence test = model.generate(input_ids, max_length=max_length, do_sample=True) test_tokens = tokenizer.decode(test[0]) # append this to a log file with open("simplified_greedy_search.log", "a") as f: f.write(extract_content(test_tokens) + "\n") # Generate tokens until max_length or EOS token is reached while len(generated_ids) < max_length: # Prepare inputs model_inputs = model.prepare_inputs_for_generation(input_ids, **model_kwargs) # Forward pass outputs = model(**model_inputs, return_dict=True) # Get the last token logits next_token_logits = outputs.logits[:, -1, :] # Store the logits logits_sequence.append(next_token_logits) # Get the most probable token next_token_id = torch.argmax(next_token_logits, dim=-1) # Check if EOS token is generated if next_token_id.item() == eos_token_id: break # Add the generated token to the sequence generated_ids.append(next_token_id.item()) # Update the input_ids input_ids = torch.cat([input_ids, next_token_id.unsqueeze(0)], dim=-1) # Update the model_kwargs for the next iteration model_kwargs = model._update_model_kwargs_for_generation( outputs, model_kwargs, is_encoder_decoder=model.config.is_encoder_decoder ) # concatentate the generated ids into a tensor generated_ids = torch.tensor(generated_ids).to(input_ids.device).reshape((1,-1)) # concatenate the logits into a tensor logits_sequence = torch.stack(logits_sequence, dim=1).to(input_ids.device) return generated_ids, logits_sequence ``` ### Version info Name: trl Version: 0.7.11 Name: peft Version: 0.9.0 Name: transformers Version: 4.38.2 Name: torch Version: 2.2.1 ### Who can help? @pacman100 @younesbelkada @sayakpaul ### Information - [X] The official example scripts - [X] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder - [X] My own task or dataset (give details below) ### Reproduction ```python base_model = "mistralai/Mistral-7B-Instruct-v0.2" model = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.bfloat16) peft_dir = "./foo" device = "cuda" model = PeftModel.from_pretrained(model, peft_dir, is_trainable=True).to(device) inputs = tokenizer.encode("[INST]Sing a nice song.[/INST]", return_tensors="pt").to(device) single_output = model.generate(inputs, max_length=5500, do_sample=True) print(tokenizer.decode(single_output[0])) ``` ### Expected behavior I expect that generate does not generate garbage on trainable models.

model = PeftModel.from_pretrained(model, peft_dir, is_trainable=True).to(device)

Topic		Replies	Views
PPO Training does not improve SFT model outputs (Metrics identical before and after PPO) 🤗Transformers	1	41	May 19, 2025
Is there any actual performance improvement when using LoRA alone for SFT on the LLaMA 3.2 3B base model? Beginners	2	45	June 20, 2025
DPO Training ruins my model’s conversational coherence Intermediate	1	19	June 26, 2025
Fine tuning a LLaMa 3 with QLora - metrics calculation Beginners	1	883	October 17, 2024
Different metrics score between when training and when merge lora adapter testing DeepSpeed	1	117	October 25, 2024

Identical Evaluation Metrics for SFT & DPO–Fine-Tuned LoRA Adapter on SeaLLMs-v3-7B

1. My workflow

2. Environment

3. Questions

Related topics