Fine tuning and it's effects on model safety

foo-barrr · July 19, 2025, 3:52pm

Hey everyone,

I’m doing some research into the practical safety challenges that arise after a base model has been released.

We all start with safety-aligned models like Llama 3, Mistral, Gemma, etc. But the real world requires us to adapt them. I was reading a fascinating paper out of Princeton/OpenAI : [2310.03693] Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! that states that fine-tuning can compromise a model’s safety alignment.

The paper showed that even fine-tuning on benign datasets can degrade safety. This got me thinking beyond academic benchmarks and red-teaming. I was curious how much this happens in the real world.

My question for all of you who are building with these models is:

Have you seen this in your own projects? When you’ve fine-tuned a model (using PEFT, full fine-tuning, etc.) or even just done some complex prompt engineering for a specific business use case, have you noticed any of the following?

Emergent Toxicity/Bias: Did the model start making subtly (or overtly) toxic, biased, or inappropriate comments in contexts where the base model never would have?
Instruction-Following Breakdowns: Did it start getting “dumber” in a sense, ignoring parts of its system prompt or safety instructions that it followed perfectly before tuning?
Increased Hallucinations: Did it begin to confidently invent facts or details relevant to the new domain you trained it on?
“Weird” Unpredictable Behavior: Any other strange, unexpected, or misaligned behaviors that made you hesitant to ship it?

I’m particularly interested in hearing from people who are using these models in business settings (e.g., customer support bots, content generation, internal knowledge tools). What challenges did you face, and how did you try to solve them?

Thanks in advance for sharing your insights!

tamtambaby · July 19, 2025, 10:56pm

One who doesn’t know wrong cannot distinguish right. You know that ‘toxite’ is wrong, but does the model know? It cannot reject something it doesn’t know. So what I mean is, if you don’t put the wrong into the model, it cannot distinguish the right.

John6666 · July 20, 2025, 2:18am

There seem to be several established methods for ensuring safety and know-how for destroying safety valves. If you don’t want to compromise safety through normal fine-tuning, it may be best to avoid practices similar to these methods.

Also, personally, I think that since teaching something new inevitably causes something else to be forgotten, given the current small number of parameters (far fewer than the human brain can handle), it is necessary to continuously teach new safety measures in applications where safety is essential.

OneStarDao · July 21, 2025, 12:46pm

Really appreciate this question — I’ve been chewing on the same problem for a while, especially around hallucinations and prompt obedience post-finetune.

What I’ve noticed in my own projects is: even without fine-tuning, just doing prompt engineering at scale can gradually break model alignment in weird ways. Especially if you’re stacking instructions over time, or building long-chain prompt logic — the cracks start to show.

That’s actually why I started building a side tool (yep, I’m the author ) that kinda wraps around any LLM and tests it with 50+ structured prompts at once.
The idea is: instead of just hoping your new prompt/fine-tune worked, you stress test the model by asking it to generate 50 logically consistent answers to the same query. If the answers diverge wildly, it’s a sign something got distorted (hallucination, bias drift, etc.).

I’ve been using it more as a semantic coherence scanner than a normal chatbot.

Totally free/open source if anyone wants to test this angle of model safety:
WFGY/OS/BlahBlahBlah at main · onestardao/WFGY · GitHub

Also curious if anyone’s tried similar “external alignment validators” instead of touching the model weights directly.

Topic		Replies	Views
Safe_Mode = [True, False] Community Calls	0	258	December 27, 2023
Finetuning options with SAM? Models	4	5241	May 11, 2023
Monitoring ML and LLM models in production for drift, trust, and safety Show and Tell	2	1	July 21, 2025
Guidance on getting started with fine tuned uncensored model Beginners	2	1265	March 8, 2025
Facing issues in fine-tuning Vicuna-7b model Research	0	499	January 18, 2024

Fine tuning and it's effects on model safety

Related topics