Hey everyone,
I’m doing some research into the practical safety challenges that arise after a base model has been released.
We all start with safety-aligned models like Llama 3, Mistral, Gemma, etc. But the real world requires us to adapt them. I was reading a fascinating paper out of Princeton/OpenAI : [2310.03693] Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! that states that fine-tuning can compromise a model’s safety alignment.
The paper showed that even fine-tuning on benign datasets can degrade safety. This got me thinking beyond academic benchmarks and red-teaming. I was curious how much this happens in the real world.
My question for all of you who are building with these models is:
Have you seen this in your own projects? When you’ve fine-tuned a model (using PEFT, full fine-tuning, etc.) or even just done some complex prompt engineering for a specific business use case, have you noticed any of the following?
-
Emergent Toxicity/Bias: Did the model start making subtly (or overtly) toxic, biased, or inappropriate comments in contexts where the base model never would have?
-
Instruction-Following Breakdowns: Did it start getting “dumber” in a sense, ignoring parts of its system prompt or safety instructions that it followed perfectly before tuning?
-
Increased Hallucinations: Did it begin to confidently invent facts or details relevant to the new domain you trained it on?
-
“Weird” Unpredictable Behavior: Any other strange, unexpected, or misaligned behaviors that made you hesitant to ship it?
I’m particularly interested in hearing from people who are using these models in business settings (e.g., customer support bots, content generation, internal knowledge tools). What challenges did you face, and how did you try to solve them?
Thanks in advance for sharing your insights!
1 Like
One who doesn’t know wrong cannot distinguish right. You know that ‘toxite’ is wrong, but does the model know? It cannot reject something it doesn’t know. So what I mean is, if you don’t put the wrong into the model, it cannot distinguish the right.
1 Like
There seem to be several established methods for ensuring safety and know-how for destroying safety valves. If you don’t want to compromise safety through normal fine-tuning, it may be best to avoid practices similar to these methods.
Also, personally, I think that since teaching something new inevitably causes something else to be forgotten, given the current small number of parameters (far fewer than the human brain can handle), it is necessary to continuously teach new safety measures in applications where safety is essential.
1 Like
Really appreciate this question — I’ve been chewing on the same problem for a while, especially around hallucinations and prompt obedience post-finetune.
What I’ve noticed in my own projects is: even without fine-tuning, just doing prompt engineering at scale can gradually break model alignment in weird ways. Especially if you’re stacking instructions over time, or building long-chain prompt logic — the cracks start to show.
That’s actually why I started building a side tool (yep, I’m the author
) that kinda wraps around any LLM and tests it with 50+ structured prompts at once.
The idea is: instead of just hoping your new prompt/fine-tune worked, you stress test the model by asking it to generate 50 logically consistent answers to the same query. If the answers diverge wildly, it’s a sign something got distorted (hallucination, bias drift, etc.).
I’ve been using it more as a semantic coherence scanner than a normal chatbot.
Totally free/open source if anyone wants to test this angle of model safety:
WFGY/OS/BlahBlahBlah at main · onestardao/WFGY · GitHub
Also curious if anyone’s tried similar “external alignment validators” instead of touching the model weights directly.
1 Like