Hi everyone,
I’m working on a domain-specific fine-tuning task involving formula calculations and long-form generation, but after SFT fine-tuning, the model quality drops significantly. I would like to ask for insights into possible causes and how to address them.
Problem Description
I performed SFT fine-tuning on a model for a task involving:
- Domain-specific technical content
- Formula calculations, derivations, structured explanations
- Long outputs (often several thousand tokens)
After fine-tuning, the model exhibits the following issues:
1. Significant degradation in output quality
- Logical inconsistency
- More formula reasoning errors
- Long-form generation collapses in the second half
2. Randomly switches language from Chinese → English
(This NEVER happens in the base model or other models fine-tuned with the same data.)
Sometimes the model suddenly switches to English mid-generation, even though all training data are Chinese.
3. Same data + same fine-tuning method
When applied to other models, the results are good, so the issue is likely model-specific rather than data or pipeline related.
Key Questions I Would Like to Ask
Q1: What could cause this “quality degradation + random language switching” after fine-tuning?
Q2: From the SFT / fine-tuning perspective, how can I fix this?
I would appreciate advice in the following areas:
Data construction
- How should the dataset be structured?
- Should I mix in some high-quality base-model data to prevent catastrophic forgetting?
- Should I explicitly enforce language consistency?
Training / fine-tuning methods
-
Should I:
- Use continual pretraining instead of direct SFT?
- Mix short-form and long-form examples to stabilize long-context modeling?
Inference
- Do decoding parameters (top-k, temperature, etc.) worsen formula-heavy tasks?
- Should I apply penalties or constraints to avoid language switching?
Q3: Are there related research papers I can read?
Especially papers on:
- Fine-tuning degradation
- Language drift after SFT
- Mode collapse / repetition in long-form generation
- Stability of formula or symbolic reasoning tokens
- Long-context scaling techniques (RoPE / NTK / YaRN)
If there are known benchmarks or studies describing similar symptoms, I’d love to learn about them.
Summary
Using the same dataset and same fine-tuning pipeline, other models behave normally.
But this particular model shows:
- Quality degradation
- Long-context collapse
- Random switching between Chinese and English
I would greatly appreciate insights, suggestions, or relevant papers from the community.
Thanks in advance! ![]()