Fine-tuned Model Shows Severe Quality Degradation (Long-Form + Formula Reasoning Task). Possible Causes and Solutions?

Hi everyone,
I’m working on a domain-specific fine-tuning task involving formula calculations and long-form generation, but after SFT fine-tuning, the model quality drops significantly. I would like to ask for insights into possible causes and how to address them.

:pushpin: Problem Description

I performed SFT fine-tuning on a model for a task involving:

  • Domain-specific technical content
  • Formula calculations, derivations, structured explanations
  • Long outputs (often several thousand tokens)

After fine-tuning, the model exhibits the following issues:


1. Significant degradation in output quality

  • Logical inconsistency
  • More formula reasoning errors
  • Long-form generation collapses in the second half

2. Randomly switches language from Chinese → English

(This NEVER happens in the base model or other models fine-tuned with the same data.)

Sometimes the model suddenly switches to English mid-generation, even though all training data are Chinese.


3. Same data + same fine-tuning method

When applied to other models, the results are good, so the issue is likely model-specific rather than data or pipeline related.


:pushpin: Key Questions I Would Like to Ask

Q1: What could cause this “quality degradation + random language switching” after fine-tuning?


Q2: From the SFT / fine-tuning perspective, how can I fix this?

I would appreciate advice in the following areas:

:check_mark: Data construction

  • How should the dataset be structured?
  • Should I mix in some high-quality base-model data to prevent catastrophic forgetting?
  • Should I explicitly enforce language consistency?

:check_mark: Training / fine-tuning methods

  • Should I:

    • Use continual pretraining instead of direct SFT?
    • Mix short-form and long-form examples to stabilize long-context modeling?

:check_mark: Inference

  • Do decoding parameters (top-k, temperature, etc.) worsen formula-heavy tasks?
  • Should I apply penalties or constraints to avoid language switching?

Q3: Are there related research papers I can read?

Especially papers on:

  • Fine-tuning degradation
  • Language drift after SFT
  • Mode collapse / repetition in long-form generation
  • Stability of formula or symbolic reasoning tokens
  • Long-context scaling techniques (RoPE / NTK / YaRN)

If there are known benchmarks or studies describing similar symptoms, I’d love to learn about them.


:pushpin: Summary

Using the same dataset and same fine-tuning pipeline, other models behave normally.
But this particular model shows:

  • Quality degradation
  • Long-context collapse
  • Random switching between Chinese and English

I would greatly appreciate insights, suggestions, or relevant papers from the community.
Thanks in advance! :folded_hands:

1 Like

For now, here are the resources.