No improvement on test data after finetuning

Hello all,

I’ve finetuned a llama-like LLM (llm-stacking/StackLLM_410M_750BToken) which has been pretrained with model growth techniques. After having finetuned model at hand, I’ve compared this new model with the initial one using my test data. Here are some specs about finetuning process:

Dataset: Wiki Auto (text simplification, 99k train, 1k eval and 8k test samples)
An example (totally random):

Input: 'Make this text simpler: “A romantic friendship , passionate friendship , or affectionate friendship is a very close but typically non-sexual relationship between friends , often involving a degree of physical closeness beyond that which is common in the contemporary Western societies [.\n]”
Output: ‘A romantic friendship , passionate friendship or affectionate friendship is a close but non-sexual relationship between friends that often involves a degree of physical and emotional closeness [.\n]’

During this training, I am using constant learning rate (2e-5) and max_length: 512. Also, i’ve added a system prompt to the beginning of my training set like:

SYSTEM_PROMPT = (
            "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request."
        )
        example["input"] = (
        f"<<SYS>>\n{SYSTEM_PROMPT}\n<</SYS>>\n\n"
        f"[INST]\n{example['input']}\n[/INST]\n"
            )

Also, the new special tokens (<> and [INST]) have been added to the tokenizer. After the training, I’ve checked SARI, ROGUE and BLEU scores with an expectation of improvement.

Finetuned model:
{
“bleu”: 0.040806923682440786,
“predict_runtime”: 6586.8188,
“rouge1”: 0.13235145023440137,
“rouge2”: 0.08571665482229097,
“rougeL”: 0.12005023055259136,
“run_name”: “…/models/stack_410m_m1/test”,
“sari”: 47.72432723208482
}

compared to the initial model:
{
“bleu”: 0.040121076764105444,
“predict_runtime”: 1652.42,
“rouge1”: 0.1798148817997393,
“rouge2”: 0.11915879579507814,
“rougeL”: 0.16164123735653954,
“run_name”: “…/models/stack_410m_m0/test”,
“sari”: 47.569474387579355
}

Here I am wondering what is wrong with my finetuned model since the initial model outperforms it in terms of rouge scores and the improvement in SARI is quite low (negligible i would say).

Potential reasons came to my mind:

  • Overfitting to the training data (I’ve seen gradually decreased loss with each epoch -3 epochs total-)
  • Low quality of training data
  • The model size is too small to achieve such improvement (410M parameters)
  • Incorrect learning rate approach
  • System prompt is unnecessary

PS: I also test the inference performance with a sample of training dataset to see if overfitting occurs. Here are the results:

Finetuned model:
{
“bleu”: 0.051644126350148596,
“predict_runtime”: 804.9154,
“rouge1”: 0.1435397489247206,
“rouge2”: 0.10497246085615825,
“rougeL”: 0.13358851856751341,
“run_name”: “…/models/stack_410m_m1/test_sanity_check”,
“sari”: 50.117559958469016
}

Initial model:
{
“bleu”: 0.04839316130281333,
“predict_runtime”: 812.8757,
“rouge1”: 0.1919509731503432,
“rouge2”: 0.13766835268258099,
“rougeL”: 0.17782691229698713,
“run_name”: “…/models/stack_410m_m0/test_sanity_check”,
“sari”: 49.52337018081088
}

Any suggestion or idea is much appreciated!

1 Like

Hello all who visited this page, I’ve seen that 410M model does not have required ability to generate simplified text. Now I am trying 3B models and I can see an improvement in SARI score

1 Like