No improvement on test data after finetuning

esualp · January 28, 2025, 8:49pm

Hello all,

I’ve finetuned a llama-like LLM (llm-stacking/StackLLM_410M_750BToken) which has been pretrained with model growth techniques. After having finetuned model at hand, I’ve compared this new model with the initial one using my test data. Here are some specs about finetuning process:

Dataset: Wiki Auto (text simplification, 99k train, 1k eval and 8k test samples)
An example (totally random):

Input: 'Make this text simpler: “A romantic friendship , passionate friendship , or affectionate friendship is a very close but typically non-sexual relationship between friends , often involving a degree of physical closeness beyond that which is common in the contemporary Western societies [.\n]”
Output: ‘A romantic friendship , passionate friendship or affectionate friendship is a close but non-sexual relationship between friends that often involves a degree of physical and emotional closeness [.\n]’

During this training, I am using constant learning rate (2e-5) and max_length: 512. Also, i’ve added a system prompt to the beginning of my training set like:

SYSTEM_PROMPT = (
            "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request."
        )
        example["input"] = (
        f"<<SYS>>\n{SYSTEM_PROMPT}\n<</SYS>>\n\n"
        f"[INST]\n{example['input']}\n[/INST]\n"
            )

Also, the new special tokens (<> and [INST]) have been added to the tokenizer. After the training, I’ve checked SARI, ROGUE and BLEU scores with an expectation of improvement.

Finetuned model:
{
“bleu”: 0.040806923682440786,
“predict_runtime”: 6586.8188,
“rouge1”: 0.13235145023440137,
“rouge2”: 0.08571665482229097,
“rougeL”: 0.12005023055259136,
“run_name”: “…/models/stack_410m_m1/test”,
“sari”: 47.72432723208482
}

compared to the initial model:
{
“bleu”: 0.040121076764105444,
“predict_runtime”: 1652.42,
“rouge1”: 0.1798148817997393,
“rouge2”: 0.11915879579507814,
“rougeL”: 0.16164123735653954,
“run_name”: “…/models/stack_410m_m0/test”,
“sari”: 47.569474387579355
}

Here I am wondering what is wrong with my finetuned model since the initial model outperforms it in terms of rouge scores and the improvement in SARI is quite low (negligible i would say).

Potential reasons came to my mind:

Overfitting to the training data (I’ve seen gradually decreased loss with each epoch -3 epochs total-)
Low quality of training data
The model size is too small to achieve such improvement (410M parameters)
Incorrect learning rate approach
System prompt is unnecessary

PS: I also test the inference performance with a sample of training dataset to see if overfitting occurs. Here are the results:

Finetuned model:
{
“bleu”: 0.051644126350148596,
“predict_runtime”: 804.9154,
“rouge1”: 0.1435397489247206,
“rouge2”: 0.10497246085615825,
“rougeL”: 0.13358851856751341,
“run_name”: “…/models/stack_410m_m1/test_sanity_check”,
“sari”: 50.117559958469016
}

Initial model:
{
“bleu”: 0.04839316130281333,
“predict_runtime”: 812.8757,
“rouge1”: 0.1919509731503432,
“rouge2”: 0.13766835268258099,
“rougeL”: 0.17782691229698713,
“run_name”: “…/models/stack_410m_m0/test_sanity_check”,
“sari”: 49.52337018081088
}

Any suggestion or idea is much appreciated!

esualp · February 13, 2025, 12:06am

Hello all who visited this page, I’ve seen that 410M model does not have required ability to generate simplified text. Now I am trying 3B models and I can see an improvement in SARI score

Topic		Replies	Views
Help, please! Seems fine tuning on LLM is not working Beginners	4	1523	April 5, 2024
Performance problems with finetuned model (Llama 2 7B based) Beginners	3	680	June 10, 2024
LLM predicting only 1 class after fine-tuning Beginners	0	401	September 4, 2023
Guidance on getting started with fine tuned uncensored model Beginners	2	1046	March 8, 2025
Fine-tuning don't work / bad results Beginners	5	1670	January 15, 2025

No improvement on test data after finetuning

Related topics