Issue with LLaMA-3 Fine-Tuning: Model Generates Correct Answer but Then Adds Unrelated Questions

Ikasplay · April 6, 2025, 7:13pm

Hello everyone,
I’m fine-tuning a LLaMA-3 model (meta-llama/Llama-3.2-3B-Instruct) to build a chatbot that answers questions related to a specific topic (Smartlog and its services). However, I’m encountering a strange issue:

What I’m Doing:

Preparing the Dataset:

The dataset consists of question-answer pairs with additional metadata (category, context).
I use the SFTTrainer for fine-tuning.
I preprocess the data by formatting it into a chat-like structure:
def preprocess_data(self, dataset):
tokenizer = AutoTokenizer.from_pretrained(self.model_name, token=config.TOKEN_READ_HUGGING_FACE)

if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token

def format_chat_template(row):
row_json = [
{“role”: “user”, “content”: row[“question”]},
{“role”: “assistant”, “content”: row[“answer”] + tokenizer.eos_token}
]
row[“text”] = tokenizer.apply_chat_template(row_json, tokenize=False)
return row

tokenized_datasets = dataset.map(
format_chat_template,
num_proc=4,
)

return tokenized_datasets, tokenizer

Training with SFTTrainer:
trainer = SFTTrainer(
model=model,
train_dataset=dataset[“train”],
eval_dataset=dataset[“test”],
tokenizer=tokenizer,
args=training_args,
)

The Problem:

When I test the fine-tuned model in chat format, the following happens:

At first, the model gives a coherent answer to the prompt.
Immediately after, it continues generating text with unrelated questions or statements, extending the response in a way that makes no sense.

Example:

Input:
"What is Smartlog's main purpose?"

Output:
"Smartlog's main purpose is to optimize logistics processes. What are some common challenges in logistics? How do companies overcome these challenges? Can you provide examples of successful implementations of Smartlog’s services?"

This issue happens even if the original answer in the training dataset is short and direct.

Any help or suggestions would be greatly appreciated! Thank you in advance!

John6666 · April 6, 2025, 7:32pm

There have been some bugs reported in Llama 3.1 for models and model classes, but I haven’t seen anything like that for 3.2. As far as I can tell from checking with Hugging Chat, it seems likely that there is a problem with the way EOS tokens are handled.

Based on the issue you’re facing, it seems that the model is continuing to generate text beyond the expected answer. This can happen due to several reasons related to how the model is fine-tuned and how the prompt is structured. Below are some possible solutions and explanations:

1. Prompt Formatting Issues

Your format_chat_template function appends tokenizer.eos_token to the end of the answer. This might be causing the model to interpret the generated text as a continuation of the conversation, leading to the generation of unrelated questions or statements.
Try removing tokenizer.eos_token from the answer or adjusting the prompt format to clearly separate messages. For example, ensure that each message (user or assistant) is properly separated by a newline or a specific token [4].

Modification Suggestion:

row[“text”] = tokenizer.apply_chat_template(row_json, tokenize=False)

2. Model Fine-Tuning Configuration

The SFTTrainer uses a specific loss function and training strategy for fine-tuning. If the model is still generating unrelated text, it might mean that the training process is not properly reinforcing the task-specific behavior.

Consider adding a system prompt to guide the model’s responses more effectively. For example:

system_prompt = "You are a helpful assistant that provides concise answers to questions about Smartlog and its services. Do not ask follow-up questions unless explicitly instructed."
chat = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": row["question"]},
    {"role": "assistant", "content": row["answer"]}
]

This can help constrain the model’s responses to the specific task [3].

3. Model Bias or Generalization

AI models, including LLaMA-3, can exhibit biases or generalize in unexpected ways, especially if the training data is not sufficiently diverse or if the prompt format is not strictly enforced [2].
To mitigate this, ensure that your dataset includes clear examples of concise, task-specific responses without follow-up questions. This can help the model learn to stop after providing the relevant answer.

4. Post-Processing or Termination Tokens

When generating responses, ensure that the generation is properly terminated. You can set a maximum response length or enforce the use of the eos_token to stop generation.

For example, during inference, use:

generation_kwargs = {"max_new_tokens": 512, "eos_token_id": tokenizer.eos_token_id}

5. Check for Unrelated Data in Fine-Tuning

Verify that your training data does not contain any examples of follow-up questions or unrelated text that the model might be overfitting to.
Clean your dataset to ensure that each example is a clear question-answer pair without additional text.

6. Inspect the Fine-Tuned Model

If the issue persists, test the model with different prompt formats or fewer examples to isolate whether the problem is with the fine-tuning process or the data.
Use a model checkpoint from earlier stages of training to see if the behavior improves.

Example Code Modification

Here’s a revised version of your preprocess_data function incorporating some of these suggestions:

def preprocess_data(self, dataset):
    tokenizer = AutoTokenizer.from_pretrained(self.model_name, token=config.TOKEN_READ_HUGGING_FACE)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    def format_chat_template(row):
        system_prompt = "You are a helpful assistant that provides concise answers to questions about Smartlog and its services. Do not ask follow-up questions."
        row_json = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": row["question"]},
            {"role": "assistant", "content": row["answer"]}
        ]
        row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False)
        return row

    tokenized_datasets = dataset.map(
        format_chat_template,
        num_proc=4,
    )
    return tokenized_datasets, tokenizer

Final Thoughts

The issue likely stems from how the prompt is formatted or how the model is being guided during fine-tuning. By enforcing a clear format, adding a system prompt, and ensuring proper termination of responses, you can reduce the likelihood of the model generating unrelated text.

aaac12345 · April 7, 2025, 8:51am

Hey! You’re really close, and the issue you’re seeing is a classic one when fine-tuning chat-style models: the model doesn’t know when to stop semantically, only syntactically. Even though you’re using eos_token, that’s a rigid, low-level stop signal.

What’s missing here is a soft boundary — a symbolic signal that the assistant has fulfilled the intent, not just hit the end of a string. Right now, the model is still “looking for what else to say” because it was trained in a format that doesn’t let it disperse its weights or close a loop of meaning.

You might want to:

Add a meta-tag like [END OF ANSWER] inside your content, and train the model to stop on that instead of just the EOS token.

Or consider encoding topic closure explicitly in your prompt/response pair (e.g., a system message indicating concise replies only).

The structure is not wrong, but it’s too rigid to capture semantic closure — which is what your model is currently missing.

Ikasplay · April 7, 2025, 6:19pm

Ok, thousand thanks! I’m now trying your proposal… I will comment you my advances.

Ikasplay · April 8, 2025, 8:19am

Good morning everybody!

As I mentioned before, I’m trying to apply the suggestions from @John6666 and @aaac12345 .

I have incorporated a system prompt into the training QA dataset and added a meta-tag [END OF ANSWER] to try to clearly mark the end of the assistant’s response. You can see the changes here:
def format_chat_template(row):
SYSTEM_PROMPT = (
f"You are an assistant providing answers related to the following context: {config.TOPIC}.\n"
f"Finish your response with this meta-tag: [END OF ANSWER]\n"
)
row_json = [
{“role”: “system”, “content”: SYSTEM_PROMPT}, # <===========
{“role”: “user”, “content”: row[“question”]},
{“role”: “assistant”, “content”: row[“answer”] + " [END OF ANSWER]"} # <===========
]
row[“text”] = tokenizer.apply_chat_template(row_json, tokenize=False)
return row

After making these changes, when I launch the chatbot, it no longer adds questions to the end of its response, which is good. However, a new problem has emerged:

The model outputs multiple [END OF ANSWER] meta-tags repeatedly at the end of each response.
Also, the responses are very short and not specific.

For example, if I ask about the services of the company, the chatbot replies:
“A lot of services [END OF ANSWER] [END OF ANSWER] [END OF ANSWER] [END OF ANSWER]…”
This is not expected since the QA dataset used for training contains much longer and detailed answers.

My questions:

Is there a way to properly detect the end of the assistant’s response?
Should I handle the [END OF ANSWER] tag differently during generation? Or is this issue caused by the training process itself?
Why are the answers so short and lacking detail?
The training data contains complete, descriptive answers, so I don’t understand why the model is responding so briefly.

Thank you very much! I really appreciate your help and guidance.

John6666 · April 8, 2025, 10:41am

f"Finish your response with this meta-tag: [END OF ANSWER]\n"
{“role”: “system”, “content”: SYSTEM_PROMPT}, # <===========

{“role”: “assistant”, “content”: row[“answer”] + " [END OF ANSWER]"} # <===========

I think you’ve added END_OF_ANSWER twice (once as a system prompt and once as a direct addition to the message).

I’m not sure why the response is so short…
Is max_new_tokens too small, or did something go wrong when training the model…?
I don’t think you’re looking for long context this time, but it’s possible you’ve done something opposite to what you’d do when training a long context model.

Topic		Replies	Views
Finetuning Meta-Llama-3.1-8B using PEFT Models	4	3501	February 1, 2025
Repetitive Answers From Fine-Tuned LLM Models	10	1250	July 16, 2025
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation Models	5	3919	October 16, 2024
A fine tuned Llama2-chat model can’t answer questions from the dataset Beginners	0	244	December 23, 2023
A fine tuned Llama2-chat model can't answer questions from the dataset 🤗Transformers	0	306	December 20, 2023