Fine-Tuning + RAG based Chatbot: Dataset Structure & Instruction Adherence Issues

:rocket: Fine-Tuning a Document-Based Chatbot – Issues and Questions

Hello everyone!
I am working on fine-tuning a chatbot that generates answers based on documents (RAG + Fine-tuning).
During the tuning process, I encountered several issues, and I would appreciate any insights or solutions from those with experience in this area.


:rocket: Question 1: How should the dataset be structured for training a document-based chatbot?
When training a model to generate document-based answers,
:white_check_mark: Should I use a question-answer dataset?
:white_check_mark: Or should I build a question-document-answer dataset?

I’d love to know the common approach!


:rocket: Question 2: Issues encountered after experimenting with two training methods

:one: Training with a Question + Answer dataset
:heavy_check_mark: The responses were natural, but hallucination (incorrect information generation) occurred.
:heavy_check_mark: The model generated answers even when the provided document contained no relevant information.
:heavy_check_mark: To prevent this, I added the following instructions to the inference-time prompt:

  • “If the document does not contain relevant information for the question, respond with: ‘Sorry, I couldn’t find any relevant information.’”
  • “End the response with: ‘Thank you :):)’”
    :heavy_check_mark: However, the model did not follow these instructions.

:pushpin: Here is the prompt I used:

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

:heavy_check_mark: I included multiple instructions in the Instruction section, but the model did not adhere to them.
:heavy_check_mark: Additionally, I did not include documents during training but added them only at inference time.


:two: Training with a Question + Document + Answer dataset
:heavy_check_mark: When training with documents, the generated answers were strange and inconsistent.
:heavy_check_mark: In some cases, the model directly copied parts of the document instead of generating a proper response.
:heavy_check_mark: The documents I used for training were quite long—could this be the reason?


:question: What is the standard approach for training a document-based chatbot?
For tasks involving document-based answer generation, how is training typically conducted?
Is there a better approach than what I have tried?

I would really appreciate any insights or advice! :blush:

#fine-tuning #llama #rag #instruction-tuning #hallucination #dataset-preparation #inference #prompt-engineering #large-language-models #document-based-chatbot

1 Like

Hi leedahyeon,
I have built several document chatbots. However none of them needed fine-tuning. So just sharing this if you’re open to explore an alternative solution…

I remember for one particular project, my research into whether we should fine-tune or not led me to the conclusion that fine-tuning was not nessesary (at least for my client’s chatbot requirements). My research conclusion was that fine-tuning can cause a model to lose its ability to generalise. Naturally LLMs have already been fine-tuned as part of the final stages to be able to generalise quite well and my experience is that I could get great results by instead using a carefully constructed prompt to help the LLM focus on the specific requirements.

General advice would be only fine-tune a model after you have tested with various prompt engineering strategies first. In many cases, better results can be achieved with a well engineered prompt. This is something that even OpenAI recommends:
https://platform.openai.com/docs/guides/fine-tuning#when-to-use-fine-tuning

If you plan to go down the road of fine-tuning, after exhasing various prompt engineering strategies and testing, there would be a better understanding of the expectations around the nessesary prompt and generated response. These prompt/response pairs can then be used in the fine-tuning process to help guide the LLM with the expected response.

It is recommended to have a data set of about 50 to 100 prompt/response pairs for fine-tuning. These pairs would become evident and collected/formulated from the previous step of prompt testing.

It is also worth mentioning that the context size of the prompt and response of each pair should be carefully considered. If for example the context size is small, the fine-tuned model will most likely respond accordingly to prompts and therefore generate a smaller response than what may have been expected. Therfore it is important to provide full and complete prompt/response pairs as would be expected.

The above is really general advice, and so applied accordinly without knowing your specific requirements.

2 Likes

This is why I came here. What a quality response. Thank you for taking the time.

I agree that prompting is it’s own form of fine tuning. However, long prompts eat into your context and burn tokens do they not?

In my use case I’m designing some bespoke “AI personas” who are designed to be used in long conversations. As such, I’m drawn to the idea of a LORA fine tuning. I’m using fireworks.ai

Thoughts?

2 Likes

Hi leebase,
Here’s my thoughts. I think it depends on how long your prompts might need to be to accomplish your AI personas. They may not need to be as long as might be thought. Generally a good prompt would include instructions and rules along with a few examples to follow. The size of such a prompt would be fairly small compared to the long conversational context. The prompt would act as a system type prompt that is included for each new response generation.

I would weigh up the benefits of just using prompt engineering against fine-tuning and consider whether preserving the LLMs generalisation and general knowledge is important to your usecase or whether using more tokens and the cost might be more important. It might be worth trying both and testing the results.

You could potentially test both directions on a small locally hosted model like llama3.2 and see how it affects the outcome. From my experience, smaller models tend to quickly show the weaknesses of particular methods and more obviously when compared to larger models where these effects may not be so noticable. It would also be cheaper or potentially no cost for training locally to test both methods out.

Additionally, if you are concerned about your context window with regard to long conversations, I would also consider implementing RAG into your pipeline so that only previous conversational context relevant to the current conversation is retrieved and supplemented to the prompt rather than the entire conversation history. This would reduce tokens needed and can also help the LLm focus much better on the current conversation requirements.

I have tried two different methods for RAG memories. The first was the entire conversation history. This is ok but also needs an additional LLM step for internal dialog to retrieve and filter only what is relevant, which can get messy. The second approach, is what I prefer now, which is a system that learns important information from previous conversational history through an emotional triggering system. If the converastion has sparked the LLMs emotion triggers, it will “learn” from the conversation and store the information based on an internal reflection for whether it should be a short term memory, mid-term memory or a long-term memory. With this approach the RAG retrieval system is much cleaner and simpler to manage and tends to retrieve more relevant information rather than overwhelmed with alot of memorized noise.

Anyway, that’s probably more than you wanted to know but hopefully helpful.

4 Likes

It is most definitely NOT more than I WANT to know. It is more than I have the context to understand. I will save this and ponder. And yes, feed it to the AI to discuss the implications for my current plan.

I am quite grateful for this well thought out and generous-with-your-time post.

1 Like

BTW, thought you might be interested in gpt 4o’s integration of your thoughts with the plans he and I had already been discussing.

ChatGPT Integration of your feedback with my plans

That’s a solid, well-thought-out response. Let’s break it down in the context of your LoRA fine-tuning plans and your broader business model.

Key Takeaways for Your LoRA Plans

  1. Prompt Engineering vs. LoRA Fine-Tuning

• The response suggests that your AI personas might not require extensive fine-tuning if well-crafted prompts can achieve the desired behavior.

Test Idea: Try refining your personas using advanced prompt engineering before investing too much into LoRA fine-tuning.

Consideration: If your personas are meant to be adaptable across multiple scenarios while still retaining strong consistency, LoRA fine-tuning might still be useful.

  1. Preserving Generalization vs. Cost Trade-Off

• Fine-tuning with LoRA narrows down the model’s response style, potentially losing some general-purpose knowledge.

• If generalization is crucial (e.g., clients want AI that retains broad knowledge while behaving a certain way), LoRA may not be the best approach.

Test Idea: Compare a fine-tuned model with an optimized prompt-based approach to see which retains better general knowledge while meeting your persona goals.

  1. Local Testing with LLaMA 3.2

• This is a great idea for your use case!

• Since you’re already running models locally (like with FireworksAI), using LLaMA for small-scale testing can help identify issues before scaling up.

• Smaller models exaggerate weaknesses, making them ideal for testing the effectiveness of your methods.

For Your Business Model

  1. Handling Long Conversations: RAG Integration

• Your AI personas are meant for deep, engaging discussions, so managing long conversations efficiently is crucial.

• The suggested RAG-based approach with memory filtering is interesting—prioritizing emotionally triggered datarather than storing everything.

This could be a unique selling point for your AI personas, ensuring they recall only the most relevant context instead of overwhelming clients with excessive memory.

  1. Scalability & Cost Efficiency

• If you’re considering a business model where customers use these AI personas regularly, token cost efficiency is key.

• LoRA fine-tuning increases inference efficiency but requires upfront training investment.

• A hybrid approach—structured prompting with RAG and selective LoRA fine-tuning—might be the best balance for both customization and cost control.

Suggested Next Steps

  1. Prototype a persona using ONLY advanced prompt engineering and test its depth.

  2. Run small-scale LoRA fine-tuning (on LLaMA 3.2 locally) to compare effectiveness.

  3. Implement a RAG memory system with emotion-based filtering and test relevance.

  4. Decide on the best balance between efficiency, customization, and cost for your business model.

Your instincts about fine-tuning are still solid, but this response provides an alternative path that could reduce complexity and costs while maintaining flexibility. It’s definitely worth testing both approaches! :rocket:

2 Likes

Hello everyone, thank you for sharing your insights!
I have considered your feedback,
before diving into fine-tuning, I wanted to test the impact of system message tuning on the base model. However, I encountered an issue: even the base model does not consistently follow the system message instructions.

To check whether the system message is being properly applied, I used the following prompt:

system_message = (
    "You are an **internal policy expert AI** responsible for answering employees' questions! "
    "Follow these rules strictly!\n\n"
    "1. You must only refer to the provided internal documents when answering questions!\n"
    "   - Never assume or generate information that is not in the documents!\n"
    "   - Provide concise and to-the-point answers that directly address the core of the question!\n\n"
    "2. If the answer cannot be found in the internal documents, respond with the following sentence only:\n"
    "   - 'It is difficult to determine based solely on the internal documents! Please contact the relevant team!'\n\n"
    "3. End every sentence with an exclamation mark (!) instead of a period (.)!\n\n"
    "4. Every response must end with the following sentence:\n"
    "   - 'Thanks :) Come back anytime!'\n\n"
    "Strictly follow these rules and provide employees with prompt and accurate responses!"
)

However, when running inference, the base model does not fully adhere to these instructions—at best, it only applies one or two of the rules inconsistently.

I tested this with both:

  • unsloth/Meta-Llama-3.1-8B-Instruct-unsloth-bnb-4bit
  • meta-llama/Meta-Llama-3.1-8B-Instruct

But the issue persists in both cases. I’m trying to determine whether this is a model limitation or an issue in my code implementation.

  1. Inference with Unsloth Model
messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": f"query: {query}"},
    {"role": "system", "content": f"reference docs:\n{docs}"},
]
                
inputs = self.tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(self.device)

outputs = self.model.generate(
    input_ids=inputs,
    pad_token_id=self.tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=True,
    temperature=creativity_index,
    top_p=diversity_index,
    repetition_penalty=frequency_index + 1,
    penalty_alpha=regnt_penalty_index
)

generated_text = self.tokenizer.batch_decode(outputs[:, inputs.shape[1]:], skip_special_tokens=True)[0]

  1. Inference with Standard Model
messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": f"query: {query}"},
    {"role": "system", "content": f"reference docs:\n{docs}"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,  
    add_generation_prompt=True,
    return_tensors="pt", 
).to(model.device)  

output_ids = model.generate(
    inputs,
    max_new_tokens=256,  # Limit response length
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.05,  # Prevent repetitive outputs
)

generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True).split("\n")[-1].strip()

Could you provide feedback on this issue? Specifically, I’m trying to determine whether the problem stems from the model itself or from my implementation.
Thanks.

2 Likes

Hi leedahyeon,
These are fairly small models and so it is most likely due to the nature of such small models. My obervations have been the longer the system prompt, the less they adhere to it. So you’re most likley observing this phenomenon with small models. Also check the documentation for the specific models you are using for what temperature values are considered best for maximising accuracy (unless you need more creativity). You may find high temperature values may cause the model to veer away from your instructions from time to time based on how “creative” it is behaving.

You could also add another step per generated response for the LLM to validate the response based on your rules. Small models work much better when the total number of input tokens are kept low, including the instructions. So if you can break up your system instructions into single instructions and check if the response validates against each instruction set within a for each loop, then you can help the LLM to focus on accuracy. Then if any of the validaitons fail, exit the loop and regenerate the LLM response. For each validation, you could also provide a score instead of a true or false. That way you could add another step to see if it passes each rule by a weighted validation score. More important rules would have a higher weight. Then you can loop the generative process (first step) with a max number of retries of say 5 (if it fails the first time). Then sort the results and return the best answer.

Alternatively, if you can run a larger model like Mistral Nemo:12b or 12b-instruct-2407-fp16, I think you’ll find you get somewhat better results and you may not need the validation steps (although it could still help). If your GPU is the limitation, you could try running it with Ollama by fractioning across both GPU and CPU. It is slower but might provide some insights for your testing.

3 Likes