How to finetune with a own private data and then build chatbot on that?

So far with the example of fine tuning I see examples of summarisation, chatbot based on specific use cases etc. However, I want to build the a chatbot based on my own private data (100s of PDF & word files). How can I fine tune on this. The approach I am thinking is
1-> LoRA fine tuning of the base alpaca model on my own private data
2-> LoRA fine tuning of the above model on some input output prompts.

Is it a good technique for build chatbot on private datasets. Please someone can suggest a good way of building model based on private data.

3 Likes

Hi, I found Abishek Thakur’s YT to be really helpful for fine tuning .He posts incredible stuffs related. In case for the video I followed , I took lesson form 1littlecoder' . Training Falcon-7B on colab.
Here link - (How-To Instruct Fine-Tuning Falcon-7B [Google Colab Included] - YouTube)

@Saugatkafley , Thank you for your response. I have already experimented with this type of training, which involves prompt-based fine-tuning, and it has been effective for me.
To elaborate further, let’s consider a scenario where I possess private documents and wish to generate prompts based on that data. However, the language model lacks any knowledge about these specific documents. Even if I attempt to fine-tune the model using prompts demonstrated in video, it would likely miss out on crucial information present in the private documents. I want to have model knowledge of those documents as well.

Have you considered using a QA model?

Prompt: Tell me about X.
QA model retrieves relevant text chunks based on 'Tell me about X'
Text chunks get put in a new prompt that then is fed to the LLM: f'{original prompt} using this context: {relevant_text_chunks}'

1 Like

Have you looked into RAGs?