I am trying to train a q/a chatbot on a very large dataset. I have the text itself and another dataset that is formatted as user/assistant conversations. I want the model to learn the contents of the text and then also answer questions about it.
I know RAG is definitely a good option here but would finetuning twice also work in this case? The fintuning will happen once on the text so the model can train on the raw text data and then on the q/a so it can act as a chatbot.
You fine-tuning approach sounds interesting - you could also try to leverage on the T5 models for this which have been pre-trained for multiple tasks including QnA in the way you explained.
But maybe this approach could lead to overfitting with questions which are out of distribution but similar-looking based on the fine-tuning dataset return the training set answers, leading to wrong outputs (since the model has seen the training answer twice, once in the raw text and the other time in the QnA form)
RAG would definitely be worth checking out with a vector database to do similarity searching and retrieval - they’re meant for the purpose you’re aiming for. it could also be faster in terms of retrieval.