Adding domain knowledge in LLMs via fine tuning

xander71988 · June 19, 2023, 3:41pm

Hi,
I’m trying to fine tune a LLaMA model in a Causal Language Modelling fashion (i.e. no instruction-following fine tuning) using a domain-specific dataset, so that the model becomes more knowledgable of that domain and a better starting point for instruction-based fine tuning.
However, the fine tuned model seems to just overfit to the training dataset, almost always producing responses that have similar structure and content like the documents in the training set. Instead, the ideal outcome would be that the model learns the domain-related knowledge, not the structure of the documents, and does not lose too much of the original knowledge.

My questions are the following:

Has anyone had any experience with this?
Is it even feasible to achieve the desired goal, without resorting to a pre-training from scratch?
What can be done from a training perspective? E.g. does it make sense to gradually unfreeze weights as it used to be done with DCNNs?

hemanth-kj · July 11, 2023, 8:49pm

Hi,
I am experiencing the same issue with a very similar task. The new model does seem to learn some domain-related knowledge but massively loses the original model’s conversational/english capabilities when I try a Causal learning. There are also several cases of Hallucinations observed in my dataset. Do share if you think there are any possible reasons on any of the questions?

I thought about gradually unfreezing the model weights and do a very low learning rate learning but that would even more alter the original model, in my opinion.

edubs · July 23, 2023, 6:19pm

This is expected and one of the main areas of research now.
Think about it:
LLaMA was trained on 1.4 trillion tokens, if you fine tune on 1 billion tokens (that is already a lot for fine tuning), it would be less than 0.1%. Not even considering cases where more epochs are used and the learning rate change.
So it would be unfair to say that the model is not learning the knowledge.
What we are seeing more on fine tuning is that the model is learning the format, like for QA.

Right now it is very hard to fine tune a model to inject knowledge like it has from pretraining, but we expect it to be easier with more research.

Topic		Replies	Views
LM fine-tuning on unlabelled dataset Beginners	0	454	April 10, 2021
LM finetuning on domain specific unlabelled data Beginners	6	4729	April 21, 2021
Train llama on domain specific dataset and on instruction format dataset Models	1	2325	October 28, 2023
Domain adaptation from Causal LM to a Seq2Seq model 🤗Transformers	0	690	March 8, 2023
What’s the best strategy for fine-tuning a large language model (LLM) on domain-specific data without catastrophic forgetting? Beginners	1	24	October 11, 2025

Adding domain knowledge in LLMs via fine tuning

Related topics