I’m trying to fine tune a LLaMA model in a Causal Language Modelling fashion (i.e. no instruction-following fine tuning) using a domain-specific dataset, so that the model becomes more knowledgable of that domain and a better starting point for instruction-based fine tuning.
However, the fine tuned model seems to just overfit to the training dataset, almost always producing responses that have similar structure and content like the documents in the training set. Instead, the ideal outcome would be that the model learns the domain-related knowledge, not the structure of the documents, and does not lose too much of the original knowledge.
My questions are the following:
- Has anyone had any experience with this?
- Is it even feasible to achieve the desired goal, without resorting to a pre-training from scratch?
- What can be done from a training perspective? E.g. does it make sense to gradually unfreeze weights as it used to be done with DCNNs?
I am experiencing the same issue with a very similar task. The new model does seem to learn some domain-related knowledge but massively loses the original model’s conversational/english capabilities when I try a Causal learning. There are also several cases of Hallucinations observed in my dataset. Do share if you think there are any possible reasons on any of the questions?
I thought about gradually unfreezing the model weights and do a very low learning rate learning but that would even more alter the original model, in my opinion.
This is expected and one of the main areas of research now.
Think about it:
LLaMA was trained on 1.4 trillion tokens, if you fine tune on 1 billion tokens (that is already a lot for fine tuning), it would be less than 0.1%. Not even considering cases where more epochs are used and the learning rate change.
So it would be unfair to say that the model is not learning the knowledge.
What we are seeing more on fine tuning is that the model is learning the format, like for QA.
Right now it is very hard to fine tune a model to inject knowledge like it has from pretraining, but we expect it to be easier with more research.