Continued pre-training

sandeepaffine · September 5, 2024, 4:18pm

Hi,

I have been working on continued pre-training on meta-llama/Llama-2-7b-chat-hf model.
The literature and articles say, continued pre-training works well with more data (something like 100 M tokens or more). But I have been experimenting with around 2M data and I have seen some decent improvement in model’s relevance on new data after continued pre-training. However, it is not great!

Now, the domain I have been working on does not really have a lot of data. So, if adding more data helps, can I add data which is outside of domain as well? May be something similar to my domain, to increase the dataset size.

Another question I have is, if there are rare data points in a small data set (say 1M) vs same amount of rare data points in a bigger data set (100M - by adding more similar data). In this case, will the model trained on 100M have a better answering ability on questions related to the rare data points? Even though in both datasets, the amount of rare data points are exactly same!

Thanks in advance for the answer!

Topic		Replies	Views
Adding domain knowledge in LLMs via fine tuning Research	2	5579	July 23, 2023
Llama-2-7b-chat fine-tuning Models	4	6766	April 26, 2024
LLaMa2 fine-tuning: Multi-turn conversation dataset template Models	2	5228	March 6, 2024
+8 Fiverr AI devs (including Pro and Top rated) couldn't do this apparently easy development, can you? Intermediate	2	68	August 22, 2024
Continuous training of google-bert/bert-base-uncased Beginners	1	111	January 13, 2025

Continued pre-training

Related topics