Continued pre-training

Hi,

I have been working on continued pre-training on meta-llama/Llama-2-7b-chat-hf model.
The literature and articles say, continued pre-training works well with more data (something like 100 M tokens or more). But I have been experimenting with around 2M data and I have seen some decent improvement in model’s relevance on new data after continued pre-training. However, it is not great!

Now, the domain I have been working on does not really have a lot of data. So, if adding more data helps, can I add data which is outside of domain as well? May be something similar to my domain, to increase the dataset size.

Another question I have is, if there are rare data points in a small data set (say 1M) vs same amount of rare data points in a bigger data set (100M - by adding more similar data). In this case, will the model trained on 100M have a better answering ability on questions related to the rare data points? Even though in both datasets, the amount of rare data points are exactly same!

Thanks in advance for the answer!