How to deal with of new vocabulary?

Hi, the project that I am working on has a lot of domain-specific vocabulary. Could you please suggest techniques for tuning BERT on domain data? I do have over 1 million unlabeled sentences. Hoping that should be enough to pre-train the language model.
My end goal is to train a multi-class classification model. But, my primary interest is to pre-train the BERT language model on domain data (with 1 million texts), use the word embeddings from the trained model, and feed into traditional classification models like Random Forest. Thanks!

1 Like

I’d also be very interested to see if/how this could be done for BART’s encoder since this might be a solution to this problem