How to deal with of new vocabulary?

madhurik · November 2, 2021, 8:09pm

Hi, the project that I am working on has a lot of domain-specific vocabulary. Could you please suggest techniques for tuning BERT on domain data? I do have over 1 million unlabeled sentences. Hoping that should be enough to pre-train the language model.
My end goal is to train a multi-class classification model. But, my primary interest is to pre-train the BERT language model on domain data (with 1 million texts), use the word embeddings from the trained model, and feed into traditional classification models like Random Forest. Thanks!

rubenk · November 3, 2021, 9:53am

I’d also be very interested to see if/how this could be done for BART’s encoder since this might be a solution to this problem

Topic		Replies	Views
Fine-tuning BERT Model on domain specific language Models	1	1799	January 5, 2021
Training BERT model from scratch with custom sequence Beginners	0	394	September 21, 2022
Using custom embeddings for pre-training model for new vocabulary Beginners	0	205	December 25, 2023
Fine-tune model for domain or create language model from scratch Beginners	0	658	May 2, 2022
Creating word embeddings using BERT of machine generated sequential data Models	0	265	April 7, 2023

How to deal with of new vocabulary?

Related topics