Fine-tune model for domain or create language model from scratch

Hello, I am new to language modelling and followed the hugging face course about transformer models.

My goal is to have a text-classification model which is trained on a specific domain (insurance) in order to get better results and need less labeled training data.

What I read so far is that I could use a pre-trained language model like bert-base-german-cased and fine tune it with my labeled dataset for classification tasks.

Or I could train my own language model which means I have to build my own tokenizer an a custom model.

Note: I have a large corpus of unlabeld texts which I could use for training (MLM).

Would you advise me to take a pre-trained model and fine-tune this with my already labeled data or should I build my own language model from scratch with MLM?

Thank you in advance.

1 Like