Hello, I am new to language modelling and followed the hugging face course about transformer models.
My goal is to have a text-classification model which is trained on a specific domain (insurance) in order to get better results and need less labeled training data.
What I read so far is that I could use a pre-trained language model like bert-base-german-cased and fine tune it with my labeled dataset for classification tasks.
Or I could train my own language model which means I have to build my own tokenizer an a custom model.
Note: I have a large corpus of unlabeld texts which I could use for training (MLM).
Would you advise me to take a pre-trained model and fine-tune this with my already labeled data or should I build my own language model from scratch with MLM?
Thank you in advance.