Fine-tune model for domain or create language model from scratch

Marcii · May 2, 2022, 12:30pm

Hello, I am new to language modelling and followed the hugging face course about transformer models.

My goal is to have a text-classification model which is trained on a specific domain (insurance) in order to get better results and need less labeled training data.

What I read so far is that I could use a pre-trained language model like bert-base-german-cased and fine tune it with my labeled dataset for classification tasks.

Or I could train my own language model which means I have to build my own tokenizer an a custom model.

Note: I have a large corpus of unlabeld texts which I could use for training (MLM).

Would you advise me to take a pre-trained model and fine-tune this with my already labeled data or should I build my own language model from scratch with MLM?

Thank you in advance.

Topic		Replies	Views
Doing classification 100% from scratch? 🤗Transformers	4	1751	September 17, 2021
LM fine-tuning on unlabelled dataset Beginners	0	453	April 10, 2021
Use BertLMHeadModel to finetunning a language model 🤗Transformers	0	327	March 30, 2021
Fine tunning Spanish BERT model Beginners	6	742	February 3, 2021
Train MLM on my own domain and fine tune on downstream classification task Intermediate	3	1030	April 16, 2024

Fine-tune model for domain or create language model from scratch

Related topics