Train MLM on my own domain and fine tune on downstream classification task

Alex18 · October 26, 2022, 2:01am

I understand to fine tune an available BERT model for text classification, I can do something like below as example:

tokenizer = AutoTokenizer.from_pretrained(“distilbert-base-uncased”)
model = AutoModelForSequenceClassification.from_pretrained(“distilbert-base-uncased”, num_labels=2)

but if have a specific domain I would rather:

Further train a tokenizer on my text corpus (without the label) and save the trained tokenizer model
instantiate tokenizer = BertTokenizerFast.from_pretrained(path_to_my_new_saved_tokenizer)
instantiate model = BertForMaskedLM(config=model_config)
train a mask language model and save the fine-tuned model on the unlabeled data

question is how can I now fine-tune the model (step 4) on my labeled text data. I am looking for code example after step 3. Can I just do sth like:

BertForSequenceClassification.from_pretrained(path_to_my_new_trained_mlm_model)

Most of the examples I find online just show (the first part) which is “how to train BERT from scratch” but don’t show the next step which is “how to fine tune that trained mlm model on labeled data now”.

Something like the last section (IMDb Classifier) of the following image (assuming I have completed steps 1 and 2) taken from hugging face course here:

Agaress5723 · October 27, 2022, 12:36pm

Continuing the discussion from Train MLM on my own domain and fine tune on downstream classification task:

Agaress5723 · October 27, 2022, 1:26pm

Continuing the discussion from Train MLM on my own domain and fine tune on downstream classification task:

sauce1611 · April 16, 2024, 4:22pm

Hi @ Alex18
Did you find the solution?

Topic		Replies	Views
Fine-tuning BERT Model on domain specific language and for classification 🤗Transformers	7	8427	November 14, 2024
Fine-tune model for domain or create language model from scratch Beginners	0	656	May 2, 2022
Why fine-tuning BERT mlm on specific domain doesn't work? What am I doing wrong? 🤗Transformers	2	1426	November 22, 2021
Fine tune Masked Language Model on custom dataset Beginners	5	6064	August 20, 2020
How to "further pretrain" a tokenizer (do I need to do so?) 🤗Tokenizers	5	4386	February 20, 2022

Train MLM on my own domain and fine tune on downstream classification task

Related topics