I understand to fine tune an available BERT model for text classification, I can do something like below as example:
tokenizer = AutoTokenizer.from_pretrained(“distilbert-base-uncased”)
model = AutoModelForSequenceClassification.from_pretrained(“distilbert-base-uncased”, num_labels=2)
but if have a specific domain I would rather:
- Further train a tokenizer on my text corpus (without the label) and save the trained tokenizer model
- instantiate tokenizer = BertTokenizerFast.from_pretrained(path_to_my_new_saved_tokenizer)
- instantiate model = BertForMaskedLM(config=model_config)
- train a mask language model and save the fine-tuned model on the unlabeled data
question is how can I now fine-tune the model (step 4) on my labeled text data. I am looking for code example after step 3. Can I just do sth like:
BertForSequenceClassification.from_pretrained(path_to_my_new_trained_mlm_model)
Most of the examples I find online just show (the first part) which is “how to train BERT from scratch” but don’t show the next step which is “how to fine tune that trained mlm model on labeled data now”.
Something like the last section (IMDb Classifier) of the following image (assuming I have completed steps 1 and 2) taken from hugging face course here: