BertForMaskedLM on a fine-tuned base model

Is there a way for me to fine-tune the base bert/roberta architecture on a task like sequence classification, and then use the fine-tuned model as a base model for MLM predictions? I tried this by copying the state dict over from the sequence classification task into the MLM architecture, but that did not work at all. Seems like the weights that I swap from the sequence prediction task do not play well with the MLM objective.

Here is a code snippet -

#Load the fine-tuned ‘roberta-base’ model into RobertaForMaskedLM
roberta_mlm_model = RobertaForMaskedLM.from_pretrained(MODEL_FILE)

#load the default model
default_model = RobertaForMaskedLM.from_pretrained(‘roberta-base’)

#swap the weights for the head

Can someone tell me if I am thinking in the right direction here?


This will be problematic because the heads are not compatible. In other words, you can fine tune the model and use the weights from one model in the other, but you still have the issue that the heads are different and cannot be mapped. So after fine tuning for sequenence classification, saving model, and loading that model in a MLM version of that architecture the LMHead will not have pretrained weights.

1 Like

Thanks @BramVanroy! That is what I thought. The weights will not be relevant. How do you suggest I solve this? I was thinking of trying to train both the sequence classification task and the mlm task on my dataset in a multi-task setting. Not sure it you or anyone has any pointers