MLM vs CLM, can be exchanged?

Hello everyone, I have been building a BERT model with a Masked Language Model approach as my pretraining goal. All of a sudden I need to, instead of using a Masked Language Model, need the outputs of a Causal Language Model (predict the n+1 token of the sequence).

I was thinking if it would be correct to use this model and just the last token in the preprocessing and re-train my model. Do you think that’s a good approach? Or I should build again a model specific to an autoregressive behaviour…

Any references?

Thank you very much