Hello everyone, I have been building a BERT model with a Masked Language Model approach as my pretraining goal. All of a sudden I need to, instead of using a Masked Language Model, need the outputs of a Causal Language Model (predict the n+1 token of the sequence).
I was thinking if it would be correct to use this model and just the last token in the preprocessing and re-train my model. Do you think that’s a good approach? Or I should build again a model specific to an autoregressive behaviour…
Any references?
Thank you very much