Encoder decoder model

I want to use an encoder-decoder model, but I want to perform masked language modeling and casual language modeling (unsupervised learning) on the encoder-decoder so it understands the context I am using, before fine-tuning it. How best to go about this?