Warm-started encoder-decoder models (Bert2Gpt2 and Bert2Bert)

Thank you @nielsr for your useful explanations and suggestions, I really appreciate it.