Warm-started encoder-decoder models (Bert2Gpt2 and Bert2Bert)

I am working on warm starting models for the summarization task based on @patrickvonplaten 's great blog: Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models. However, I have a few questions regarding these models, especially for Bert2Gpt2 and Bert2Bert models:

1- As we all know, the summarization task requires a sequence2sequence model. In @patrickvonplaten’s blog of warm-starting bert2gpt2 model :

Why don’t we use Seq2SeqTrainer and Seq2SeqTrainingArguments? Instead, we use Trainer and TrainingArguments.

2- For Bert2Gpt2 model, how can the decoder (Gpt2) understand the output of the encoder (Bert) while they use different vocabularies?

3- For Bert2Bert and Roberta2Roberta models, how can they be used as decoders while they are encoder-only models?

Best Regards :slight_smile: