Warm-started encoder-decoder models (Bert2Gpt2 and Bert2Bert)

Ayham · December 11, 2021, 2:00pm

I am working on warm starting models for the summarization task based on @patrickvonplaten 's great blog: Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models. However, I have a few questions regarding these models, especially for Bert2Gpt2 and Bert2Bert models:

1- As we all know, the summarization task requires a sequence2sequence model. In @patrickvonplaten’s blog of warm-starting bert2gpt2 model :

Why don’t we use Seq2SeqTrainer and Seq2SeqTrainingArguments? Instead, we use Trainer and TrainingArguments.

2- For Bert2Gpt2 model, how can the decoder (Gpt2) understand the output of the encoder (Bert) while they use different vocabularies?

3- For Bert2Bert and Roberta2Roberta models, how can they be used as decoders while they are encoder-only models?

Best Regards

Topic		Replies	Views
Leveraging pre-trained checkpoints for summarization Models	33	3167	November 25, 2022
Training Bert2GPT2 model Summarization doesn't lead to acceptable results Models	0	453	December 8, 2021
Training issue of a Transformer based Encoder-Decoder model based on pre-trained BanglaBERT Models	1	745	May 12, 2022
BERT2BERT Notebook for Models without GenerationMixin 🤗Transformers	0	289	November 12, 2020
Bert2bert translator? 🤗Transformers	6	44	August 28, 2025