The effectiveness of initializing Encoder-Decoder models from pre-trained encoder-only models, such as BERT and RoBERTa, for sequence-to-sequence tasks has been shown in: https://arxiv.org/abs/1907.12461.

Similarly, the `EncoderDecoderModel`

framework of Transformers can be used to leverage initialize Encoder-Decoder models from â€śbert-base-casedâ€ť or â€śroberta-baseâ€ť for summarization.

One can initialize such a model with weights from pre-trained checkpoints via:

```
from transformers import EncoderDecoderModel
bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased")
```

A couple of models based on â€śbert-base-casedâ€ť or â€śroberta-baseâ€ť have been trained this way for the CNN/Daily-Mail summarization task with the purpose of verifying that the `EncoderDecoderModel`

framework is functional.

Below the Rouge2 - fmeasure results on the test set of CNN/Daily-Mai:

**Bert2GPT2**: *15.19* https://huggingface.co/patrickvonplaten/bert2gpt2-cnn_dailymail-fp16

**Bert2Bert**: *16.1* - https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16

**Roberta2Roberta**: *16.79*: https://huggingface.co/patrickvonplaten/roberta2roberta-cnn_dailymail-fp16

**Roberta2Roberta (shared)**: *16.59*: https://huggingface.co/patrickvonplaten/roberta2roberta-share-cnn_dailymail-fp16

**Note**: The models below were trained without any hyper-parameter search and fp16 precision. For more detail, please refer to the respective model card.

UPDATE:

Better models using the Seq2Seq Trainer and code on the current master give the following results:

**BERT2BERT** on CNN/Dailymail: *18.22* - https://huggingface.co/patrickvonplaten/bert2bert_cnn_daily_mail

**Roberta2Roberta (shared)** on BBC/XSum: *16.89* - https://huggingface.co/patrickvonplaten/roberta_shared_bbc_xsum

Also two notebooks are attached to the model cards showing how Encoder-Decoder models can be trained using master.