The effectiveness of initializing Encoder-Decoder models from pre-trained encoder-only models, such as BERT and RoBERTa, for sequence-to-sequence tasks has been shown in: https://arxiv.org/abs/1907.12461.
EncoderDecoderModel framework of Transformers can be used to leverage initialize Encoder-Decoder models from “bert-base-cased” or “roberta-base” for summarization.
One can initialize such a model with weights from pre-trained checkpoints via:
from transformers import EncoderDecoderModel bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased")
A couple of models based on “bert-base-cased” or “roberta-base” have been trained this way for the CNN/Daily-Mail summarization task with the purpose of verifying that the
EncoderDecoderModel framework is functional.
Below the Rouge2 - fmeasure results on the test set of CNN/Daily-Mai:
Bert2GPT2: 15.19 https://huggingface.co/patrickvonplaten/bert2gpt2-cnn_dailymail-fp16
Bert2Bert: 16.1 - https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16
Roberta2Roberta: 16.79: https://huggingface.co/patrickvonplaten/roberta2roberta-cnn_dailymail-fp16
Roberta2Roberta (shared): 16.59: https://huggingface.co/patrickvonplaten/roberta2roberta-share-cnn_dailymail-fp16
Note: The models below were trained without any hyper-parameter search and fp16 precision. For more detail, please refer to the respective model card.
Better models using the Seq2Seq Trainer and code on the current master give the following results:
BERT2BERT on CNN/Dailymail: 18.22 - https://huggingface.co/patrickvonplaten/bert2bert_cnn_daily_mail
Roberta2Roberta (shared) on BBC/XSum: 16.89 - https://huggingface.co/patrickvonplaten/roberta_shared_bbc_xsum
Also two notebooks are attached to the model cards showing how Encoder-Decoder models can be trained using master.