The effectiveness of initializing Encoder-Decoder models from pre-trained encoder-only models, such as BERT and RoBERTa, for sequence-to-sequence tasks has been shown in: https://arxiv.org/abs/1907.12461.
Similarly, the EncoderDecoderModel framework of Transformers can be used to leverage initialize Encoder-Decoder models from “bert-base-cased” or “roberta-base” for summarization.
One can initialize such a model with weights from pre-trained checkpoints via:
from transformers import EncoderDecoderModel
bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased")
A couple of models based on “bert-base-cased” or “roberta-base” have been trained this way for the CNN/Daily-Mail summarization task with the purpose of verifying that the EncoderDecoderModel framework is functional.
Below the Rouge2 - fmeasure results on the test set of CNN/Daily-Mai:
would love to know how finetune times/inference times compare to bart-base/bart-large. These are roughly bart-base size, right?
Would also love to know on xsum where gaps between good and worse models get magnified in ROUGE space.
Feels like we desperately need some sort of lb/aggregator, like the one you tried to get going for benchmarking. I know bart-large takes ~24h to get to ~ 21 ROUGE on cnn. @VictorSanh got 15.5 ROUGE2 with bart-base on xsum which felt a little low to me.
In my Roberta2Roberta experiment for inference on cnn test dataset on P100, it took 2 hours , 22 minutes.
I fine-tuned for 16 hours but got much worse results than Patrick. ROUGE-2 F-measure was just 9.9
Hi @patrickvonplaten,
I tried to reproduce your Bert2GPT2-CNN_dailymail, but when I train it I get following error message TypeError: forward() got an unexpected keyword argument 'encoder_hidden_states'. The gist of my notebook: https://gist.github.com/cahya-wirawan/b36e91cae21a6a7f9a10e1c85f59d9ae
I use also the branch bert2gpt2-cnn_dailymail-fp16 as suggested. Would be nice if you could point me where I did it wrongly.
Thanks.
The models below were trained without any hyper-parameter search .Does that claim means any other parameter experiment has not been compared ? I also wonder fp16 or fp32 significantly affect the performance of the model ?
Hey , do you know why your result is worse than patrickvonplaten/roberta2roberta-cnn_dailymail-fp16 ?
I also want to reproduce the result. And I got similar rouge-2 score such as 9.6 or 9.7.
But in the original paper , the rouge-2 score is 18.9.
It is weird.
Do we need to increase the batch size or train more steps ?
I just added two notebooks showing how to reproduce the results in the paper.
Here one for Bert2Bert on CNN/Dailymail:
Here one for RobertaShared on BBC XSum:
The Bert2Bert model actually performs a bit better than reported in the paper, the roberta_shared model a bit worse (but training roberta_shared a bit longer would probably close that gap).
The motivation of doing this is to provide some educational material on how to use the EncoderDecoderModel - the exact performance was less important here.
I’m planning on making two short notebooks on Roberta2GPT2 for sentence fusion (DiscoFuse) and a Bert2Rnd for WMT en-de. Hope this is useful!