Leveraging pre-trained checkpoints for summarization

The effectiveness of initializing Encoder-Decoder models from pre-trained encoder-only models, such as BERT and RoBERTa, for sequence-to-sequence tasks has been shown in: https://arxiv.org/abs/1907.12461.

Similarly, the EncoderDecoderModel framework of :hugs:Transformers can be used to leverage initialize Encoder-Decoder models from “bert-base-cased” or “roberta-base” for summarization.

One can initialize such a model with weights from pre-trained checkpoints via:

from transformers import EncoderDecoderModel
bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased")

A couple of models based on “bert-base-cased” or “roberta-base” have been trained this way for the CNN/Daily-Mail summarization task with the purpose of verifying that the EncoderDecoderModel framework is functional.

Below the Rouge2 - fmeasure results on the test set of CNN/Daily-Mai:

Bert2GPT2: 15.19 https://huggingface.co/patrickvonplaten/bert2gpt2-cnn_dailymail-fp16
Bert2Bert: 16.1 - https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16
Roberta2Roberta: 16.79: https://huggingface.co/patrickvonplaten/roberta2roberta-cnn_dailymail-fp16
Roberta2Roberta (shared): 16.59: https://huggingface.co/patrickvonplaten/roberta2roberta-share-cnn_dailymail-fp16

Note: The models below were trained without any hyper-parameter search and fp16 precision. For more detail, please refer to the respective model card.


Better models using the Seq2Seq Trainer and code on the current master give the following results:

BERT2BERT on CNN/Dailymail: 18.22 - https://huggingface.co/patrickvonplaten/bert2bert_cnn_daily_mail
Roberta2Roberta (shared) on BBC/XSum: 16.89 - https://huggingface.co/patrickvonplaten/roberta_shared_bbc_xsum

Also two notebooks are attached to the model cards showing how Encoder-Decoder models can be trained using master.


Interesting results!

  1. would love to know how finetune times/inference times compare to bart-base/bart-large. These are roughly bart-base size, right?

  2. Would also love to know on xsum where gaps between good and worse models get magnified in ROUGE space.

  3. Feels like we desperately need some sort of lb/aggregator, like the one you tried to get going for benchmarking. I know bart-large takes ~24h to get to ~ 21 ROUGE on cnn. @VictorSanh got 15.5 ROUGE2 with bart-base on xsum which felt a little low to me.

  1. Are you using pip install wandb? Share your logs?
1 Like

Does this mean we can delete bertabs!?

1 Like
  1. In my Roberta2Roberta experiment for inference on cnn test dataset on P100, it took 2 hours , 22 minutes.
    I fine-tuned for 16 hours but got much worse results than Patrick. ROUGE-2 F-measure was just 9.9

I got 16.6 ROUGE 2 finetuning bart-base on XSUM, in 3 epochs/ 7.5 hrs

Still way worse than distilbart-xsum-6-6 (20.92) and not that much faster.

Hi @patrickvonplaten,
I tried to reproduce your Bert2GPT2-CNN_dailymail, but when I train it I get following error message
TypeError: forward() got an unexpected keyword argument 'encoder_hidden_states'. The gist of my notebook: https://gist.github.com/cahya-wirawan/b36e91cae21a6a7f9a10e1c85f59d9ae
I use also the branch bert2gpt2-cnn_dailymail-fp16 as suggested. Would be nice if you could point me where I did it wrongly.

Hey @patrickvonplaten ,it seems like the result doesn’t match this paper’s result .

for example ,the roberta2roberta model
the rouge2 fmeasure in the paper for cnn dataset is 18.5
but your result is 16.79.

The models below were trained without any hyper-parameter search .Does that claim means any other parameter experiment has not been compared ? I also wonder fp16 or fp32 significantly affect the performance of the model ?

Hey , do you know why your result is worse than
patrickvonplaten/roberta2roberta-cnn_dailymail-fp16 ?
I also want to reproduce the result. And I got similar rouge-2 score such as 9.6 or 9.7.
But in the original paper , the rouge-2 score is 18.9.
It is weird.
Do we need to increase the batch size or train more steps ?

I didn’t investigate it much, @patrickvonplaten will have some ideas about this

it seems that tokenizer.batch_decode
this method is removed
I can’t find it in the document

Seems like it’s not included in the docs, but it’s available. See

I just added two notebooks showing how to reproduce the results in the paper.
Here one for Bert2Bert on CNN/Dailymail:

Here one for RobertaShared on BBC XSum:

The Bert2Bert model actually performs a bit better than reported in the paper, the roberta_shared model a bit worse (but training roberta_shared a bit longer would probably close that gap).

The motivation of doing this is to provide some educational material on how to use the EncoderDecoderModel - the exact performance was less important here.

I’m planning on making two short notebooks on Roberta2GPT2 for sentence fusion (DiscoFuse) and a Bert2Rnd for WMT en-de. Hope this is useful!


A longer blog post on this topic is online now: https://huggingface.co/blog/warm-starting-encoder-decoder

1 Like

Thanks so much for your blog @patrickvonplaten, you are the hero!

How many epoches did you run for full training ?

ran for 3 epochs

Is it possible to use a pretrained encoder and then an untrained decoder that is just defined by a config (and also vice versa)?

Yes! In this case I’d recommend loading the encoder / decoder directly.

encoder = AutoModel.from_pretrained(...)
decoder = BertLMHeadModel(BertConfig())

enc_dec_model = EncoderDecoderModel(encoder=encoder, decoder=decoder)

Amazing!! Thanks so much!