Leveraging pre-trained checkpoints for summarization

patrickvonplaten · August 23, 2020, 10:51pm

The effectiveness of initializing Encoder-Decoder models from pre-trained encoder-only models, such as BERT and RoBERTa, for sequence-to-sequence tasks has been shown in: https://arxiv.org/abs/1907.12461.

Similarly, the EncoderDecoderModel framework of Transformers can be used to leverage initialize Encoder-Decoder models from “bert-base-cased” or “roberta-base” for summarization.

One can initialize such a model with weights from pre-trained checkpoints via:

from transformers import EncoderDecoderModel
bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased")

A couple of models based on “bert-base-cased” or “roberta-base” have been trained this way for the CNN/Daily-Mail summarization task with the purpose of verifying that the EncoderDecoderModel framework is functional.

Below the Rouge2 - fmeasure results on the test set of CNN/Daily-Mai:

Bert2GPT2: 15.19 https://huggingface.co/patrickvonplaten/bert2gpt2-cnn_dailymail-fp16
Bert2Bert: 16.1 - https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16
Roberta2Roberta: 16.79: https://huggingface.co/patrickvonplaten/roberta2roberta-cnn_dailymail-fp16
Roberta2Roberta (shared): 16.59: https://huggingface.co/patrickvonplaten/roberta2roberta-share-cnn_dailymail-fp16

Note: The models below were trained without any hyper-parameter search and fp16 precision. For more detail, please refer to the respective model card.

UPDATE:

Better models using the Seq2Seq Trainer and code on the current master give the following results:

BERT2BERT on CNN/Dailymail: 18.22 - https://huggingface.co/patrickvonplaten/bert2bert_cnn_daily_mail
Roberta2Roberta (shared) on BBC/XSum: 16.89 - https://huggingface.co/patrickvonplaten/roberta_shared_bbc_xsum

Also two notebooks are attached to the model cards showing how Encoder-Decoder models can be trained using master.

sshleifer · August 25, 2020, 2:22am

Interesting results!

would love to know how finetune times/inference times compare to bart-base/bart-large. These are roughly bart-base size, right?
Would also love to know on xsum where gaps between good and worse models get magnified in ROUGE space.
Feels like we desperately need some sort of lb/aggregator, like the one you tried to get going for benchmarking. I know bart-large takes ~24h to get to ~ 21 ROUGE on cnn. @VictorSanh got 15.5 ROUGE2 with bart-base on xsum which felt a little low to me.

Are you using pip install wandb? Share your logs?

sshleifer · August 25, 2020, 4:44am

Does this mean we can delete bertabs!?

valhalla · August 25, 2020, 9:10am

In my Roberta2Roberta experiment for inference on cnn test dataset on P100, it took 2 hours , 22 minutes.
I fine-tuned for 16 hours but got much worse results than Patrick. ROUGE-2 F-measure was just 9.9

sshleifer · August 25, 2020, 3:49pm

I got 16.6 ROUGE 2 finetuning bart-base on XSUM, in 3 epochs/ 7.5 hrs

Still way worse than distilbart-xsum-6-6 (20.92) and not that much faster.

cahya · September 8, 2020, 11:24am

Hi @patrickvonplaten,
I tried to reproduce your Bert2GPT2-CNN_dailymail, but when I train it I get following error message
TypeError: forward() got an unexpected keyword argument 'encoder_hidden_states'. The gist of my notebook: https://gist.github.com/cahya-wirawan/b36e91cae21a6a7f9a10e1c85f59d9ae
I use also the branch bert2gpt2-cnn_dailymail-fp16 as suggested. Would be nice if you could point me where I did it wrongly.
Thanks.

guoziyuan · October 27, 2020, 9:40am

Hey @patrickvonplaten ,it seems like the result doesn’t match this paper’s result .

for example ,the roberta2roberta model
the rouge2 fmeasure in the paper for cnn dataset is 18.5
but your result is 16.79.

guoziyuan · October 27, 2020, 9:45am

The models below were trained without any hyper-parameter search .Does that claim means any other parameter experiment has not been compared ? I also wonder fp16 or fp32 significantly affect the performance of the model ?

guoziyuan · October 28, 2020, 10:13am

Hey , do you know why your result is worse than
patrickvonplaten/roberta2roberta-cnn_dailymail-fp16 ?
I also want to reproduce the result. And I got similar rouge-2 score such as 9.6 or 9.7.
But in the original paper , the rouge-2 score is 18.9.
It is weird.
Do we need to increase the batch size or train more steps ?

valhalla · October 29, 2020, 5:17pm

I didn’t investigate it much, @patrickvonplaten will have some ideas about this

guoziyuan · October 30, 2020, 5:09am

it seems that tokenizer.batch_decode
this method is removed
I can’t find it in the document

valhalla · October 31, 2020, 2:42pm

Seems like it’s not included in the docs, but it’s available. See

github.com

huggingface/transformers/blob/master/src/transformers/tokenization_utils_base.py#L2943


    """
    Converts a sequence of token ids in a single string. The most simple way to do it is ``" ".join(tokens)`` but
    we often want to remove sub-word tokenization artifacts at the same time

    Args:
        tokens (:obj:`List[str]`): The token to join in a string.
    Return: The joined tokens.
    """
    raise NotImplementedError

def batch_decode(
    self,
    sequences: Union[List[int], List[List[int]], "np.ndarray", "torch.Tensor", "tf.Tensor"],
    skip_special_tokens: bool = False,
    clean_up_tokenization_spaces: bool = True,
    **kwargs
) -> List[str]:
    """
    Convert a list of lists of token ids into a list of strings by calling decode.

    Args:

patrickvonplaten · November 2, 2020, 7:26pm

I just added two notebooks showing how to reproduce the results in the paper.
Here one for Bert2Bert on CNN/Dailymail:

github.com

patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "BERT2BERT for CNN/Dailymail",
      "provenance": [],
      "collapsed_sections": [],
      "authorship_tag": "ABX9TyMYchE5beUGXDKxJoG2JsLp",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU",
    "widgets": {
      "application/vnd.jupyter.widget-state+json": {
        "e13b8fdb8a564e8699c77890e29995bf": {
          "model_module": "@jupyter-widgets/controls",

This file has been truncated. show original

Here one for RobertaShared on BBC XSum:

github.com

patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "RoBERTaShared for BBC XSum",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true,
      "authorship_tag": "ABX9TyNjqhK5uRHETrMPoFcaBl9P",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU",
    "widgets": {
      "application/vnd.jupyter.widget-state+json": {
        "675630bf7b35483f9a424eb7433ab76a": {

This file has been truncated. show original

The Bert2Bert model actually performs a bit better than reported in the paper, the roberta_shared model a bit worse (but training roberta_shared a bit longer would probably close that gap).

The motivation of doing this is to provide some educational material on how to use the EncoderDecoderModel - the exact performance was less important here.

I’m planning on making two short notebooks on Roberta2GPT2 for sentence fusion (DiscoFuse) and a Bert2Rnd for WMT en-de. Hope this is useful!

patrickvonplaten · November 9, 2020, 5:17pm

A longer blog post on this topic is online now: https://huggingface.co/blog/warm-starting-encoder-decoder

Jung · November 10, 2020, 12:30pm

Thanks so much for your blog @patrickvonplaten, you are the hero!

guoziyuan · November 17, 2020, 6:30am

How many epoches did you run for full training ?

patrickvonplaten · November 17, 2020, 2:42pm

ran for 3 epochs

ncoop57 · November 17, 2020, 8:11pm

Is it possible to use a pretrained encoder and then an untrained decoder that is just defined by a config (and also vice versa)?

patrickvonplaten · November 18, 2020, 7:28am

Yes! In this case I’d recommend loading the encoder / decoder directly.

encoder = AutoModel.from_pretrained(...)
decoder = BertLMHeadModel(BertConfig())

enc_dec_model = EncoderDecoderModel(encoder=encoder, decoder=decoder)

ncoop57 · November 19, 2020, 6:33pm

Amazing!! Thanks so much!

Topic		Replies	Views
Summarization taks, looking for clarifications before getting started Beginners	10	983	February 16, 2021
Training Bert2GPT2 model Summarization doesn't lead to acceptable results Models	0	456	December 8, 2021
Warm-started encoder-decoder models (Bert2Gpt2 and Bert2Bert) Beginners	11	2518	June 9, 2024
BERT2BERT Notebook for Models without GenerationMixin 🤗Transformers	0	290	November 12, 2020
Training issue of a Transformer based Encoder-Decoder model based on pre-trained BanglaBERT Models	1	748	May 12, 2022

Leveraging pre-trained checkpoints for summarization

Related topics