Reproducing bert2bert_cnn_dm result

Hi @patrickvonplaten, I was trying to evaluate your bert2bert_cnn_daily_mail model on the CNN-DM test set but the ROUGE scores I got are much lower (I didn’t do any fine-tuning). I followed the parameters in your colab notebook. Really appreciate it if you can help to point out what I did wrong. Below is my complete code:

from transformers import BertTokenizerFast, AutoModelForSeq2SeqLM
from datasets import load_metric

metric = load_metric('rouge')

model_string = 'patrickvonplaten/bert2bert_cnn_daily_mail'
tokenizer = BertTokenizerFast.from_pretrained(model_string)
model = AutoModelForSeq2SeqLM.from_pretrained(model_string, return_dict=True).to(device)

test_data = datasets.load_dataset("cnn_dailymail", "3.0.0", split="test[:10%]")
inputs = tokenizer(test_data["article"], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
input_ids ="cuda")
attention_mask ="cuda")
batch_size = 24

ground_truths, baseline_preds = list(), list()
for i in tnrange(0, len(test_data), batch_size):
    #import pdb; pdb.set_trace()
    outputs = model.generate(input_ids[i:i+batch_size], attention_mask=attention_mask[i:i+batch_size], 
                             max_length=142, min_length=56, no_repeat_ngram_size=3, early_stopping=True, 
                             length_penalty=2.0, num_beams=4).tolist()
    batch_preds = tokenizer.batch_decode(outputs, skip_special_tokens=True)
metric.compute(predictions=baseline_preds, references=ground_truths, rouge_types=['rouge1', 'rouge2', 'rougeL'])

The ROUGE -1/-2/-L F1 scores I got was 30.07/12.16/21.59. I only used the first 10% of the examples in the test set (1,149) because I have limited computing resources. But I don’t expect the ROUGE-2 score to be that far from 18.22 that was reported in the model card.