Facebook/bart-large-cnn has a low rouge score on cnn_dailymail

I had similar issues reproducing the ROUGE scores of facebook/bart-large-cnn on the CNN/Daily Mail dataset.

First I tried the evaluation script examples/seq2seq/eval.py with the follwoing arguments:

export DATA_DIR=cnn_dm
python3 ./run_eval.py
   facebook/bart-large-cnn
   $DATA_DIR/val.source
   dbart_val_generations.txt
   --reference_path $DATA_DIR/val.target
   --score_path cnn_rouge.json
   --task summarization
   --device cuda
   --max_source_length 1024
   --max_target_length 56
   --fp16
   --bs 32

Which returned the following results:
{"rouge1": 44.8251, "rouge2": 21.6955, "rougeL": 31.2251, "n_obs": 13368, "runtime": 12773, "seconds_per_sample": 0.9555}

In addition I tried to create my own minimal script for debugging:

from datasets import load_dataset
from transformers import BartForConditionalGeneration, BartTokenizer
from tqdm import tqdm
import torch
DEFAULT_DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i : i + n]

def generate_summaries(lns, metric, batch_size=16, device=DEFAULT_DEVICE):
    
    tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn") 
    model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn").to(device)
    
    article_batches = list(chunks(lns['article'], batch_size))
    target_batches = list(chunks(lns['highlights'], batch_size))
    
    for article_batch, target_batch in tqdm(zip(article_batches, target_batches), total=len(article_batches)):
        dct = tokenizer.batch_encode_plus(article_batch,
                                          max_length=1024,
                                          truncation=True,
                                          padding='max_length',
                                          return_tensors="pt")
        summaries = model.generate(
            input_ids=dct["input_ids"].to(device),
            attention_mask=dct["attention_mask"].to(device),
            num_beams=4,
            length_penalty=2.0,
            max_length=142,
            min_len=56,
            no_repeat_ngram_size=3,
            early_stopping=True,
            decoder_start_token_id=tokenizer.eos_token_id,
        )
        
        dec = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summaries]

        metric.add_batch(predictions=dec, references=target_batch)
    score = metric.compute()
    return score

dataset = load_dataset("cnn_dailymail", version='3.0.0')
rouge_metric = datasets.load_metric('rouge')
score = generate_summaries(dataset['test'], rouge_metric)

This results in the following scores:

score['rouge1'].high.fmeasure, score['rougeL'].high.fmeasure
(0.42937086857299406, 0.30214912710797714)

score['rouge1'].mid.fmeasure, score['rougeL'].mid.fmeasure
(0.4270572516743017, 0.29984448611319414)

score['rouge1'].low.fmeasure, score['rougeL'].low.fmeasure
(0.4248562688572123, 0.297676569998942)

In both cases especially the RougeL scores are massively underwhelming. Do you have any idea what the problem might be? There seems to be quite a large gap between the reported and measured results. Thanks for any pointer you can provide.