Facebook/bart-large-cnn has a low rouge score on cnn_dailymail

I test on test split in the tensorflow_dataset. Use python library rouge to output rouge score.
The score is quite low, compared to the score reported in the paper

{‘rouge-1’: {‘f’: 0.38628074837405213,
‘p’: 0.38253581551915466,
‘r’: 0.4136606028772784},
‘rouge-2’: {‘f’: 0.1810805831229415,
‘p’: 0.17948749808930747,
‘r’: 0.193921872080545},
‘rouge-l’: {‘f’: 0.3747852342130126,
‘p’: 0.37128779953880464,
‘r’: 0.3958861147871471}}

the score reported in the paper:
BART 44.16 21.28 40.90 (R1, R2, RL)

parameter of generate function:

num_beams=4,
length_penalty=2.0,
max_length=256,
min_length=10,
no_repeat_ngram_size=3

Hi @LiuYangyang,

can you post the parameters you used for the generate function. (assuming you used generate)

1 Like

num_beams=4,
length_penalty=2.0,
max_length=256,
min_length=10,
no_repeat_ngram_size=3

Can you try using the reported parameters, these are the params used by the authors,

num_beams=4,
length_penalty=2.0,
max_length=142,
min_length=56,
no_repeat_ngram_size=3,
do_sample=False,
early_stopping=True,
decoder_start_token_id=tokenizer.eos_token_id,

Using the reported parameters, the score is still quite far from the reported one.

{‘rouge-1’: {‘f’: 0.3857450692467576,
‘p’: 0.3623898533303822,
‘r’: 0.4360411721146263},
‘rouge-2’: {‘f’: 0.17910119007512806,
‘p’: 0.168426671688575,
‘r’: 0.20252494493851977},
‘rouge-l’: {‘f’: 0.3742737179767776,
‘p’: 0.3537086320034924,
‘r’: 0.4147416244432235}}

Can you send the full code you ran? Incl the tokenizer?

I am just kicked off a command to replicate that will be done in 45 mins.

I just reran xsum and got results close to the paper

python run_eval.py facebook/bart-large-xsum  \
  xsum/test.source \ 
  bart_xsum_test_gens.txt \
  --reference_path xsum/test.target \
  --bs 16 --score_path tmp_rouge.json --fp16

Runtime: 40mins.

{'rouge1': 0.4518
'rouge2': 0.21746,
'rougeL': 0.3636}

(these are fscore.mid_measure values, see calculate_rouge_score)

This my code to evaluate

https://gist.github.com/WangHexie/859e946947d275f959083a93f0fa2486

I used code in the run_eval.py

ROUGE_KEYS = [“rouge1”, “rouge2”, “rougeL”]

scorer = rouge_scorer.RougeScorer(ROUGE_KEYS, use_stemmer=True)
aggregator = scoring.BootstrapAggregator()

for reference_ln, output_ln in zip(context_sum, highlights):
scores = scorer.score(reference_ln, output_ln)
aggregator.add_scores(scores)

result = aggregator.aggregate()
t = {k: v.mid.fmeasure for k, v in result.items()}

By changing module and changing from average to mid_measure, the score is getting closer. Though rougeL is worse

{‘rouge1’: 0.4372446280018615,
‘rouge2’: 0.2083375655890657,
‘rougeL’: 0.30426740583778056}

Try changing 512->1024 in BatchBuilder:

    sentences_input_ids = tokenizer.batch_encode_plus(sentences, return_tensors='pt', max_length=512,  pad_to_max_length=True,  truncation=True).to(device)

I had similar issues reproducing the ROUGE scores of facebook/bart-large-cnn on the CNN/Daily Mail dataset.

First I tried the evaluation script examples/seq2seq/eval.py with the follwoing arguments:

export DATA_DIR=cnn_dm
python3 ./run_eval.py
   facebook/bart-large-cnn
   $DATA_DIR/val.source
   dbart_val_generations.txt
   --reference_path $DATA_DIR/val.target
   --score_path cnn_rouge.json
   --task summarization
   --device cuda
   --max_source_length 1024
   --max_target_length 56
   --fp16
   --bs 32

Which returned the following results:
{"rouge1": 44.8251, "rouge2": 21.6955, "rougeL": 31.2251, "n_obs": 13368, "runtime": 12773, "seconds_per_sample": 0.9555}

In addition I tried to create my own minimal script for debugging:

from datasets import load_dataset
from transformers import BartForConditionalGeneration, BartTokenizer
from tqdm import tqdm
import torch
DEFAULT_DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i : i + n]

def generate_summaries(lns, metric, batch_size=16, device=DEFAULT_DEVICE):
    
    tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn") 
    model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn").to(device)
    
    article_batches = list(chunks(lns['article'], batch_size))
    target_batches = list(chunks(lns['highlights'], batch_size))
    
    for article_batch, target_batch in tqdm(zip(article_batches, target_batches), total=len(article_batches)):
        dct = tokenizer.batch_encode_plus(article_batch,
                                          max_length=1024,
                                          truncation=True,
                                          padding='max_length',
                                          return_tensors="pt")
        summaries = model.generate(
            input_ids=dct["input_ids"].to(device),
            attention_mask=dct["attention_mask"].to(device),
            num_beams=4,
            length_penalty=2.0,
            max_length=142,
            min_len=56,
            no_repeat_ngram_size=3,
            early_stopping=True,
            decoder_start_token_id=tokenizer.eos_token_id,
        )
        
        dec = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summaries]

        metric.add_batch(predictions=dec, references=target_batch)
    score = metric.compute()
    return score

dataset = load_dataset("cnn_dailymail", version='3.0.0')
rouge_metric = datasets.load_metric('rouge')
score = generate_summaries(dataset['test'], rouge_metric)

This results in the following scores:

score['rouge1'].high.fmeasure, score['rougeL'].high.fmeasure
(0.42937086857299406, 0.30214912710797714)

score['rouge1'].mid.fmeasure, score['rougeL'].mid.fmeasure
(0.4270572516743017, 0.29984448611319414)

score['rouge1'].low.fmeasure, score['rougeL'].low.fmeasure
(0.4248562688572123, 0.297676569998942)

In both cases especially the RougeL scores are massively underwhelming. Do you have any idea what the problem might be? There seems to be quite a large gap between the reported and measured results. Thanks for any pointer you can provide.

An update from my side: after some additional experimenting and digging I found the potential issue in my script. This issue on the datasets repo explains the differences in splitting multiple sentences in different implementations of ROUGE scores. Since the ROUGE metric in the datasets library corresponds to (2) it seems that ROUGE-L ignores sentence splitting. I had to add the following step after decoding adding a new line:

dec = [d.replace('. ', '.\n') for d in dec]

And then got the following scores with the new datasets=1.1.0 which also returns ROUGE-Lsum by default:

for metric in ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']:
    print(f'{metric}:\t{score[metric].mid.fmeasure:.4f}')

rouge1: 0.4238
rouge2: 0.2033
rougeL: 0.2977
rougeLsum: 0.3956

This is much closer to the reported scores assuming rougeLsum corresponds to rougeL as measured by other libraries. Can you confirm that this is the underlying issue? Also, there still is a discrepancy, especially in ROUGE-1 which is unexplained.

calculate_rouge_score has this fix: https://github.com/huggingface/transformers/blob/master/examples/seq2seq/utils.py#L481!

Thanks for the pointer - that seems to do pretty much the same fix as I did. Should this be added to the datasets.metrics ROUGE implementation? It would be nice if one could use it to reproduce the results.

It would be a fair amount of work to remove the nltk dependency, but definitely a useful contribution. I would guess that it would affect metrics (though not sure) to assume that some fixed string marks the end of a sentence.
see