Facebook/bart-large-cnn has a low rouge score on cnn_dailymail

LiuYangyang · August 11, 2020, 9:47am

I test on test split in the tensorflow_dataset. Use python library rouge to output rouge score.
The score is quite low, compared to the score reported in the paper

{‘rouge-1’: {‘f’: 0.38628074837405213,
‘p’: 0.38253581551915466,
‘r’: 0.4136606028772784},
‘rouge-2’: {‘f’: 0.1810805831229415,
‘p’: 0.17948749808930747,
‘r’: 0.193921872080545},
‘rouge-l’: {‘f’: 0.3747852342130126,
‘p’: 0.37128779953880464,
‘r’: 0.3958861147871471}}

the score reported in the paper:
BART 44.16 21.28 40.90 (R1, R2, RL)

parameter of generate function:

num_beams=4,
length_penalty=2.0,
max_length=256,
min_length=10,
no_repeat_ngram_size=3

valhalla · August 11, 2020, 11:02am

Hi @LiuYangyang,

can you post the parameters you used for the generate function. (assuming you used generate)

LiuYangyang · August 11, 2020, 1:11pm

num_beams=4,
length_penalty=2.0,
max_length=256,
min_length=10,
no_repeat_ngram_size=3

valhalla · August 11, 2020, 2:16pm

Can you try using the reported parameters, these are the params used by the authors,

num_beams=4,
length_penalty=2.0,
max_length=142,
min_length=56,
no_repeat_ngram_size=3,
do_sample=False,
early_stopping=True,
decoder_start_token_id=tokenizer.eos_token_id,

LiuYangyang · August 11, 2020, 4:47pm

Using the reported parameters, the score is still quite far from the reported one.

{‘rouge-1’: {‘f’: 0.3857450692467576,
‘p’: 0.3623898533303822,
‘r’: 0.4360411721146263},
‘rouge-2’: {‘f’: 0.17910119007512806,
‘p’: 0.168426671688575,
‘r’: 0.20252494493851977},
‘rouge-l’: {‘f’: 0.3742737179767776,
‘p’: 0.3537086320034924,
‘r’: 0.4147416244432235}}

sshleifer · August 11, 2020, 7:06pm

Can you send the full code you ran? Incl the tokenizer?

I am just kicked off a command to replicate that will be done in 45 mins.

sshleifer · August 11, 2020, 7:58pm

I just reran xsum and got results close to the paper

python run_eval.py facebook/bart-large-xsum  \
  xsum/test.source \ 
  bart_xsum_test_gens.txt \
  --reference_path xsum/test.target \
  --bs 16 --score_path tmp_rouge.json --fp16

Runtime: 40mins.

{'rouge1': 0.4518
'rouge2': 0.21746,
'rougeL': 0.3636}

(these are fscore.mid_measure values, see calculate_rouge_score)

LiuYangyang · August 12, 2020, 1:41am

This my code to evaluate

https://gist.github.com/WangHexie/859e946947d275f959083a93f0fa2486

LiuYangyang · August 12, 2020, 5:25am

I used code in the run_eval.py

ROUGE_KEYS = [“rouge1”, “rouge2”, “rougeL”]

scorer = rouge_scorer.RougeScorer(ROUGE_KEYS, use_stemmer=True)
aggregator = scoring.BootstrapAggregator()

for reference_ln, output_ln in zip(context_sum, highlights):
scores = scorer.score(reference_ln, output_ln)
aggregator.add_scores(scores)

result = aggregator.aggregate()
t = {k: v.mid.fmeasure for k, v in result.items()}

By changing module and changing from average to mid_measure, the score is getting closer. Though rougeL is worse

{‘rouge1’: 0.4372446280018615,
‘rouge2’: 0.2083375655890657,
‘rougeL’: 0.30426740583778056}

sshleifer · August 13, 2020, 4:50pm

Try changing 512->1024 in BatchBuilder:

    sentences_input_ids = tokenizer.batch_encode_plus(sentences, return_tensors='pt', max_length=512,  pad_to_max_length=True,  truncation=True).to(device)

lvwerra · October 3, 2020, 3:21pm

I had similar issues reproducing the ROUGE scores of facebook/bart-large-cnn on the CNN/Daily Mail dataset.

First I tried the evaluation script examples/seq2seq/eval.py with the follwoing arguments:

export DATA_DIR=cnn_dm
python3 ./run_eval.py
   facebook/bart-large-cnn
   $DATA_DIR/val.source
   dbart_val_generations.txt
   --reference_path $DATA_DIR/val.target
   --score_path cnn_rouge.json
   --task summarization
   --device cuda
   --max_source_length 1024
   --max_target_length 56
   --fp16
   --bs 32

Which returned the following results:
{"rouge1": 44.8251, "rouge2": 21.6955, "rougeL": 31.2251, "n_obs": 13368, "runtime": 12773, "seconds_per_sample": 0.9555}

In addition I tried to create my own minimal script for debugging:

from datasets import load_dataset
from transformers import BartForConditionalGeneration, BartTokenizer
from tqdm import tqdm
import torch
DEFAULT_DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i : i + n]

def generate_summaries(lns, metric, batch_size=16, device=DEFAULT_DEVICE):
    
    tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn") 
    model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn").to(device)
    
    article_batches = list(chunks(lns['article'], batch_size))
    target_batches = list(chunks(lns['highlights'], batch_size))
    
    for article_batch, target_batch in tqdm(zip(article_batches, target_batches), total=len(article_batches)):
        dct = tokenizer.batch_encode_plus(article_batch,
                                          max_length=1024,
                                          truncation=True,
                                          padding='max_length',
                                          return_tensors="pt")
        summaries = model.generate(
            input_ids=dct["input_ids"].to(device),
            attention_mask=dct["attention_mask"].to(device),
            num_beams=4,
            length_penalty=2.0,
            max_length=142,
            min_len=56,
            no_repeat_ngram_size=3,
            early_stopping=True,
            decoder_start_token_id=tokenizer.eos_token_id,
        )
        
        dec = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summaries]

        metric.add_batch(predictions=dec, references=target_batch)
    score = metric.compute()
    return score

dataset = load_dataset("cnn_dailymail", version='3.0.0')
rouge_metric = datasets.load_metric('rouge')
score = generate_summaries(dataset['test'], rouge_metric)

This results in the following scores:

score['rouge1'].high.fmeasure, score['rougeL'].high.fmeasure
(0.42937086857299406, 0.30214912710797714)

score['rouge1'].mid.fmeasure, score['rougeL'].mid.fmeasure
(0.4270572516743017, 0.29984448611319414)

score['rouge1'].low.fmeasure, score['rougeL'].low.fmeasure
(0.4248562688572123, 0.297676569998942)

In both cases especially the RougeL scores are massively underwhelming. Do you have any idea what the problem might be? There seems to be quite a large gap between the reported and measured results. Thanks for any pointer you can provide.

lvwerra · October 4, 2020, 8:27pm

An update from my side: after some additional experimenting and digging I found the potential issue in my script. This issue on the datasets repo explains the differences in splitting multiple sentences in different implementations of ROUGE scores. Since the ROUGE metric in the datasets library corresponds to (2) it seems that ROUGE-L ignores sentence splitting. I had to add the following step after decoding adding a new line:

dec = [d.replace('. ', '.\n') for d in dec]

And then got the following scores with the new datasets=1.1.0 which also returns ROUGE-Lsum by default:

for metric in ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']:
    print(f'{metric}:\t{score[metric].mid.fmeasure:.4f}')

rouge1: 0.4238
rouge2: 0.2033
rougeL: 0.2977
rougeLsum: 0.3956

This is much closer to the reported scores assuming rougeLsum corresponds to rougeL as measured by other libraries. Can you confirm that this is the underlying issue? Also, there still is a discrepancy, especially in ROUGE-1 which is unexplained.

sshleifer · October 5, 2020, 1:25pm

calculate_rouge_score has this fix: https://github.com/huggingface/transformers/blob/master/examples/seq2seq/utils.py#L481!

lvwerra · October 5, 2020, 7:26pm

Thanks for the pointer - that seems to do pretty much the same fix as I did. Should this be added to the datasets.metrics ROUGE implementation? It would be nice if one could use it to reproduce the results.

sshleifer · October 5, 2020, 10:02pm

It would be a fair amount of work to remove the nltk dependency, but definitely a useful contribution. I would guess that it would affect metrics (though not sure) to assume that some fixed string marks the end of a sentence.
see

Topic		Replies	Views
Bart-base rouge scores Research	11	1730	October 27, 2020
Cannot reproduce the results Beginners	5	883	October 5, 2020
Reproducing bert2bert_cnn_dm result Models	0	234	June 30, 2021
How to get the score for a generated sentence from BartForConditionalGeneration Models	0	548	March 1, 2022
Rouge implementation of Huggingface Datasets 🤗Datasets	2	1972	November 18, 2021

Facebook/bart-large-cnn has a low rouge score on cnn_dailymail

Related topics