I had similar issues reproducing the ROUGE scores of facebook/bart-large-cnn on the CNN/Daily Mail dataset.
First I tried the evaluation script examples/seq2seq/eval.py
with the follwoing arguments:
export DATA_DIR=cnn_dm
python3 ./run_eval.py
facebook/bart-large-cnn
$DATA_DIR/val.source
dbart_val_generations.txt
--reference_path $DATA_DIR/val.target
--score_path cnn_rouge.json
--task summarization
--device cuda
--max_source_length 1024
--max_target_length 56
--fp16
--bs 32
Which returned the following results:
{"rouge1": 44.8251, "rouge2": 21.6955, "rougeL": 31.2251, "n_obs": 13368, "runtime": 12773, "seconds_per_sample": 0.9555}
In addition I tried to create my own minimal script for debugging:
from datasets import load_dataset
from transformers import BartForConditionalGeneration, BartTokenizer
from tqdm import tqdm
import torch
DEFAULT_DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in range(0, len(lst), n):
yield lst[i : i + n]
def generate_summaries(lns, metric, batch_size=16, device=DEFAULT_DEVICE):
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn").to(device)
article_batches = list(chunks(lns['article'], batch_size))
target_batches = list(chunks(lns['highlights'], batch_size))
for article_batch, target_batch in tqdm(zip(article_batches, target_batches), total=len(article_batches)):
dct = tokenizer.batch_encode_plus(article_batch,
max_length=1024,
truncation=True,
padding='max_length',
return_tensors="pt")
summaries = model.generate(
input_ids=dct["input_ids"].to(device),
attention_mask=dct["attention_mask"].to(device),
num_beams=4,
length_penalty=2.0,
max_length=142,
min_len=56,
no_repeat_ngram_size=3,
early_stopping=True,
decoder_start_token_id=tokenizer.eos_token_id,
)
dec = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summaries]
metric.add_batch(predictions=dec, references=target_batch)
score = metric.compute()
return score
dataset = load_dataset("cnn_dailymail", version='3.0.0')
rouge_metric = datasets.load_metric('rouge')
score = generate_summaries(dataset['test'], rouge_metric)
This results in the following scores:
score['rouge1'].high.fmeasure, score['rougeL'].high.fmeasure
(0.42937086857299406, 0.30214912710797714)
score['rouge1'].mid.fmeasure, score['rougeL'].mid.fmeasure
(0.4270572516743017, 0.29984448611319414)
score['rouge1'].low.fmeasure, score['rougeL'].low.fmeasure
(0.4248562688572123, 0.297676569998942)
In both cases especially the RougeL scores are massively underwhelming. Do you have any idea what the problem might be? There seems to be quite a large gap between the reported and measured results. Thanks for any pointer you can provide.