Which tokenizer does "rouge" metric uses under the hood?

AndreaSottana · July 1, 2022, 4:53pm

I have a question regarding the datasets library’s implementation of the rouge score metric for text NLP text summarization; for avoidance of doubt, I am referring to the implementation loaded as follows

from datasets import load_metric
rouge_score = load_metric("rouge")
rouge_score.compute(
    predictions=..., references=..., use_stemmer=False
)

For predictions and references, the compute method accepts a list of strings, where each string corresponds to one reference summary, or predicted summary. These summaries are passed on as untokenized (otherwise predictions and references would have to be lists of lists of strings, but this it not the case). Considering the various rouge scores must be calculated from tokenised text, I was wondering which tokenizer is used by default under the hood. I am assuming it’s an English tokenizer. Is it possible to change the tokenizer choice?

mariosasko · July 11, 2022, 1:59pm

Hi! We internally use Google Research’s implementation of ROUGE, and its (default) tokenization code is available here: google-research/tokenize.py at master · google-research/google-research · GitHub. Note that as of recently, this library supports passing a custom tokenizer, so feel free to open an issue in the evaluate repo if you are interested in that feature (evaluate is our metrics library now, and the metrics scripts in datasets are in the maintenance state).

AndreaSottana · July 11, 2022, 3:22pm

Hi @mariosasko
Many thanks for your reply and for letting me know about the switch to the evaluate library.
Would you mind answering another quick query vaguely related to this? Happy to open a new question if needed. At the moment I am getting an F1 score (which is called fmeasure in the library) which is lower than both ROUGE precision and recall. How is this possible? Is fmeasure same as F1 and calculated with the formula 2 * precision * recall / (precision + recall) ?

Topic		Replies	Views
ROUGE score problem Beginners	2	1817	August 5, 2023
Rouge implementation of Huggingface Datasets 🤗Datasets	2	1970	November 18, 2021
Facebook/bart-large-cnn has a low rouge score on cnn_dailymail Beginners	14	3226	October 5, 2020
Rouge-L score in Trainer huggingface 🤗Transformers	1	2012	September 25, 2023
Calculating Rouge metric for fine tunning Pegasus 🤗Transformers	0	1777	May 27, 2021

Which tokenizer does "rouge" metric uses under the hood?

Related topics