I have a question regarding the datasets
library’s implementation of the rouge score metric for text NLP text summarization; for avoidance of doubt, I am referring to the implementation loaded as follows
from datasets import load_metric
rouge_score = load_metric("rouge")
rouge_score.compute(
predictions=..., references=..., use_stemmer=False
)
For predictions and references, the compute
method accepts a list of strings, where each string corresponds to one reference summary, or predicted summary. These summaries are passed on as untokenized (otherwise predictions and references would have to be lists of lists of strings, but this it not the case). Considering the various rouge scores must be calculated from tokenised text, I was wondering which tokenizer is used by default under the hood. I am assuming it’s an English tokenizer. Is it possible to change the tokenizer choice?
Hi! We internally use Google Research’s implementation of ROUGE, and its (default) tokenization code is available here: google-research/tokenize.py at master · google-research/google-research · GitHub. Note that as of recently, this library supports passing a custom tokenizer, so feel free to open an issue in the evaluate
repo if you are interested in that feature (evaluate
is our metrics library now, and the metrics scripts in datasets
are in the maintenance state).
Hi @mariosasko
Many thanks for your reply and for letting me know about the switch to the evaluate
library.
Would you mind answering another quick query vaguely related to this? Happy to open a new question if needed. At the moment I am getting an F1 score (which is called fmeasure
in the library) which is lower than both ROUGE precision and recall. How is this possible? Is fmeasure same as F1 and calculated with the formula 2 * precision * recall / (precision + recall)
?