Which tokenizer does "rouge" metric uses under the hood?

I have a question regarding the datasets library’s implementation of the rouge score metric for text NLP text summarization; for avoidance of doubt, I am referring to the implementation loaded as follows

from datasets import load_metric
rouge_score = load_metric("rouge")
rouge_score.compute(
    predictions=..., references=..., use_stemmer=False
)

For predictions and references, the compute method accepts a list of strings, where each string corresponds to one reference summary, or predicted summary. These summaries are passed on as untokenized (otherwise predictions and references would have to be lists of lists of strings, but this it not the case). Considering the various rouge scores must be calculated from tokenised text, I was wondering which tokenizer is used by default under the hood. I am assuming it’s an English tokenizer. Is it possible to change the tokenizer choice?

Hi! We internally use Google Research’s implementation of ROUGE, and its (default) tokenization code is available here: google-research/tokenize.py at master · google-research/google-research · GitHub. Note that as of recently, this library supports passing a custom tokenizer, so feel free to open an issue in the evaluate repo if you are interested in that feature (evaluate is our metrics library now, and the metrics scripts in datasets are in the maintenance state).

Hi @mariosasko
Many thanks for your reply and for letting me know about the switch to the evaluate library.
Would you mind answering another quick query vaguely related to this? Happy to open a new question if needed. At the moment I am getting an F1 score (which is called fmeasure in the library) which is lower than both ROUGE precision and recall. How is this possible? Is fmeasure same as F1 and calculated with the formula 2 * precision * recall / (precision + recall) ?