Rouge implementation of Huggingface Datasets

Hello, I’m using rouge implemented in Huggingface Datasets to get rouge score for my summarization texts.

The result is like

{'rouge1': AggregateScore(low=Score(precision=0.16620308156871524, recall=0.18219819615984395, fmeasure=0.16226017699359463), mid=Score(precision=0.17274338501705871, recall=0.1890957812369246, fmeasure=0.16823877588620403), high=Score(precision=0.17934569582981455, recall=0.1965626706042028, fmeasure=0.17491509794856058)), 
'rouge2': AggregateScore(low=Score(precision=0.12478835737689957, recall=0.1362113231755514, fmeasure=0.12055941950062395), mid=Score(precision=0.1303967602691664, recall=0.1423747229852964, fmeasure=0.1258363976151122), high=Score(precision=0.13654527560789362, recall=0.1488071465116122, fmeasure=0.13184989406704056)), 
'rougeL': AggregateScore(low=Score(precision=0.16568068818352072, recall=0.1811919016674486, fmeasure=0.1614784523482225), mid=Score(precision=0.17156684723552357, recall=0.1879777628247058, fmeasure=0.16720699286250762), high=Score(precision=0.17788847350584547, recall=0.1948899838530898, fmeasure=0.17316501523379826))}

What is the low, mid, and high categories mean? If I only want one rouge F1 value, which one should I choose?
Thanks for your attention!

HuggingFace Datasets implements the ROUGE metric using the rouge_score Python package, as seen here.

When taking a look at the source code of this package, it seems that low, mid and high refer to confidence intervals for the scores. As seen here, there are three bounds: low (row 0), mid (row 1) and high (row 2). Mid is always the mean, while low and high bounds are specified by the confidence_interval (which defaults to 0.95 meaning it will return the 2.5th and 97.5th percentiles for a 95% confidence interval on the mean).

Thanks for your reply~~ This helps me a lot. :smile: @nielsr