Where is the source to benchmark's dataset entries on the model's website

If I go to https://huggingface.co/Helsinki-NLP/opus-mt-ru-en, I see a bunch of dataset entries in the benchmark table. How do I know what they are so that if I do my own training I can compare apples to apples?

For example this particular model has a list:

[...]
|newstest2015-enru.ru.en |30.4 |0.568|
|newstest2016-enru.ru.en |30.1 |0.565|
[...]
newstest2019-ruen.ru.en 	31.4 	0.576
Tatoeba.ru.en 	61.1 	0.736

After some research I have derived that most likely these are WMT datasets (e.g. https://www.statmt.org/wmt16/), but I could be wrong. And even if I got it right, I can’t tell whether newstest2015 is actually wmt16 or wmt15? This is because wmt16 doesn’t include any data from 2016. It’s made from News Crawl articles up to and including 2015, according to http://www.statmt.org/wmt16/translation-task.html. So I can’t tell whether the year in newstest2016-enru.ru.en refers to the name of the dataset or the last included year of the News Crawl dump.

Any suggestions to how I could find which entry is the right one if I finetune on wmt16?
edit: Since there is wmt19 out there and their scorecard contains the “newstest2019” as the most recent entry, most likely the year listed in the entry is of the WMT release and not of the News Crawl data. So if I train on wmt16 I’d compare with newstest2016-enru.ru.en.

That’s said, perhaps, it’d make it easier for the users if the contributed model’s webpage identifed which datasets it has in its benchmark table, with a link to a source or at least an official name so the former can be found. Also since there is a high link rot - perhaps, a backup link to waybackmachine.

Thank you.

p.s. now that I have investigated this model, the helpful links from the benchmark entry would have been:

I guess the only way is to open a issue on their github repo or contact them directly to ask.

1 Like

Oh, I see. By looking at the other models, a model card is just a text entry. For some reason I thought there was a specific API/form for each field.

Then, yes, it’s just a matter of putting the notes in place - I will contact them via their github - thank you for this suggestion, @RichardWang!

Filed an issue there https://github.com/Helsinki-NLP/OPUS-MT-train/issues/17