If I go to https://huggingface.co/Helsinki-NLP/opus-mt-ru-en, I see a bunch of dataset entries in the benchmark table. How do I know what they are so that if I do my own training I can compare apples to apples?
For example this particular model has a list:
[...]
|newstest2015-enru.ru.en |30.4 |0.568|
|newstest2016-enru.ru.en |30.1 |0.565|
[...]
newstest2019-ruen.ru.en 31.4 0.576
Tatoeba.ru.en 61.1 0.736
After some research I have derived that most likely these are WMT datasets (e.g. https://www.statmt.org/wmt16/), but I could be wrong. And even if I got it right, I can’t tell whether newstest2015
is actually wmt16 or wmt15? This is because wmt16 doesn’t include any data from 2016. It’s made from News Crawl articles up to and including 2015, according to http://www.statmt.org/wmt16/translation-task.html. So I can’t tell whether the year in newstest2016-enru.ru.en
refers to the name of the dataset or the last included year of the News Crawl dump.
Any suggestions to how I could find which entry is the right one if I finetune on wmt16?
edit: Since there is wmt19 out there and their scorecard contains the “newstest2019” as the most recent entry, most likely the year listed in the entry is of the WMT release and not of the News Crawl data. So if I train on wmt16 I’d compare with newstest2016-enru.ru.en
.
That’s said, perhaps, it’d make it easier for the users if the contributed model’s webpage identifed which datasets it has in its benchmark table, with a link to a source or at least an official name so the former can be found. Also since there is a high link rot - perhaps, a backup link to waybackmachine.
Thank you.
p.s. now that I have investigated this model, the helpful links from the benchmark entry would have been:
- all but last entry: http://opus.nlpl.eu/WMT-News.php and maybe the original http://www.statmt.org/wmt19/
- last entry: http://opus.nlpl.eu/Tatoeba.php and maybe the original https://tatoeba.org/eng/