Where is the source to benchmark's dataset entries on the model's website

stas · August 10, 2020, 4:27am

If I go to https://huggingface.co/Helsinki-NLP/opus-mt-ru-en, I see a bunch of dataset entries in the benchmark table. How do I know what they are so that if I do my own training I can compare apples to apples?

For example this particular model has a list:

[...]
|newstest2015-enru.ru.en |30.4 |0.568|
|newstest2016-enru.ru.en |30.1 |0.565|
[...]
newstest2019-ruen.ru.en 	31.4 	0.576
Tatoeba.ru.en 	61.1 	0.736

After some research I have derived that most likely these are WMT datasets (e.g. https://www.statmt.org/wmt16/), but I could be wrong. And even if I got it right, I can’t tell whether newstest2015 is actually wmt16 or wmt15? This is because wmt16 doesn’t include any data from 2016. It’s made from News Crawl articles up to and including 2015, according to http://www.statmt.org/wmt16/translation-task.html. So I can’t tell whether the year in newstest2016-enru.ru.en refers to the name of the dataset or the last included year of the News Crawl dump.

Any suggestions to how I could find which entry is the right one if I finetune on wmt16?
edit: Since there is wmt19 out there and their scorecard contains the “newstest2019” as the most recent entry, most likely the year listed in the entry is of the WMT release and not of the News Crawl data. So if I train on wmt16 I’d compare with newstest2016-enru.ru.en.

That’s said, perhaps, it’d make it easier for the users if the contributed model’s webpage identifed which datasets it has in its benchmark table, with a link to a source or at least an official name so the former can be found. Also since there is a high link rot - perhaps, a backup link to waybackmachine.

Thank you.

p.s. now that I have investigated this model, the helpful links from the benchmark entry would have been:

all but last entry: http://opus.nlpl.eu/WMT-News.php and maybe the original http://www.statmt.org/wmt19/
last entry: http://opus.nlpl.eu/Tatoeba.php and maybe the original https://tatoeba.org/eng/

RichardWang · August 10, 2020, 7:10am

I guess the only way is to open a issue on their github repo or contact them directly to ask.

stas · August 10, 2020, 5:09pm

Oh, I see. By looking at the other models, a model card is just a text entry. For some reason I thought there was a specific API/form for each field.

Then, yes, it’s just a matter of putting the notes in place - I will contact them via their github - thank you for this suggestion, @RichardWang!

Filed an issue there https://github.com/Helsinki-NLP/OPUS-MT-train/issues/17

Topic	Replies	Views
A few wmt seq2seq dataset-related scripts Beginners	586	August 12, 2020
WMT testset score Beginners	183	September 20, 2021
Request for Further Information on Datasets Beginners	280	November 26, 2020
Dataset parameters to finetune a pretrained translation model on new vocabulary Models	365	July 5, 2023
Anyone have idea how we can finetune a model using Trainer API? 🤗Transformers	446	April 22, 2022

Where is the source to benchmark's dataset entries on the model's website

Related topics