A few wmt seq2seq dataset-related scripts

stas · August 12, 2020, 2:34am

Here are a couple of new wmt dataset seq2seq related scripts:

1.

If you need to test your transformers code performance against fairseq, the latter requires a different format for the dataset.

I wrote a little script to convert from a dataset format used by transformers summarization-like tasks (seq2seq) (i.e. train.source, train.target, val.source, val.target, test.source, test.targett) to what fairseq expects for MBART: Multilingual Denoising Pre-training for Neural Machine Translation.

You can use this sample dataset. But also see the next item below.

For more details, please see: https://github.com/stas00/nlp-helpers/tree/master/fairseq/wmt/

2.

Related, @sshleifer and I have just added a script to download wmt data for any year and language pair and save it in seq2seq format.

You will find it here: https://github.com/huggingface/transformers/blob/master/examples/seq2seq/download_wmt.py

This script’s output is as good as huggingface/nlp is. e.g. for wmt19-ru-en the current incarnation of the nlp module is missing the test data and has other problems.

Thanks to the tip by @sshleifer, I was able to get the missing files with:

pip install sacrebleu
sacrebleu -t wmt19 -l ru-en --echo src > test.source
sacrebleu -t wmt19 -l ru-en --echo ref > test.target

Enjoy.

Topic		Replies	Views
Good command to test examples/seq2seq refactors 🤗Transformers	0	239	September 3, 2020
[new model] FSMT has been released + 9 models ported 🤗Transformers	3	1151	September 25, 2020
Summarization on newsroom dataset 🤗Datasets	0	327	October 7, 2020
Cannot convert mbart from fairseq to huggingface using the script in the repo 🤗Transformers	3	1258	February 8, 2022
Cannot download wmt16 🤗Datasets	0	437	November 16, 2020

A few wmt seq2seq dataset-related scripts

1.

2.

Related topics