A few wmt seq2seq dataset-related scripts

Here are a couple of new wmt dataset seq2seq related scripts:

1.

If you need to test your transformers code performance against fairseq, the latter requires a different format for the dataset.

I wrote a little script to convert from a dataset format used by transformers summarization-like tasks (seq2seq) (i.e. train.source, train.target, val.source, val.target, test.source, test.targett) to what fairseq expects for MBART: Multilingual Denoising Pre-training for Neural Machine Translation.

You can use this sample dataset. But also see the next item below.

For more details, please see: https://github.com/stas00/nlp-helpers/tree/master/fairseq/wmt/

2.

Related, @sshleifer and I have just added a script to download wmt data for any year and language pair and save it in seq2seq format.

You will find it here: https://github.com/huggingface/transformers/blob/master/examples/seq2seq/download_wmt.py

This script’s output is as good as huggingface/nlp is. e.g. for wmt19-ru-en the current incarnation of the nlp module is missing the test data and has other problems.

Thanks to the tip by @sshleifer, I was able to get the missing files with:

pip install sacrebleu
sacrebleu -t wmt19 -l ru-en --echo src > test.source
sacrebleu -t wmt19 -l ru-en --echo ref > test.target

Enjoy.

1 Like