Here are a couple of new wmt dataset seq2seq related scripts:
1.
If you need to test your transformers code performance against fairseq, the latter requires a different format for the dataset.
I wrote a little script to convert from a dataset format used by transformers summarization-like tasks (seq2seq) (i.e. train.source, train.target, val.source, val.target, test.source, test.targett) to what fairseq expects for MBART: Multilingual Denoising Pre-training for Neural Machine Translation.
You can use this sample dataset. But also see the next item below.
For more details, please see: https://github.com/stas00/nlp-helpers/tree/master/fairseq/wmt/
2.
Related, @sshleifer and I have just added a script to download wmt data for any year and language pair and save it in seq2seq format.
You will find it here: https://github.com/huggingface/transformers/blob/master/examples/seq2seq/download_wmt.py
This script’s output is as good as huggingface/nlp
is. e.g. for wmt19-ru-en
the current incarnation of the nlp
module is missing the test data and has other problems.
Thanks to the tip by @sshleifer, I was able to get the missing files with:
pip install sacrebleu
sacrebleu -t wmt19 -l ru-en --echo src > test.source
sacrebleu -t wmt19 -l ru-en --echo ref > test.target
Enjoy.