How to train a translation model from scratch to reproduce <attention is all you need>?

Hi,

Motivation

I want to reproduce the experiments described in Attention is all you need, which is a transformer base model from scratch. The model architecture is the same as Attention is all you need.
I am looking for a transformer base model to train from scratch with HuggingFace but cannot find it.

I found that there are many pre-trained models (e.g., T5, BART, MariaMT), but I would like to train a transformer base model from scratch to compare different optimizers during pre-training.

The experiments are based on WMT14.
I am using FSMT because I cannot find an implementation of the transformer. (not sure whether it is a good choice)
I was wondering which model implementation, dataset and tokenizer are the best choices.

  1. stas/wmt14-en-de-pre-processed with facebook/wmt19-en-de
  2. wmt14 with facebook/wmt19-en-de
    Especially, I do not know which tokenizer should be used.

In summary, I want to reproduce the results of but have no idea how to train a translation model from scratch.

Thanks in advance if you could provide some suggestions!
@patrickvonplaten @valhalla