Hi,
Motivation
I want to reproduce the experiments described in Attention is all you need, which is a transformer base model from scratch. The model architecture is the same as Attention is all you need.
I am looking for a transformer base model to train from scratch with HuggingFace but cannot find it.
I found that there are many pre-trained models (e.g., T5, BART, MariaMT), but I would like to train a transformer base model from scratch to compare different optimizers during pre-training.
The experiments are based on WMT14.
I am using FSMT
because I cannot find an implementation of the transformer. (not sure whether it is a good choice)
I was wondering which model implementation, dataset and tokenizer are the best choices.
-
stas/wmt14-en-de-pre-processed
withfacebook/wmt19-en-de
-
wmt14
withfacebook/wmt19-en-de
Especially, I do not know which tokenizer should be used.
In summary, I want to reproduce the results of but have no idea how to train a translation model from scratch.
Thanks in advance if you could provide some suggestions!
@patrickvonplaten @valhalla