How to train a translation model from scratch to reproduce <attention is all you need>?

victordiao · November 29, 2022, 2:26pm

Hi,

Motivation

I want to reproduce the experiments described in Attention is all you need, which is a transformer base model from scratch. The model architecture is the same as Attention is all you need.
I am looking for a transformer base model to train from scratch with HuggingFace but cannot find it.

I found that there are many pre-trained models (e.g., T5, BART, MariaMT), but I would like to train a transformer base model from scratch to compare different optimizers during pre-training.

The experiments are based on WMT14.
I am using FSMT because I cannot find an implementation of the transformer. (not sure whether it is a good choice)
I was wondering which model implementation, dataset and tokenizer are the best choices.

stas/wmt14-en-de-pre-processed with facebook/wmt19-en-de
wmt14 with facebook/wmt19-en-de
Especially, I do not know which tokenizer should be used.

In summary, I want to reproduce the results of but have no idea how to train a translation model from scratch.

Thanks in advance if you could provide some suggestions!
@patrickvonplaten @valhalla

Topic		Replies	Views
Reproduce attention is all you need Beginners	0	481	June 25, 2022
Train a transformer from scratch 🤗Transformers	0	434	August 9, 2021
Vanilla Transformer Beginners	1	1177	June 6, 2023
Way to train a basic Transformer Beginners	6	637	November 21, 2020
How to make pure transformer model Beginners	0	136	May 22, 2024

How to train a translation model from scratch to reproduce <attention is all you need>?

Motivation

Related topics