Hi @Hildweig, it was fine-tuned the way you fine-tune/train any encoder-decoder Transformer model.
On a high level, the encoder takes the input sequence, creates a hidden representation of it.
The deocoder then receives the encoder representation, and is trained to generate the output sequence auto-regressively using teacher forcing.
You may have already read this, but The Illustrated Transformer by Jay Alammar explains it really well.
all the training hparams are in this file