Finetuning T5 on translation task

I am finetuning T5 on a translation task. I am using Flax.

The translation task is between two fairly similar dialects.

Watching how the quality of the translation progresses, I see that is starts with predicting “”. Then it calculates the loss between and , and then slowly after several iterations gets to something that is fairly similar to , and then starts improving after finally ending up with a translation. Even if I end up with something that is qualitatively decent (and a BLEU-score of above 80), it still seems slow and “unstable”.

Since this is a seq2seq model, I guess it already has a fairly good way of doing source->encode/decode->source, and then calculating loss based on [encoded-decoded] source vs target, instead of calculating the loss based on vs .

I am fairly new to these models. Does this make sense?