I am finetuning T5 on a translation task. I am using Flax.
The translation task is between two fairly similar dialects.
Watching how the quality of the translation progresses, I see that is starts with predicting “”. Then it calculates the loss between and , and then slowly after several iterations gets to something that is fairly similar to
Since this is a seq2seq model, I guess it already has a fairly good way of doing source->encode/decode->source, and then calculating loss based on [encoded-decoded] source vs target, instead of calculating the loss based on vs .
I am fairly new to these models. Does this make sense?