Question on HuggingFace's T5 documenation

I got a few questions on how T5 is trained reading this HuggingFace’s T5 doc.

  1. I think maybe this is not true statement?

“T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format”

Isn’t unsupervised(filling out masked token) for unsupervised(self-supervised) training and supervised(ex Summarize: …) is for fine-tuning?

  1. Did I understood following thing correctly?

So this is for pre-training
“The input of the encoder is the corrupted sentence, the input of the decoder is the original sentence and the target is then the dropped out tokens delimited by their sentinel tokens.”

and this is for fine-tuning
“It is trained using teacher forcing. This means that for training, we always need an input sequence and a corresponding target sequence. The input sequence is fed to the model using input_ids . The target sequence is shifted to the right, i.e., prepended by a start-sequence token and fed to the decoder using the decoder_input_ids.”