mT5/T5v1.1 Fine-Tuning Results

Hey everybody,

The mT5 and improved T5v1.1 models are added:

Improved T5 models (small to large):

and mT5 models (small to large):

are in the model hub Will upload the 3b and 11b versions in the coming days…

I want to start a thread here to collect some fine-tuning results and possibly some notebooks & tips and tricks.

If anyone has fine-tuned a mT5 or T5v1.1 model, it would be awesome to share the results here :slight_smile:

Also, it might be interesting to see whether fp16 is compatible with the new T5 models, cf. with

I’ll try to allocate some time this week for fine-tuning, but I’m very excited about some possible discussions here.

Tagging some of our power contributors @valhalla @mrm8488 @beltagy @Jung (just FYI :slight_smile: )


I was trying to fine-tune it on a Chinese short text classification task and found MT5ForConditionalGeneration not in transformers-3.5.1 yet while it is here?

Congrats :clap::clap::clap: @patrickvonplaten

Added it yesterday, so only on master yet :slight_smile: But we’ll release 4.0 very soon :slight_smile:


And even more T5 pre-trained checkpoints for closed book question answering are here: . The official paper was released just a couple of days ago :slight_smile:


This issue might also be of interest:

Hi, I fine-tune 3 datasets separately on mT5-small or base model.
Three datasets are English STSb dataset, KorSTS dataset, my personal korean news classification dataset.

I want a result of the same form as that of T5. Looks like <pad> 5.0 </s>. But my model result in this from. <pad> <extra_id_0>SOMETHING</s>.

Changing some factors, such as lr and epochs, does not change the form of the correct answer at all.
My code is on this address.

What’s the problem? Please give me some advice on the code.

Hi Patrick! ,

Since we have many new great pretrained models on T5 (not yet considering mT5), I would love to try summarize meaning of their postfixs to make sure we understand them correclty.

v1_1 or xl or xxl – use minor-change architecture to original T5 detailed in here and here . Due to these minor changes, number of parameters change a bit, so xl replaces 3B and xxl replaces 11B (not sure if they are bigger or smaller than before) .

They also pretrained only on C4 (ie. not pretrained on multi-task supervised datasets like original T5) .

ssm – use salient span masking detailed in the paper Section 3. This special masking significantly improves model’s world knowledge.

tqa – finetuned on Trivia Q&A dataset , using 100% of training data
tqao – like above but using only 90% of training data

wq – finetuned on Web Question dataset, using 100% of training data
wqo – like above but using only 90% of training data

nq – finetuned on Google Natural Question dataset, using 100% of training data
nqo – like above but using only 90% of training data

Also want to note that although official metric performance of these SSM-pretrained looks inferior to Open-book models like DPR, the authors note in the paper using manual evaluation that around 30% of “officially wrong answers” are “false negative” as T5 freely generated answers may not match the gold-truth perfectly (but in fact they are correct) .

For example, in Close-book NQ task, taking into account these false negative, T5-XXL-SSM is estimated to has 0.57 metric points compared to official metric of 0.37 and DPR’s SOTA of 0.42


that’s an awesome summary of the new models thank you :slight_smile:

1 Like


Thank you for adding mT5 to the Transformers!
I tried to fine-tune it using a dataset including Japanese, but it seems the generation result is not good. I think there are some problems in my fine-tuning setting.
May I ask the tips for fine-tuning mT5 here? Or should I ask them in T5 Finetuning Tips?

Thank you in advance.

Would be nice if you ask it in T5 Finetuning Tips and post a link here

1 Like

Thank you! I’ll do so.