mT5/T5v1.1 Fine-Tuning Results

Hey everybody,

The mT5 and improved T5v1.1 models are added:

Improved T5 models (small to large):

and mT5 models (small to large):

are in the model hub Will upload the 3b and 11b versions in the coming days…

I want to start a thread here to collect some fine-tuning results and possibly some notebooks & tips and tricks.

If anyone has fine-tuned a mT5 or T5v1.1 model, it would be awesome to share the results here :slight_smile:

Also, it might be interesting to see whether fp16 is compatible with the new T5 models, cf. with

I’ll try to allocate some time this week for fine-tuning, but I’m very excited about some possible discussions here.

Tagging some of our power contributors @valhalla @mrm8488 @beltagy @Jung (just FYI :slight_smile: )


I was trying to fine-tune it on a Chinese short text classification task and found MT5ForConditionalGeneration not in transformers-3.5.1 yet while it is here?

Congrats :clap::clap::clap: @patrickvonplaten

Added it yesterday, so only on master yet :slight_smile: But we’ll release 4.0 very soon :slight_smile:


And even more T5 pre-trained checkpoints for closed book question answering are here: . The official paper was released just a couple of days ago :slight_smile:


This issue might also be of interest:

Hi, I fine-tune 3 datasets separately on mT5-small or base model.
Three datasets are English STSb dataset, KorSTS dataset, my personal korean news classification dataset.

I want a result of the same form as that of T5. Looks like <pad> 5.0 </s>. But my model result in this from. <pad> <extra_id_0>SOMETHING</s>.

Changing some factors, such as lr and epochs, does not change the form of the correct answer at all.
My code is on this address.

What’s the problem? Please give me some advice on the code.
1 Like

Hi Patrick! ,

Since we have many new great pretrained models on T5 (not yet considering mT5), I would love to try summarize meaning of their postfixs to make sure we understand them correclty.

v1_1 or xl or xxl – use minor-change architecture to original T5 detailed in here and here . Due to these minor changes, number of parameters change a bit, so xl replaces 3B and xxl replaces 11B (not sure if they are bigger or smaller than before) .

They also pretrained only on C4 (ie. not pretrained on multi-task supervised datasets like original T5) .

ssm – use salient span masking detailed in the paper Section 3. This special masking significantly improves model’s world knowledge.

tqa – finetuned on Trivia Q&A dataset , using 100% of training data
tqao – like above but using only 90% of training data

wq – finetuned on Web Question dataset, using 100% of training data
wqo – like above but using only 90% of training data

nq – finetuned on Google Natural Question dataset, using 100% of training data
nqo – like above but using only 90% of training data

Also want to note that although official metric performance of these SSM-pretrained looks inferior to Open-book models like DPR, the authors note in the paper using manual evaluation that around 30% of “officially wrong answers” are “false negative” as T5 freely generated answers may not match the gold-truth perfectly (but in fact they are correct) .

For example, in Close-book NQ task, taking into account these false negative, T5-XXL-SSM is estimated to has 0.57 metric points compared to official metric of 0.37 and DPR’s SOTA of 0.42


that’s an awesome summary of the new models thank you :slight_smile:

1 Like


Thank you for adding mT5 to the Transformers!
I tried to fine-tune it using a dataset including Japanese, but it seems the generation result is not good. I think there are some problems in my fine-tuning setting.
May I ask the tips for fine-tuning mT5 here? Or should I ask them in T5 Finetuning Tips?

Thank you in advance.

1 Like

Would be nice if you ask it in T5 Finetuning Tips and post a link here

1 Like

Thank you! I’ll do so.

I have the same issue that <extra_id_0> always appears. Anyone knows how to solve it?



Could anyone share an example of the code for fine tuning mt5?

I am trying to fine tune it for QA and Abstractive Summarization with Spanish datasets, and I think it could be great to share the results here after.

Thanks in advance! :hugs:


[UPDATE]: Hi! I might have this done by this month (train mt5 for Spanish QA and Abstractive Summarization), I will comment here again once I upload it to the models hub. I am first checking the results I get doing it with English datasets, and I will use Spanish after that (I don’t have much computer resources so it takes me a lot of time and I prefer to test with English which I guess would definitely work to solve any issues I may have, and then with Spanish once I can be sure I am not going to spend too much time with something that may not work :slight_smile: )


In case it’s of interest, I’ve uploaded a large mT5 model fine-tuned on MNLI and xtreme-XNLI to the model hub: alan-turing-institute/mt5-large-finetuned-mnli-xtreme-xnli

I ended up tuning using the original google repo (with some nice pointers from Stephen Mayhew’s notebook). So can’t really offer much in the way of tips for tuning with transformers, unfortunately.

I’ve seen some fairly encouraging early results comparing the tuned model to joddav’s excellent XLM-R model (have run out of links in this post for new user) for zero shot classification on some benchmark data (yinwenpeng, BenchmarkingZeroShot - link limit again!) but have only run over a small subset so far. My institute is planning on trialing these models in a project we’ve got coming up so hopefully will be able to update at some point :slight_smile:

1 Like

I’ve spent quite a while fine-tuning mt5-small for German-to-English translation, but with only mediocre results. I’ve incorporated the suggestions given in the T5 Finetuning Tips forum, but the model still only reaches a BLEU score of about 9 (based on published results, I was expecting a score likely in the 30-40 range). I’d be very interested to know if anyone has achieved better translation results with this model.

FYI: I’m limiting the fine-tuning to a 10k subset of the wmt16 dataset.