mT5/T5v1.1 Fine-Tuning Results

patrickvonplaten · November 17, 2020, 2:51pm

Hey everybody,

The mT5 and improved T5v1.1 models are added:

Improved T5 models (small to large):

and mT5 models (small to large):

are in the model hub Will upload the 3b and 11b versions in the coming days…

I want to start a thread here to collect some fine-tuning results and possibly some notebooks & tips and tricks.

If anyone has fine-tuned a mT5 or T5v1.1 model, it would be awesome to share the results here

Also, it might be interesting to see whether fp16 is compatible with the new T5 models, cf. with https://github.com/huggingface/transformers/issues/4287

I’ll try to allocate some time this week for fine-tuning, but I’m very excited about some possible discussions here.

Tagging some of our power contributors @valhalla @mrm8488 @beltagy @Jung (just FYI )

congcongwang · November 17, 2020, 4:23pm

I was trying to fine-tune it on a Chinese short text classification task and found MT5ForConditionalGeneration not in transformers-3.5.1 yet while it is here?

mrm8488 · November 17, 2020, 4:24pm

Congrats @patrickvonplaten

patrickvonplaten · November 17, 2020, 6:27pm

Added it yesterday, so only on master yet But we’ll release 4.0 very soon

patrickvonplaten · November 17, 2020, 6:28pm

And even more T5 pre-trained checkpoints for closed book question answering are here:
https://huggingface.co/models?search=ssm . The official paper was released just a couple of days ago

patrickvonplaten · November 27, 2020, 9:16am

This issue might also be of interest: https://github.com/huggingface/transformers/issues/8704#issuecomment-734731916

jejuwayfarer · December 4, 2020, 12:30am

Hi, I fine-tune 3 datasets separately on mT5-small or base model.
Three datasets are English STSb dataset, KorSTS dataset, my personal korean news classification dataset.

I want a result of the same form as that of T5. Looks like <pad> 5.0 </s>. But my model result in this from. <pad> <extra_id_0>SOMETHING</s>.

Changing some factors, such as lr and epochs, does not change the form of the correct answer at all.
My code is on this address.

github.com/huggingface/transformers

mT5 fine-tuned model generate wrong answer

opened 05:05AM - 30 Nov 20 UTC

closed 12:15AM - 04 Dec 20 UTC

JejuWayfarer

## Environment info  - `transformers` version: 4.0.0-rc-1 - Platform: Linux - Python version: 3.7.9 - PyTorch version (GPU?): 1.4.0 - Tensorflow version (GPU?): NA - Using GPU in script?: yes - Using distributed or parallel set-up in script?: no ### Who can help  @patrickvonplaten ## Information Model I am using (Bert, XLNet ...): MT5ForConditionalGeneration.from_pretrained('google/mt5-small') The problem arises when using: * [ ] the official example scripts: (give details below) * [x] my own modified scripts: (give details below) The tasks I am working on is: * [ ] an official GLUE/SQUaD task: (give the name) * [x] my own task or dataset: (give details below) KoreanSTS dataset https://github.com/kakaobrain/KorNLUDatasets ## To reproduce Steps to reproduce the behavior: 1. fine-tuning Korean STSb dataset on mT5-small model 2. Proceed inference using testset 3. Strange results  ```ruby import pandas as pd %matplotlib inline import matplotlib.pyplot as plt import random import time import datetime import numpy as np import os from tqdm.notebook import tqdm import logging import matplotlib.pyplot as plt import seaborn as sns import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler from transformers import Adafactor, get_linear_schedule_with_warmup, MT5ForConditionalGeneration, T5Tokenizer from scipy.stats import spearmanr, pearsonr tokenizer = T5Tokenizer.from_pretrained('google/mt5-small') model = MT5ForConditionalGeneration.from_pretrained('google/mt5-small', return_dict=True) GPU_NUM = 4 device = torch.device(f'cuda:{GPU_NUM}' if torch.cuda.is_available() else 'cpu') torch.cuda.set_device(device) # change allocation of current GPU print ('Current cuda device ', torch.cuda.current_device()) # check data_path = "../dataset" train = os.path.join(data_path,'sts-train.tsv') test = os.path.join(data_path,'sts-test.tsv') dev = os.path.join(data_path,'sts-dev.tsv') train_data = pd.read_csv(train, delimiter='\t', error_bad_lines=False) test_data = pd.read_csv(test, delimiter='\t', error_bad_lines=False) dev_data = pd.read_csv(dev, delimiter='\t', error_bad_lines=False) train_data.score = round(train_data.score*5)/5 train_data = train_data.applymap(str) train_data['input']='' for i in range(len(train_data)): strs_to_join = [] strs_to_join = ['stsb sentence1:', train_data.iloc[i]['sentence1'], 'sentence2:', train_data.iloc[i]['sentence2']] train_data['input'].iloc[i] = " ".join(strs_to_join) dev_data.score = round(dev_data.score*5)/5 dev_data = dev_data.applymap(str) dev_data['input']='' for i in range(len(dev_data)): strs_to_join = [] strs_to_join = ['stsb sentence1:', dev_data.iloc[i]['sentence1'], 'sentence2:', dev_data.iloc[i]['sentence2']] dev_data['input'].iloc[i] = " ".join(strs_to_join) dev_target = dev_data.score test_data.score = round(test_data.score*5)/5 test_data = test_data.applymap(str) test_data['input']='' for i in range(len(test_data)): strs_to_join = [] strs_to_join = ['stsb sentence1:', test_data.iloc[i]['sentence1'], 'sentence2:', test_data.iloc[i]['sentence2']] test_data['input'].iloc[i] = " ".join(strs_to_join) test_target = test_data.score train_inputs, train_targets, dev_inputs, dev_targets, test_inputs, test_targets = [],[],[],[],[],[] for input in train_data.input: tokenized_inputs = tokenizer.encode_plus(input, max_length=283, padding='max_length', return_tensors="pt").input_ids train_inputs.append(tokenized_inputs) for target in train_target: tokenized_targets = tokenizer.encode_plus(target, max_length=2, padding='max_length', return_tensors="pt").input_ids train_targets.append(tokenized_targets) for input in dev_data.input: tokenized_inputs = tokenizer.encode_plus(input, max_length=283, padding='max_length', return_tensors="pt").input_ids dev_inputs.append(tokenized_inputs) for target in dev_target: tokenized_targets = tokenizer.encode_plus(target, max_length=2, padding='max_length', return_tensors="pt").input_ids dev_targets.append(tokenized_targets) for input in test_data.input: tokenized_inputs = tokenizer.encode_plus(input, max_length=283, padding='max_length', return_tensors="pt").input_ids test_inputs.append(tokenized_inputs) for target in test_target: tokenized_targets = tokenizer.encode_plus(target, max_length=2, padding='max_length', return_tensors="pt").input_ids test_targets.append(tokenized_targets) train_input_ids = torch.cat(train_inputs, dim=0) train_labels = torch.cat(train_targets, dim=0) dev_input_ids = torch.cat(dev_inputs, dim=0) dev_labels = torch.cat(dev_targets, dim=0) test_input_ids = torch.cat(test_inputs, dim=0) test_labels = torch.cat(test_targets, dim=0) train_dataset = TensorDataset(train_input_ids, train_labels) dev_dataset = TensorDataset(dev_input_ids, dev_labels) test_dataset = TensorDataset(test_input_ids, test_labels) batch_size = 16 train_dataloader = DataLoader( train_dataset, # The training samples. sampler = RandomSampler(train_dataset), # Select batches randomly batch_size = batch_size # Trains with this batch size. ) dev_dataloader = DataLoader( dev_dataset, # The validation samples. sampler = SequentialSampler(dev_dataset), # Pull out batches sequentially. batch_size = batch_size # Evaluate with this batch size. ) test_dataloader = DataLoader( test_dataset, # The validation samples. sampler = SequentialSampler(test_dataset), # Pull out batches sequentially. batch_size = batch_size # Evaluate with this batch size. ) model.cuda() params = list(model.named_parameters()) optimizer = Adafactor(model.parameters(), lr = 1e-3, # args.learning_rate - default is 5e-5, our notebook had 2e-5 eps=(1e-30, 1e-3), relative_step = False ) epochs = 30 total_steps = len(train_dataloader) * epochs scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 0, # Default value in run_glue.py num_training_steps = total_steps) predictions_all=[] seed_val = 0 random.seed(seed_val) np.random.seed(seed_val) torch.manual_seed(seed_val) torch.cuda.manual_seed_all(seed_val) training_stats = [] total_t0 = time.time() for epoch_i in tqdm(range(0, epochs)): # Training print("") print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs)) print('Training...') t0 = time.time() total_train_loss = 0 model.train() for step, batch in tqdm(enumerate(train_dataloader)): if step % 50 == 0 and not step == 0: elapsed = format_time(time.time() - t0) print(' Batch {:>5,} of {:>5,}. Elapsed: {:}.'.format(step, len(train_dataloader), elapsed)) b_input_ids = batch[0].to(device) b_labels = batch[1].to(device) model.zero_grad() output = model(input_ids=b_input_ids, labels=b_labels, return_dict=True) loss = output.loss logits = output.logits total_train_loss += loss.item() loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() scheduler.step() avg_train_loss = total_train_loss / len(train_dataloader) training_time = format_time(time.time() - t0) print("") print(" Average training loss: {0:.2f}".format(avg_train_loss)) print(" Training epcoh took: {:}".format(training_time)) # Validation print("") print("Running Validation...") t0 = time.time() model.eval() total_eval_loss = 0 nb_eval_steps = 0 for batch in tqdm(dev_dataloader): b_input_ids = batch[0].to(device) b_labels = batch[1].to(device) with torch.no_grad(): output = model(input_ids=b_input_ids, labels=b_labels, return_dict=True) loss = output.loss logits = output.logits total_eval_loss += loss.item() logits = logits.detach().cpu().numpy() label_ids = b_labels.to('cpu').numpy() avg_val_loss = total_eval_loss / len(dev_dataloader) validation_time = format_time(time.time() - t0) print(" Validation Loss: {0:.2f}".format(avg_val_loss)) print(" Validation took: {:}".format(validation_time)) training_stats.append( { 'epoch': epoch_i + 1, 'Training Loss': avg_train_loss, 'Valid. Loss': avg_val_loss, 'Training Time': training_time, 'Validation Time': validation_time } ) # test print('Predicting labels for {:,} test sentences...'.format(len(test_input_ids))) model.eval() predictions = [] for batch in tqdm(test_dataloader): b_input_ids = batch[0].to(device) with torch.no_grad(): outputs = model.generate(b_input_ids) predictions.append(outputs) print('DONE.') predictions_all.append(predictions) print("") print("Training complete!") print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0))) for i in range(10): output = model.generate(test_input_ids[i].cuda().reshape(1,-1)) print(tokenizer.decode(output[0])) ``` ><pad> <extra_id_0></s> <pad> <extra_id_0>.</s> <pad> <extra_id_0>.</s> <pad> <extra_id_0></s> <pad> <extra_id_0>합니다.</s> <pad> <extra_id_0></s> <pad> <extra_id_0>.</s> <pad> <extra_id_0>.</s> <pad> <extra_id_0>.</s> <pad> <extra_id_0>.</s> ## Expected behavior Thank you for sharing so you can use T5 and mT5 using pytorch. 1. I fine-tuned the Korean STSB dataset on mt5-small. But the result didn't come out the way I wanted it to come out in a strange shape. There are about 5700 training datasets. I wonder if there was a mistake in the learning process, or because the data set was insufficient, or because it was less learned. 2. Next, when inferencing using mT5(T5), what is the difference between proceeding with model.generate() and doing with model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)?

What’s the problem? Please give me some advice on the code.

Jung · December 4, 2020, 4:59am

Hi Patrick! ,

Since we have many new great pretrained models on T5 (not yet considering mT5), I would love to try summarize meaning of their postfixs to make sure we understand them correclty.

v1_1 or xl or xxl – use minor-change architecture to original T5 detailed in here and here . Due to these minor changes, number of parameters change a bit, so xl replaces 3B and xxl replaces 11B (not sure if they are bigger or smaller than before) .

They also pretrained only on C4 (ie. not pretrained on multi-task supervised datasets like original T5) .

ssm – use salient span masking detailed in the paper Section 3. This special masking significantly improves model’s world knowledge.

tqa – finetuned on Trivia Q&A dataset , using 100% of training data
tqao – like above but using only 90% of training data

wq – finetuned on Web Question dataset, using 100% of training data
wqo – like above but using only 90% of training data

nq – finetuned on Google Natural Question dataset, using 100% of training data
nqo – like above but using only 90% of training data

Also want to note that although official metric performance of these SSM-pretrained looks inferior to Open-book models like DPR, the authors note in the paper using manual evaluation that around 30% of “officially wrong answers” are “false negative” as T5 freely generated answers may not match the gold-truth perfectly (but in fact they are correct) .

For example, in Close-book NQ task, taking into account these false negative, T5-XXL-SSM is estimated to has 0.57 metric points compared to official metric of 0.37 and DPR’s SOTA of 0.42

patrickvonplaten · December 7, 2020, 6:06pm

that’s an awesome summary of the new models thank you

yusukemori · December 20, 2020, 12:14pm

Hello,

Thank you for adding mT5 to the Transformers!
I tried to fine-tune it using a dataset including Japanese, but it seems the generation result is not good. I think there are some problems in my fine-tuning setting.
May I ask the tips for fine-tuning mT5 here? Or should I ask them in T5 Finetuning Tips?

Thank you in advance.

valhalla · December 21, 2020, 7:42am

Would be nice if you ask it in T5 Finetuning Tips and post a link here

yusukemori · December 21, 2020, 7:44am

Thank you! I’ll do so.

Tom · January 26, 2021, 4:05pm

I have the same issue that <extra_id_0> always appears. Anyone knows how to solve it?

CarlosPR · April 9, 2021, 7:20pm

Hello!

Could anyone share an example of the code for fine tuning mt5?

I am trying to fine tune it for QA and Abstractive Summarization with Spanish datasets, and I think it could be great to share the results here after.

Thanks in advance!

CarlosPR · May 1, 2021, 5:07pm

[UPDATE]: Hi! I might have this done by this month (train mt5 for Spanish QA and Abstractive Summarization), I will comment here again once I upload it to the models hub. I am first checking the results I get doing it with English datasets, and I will use Spanish after that (I don’t have much computer resources so it takes me a lot of time and I prefer to test with English which I guess would definitely work to solve any issues I may have, and then with Spanish once I can be sure I am not going to spend too much time with something that may not work )

lannelin · June 2, 2021, 3:00pm

In case it’s of interest, I’ve uploaded a large mT5 model fine-tuned on MNLI and xtreme-XNLI to the model hub: alan-turing-institute/mt5-large-finetuned-mnli-xtreme-xnli

I ended up tuning using the original google repo (with some nice pointers from Stephen Mayhew’s notebook). So can’t really offer much in the way of tips for tuning with transformers, unfortunately.

I’ve seen some fairly encouraging early results comparing the tuned model to joddav’s excellent XLM-R model (have run out of links in this post for new user) for zero shot classification on some benchmark data (yinwenpeng, BenchmarkingZeroShot - link limit again!) but have only run over a small subset so far. My institute is planning on trialing these models in a project we’ve got coming up so hopefully will be able to update at some point

bwilde3 · March 8, 2022, 3:24am

I’ve spent quite a while fine-tuning mt5-small for German-to-English translation, but with only mediocre results. I’ve incorporated the suggestions given in the T5 Finetuning Tips forum, but the model still only reaches a BLEU score of about 9 (based on published results, I was expecting a score likely in the 30-40 range). I’d be very interested to know if anyone has achieved better translation results with this model.

FYI: I’m limiting the fine-tuning to a 10k subset of the wmt16 dataset.

Topic		Replies	Views
Multilingual T5 Model Not Found? Models	3	1123	November 17, 2020
T5 finetuning metrics not improving 🤗Transformers	1	341	June 20, 2023
E5 embedding models 🤗Transformers	1	20	March 17, 2025
How is T5 pretrained? 🤗Transformers	3	510	July 12, 2021
Finetuning options with SAM? Models	4	5230	May 11, 2023

mT5/T5v1.1 Fine-Tuning Results

Related topics