## Environment info
<!-- You can run the command `transformers-cli env` and cop…y-and-paste its output below.
Don't forget to fill out the missing fields in that output! -->
- `transformers` version: 4.0.0-rc-1
- Platform: Linux
- Python version: 3.7.9
- PyTorch version (GPU?): 1.4.0
- Tensorflow version (GPU?): NA
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
### Who can help
<!-- Your issue will be replied to more quickly if you can figure out the right person to tag with @
If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**.
Please tag fewer than 3 people.
-->
@patrickvonplaten
## Information
Model I am using (Bert, XLNet ...): MT5ForConditionalGeneration.from_pretrained('google/mt5-small')
The problem arises when using:
* [ ] the official example scripts: (give details below)
* [x] my own modified scripts: (give details below)
The tasks I am working on is:
* [ ] an official GLUE/SQUaD task: (give the name)
* [x] my own task or dataset: (give details below)
KoreanSTS dataset
https://github.com/kakaobrain/KorNLUDatasets
## To reproduce
Steps to reproduce the behavior:
1. fine-tuning Korean STSb dataset on mT5-small model
2. Proceed inference using testset
3. Strange results
<!-- If you have code snippets, error messages, stack traces please provide them here as well.
Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.-->
```ruby
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import random
import time
import datetime
import numpy as np
import os
from tqdm.notebook import tqdm
import logging
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import Adafactor, get_linear_schedule_with_warmup, MT5ForConditionalGeneration, T5Tokenizer
from scipy.stats import spearmanr, pearsonr
tokenizer = T5Tokenizer.from_pretrained('google/mt5-small')
model = MT5ForConditionalGeneration.from_pretrained('google/mt5-small', return_dict=True)
GPU_NUM = 4
device = torch.device(f'cuda:{GPU_NUM}' if torch.cuda.is_available() else 'cpu')
torch.cuda.set_device(device) # change allocation of current GPU
print ('Current cuda device ', torch.cuda.current_device()) # check
data_path = "../dataset"
train = os.path.join(data_path,'sts-train.tsv')
test = os.path.join(data_path,'sts-test.tsv')
dev = os.path.join(data_path,'sts-dev.tsv')
train_data = pd.read_csv(train, delimiter='\t', error_bad_lines=False)
test_data = pd.read_csv(test, delimiter='\t', error_bad_lines=False)
dev_data = pd.read_csv(dev, delimiter='\t', error_bad_lines=False)
train_data.score = round(train_data.score*5)/5
train_data = train_data.applymap(str)
train_data['input']=''
for i in range(len(train_data)):
strs_to_join = []
strs_to_join = ['stsb sentence1:', train_data.iloc[i]['sentence1'], 'sentence2:', train_data.iloc[i]['sentence2']]
train_data['input'].iloc[i] = " ".join(strs_to_join)
dev_data.score = round(dev_data.score*5)/5
dev_data = dev_data.applymap(str)
dev_data['input']=''
for i in range(len(dev_data)):
strs_to_join = []
strs_to_join = ['stsb sentence1:', dev_data.iloc[i]['sentence1'], 'sentence2:', dev_data.iloc[i]['sentence2']]
dev_data['input'].iloc[i] = " ".join(strs_to_join)
dev_target = dev_data.score
test_data.score = round(test_data.score*5)/5
test_data = test_data.applymap(str)
test_data['input']=''
for i in range(len(test_data)):
strs_to_join = []
strs_to_join = ['stsb sentence1:', test_data.iloc[i]['sentence1'], 'sentence2:', test_data.iloc[i]['sentence2']]
test_data['input'].iloc[i] = " ".join(strs_to_join)
test_target = test_data.score
train_inputs, train_targets, dev_inputs, dev_targets, test_inputs, test_targets = [],[],[],[],[],[]
for input in train_data.input:
tokenized_inputs = tokenizer.encode_plus(input, max_length=283, padding='max_length', return_tensors="pt").input_ids
train_inputs.append(tokenized_inputs)
for target in train_target:
tokenized_targets = tokenizer.encode_plus(target, max_length=2, padding='max_length', return_tensors="pt").input_ids
train_targets.append(tokenized_targets)
for input in dev_data.input:
tokenized_inputs = tokenizer.encode_plus(input, max_length=283, padding='max_length', return_tensors="pt").input_ids
dev_inputs.append(tokenized_inputs)
for target in dev_target:
tokenized_targets = tokenizer.encode_plus(target, max_length=2, padding='max_length', return_tensors="pt").input_ids
dev_targets.append(tokenized_targets)
for input in test_data.input:
tokenized_inputs = tokenizer.encode_plus(input, max_length=283, padding='max_length', return_tensors="pt").input_ids
test_inputs.append(tokenized_inputs)
for target in test_target:
tokenized_targets = tokenizer.encode_plus(target, max_length=2, padding='max_length', return_tensors="pt").input_ids
test_targets.append(tokenized_targets)
train_input_ids = torch.cat(train_inputs, dim=0)
train_labels = torch.cat(train_targets, dim=0)
dev_input_ids = torch.cat(dev_inputs, dim=0)
dev_labels = torch.cat(dev_targets, dim=0)
test_input_ids = torch.cat(test_inputs, dim=0)
test_labels = torch.cat(test_targets, dim=0)
train_dataset = TensorDataset(train_input_ids, train_labels)
dev_dataset = TensorDataset(dev_input_ids, dev_labels)
test_dataset = TensorDataset(test_input_ids, test_labels)
batch_size = 16
train_dataloader = DataLoader(
train_dataset, # The training samples.
sampler = RandomSampler(train_dataset), # Select batches randomly
batch_size = batch_size # Trains with this batch size.
)
dev_dataloader = DataLoader(
dev_dataset, # The validation samples.
sampler = SequentialSampler(dev_dataset), # Pull out batches sequentially.
batch_size = batch_size # Evaluate with this batch size.
)
test_dataloader = DataLoader(
test_dataset, # The validation samples.
sampler = SequentialSampler(test_dataset), # Pull out batches sequentially.
batch_size = batch_size # Evaluate with this batch size.
)
model.cuda()
params = list(model.named_parameters())
optimizer = Adafactor(model.parameters(),
lr = 1e-3, # args.learning_rate - default is 5e-5, our notebook had 2e-5
eps=(1e-30, 1e-3),
relative_step = False
)
epochs = 30
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer,
num_warmup_steps = 0, # Default value in run_glue.py
num_training_steps = total_steps)
predictions_all=[]
seed_val = 0
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
training_stats = []
total_t0 = time.time()
for epoch_i in tqdm(range(0, epochs)):
# Training
print("")
print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
print('Training...')
t0 = time.time()
total_train_loss = 0
model.train()
for step, batch in tqdm(enumerate(train_dataloader)):
if step % 50 == 0 and not step == 0:
elapsed = format_time(time.time() - t0)
print(' Batch {:>5,} of {:>5,}. Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))
b_input_ids = batch[0].to(device)
b_labels = batch[1].to(device)
model.zero_grad()
output = model(input_ids=b_input_ids, labels=b_labels, return_dict=True)
loss = output.loss
logits = output.logits
total_train_loss += loss.item()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
avg_train_loss = total_train_loss / len(train_dataloader)
training_time = format_time(time.time() - t0)
print("")
print(" Average training loss: {0:.2f}".format(avg_train_loss))
print(" Training epcoh took: {:}".format(training_time))
# Validation
print("")
print("Running Validation...")
t0 = time.time()
model.eval()
total_eval_loss = 0
nb_eval_steps = 0
for batch in tqdm(dev_dataloader):
b_input_ids = batch[0].to(device)
b_labels = batch[1].to(device)
with torch.no_grad():
output = model(input_ids=b_input_ids, labels=b_labels, return_dict=True)
loss = output.loss
logits = output.logits
total_eval_loss += loss.item()
logits = logits.detach().cpu().numpy()
label_ids = b_labels.to('cpu').numpy()
avg_val_loss = total_eval_loss / len(dev_dataloader)
validation_time = format_time(time.time() - t0)
print(" Validation Loss: {0:.2f}".format(avg_val_loss))
print(" Validation took: {:}".format(validation_time))
training_stats.append(
{
'epoch': epoch_i + 1,
'Training Loss': avg_train_loss,
'Valid. Loss': avg_val_loss,
'Training Time': training_time,
'Validation Time': validation_time
}
)
# test
print('Predicting labels for {:,} test sentences...'.format(len(test_input_ids)))
model.eval()
predictions = []
for batch in tqdm(test_dataloader):
b_input_ids = batch[0].to(device)
with torch.no_grad():
outputs = model.generate(b_input_ids)
predictions.append(outputs)
print('DONE.')
predictions_all.append(predictions)
print("")
print("Training complete!")
print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))
for i in range(10):
output = model.generate(test_input_ids[i].cuda().reshape(1,-1))
print(tokenizer.decode(output[0]))
```
><pad> <extra_id_0></s>
<pad> <extra_id_0>.</s>
<pad> <extra_id_0>.</s>
<pad> <extra_id_0></s>
<pad> <extra_id_0>합니다.</s>
<pad> <extra_id_0></s>
<pad> <extra_id_0>.</s>
<pad> <extra_id_0>.</s>
<pad> <extra_id_0>.</s>
<pad> <extra_id_0>.</s>
## Expected behavior
Thank you for sharing so you can use T5 and mT5 using pytorch.
1. I fine-tuned the Korean STSB dataset on mt5-small. But the result didn't come out the way I wanted it to come out in a strange shape.
There are about 5700 training datasets.
I wonder if there was a mistake in the learning process, or because the data set was insufficient, or because it was less learned.
2. Next, when inferencing using mT5(T5), what is the difference between proceeding with model.generate() and doing with model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)?