System Info
- Transformers 3.0.2
- Pytorch 1.6.0
- Python 3.6.7
Train and Test Examples
Train: context: The enhanced system could supply more cooling capacity to car compartment under all test conditions because of higher performance heat exchangers. answer: under all test conditions question: Where did the enhanced system supply could more coolingcapacity to car compartment because of higher performance heat exchangers?
Test: context: In this paper, machine learning is implemented in a simulated air-conditioning system based on evolutionary computing methods involving the use of classifier systems and genetic algorithms. answer: in a simulated air-conditioning system based on evolutionary computing methods involving the use of classifier systems and genetic algorithms question: Where did machine learning implement?
Example Code
Train Script:
import datetime
from torch.utils.data import Dataset
import pandas as pd
import torch
from transformers import GPT2LMHeadModel, Trainer, TrainingArguments, GPT2Tokenizer
from typing import List
today = datetime.date.today().strftime('%Y%b%d_')
now = datetime.datetime.now()
time_str = today + str(now.hour) + str(now.minute)
train_dataset = '/path/to/train_data/train.txt'
output_dir = '/path/to/models_dir/qgen/' + time_str + '_out'
logging_dir = '/path/to/models_dir/qgen/' + time_str + '_logs'
model_save_dir = '/path/to/models_dir/qgen/' + time_str + '_model'
class QGenDataset(Dataset):
def __init__(self, text_file_path):
if '.txt' in text_file_path:
tmp_data_list = []
with open(text_file_path, 'r') as text_file:
for line in text_file:
tmp_data_list.append(line.replace('\n',''))
self.data_df = pd.DataFrame({'input_text':tmp_data_list})
def __len__(self):
return len(self.data_df)
def __getitem__(self, idx: int):
if torch.is_tensor(idx):
idx = idx.tolist()
text = self.data_df.iloc[idx, 0]
return text
def qgen_data_collator(text_list: List[str]) -> dict:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
q_id = tokenizer(' question', return_tensors='pt')['input_ids'][0][0]
encoded_results = tokenizer(text_list, padding=True, truncation=True, return_tensors='pt',
return_attention_mask=True)
q_idxs = (encoded_results['input_ids'] == q_id).nonzero()
for idx, attn_mask in enumerate(encoded_results['attention_mask']):
attn_mask[q_idxs[idx][1]:] = 0
tmp_labels = []
for idx, input_id in enumerate(encoded_results['input_ids']):
label = input_id.detach().clone()
label[:q_idxs[idx][1]] = -100
tmp_labels.append(label)
batch = {}
batch['input_ids'] = torch.stack([result for result in encoded_results['input_ids']])
batch['attention_mask'] = torch.stack([result for result in encoded_results['attention_mask']])
batch['labels'] = torch.stack([result for result in tmp_labels])
return batch
qgen_dataset_train = QGenDataset(train_dataset)
model = GPT2LMHeadModel.from_pretrained('gpt2')
training_args = TrainingArguments(
output_dir=output_dir,
do_train=True,
per_device_train_batch_size=16,
logging_dir=logging_dir)
trainer = Trainer(
model=model,
args=training_args,
data_collator=qgen_data_collator,
train_dataset=qgen_dataset_train)
trainer.train()
trainer.save_model(model_save_dir)
Generate Script:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
gpt2_qgen = GPT2LMHeadModel.from_pretrained('/path/to/models_dir/qgen/my_model')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
test_q_list = ['context: In this paper, machine learning is implemented in a simulated air-conditioning system based on evolutionary computing methods involving the use of classifier systems and genetic algorithms. answer: in a simulated air-conditioning system based on evolutionary computing methods involving the use of classifier systems and genetic algorithms.']
for test_q in test_q_list:
encoded_results = tokenizer(test_q, return_tensors='pt')
gen_max_length = encoded_results['input_ids'].shape[1] + 50
gen_ids = gpt2_qgen.generate(input_ids=encoded_results['input_ids'],
max_length=gen_max_length,
do_sample=True,
top_k=80,
top_p=0.95)
q_ids = gen_ids[0][encoded_results['input_ids'].shape[1]:]
full_decode_str = tokenizer.decode(gen_ids[0])
q_decoded = tokenizer.decode(q_ids)
Problem
I finetuned GPT2 for question generation using the suggestions found here. For the tokens in the training example that contain question: ...
, I set the attention_mask
to 0 and the labels
to the tokens in question: ...
.
Before running any tests, I wanted to see how well the model generated text on a single test example. To do this, I run the generate script. What I’m finding is:
-
q_decoded
is exactly the label for the test example even though the label is not feed into the generation script. -
encoded_results['input_ids'].shape = (1,60)
andgen_ids.shape = (1,69)
despitegen_max_length
being 110.
EDIT
Let me add a little clarification. For the following text:
context: In this paper, machine learning is implemented in a simulated air-conditioning
system based on evolutionary computing methods involving the use of classifier systems
and genetic algorithms. answer: in a simulated air-conditioning system based on evolutionary
computing methods involving the use of classifier systems and genetic algorithms.
whose label is:
question: Where did machine learning implement?
the generate
method of my model produces the following:
context: In this paper, machine learning is implemented in a simulated air-conditioning
system based on evolutionary computing methods involving the use of classifier systems
and genetic algorithms. answer: in a simulated air-conditioning system based on evolutionary
computing methods involving the use of classifier systems and genetic algorithms. question:
Where did machine learning implement?<|endoftext|>
despite setting max_length
to 110 in generate
.