Finetuned model generating test label exactly

System Info

  • Transformers 3.0.2
  • Pytorch 1.6.0
  • Python 3.6.7

Train and Test Examples
Train: context: The enhanced system could supply more cooling capacity to car compartment under all test conditions because of higher performance heat exchangers. answer: under all test conditions question: Where did the enhanced system supply could more coolingcapacity to car compartment because of higher performance heat exchangers?

Test: context: In this paper, machine learning is implemented in a simulated air-conditioning system based on evolutionary computing methods involving the use of classifier systems and genetic algorithms. answer: in a simulated air-conditioning system based on evolutionary computing methods involving the use of classifier systems and genetic algorithms question: Where did machine learning implement?

Example Code
Train Script:

import datetime
from torch.utils.data import Dataset
import pandas as pd
import torch
from transformers import GPT2LMHeadModel, Trainer, TrainingArguments, GPT2Tokenizer
from typing import List

today = datetime.date.today().strftime('%Y%b%d_')
now = datetime.datetime.now()
time_str = today + str(now.hour) + str(now.minute)

train_dataset = '/path/to/train_data/train.txt'
output_dir = '/path/to/models_dir/qgen/' + time_str + '_out'
logging_dir = '/path/to/models_dir/qgen/'  + time_str + '_logs'
model_save_dir = '/path/to/models_dir/qgen/'  + time_str + '_model'

class QGenDataset(Dataset):
    def __init__(self, text_file_path):
        if '.txt' in text_file_path:
            tmp_data_list = []
            with open(text_file_path, 'r') as text_file:
                for line in text_file:
                    tmp_data_list.append(line.replace('\n',''))
            self.data_df = pd.DataFrame({'input_text':tmp_data_list})

    def __len__(self):
        return len(self.data_df)

    def __getitem__(self, idx: int):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        text = self.data_df.iloc[idx, 0]
        return text


def qgen_data_collator(text_list: List[str]) -> dict:
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    tokenizer.pad_token = tokenizer.eos_token
    q_id = tokenizer(' question', return_tensors='pt')['input_ids'][0][0]

    encoded_results = tokenizer(text_list, padding=True, truncation=True, return_tensors='pt',
                                     return_attention_mask=True)

    q_idxs = (encoded_results['input_ids'] == q_id).nonzero()
    for idx, attn_mask in enumerate(encoded_results['attention_mask']):
        attn_mask[q_idxs[idx][1]:] = 0

    tmp_labels = []
    for idx, input_id in enumerate(encoded_results['input_ids']):
        label = input_id.detach().clone()
        label[:q_idxs[idx][1]] = -100
        tmp_labels.append(label)

    batch = {}
    batch['input_ids'] = torch.stack([result for result in encoded_results['input_ids']])
    batch['attention_mask'] = torch.stack([result for result in encoded_results['attention_mask']])
    batch['labels'] = torch.stack([result for result in tmp_labels])
    return batch


qgen_dataset_train = QGenDataset(train_dataset)
model = GPT2LMHeadModel.from_pretrained('gpt2')

training_args = TrainingArguments(
    output_dir=output_dir,
    do_train=True,
    per_device_train_batch_size=16,
    logging_dir=logging_dir)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=qgen_data_collator,
    train_dataset=qgen_dataset_train)

trainer.train()
trainer.save_model(model_save_dir)

Generate Script:

from transformers import GPT2LMHeadModel, GPT2Tokenizer

gpt2_qgen = GPT2LMHeadModel.from_pretrained('/path/to/models_dir/qgen/my_model')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
test_q_list = ['context: In this paper, machine learning is implemented in a simulated air-conditioning system based on evolutionary computing methods involving the use of classifier systems and genetic algorithms. answer: in a simulated air-conditioning system based on evolutionary computing methods involving the use of classifier systems and genetic algorithms.']

for test_q in test_q_list:
    encoded_results = tokenizer(test_q, return_tensors='pt')
    gen_max_length = encoded_results['input_ids'].shape[1] + 50

    gen_ids = gpt2_qgen.generate(input_ids=encoded_results['input_ids'],
                                 max_length=gen_max_length,
                                 do_sample=True,
                                 top_k=80,
                                 top_p=0.95)
    q_ids = gen_ids[0][encoded_results['input_ids'].shape[1]:]
    full_decode_str = tokenizer.decode(gen_ids[0])
    q_decoded = tokenizer.decode(q_ids)

Problem
I finetuned GPT2 for question generation using the suggestions found here. For the tokens in the training example that contain question: ..., I set the attention_mask to 0 and the labels to the tokens in question: ....

Before running any tests, I wanted to see how well the model generated text on a single test example. To do this, I run the generate script. What Iā€™m finding is:

  1. q_decoded is exactly the label for the test example even though the label is not feed into the generation script.

  2. encoded_results['input_ids'].shape = (1,60) and gen_ids.shape = (1,69) despite gen_max_length being 110.

EDIT
Let me add a little clarification. For the following text:

context: In this paper, machine learning is implemented in a simulated air-conditioning
system based on evolutionary computing methods involving the use of classifier systems
and genetic algorithms. answer: in a simulated air-conditioning system based on evolutionary
computing methods involving the use of classifier systems and genetic algorithms.

whose label is:

question: Where did machine learning implement?

the generate method of my model produces the following:

context: In this paper, machine learning is implemented in a simulated air-conditioning
system based on evolutionary computing methods involving the use of classifier systems
and genetic algorithms. answer: in a simulated air-conditioning system based on evolutionary
computing methods involving the use of classifier systems and genetic algorithms. question:
Where did machine learning implement?<|endoftext|>

despite setting max_length to 110 in generate.

1 Like