Finetuning GPT2 with user defined loss

@npsri, I think you’ll want to include padding and truncation. Looking at the Overview section here, there is mention that it’s probably best to pad gpt2 to the right of the sequence. Thankfully, this is pretty simple to do using the HF methods! Here’s how I setup my tokenizer:

class SDAbstractsDataset(Dataset):
    def __init__(self, csv_file):
        self.sd_abstracts_df = pd.read_csv(csv_file, encoding='ISO-8859-1')
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')

    def __len__(self):
        return len(self.sd_abstracts_df)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        abstract_text = self.sd_abstracts_df.iloc[idx, 1]
        encoded_result =self.tokenizer(abstract_text, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True)

        return {'input_ids': encoded_result['input_ids'],
                'past': None,
                'attention_mask': encoded_result['attention_mask'],
                'token_type_ids': None,
                'position_ids': None,
                'head_mask': None,
                'inputs_embeds': None,
                'labels': None,
                'use_cache': True}

You’ll see the instantiation of GPT2Tokenizer in the __init__, I added the right side padding. Then when I go to tokenize a sequence in __getitem__, I use the padding, truncation, and return_attention_mask parameters.

If you’ve been following this post, you’ll also see some changes to this DataSet subclass to the one I posted before. Just something to keep note of as I continue experimenting.

Currently, however, this tokenizer is giving me the following error:

    "Asking to pad but the tokenizer does not have a padding token. "
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

which is fine since:

training_tokenizer = sd_dataset.tokenizer
print('training_tokenizer.pad_token = {}'.format(training_tokenizer.pad_token))

[Out]: training_tokenizer.pad_token = None

So now I’m just looking through the documentation to see if there is any preference about what the padding token should be.