Finetuning GPT2 with user defined loss

aclifton314 · July 23, 2020, 8:02pm

@npsri, I think you’ll want to include padding and truncation. Looking at the Overview section here, there is mention that it’s probably best to pad gpt2 to the right of the sequence. Thankfully, this is pretty simple to do using the HF methods! Here’s how I setup my tokenizer:

class SDAbstractsDataset(Dataset):
    def __init__(self, csv_file):
        self.sd_abstracts_df = pd.read_csv(csv_file, encoding='ISO-8859-1')
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')

    def __len__(self):
        return len(self.sd_abstracts_df)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        abstract_text = self.sd_abstracts_df.iloc[idx, 1]
        encoded_result =self.tokenizer(abstract_text, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True)

        return {'input_ids': encoded_result['input_ids'],
                'past': None,
                'attention_mask': encoded_result['attention_mask'],
                'token_type_ids': None,
                'position_ids': None,
                'head_mask': None,
                'inputs_embeds': None,
                'labels': None,
                'use_cache': True}

You’ll see the instantiation of GPT2Tokenizer in the __init__, I added the right side padding. Then when I go to tokenize a sequence in __getitem__, I use the padding, truncation, and return_attention_mask parameters.

If you’ve been following this post, you’ll also see some changes to this DataSet subclass to the one I posted before. Just something to keep note of as I continue experimenting.

Currently, however, this tokenizer is giving me the following error:

    "Asking to pad but the tokenizer does not have a padding token. "
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

which is fine since:

training_tokenizer = sd_dataset.tokenizer
print('training_tokenizer.pad_token = {}'.format(training_tokenizer.pad_token))

[Out]: training_tokenizer.pad_token = None

So now I’m just looking through the documentation to see if there is any preference about what the padding token should be.

Topic		Replies	Views
Loading finetuned model to generate text 🤗Transformers	12	3312	August 7, 2023
GPT-2 fine-tuning Beginners	0	1613	June 12, 2023
Generate method during finetuning Beginners	6	1941	July 30, 2020
Using GPT-J for custom sequence classification Beginners	0	407	September 14, 2022
Need help with gpt2 model Beginners	0	586	July 9, 2023

Finetuning GPT2 with user defined loss

Related topics