GPT-2 full python tokenizer example for Q/A finetuning

LaferriereJC · December 27, 2022, 4:30pm

I’ve been doing a mekton of reading and came to find out that the reason there aren’t many examples of Q/A for GPT-2 is due to the fact that most tokenizer’s expect a rust/fast tokenizer.

Fortunately I found a repo that does exactly what I want, but I can’t make sense of how to extract the specific tokenizer example.

My end goal is to finetune GPT-Neo on Squad v2.0 dataset for Q/A. Most examples that I see are for GPT-J or GPT-NeoX which do support the fast tokenizer, but my use case is to use a smaller model (the 125M parameter model).

I’ve been banging my head for weeks trying to get it to work using openAI’s chatGPT to assist which keeps giving me confusing advice (idk if I need to enclose the prompt in {} or not, or even if I need to provide “Prompt” in the beginning or not. This site has an example: How To Fine-Tune GPT-NeoX | Forefront on how they setup a jsonl for Q/A for use with forefront’s cloud services.

I have an example QADatasetClass I was working on


class QADataset(Dataset):
    def __init__(self, df, tokenizer, max_length):
        # define variables    
        self.input_ids = []
        self.attn_masks = []
        self.answers = []
        # iterate through the dataset
        #iterate over rows of a pandas dataframe 
        for index, row in df.iterrows(): 
          #print(index, row['column1'], row['column2'])
        #for dataset in dataset_dict.values():
            #for row in dataset['features']:
                # prepare the text
                prep_txt = f"""<|startoftext|>Context: {row['context']}<|pad|>Question: {row['question']}<|pad|>Answer: {row['answer']}<|endoftext|>"""
        # tokenize
        encodings_dict = tokenizer(prep_txt, truncation=True,
                                  max_length=max_length, padding="max_length")
        # append to list
        self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
        self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
        self.answers.append(row['answer'])

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, item):
        return {'input_ids': self.input_ids[item],
                'attention_masks': self.attn_masks[item],
                'answers': self.answers[item]}

def load_qa_dataset(tokenizer):
    # load dataset and sample 10k reviews.
    #file_path = "../input/sentiment140/training.1600000.processed.noemoticon.csv"
    #df = pd.read_csv(file_path, encoding='ISO-8859-1', header=None)
    #df = df[[0, 5]]
    #df.columns = ['label', 'text']
    df = okay_set[['context','question','answer']]
    df = df.sample(500, random_state=1)

    # divide into test and train
    train, test = \
              train_test_split(df,
              shuffle=True, test_size=0.05, random_state=1)

    # format into SentimentDataset class
    train_dataset = QADataset(train, tokenizer, max_length=max_length)
    test_dataset = QADataset(test, tokenizer, max_length=max_length)

    # return
    return train_dataset, test_dataset

But I get an error

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 1
  Num Epochs = 5
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 5
  Number of trainable parameters = 125200128
The following columns in the training set don't have a corresponding argument in `GPTNeoForCausalLM.forward` and have been ignored: attention_masks, answers. If attention_masks, answers are not expected by `GPTNeoForCausalLM.forward`,  you can safely ignore this message.
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-96-c6f5b732d8ed> in <module>
     16 """
     17 
---> 18 Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=test_dataset,
     19         data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
     20                                     'attention_masks': torch.stack([f[1] for f in data]),

7 frames
<ipython-input-96-c6f5b732d8ed> in <listcomp>(.0)
     17 
     18 Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=test_dataset,
---> 19         data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
     20                                     'attention_masks': torch.stack([f[1] for f in data]),
     21                                     'answers': torch.stack([f[0] for f in data])}).train()

KeyError: 0

Any help would be much appreciated and no I don’t want to use BERT (there are plenty of examples of BERT working that I could use, but I want to use the more advanced GPT-Neo and scale up to their larger parameter models).

Alternatively I’m probably going to just use Dr. Tarlaci’s code with little modification without really understanding how to tokenize a full python Q/A using squad 2.0 dataset.

LaferriereJC · December 27, 2022, 4:32pm

Alternatively I have this code that was one of my first attempts (with 14+ revisions) that runs all the way through, but produces garbage on the output,

gist.github.com

https://gist.github.com/thistleknot/b83158b8a55e2a0f7bd3e6e9af0f80a0

squad_2.py

#based on this guide but modified from sentiment to Q/A: https://towardsdatascience.com/guide-to-fine-tuning-text-generation-models-gpt-2-gpt-neo-and-t5-dc5de6b3bc5e
!pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
!pip install datasets transformers nltk sklearn jsonlines
#based on this guide but modified from sentiment to Q/A: https://towardsdatascience.com/guide-to-fine-tuning-text-generation-models-gpt-2-gpt-neo-and-t5-dc5de6b3bc5e
import pandas as pd
import numpy as np
import datasets
from datasets import load_dataset
import json
import jsonlines

This file has been truncated. show original

but I’m concerned after reading through Question answering - Hugging Face Course that I’m not setting up the evaluation properly due to multiple answers in the eval dataset.

Topic		Replies	Views
How to train GPT-2 for text summarization? Models	4	9565	November 24, 2024
Fine tuning and retokenizing Beginners	0	589	May 29, 2022
Training GPT-2 from scratch Beginners	2	1230	August 3, 2020
How to fine-tune GPT on my own data for text generation Beginners	0	2188	January 17, 2022
Need help with gpt2 model Beginners	0	586	July 9, 2023

GPT-2 full python tokenizer example for Q/A finetuning

Related topics