GPT-2 full python tokenizer example for Q/A finetuning

I’ve been doing a mekton of reading and came to find out that the reason there aren’t many examples of Q/A for GPT-2 is due to the fact that most tokenizer’s expect a rust/fast tokenizer.

Fortunately I found a repo that does exactly what I want, but I can’t make sense of how to extract the specific tokenizer example.

My end goal is to finetune GPT-Neo on Squad v2.0 dataset for Q/A. Most examples that I see are for GPT-J or GPT-NeoX which do support the fast tokenizer, but my use case is to use a smaller model (the 125M parameter model).

I’ve been banging my head for weeks trying to get it to work using openAI’s chatGPT to assist which keeps giving me confusing advice (idk if I need to enclose the prompt in {} or not, or even if I need to provide “Prompt” in the beginning or not. This site has an example: How To Fine-Tune GPT-NeoX | Forefront on how they setup a jsonl for Q/A for use with forefront’s cloud services.

I have an example QADatasetClass I was working on


class QADataset(Dataset):
    def __init__(self, df, tokenizer, max_length):
        # define variables    
        self.input_ids = []
        self.attn_masks = []
        self.answers = []
        # iterate through the dataset
        #iterate over rows of a pandas dataframe 
        for index, row in df.iterrows(): 
          #print(index, row['column1'], row['column2'])
        #for dataset in dataset_dict.values():
            #for row in dataset['features']:
                # prepare the text
                prep_txt = f"""<|startoftext|>Context: {row['context']}<|pad|>Question: {row['question']}<|pad|>Answer: {row['answer']}<|endoftext|>"""
        # tokenize
        encodings_dict = tokenizer(prep_txt, truncation=True,
                                  max_length=max_length, padding="max_length")
        # append to list
        self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
        self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
        self.answers.append(row['answer'])

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, item):
        return {'input_ids': self.input_ids[item],
                'attention_masks': self.attn_masks[item],
                'answers': self.answers[item]}

def load_qa_dataset(tokenizer):
    # load dataset and sample 10k reviews.
    #file_path = "../input/sentiment140/training.1600000.processed.noemoticon.csv"
    #df = pd.read_csv(file_path, encoding='ISO-8859-1', header=None)
    #df = df[[0, 5]]
    #df.columns = ['label', 'text']
    df = okay_set[['context','question','answer']]
    df = df.sample(500, random_state=1)

    # divide into test and train
    train, test = \
              train_test_split(df,
              shuffle=True, test_size=0.05, random_state=1)

    # format into SentimentDataset class
    train_dataset = QADataset(train, tokenizer, max_length=max_length)
    test_dataset = QADataset(test, tokenizer, max_length=max_length)

    # return
    return train_dataset, test_dataset

But I get an error

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 1
  Num Epochs = 5
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 5
  Number of trainable parameters = 125200128
The following columns in the training set don't have a corresponding argument in `GPTNeoForCausalLM.forward` and have been ignored: attention_masks, answers. If attention_masks, answers are not expected by `GPTNeoForCausalLM.forward`,  you can safely ignore this message.
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-96-c6f5b732d8ed> in <module>
     16 """
     17 
---> 18 Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=test_dataset,
     19         data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
     20                                     'attention_masks': torch.stack([f[1] for f in data]),

7 frames
<ipython-input-96-c6f5b732d8ed> in <listcomp>(.0)
     17 
     18 Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=test_dataset,
---> 19         data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
     20                                     'attention_masks': torch.stack([f[1] for f in data]),
     21                                     'answers': torch.stack([f[0] for f in data])}).train()

KeyError: 0

Any help would be much appreciated and no I don’t want to use BERT (there are plenty of examples of BERT working that I could use, but I want to use the more advanced GPT-Neo and scale up to their larger parameter models).

Alternatively I’m probably going to just use Dr. Tarlaci’s code with little modification without really understanding how to tokenize a full python Q/A using squad 2.0 dataset.

Alternatively I have this code that was one of my first attempts (with 14+ revisions) that runs all the way through, but produces garbage on the output,

but I’m concerned after reading through Question answering - Hugging Face Course that I’m not setting up the evaluation properly due to multiple answers in the eval dataset.