I’ve been doing a mekton of reading and came to find out that the reason there aren’t many examples of Q/A for GPT-2 is due to the fact that most tokenizer’s expect a rust/fast tokenizer.
Fortunately I found a repo that does exactly what I want, but I can’t make sense of how to extract the specific tokenizer example.
My end goal is to finetune GPT-Neo on Squad v2.0 dataset for Q/A. Most examples that I see are for GPT-J or GPT-NeoX which do support the fast tokenizer, but my use case is to use a smaller model (the 125M parameter model).
I’ve been banging my head for weeks trying to get it to work using openAI’s chatGPT to assist which keeps giving me confusing advice (idk if I need to enclose the prompt in {} or not, or even if I need to provide “Prompt” in the beginning or not. This site has an example: How To Fine-Tune GPT-NeoX | Forefront on how they setup a jsonl for Q/A for use with forefront’s cloud services.
I have an example QADatasetClass I was working on
class QADataset(Dataset):
def __init__(self, df, tokenizer, max_length):
# define variables
self.input_ids = []
self.attn_masks = []
self.answers = []
# iterate through the dataset
#iterate over rows of a pandas dataframe
for index, row in df.iterrows():
#print(index, row['column1'], row['column2'])
#for dataset in dataset_dict.values():
#for row in dataset['features']:
# prepare the text
prep_txt = f"""<|startoftext|>Context: {row['context']}<|pad|>Question: {row['question']}<|pad|>Answer: {row['answer']}<|endoftext|>"""
# tokenize
encodings_dict = tokenizer(prep_txt, truncation=True,
max_length=max_length, padding="max_length")
# append to list
self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
self.answers.append(row['answer'])
def __len__(self):
return len(self.input_ids)
def __getitem__(self, item):
return {'input_ids': self.input_ids[item],
'attention_masks': self.attn_masks[item],
'answers': self.answers[item]}
def load_qa_dataset(tokenizer):
# load dataset and sample 10k reviews.
#file_path = "../input/sentiment140/training.1600000.processed.noemoticon.csv"
#df = pd.read_csv(file_path, encoding='ISO-8859-1', header=None)
#df = df[[0, 5]]
#df.columns = ['label', 'text']
df = okay_set[['context','question','answer']]
df = df.sample(500, random_state=1)
# divide into test and train
train, test = \
train_test_split(df,
shuffle=True, test_size=0.05, random_state=1)
# format into SentimentDataset class
train_dataset = QADataset(train, tokenizer, max_length=max_length)
test_dataset = QADataset(test, tokenizer, max_length=max_length)
# return
return train_dataset, test_dataset
But I get an error
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
Num examples = 1
Num Epochs = 5
Instantaneous batch size per device = 2
Total train batch size (w. parallel, distributed & accumulation) = 2
Gradient Accumulation steps = 1
Total optimization steps = 5
Number of trainable parameters = 125200128
The following columns in the training set don't have a corresponding argument in `GPTNeoForCausalLM.forward` and have been ignored: attention_masks, answers. If attention_masks, answers are not expected by `GPTNeoForCausalLM.forward`, you can safely ignore this message.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-96-c6f5b732d8ed> in <module>
16 """
17
---> 18 Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=test_dataset,
19 data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
20 'attention_masks': torch.stack([f[1] for f in data]),
7 frames
<ipython-input-96-c6f5b732d8ed> in <listcomp>(.0)
17
18 Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=test_dataset,
---> 19 data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
20 'attention_masks': torch.stack([f[1] for f in data]),
21 'answers': torch.stack([f[0] for f in data])}).train()
KeyError: 0
Any help would be much appreciated and no I don’t want to use BERT (there are plenty of examples of BERT working that I could use, but I want to use the more advanced GPT-Neo and scale up to their larger parameter models).
Alternatively I’m probably going to just use Dr. Tarlaci’s code with little modification without really understanding how to tokenize a full python Q/A using squad 2.0 dataset.