How to understand the answer_start parameter of Squad dataset for training BERT-QA model + practical implications for creating custom dataset?

Hello,

This is my first time posting so I’m sorry if another section would be more appropriate for this question.

I am in the process of creating a custom dataset to benchmark the accuracy of the ‘bert-large-uncased-whole-word-masking-finetuned-squad’ model for my domain, to understand if I need to fine-tune further, etc.

When looking at the different Question Answering datasets on the Hugging Face site (squad, adversarial_qa, etc.), I see that the answer is commonly formatted as a dictionary with keys: answer (the text) and answer_start (char index where answer starts).

I’m trying to understand:

  • The intuition behind how the model uses the answer_start when calculating the loss, accuracy, etc.
  • If I need to go through the process of adding this to my custom dataset (easier to run model evaluation code, etc?)
  • If so, does anyone have a link to pre-written code / know of packages to help create this? (don’t want to re-invent the wheel if I don’t have to)

Any help or direction would be greatly appreciated!

Code example to show format:

import datasets
ds = datasets.load_dataset('squad')
train = ds['train']
print('Example: \n')
print(train['answers'][0])

Currently going through this , were you able to find out how to work with a custom dataset?