How to understand the answer_start parameter of Squad dataset for training BERT-QA model + practical implications for creating custom dataset?

skylernorgaard · July 15, 2022, 4:41pm

Hello,

This is my first time posting so I’m sorry if another section would be more appropriate for this question.

I am in the process of creating a custom dataset to benchmark the accuracy of the ‘bert-large-uncased-whole-word-masking-finetuned-squad’ model for my domain, to understand if I need to fine-tune further, etc.

When looking at the different Question Answering datasets on the Hugging Face site (squad, adversarial_qa, etc.), I see that the answer is commonly formatted as a dictionary with keys: answer (the text) and answer_start (char index where answer starts).

I’m trying to understand:

The intuition behind how the model uses the answer_start when calculating the loss, accuracy, etc.
If I need to go through the process of adding this to my custom dataset (easier to run model evaluation code, etc?)
If so, does anyone have a link to pre-written code / know of packages to help create this? (don’t want to re-invent the wheel if I don’t have to)

Any help or direction would be greatly appreciated!

Code example to show format:

import datasets
ds = datasets.load_dataset('squad')
train = ds['train']
print('Example: \n')
print(train['answers'][0])

Kroggi · September 1, 2023, 9:21pm

Currently going through this , were you able to find out how to work with a custom dataset?

Topic		Replies	Views
Question answering bot: fine-tuning with custom dataset Beginners	6	6033	June 23, 2022
Question about BERT for qa Beginners	0	594	June 30, 2022
Fine tunning QA model in SQUAD 2 dataset with more than one answer Intermediate	2	881	November 6, 2024
Dataloader time problem on custom dataset based on huggingface Beginners	2	1029	June 14, 2022
Shape of squad data for Question answering Beginners	0	412	April 15, 2023

How to understand the answer_start parameter of Squad dataset for training BERT-QA model + practical implications for creating custom dataset?

Related topics