Hi, I wanted to finetune Electra on my own Squad 2.0-style dataset so I looked at the following documentation to figure out what the data format should be.
It seems in the walkthrough that only answerable questions actually make it into the training/validation datasets. In the JSON files, if a question cannot be answered, the “answers” array is empty. However in the walkthrough, this is how a (context, question, answer) triplet gets added to the data:
for answer in qa['answers']:
contexts.append(context)
questions.append(question)
answers.append(answer)
Because it’s iterating through the “answers” array, if i’m not mistaken, the questions that are unanswerable will never get added to the data.
Asking because I’m not sure how to feed the model unanswerable questions in training, since the example in the doc just seems to ignore them - and it’s a bit part of squad 2
Conceptually I think we need to set target logits (both start and end) to be all zeros for all unanswerable questions. Also need to set/finetune threshold based on predicted logits whether the question is unanswerable or not (Need to modify the official example a bit)
Note that this is not the “official” example, but a simplified version for a tutorial. The official example is in examples/question-answering (will be further simplified very soon as I’m working on a PR) and does take into account the unanswerable questions.