Based on HF documentation, unnaswerable questions from Squad 2.0 don't make it into train/val data

Hi, I wanted to finetune Electra on my own Squad 2.0-style dataset so I looked at the following documentation to figure out what the data format should be.

It seems in the walkthrough that only answerable questions actually make it into the training/validation datasets. In the JSON files, if a question cannot be answered, the “answers” array is empty. However in the walkthrough, this is how a (context, question, answer) triplet gets added to the data:
    for answer in qa['answers']:

Because it’s iterating through the “answers” array, if i’m not mistaken, the questions that are unanswerable will never get added to the data.

1 Like

@valhalla @sgugger Not sure if you two are the right people to tag but thought I’d start somewhere!

Asking because I’m not sure how to feed the model unanswerable questions in training, since the example in the doc just seems to ignore them - and it’s a bit part of squad 2

Hi melody, I believe you are right.

Conceptually I think we need to set target logits (both start and end) to be all zeros for all unanswerable questions. Also need to set/finetune threshold based on predicted logits whether the question is unanswerable or not (Need to modify the official example a bit)

Note that this is not the “official” example, but a simplified version for a tutorial. The official example is in examples/question-answering (will be further simplified very soon as I’m working on a PR) and does take into account the unanswerable questions.