OpenbookQA and CommonsenseQA data format issues

Hey there,

I work extensively with these two datasets, and I noticed the data format/schema for both datasets is different from the original formats. In fact, OpenbookQA contains the wrong data for the label field [1]. OpenBookQA also flattens the schema for the [“question”][“stem”] field and simply renames it to question_stem [2]. CommonSenseQA, in turn, discards the id field completely and flattens [“question”][“stem”] into question field[3].

Deviating from the original format discourages HF Datasets adoption. I am guessing this was a simple mistake while creating these datasets and that it wasn’t a design decision. It just doesn’t make sense.

I can create a Github issue and PR if we agree on this one.

Best,
Vladimir

[1] datasets/openbookqa.py at master · huggingface/datasets · GitHub
[2] datasets/openbookqa.py at master · huggingface/datasets · GitHub
[3] datasets/commonsense_qa.py at master · huggingface/datasets · GitHub

Hi! I agree with all these points. Feel free to open a GH issue and a PR. There is an open issue to fix [1]. Besides [1] and [2], we should also add the missing fields to the additional config of OpenbookQA, which are present in the TFDS script but not in ours.

1 Like

For the record the created issues are: