OpenbookQA and CommonsenseQA data format issues

vblagoje · May 3, 2022, 8:08am

Hey there,

I work extensively with these two datasets, and I noticed the data format/schema for both datasets is different from the original formats. In fact, OpenbookQA contains the wrong data for the label field [1]. OpenBookQA also flattens the schema for the [“question”][“stem”] field and simply renames it to question_stem [2]. CommonSenseQA, in turn, discards the id field completely and flattens [“question”][“stem”] into question field[3].

Deviating from the original format discourages HF Datasets adoption. I am guessing this was a simple mistake while creating these datasets and that it wasn’t a design decision. It just doesn’t make sense.

I can create a Github issue and PR if we agree on this one.

Best,
Vladimir

[1] datasets/openbookqa.py at master · huggingface/datasets · GitHub
[2] datasets/openbookqa.py at master · huggingface/datasets · GitHub
[3] datasets/commonsense_qa.py at master · huggingface/datasets · GitHub

mariosasko · May 3, 2022, 1:48pm

Hi! I agree with all these points. Feel free to open a GH issue and a PR. There is an open issue to fix [1]. Besides [1] and [2], we should also add the missing fields to the additional config of OpenbookQA, which are present in the TFDS script but not in ours.

vblagoje · May 4, 2022, 6:00am

For the record the created issues are:

github.com/huggingface/datasets

CommonSenseQA has missing and inconsistent field names

opened 05:38AM - 04 May 22 UTC

vblagoje

dataset bug

## Describe the bug In short, CommonSenseQA implementation is inconsistent with… the original dataset. More precisely, we need to: 1. Add the dataset matching "id" field. The current dataset, instead, regenerates monotonically increasing id. 2. The [“question”][“stem”] field is flattened into "question". We should match the original dataset and unflatten it 3. Add the missing "question_concept" field in the question tree node 4. Anything else? Go over the data structure of the newly repaired CommonSenseQA and make sure it matches the original ## Expected results Every data item of the CommonSenseQA should structurally and data-wise match the original CommonSenseQA dataset. ## Actual results TBD ## Environment info - `datasets` version: 2.1.0 - Platform: macOS-10.15.7-x86_64-i386-64bit - Python version: 3.8.13 - PyArrow version: 7.0.0 - Pandas version: 1.4.2

github.com/huggingface/datasets

OpenBookQA has missing and inconsistent field names

opened 05:51AM - 04 May 22 UTC

vblagoje

dataset bug

## Describe the bug OpenBookQA implementation is inconsistent with the original… dataset. We need to: 1. The dataset field [question][stem] is flattened into question_stem. Unflatten it to match the original format. 2. Add missing additional fields: - 'fact1': row['fact1'], - 'humanScore': row['humanScore'], - 'clarity': row['clarity'], - 'turkIdAnonymized': row['turkIdAnonymized'] 3. Ensure the structure and every data item in the original OpenBookQA matches our OpenBookQA version. ## Expected results The structure and every data item in the original OpenBookQA matches our OpenBookQA version. ## Actual results TBD ## Environment info - `datasets` version: 2.1.0 - Platform: macOS-10.15.7-x86_64-i386-64bit - Python version: 3.8.13 - PyArrow version: 7.0.0 - Pandas version: 1.4.2

Topic		Replies	Views
Problem with Hugging face customised SQuad dataset Beginners	4	27	January 21, 2025
Recent breaking changes in `api.dataset_info`? 🤗Datasets	3	69	January 9, 2025
Using the jpelhaw / t5-word-sense-disambiguation model Beginners	2	685	April 14, 2022
Visualbert lower accuracy in validation dataset 🤗Transformers	0	185	November 20, 2023
How does Hugging Face Hub jointly versions models and their training data? 🤗Hub	5	868	January 13, 2023

OpenbookQA and CommonsenseQA data format issues

Related topics