How to create dataset from CSV to training Question answering？

fate25 · July 6, 2023, 7:25am

I made a CSV by referring to the SQUAD data structure. Then trying use script from here to training Question answering using customized CSV.

When converting from CSV to Dataset, I understood that it was necessary to convert the feature in the answers column to Sequence (feature…, so I converted it as follows.

#Custom features
ans_feature = Features({'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)})
custom_features = Features(
    {
        "id": Value(dtype='string', id=None),
        "title": Value(dtype='string', id=None),
        "context": Value(dtype='string', id=None),
        "question": Value(dtype='string', id=None),
        "answers": Sequence(ans_feature, length=-1, id=None),
    }
)

#Create train datasets
df = pd.read_csv("train.csv", encoding="utf_8")
train_l = []
for index, row in df.iterrows():
    train_l.append({'id': row[0], 'title': row[1], 'context': row[2], 'question': row[3], 'answers': row[4]})

train_dataset = Dataset.from_pandas(pd.DataFrame(data=train_l), features=custom_features)

#Create validation datasets
df = pd.read_csv("validation.csv", encoding="utf_8")
train_v = []
for index, row in df.iterrows():
    train_v.append({'id': row[0], 'title': row[1], 'context': row[2], 'question': row[3], 'answers': row[4]})

validation_dataset = Dataset.from_pandas(pd.DataFrame(data=train_v), features=custom_features)

#Create datasets
raw_datasets = DatasetDict({
    "train": train_dataset,
    "validation": validation_dataset,
})

However, the conversion process did not go well and the following error occurred.

  File "/Users/tphan/.pyenv/versions/anaconda3-2023.03/envs/transformers/lib/python3.10/site-packages/datasets/table.py", line 2140, in cast_array_to_feature
    raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{feature}")
TypeError: Couldn't cast array of type
string
to
Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)

Can someone please tell me the correct way and the code?

mariosasko · July 6, 2023, 1:11pm

Can you paste the first few lines of your CSV? What the structure of the answers column in your CSV?

You can use the converters param in read_csv to parse complex columns.

Also, JSON (Lines) is much better for representing such data (load_dataset would work out of the box).

fate25 · July 7, 2023, 4:28am

@mariosasko Thank you for replied.

I saved the SQUAD data to CSV for testing. Then I tried two patterns of CSV as shown below, but neither could be converted successfully. I haven’t used the converers parameter yet, so I’ll try it.

mariosasko · July 7, 2023, 1:16pm

You should be able to parse the second pattern with:

pd.read_csv(..., converters={"answers": lambda x: json.loads(x)})

fate25 · July 10, 2023, 6:28am

Thank you for your reply. I tried it, but it still doesn’t work with “json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes” error.

The data structure doesn’t seem correct, so I’ll try again. Thanks again.

fate25 · July 13, 2023, 12:39am

Sharing information for those in need like me. I was finally able to train with the following data structure. I would be happy if you could refer to it. Thanks to @mariosasko very much again.

Aurelie123 · January 20, 2025, 5:25pm

Hi, I am in the same situation and maybe my post is a bit late. I created a CSV file in the SQuad format on a google sheet and I realised that the answers feature upon inspection is recognised as string and has “” but the original SQuad dataset when you inspect it doesnt have the “” in the answers feature that is wrapping the “{ …}”. So far what I did was to upload the file to github where I am able to see the " " and can remove them manually ( I only have 300 entries ) and then download the file again as CSV. I hope this will work.

Topic		Replies	Views
Problem with Hugging face customised SQuad dataset Beginners	4	28	January 21, 2025
Correct way to create a Dataset from a csv file Beginners	13	14029	March 25, 2022
Problem loading .CSV for Time Series Transformer Beginners	6	792	December 15, 2022
Answer column not dictionary it is string when load csv using load_dataset 🤗Datasets	1	318	May 2, 2023
Convert .csv into dataset.Dataset Beginners	2	7074	March 20, 2022

How to create dataset from CSV to training Question answering？

Related topics