Hi @Neuroinformatica, from the datasets
docs it seems that the ideal format is line-separated JSON, so what I usually do is convert the SQuAD format as follows:
import json
from datasets import load_dataset
input_filename = "dev-v2.0.json"
output_filename = "dev-v2.0.jsonl"
with open(input_filename) as f:
dataset = json.load(f)
with open(output_filename, "w") as f:
for article in dataset["data"]:
title = article["title"]
for paragraph in article["paragraphs"]:
context = paragraph["context"]
answers = {}
for qa in paragraph["qas"]:
question = qa["question"]
idx = qa["id"]
answers["text"] = [a["text"] for a in qa["answers"]]
answers["answer_start"] = [a["answer_start"] for a in qa["answers"]]
f.write(
json.dumps(
{
"id": idx,
"title": title,
"context": context,
"question": question,
"answers": answers,
}
)
)
f.write("\n")
ds = load_dataset("json", data_files=output_filename)
This converts each article in the SQuAD dataset into a single JSON object of the form
{
"id":"56ddde6b9a695914005b9628",
"title":"Normans",
"context":"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",
"question":"In what country is Normandy located?",
"answers":{
"text":[
"France",
"France",
"France",
"France"
],
"answer_start":[
159,
159,
159,
159
]
}
}
which is then well suited for the Arrow columnar format of datasets
. HTH!