Question answering bot: fine-tuning with custom dataset

lewtun · March 15, 2021, 6:32pm

Hi @Neuroinformatica, from the datasets docs it seems that the ideal format is line-separated JSON, so what I usually do is convert the SQuAD format as follows:

import json
from datasets import load_dataset

input_filename = "dev-v2.0.json"
output_filename = "dev-v2.0.jsonl"

with open(input_filename) as f:
    dataset = json.load(f)

with open(output_filename, "w") as f:
    for article in dataset["data"]:
        title = article["title"]
        for paragraph in article["paragraphs"]:
            context = paragraph["context"]
            answers = {}
            for qa in paragraph["qas"]:
                question = qa["question"]
                idx = qa["id"]
                answers["text"] = [a["text"] for a in qa["answers"]]
                answers["answer_start"] = [a["answer_start"] for a in qa["answers"]]
                f.write(
                    json.dumps(
                        {
                            "id": idx,
                            "title": title,
                            "context": context,
                            "question": question,
                            "answers": answers,
                        }
                    )
                )
                f.write("\n")

ds = load_dataset("json", data_files=output_filename)

This converts each article in the SQuAD dataset into a single JSON object of the form

{
   "id":"56ddde6b9a695914005b9628",
   "title":"Normans",
   "context":"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",
   "question":"In what country is Normandy located?",
   "answers":{
      "text":[
         "France",
         "France",
         "France",
         "France"
      ],
      "answer_start":[
         159,
         159,
         159,
         159
      ]
   }
}

which is then well suited for the Arrow columnar format of datasets. HTH!

Topic		Replies	Views
How to understand the answer_start parameter of Squad dataset for training BERT-QA model + practical implications for creating custom dataset? Intermediate	1	876	September 1, 2023
Shape of squad data for Question answering Beginners	0	272	April 15, 2023
Creating a dataset with custom data Beginners	3	5087	September 5, 2022
Custom SQuAD2.0 dataset gives an error when using run_qa.py script 🤗Transformers	3	2682	July 30, 2021
Fine tunning QA model in SQUAD 2 dataset with more than one answer Intermediate	1	531	March 15, 2024

Question answering bot: fine-tuning with custom dataset

Related Topics