Question answering bot: fine-tuning with custom dataset

Hello everybody

I would like to fine-tune a custom QAbot that will work on italian texts (I was thinking about using the model ‘dbmdz/bert-base-italian-cased’) in a very specific field (medical reports). I already followed this guide and fine-tuned an english model by using the default train and dev file.

The problem is that now I’m trying to use my own files (formatted in SQuaD 2.0), but I’m not able to perform the same operations.

This is my code:
datasets = load_dataset('json', data_files='/content/SQuAD_it-train.json', field='data')

Instead of getting something like this…
DatasetDict({
train: Dataset({
features: [‘id’, ‘title’, ‘context’, ‘question’, ‘answers’],
num_rows: 130319
})
validation: Dataset({
features: [‘id’, ‘title’, ‘context’, ‘question’, ‘answers’],
num_rows: 11873
})
})

…I get this:
DatasetDict({
train: Dataset({
features: [‘title’, ‘paragraphs’],
num_rows: 442
})
})

I tried the same command with the train-v2.0.json file downloaded from the official SQuaD website…
datasets = load_dataset('json', data_files='/content/dev-v2.0.json', field='data')

…and this is what I got:
DatasetDict({
train: Dataset({
features: [‘title’, ‘paragraphs’],
num_rows: 442
})
})

So I’m assuming that this is not related to the file format but maybe with some parameter of the function load_dataset?

Thanks a lot for you attention

Claudio

1 Like

Hi @Neuroinformatica, from the datasets docs it seems that the ideal format is line-separated JSON, so what I usually do is convert the SQuAD format as follows:

import json
from datasets import load_dataset

input_filename = "dev-v2.0.json"
output_filename = "dev-v2.0.jsonl"

with open(input_filename) as f:
    dataset = json.load(f)

with open(output_filename, "w") as f:
    for article in dataset["data"]:
        title = article["title"]
        for paragraph in article["paragraphs"]:
            context = paragraph["context"]
            answers = {}
            for qa in paragraph["qas"]:
                question = qa["question"]
                idx = qa["id"]
                answers["text"] = [a["text"] for a in qa["answers"]]
                answers["answer_start"] = [a["answer_start"] for a in qa["answers"]]
                f.write(
                    json.dumps(
                        {
                            "id": idx,
                            "title": title,
                            "context": context,
                            "question": question,
                            "answers": answers,
                        }
                    )
                )
                f.write("\n")

ds = load_dataset("json", data_files=output_filename)

This converts each article in the SQuAD dataset into a single JSON object of the form

{
   "id":"56ddde6b9a695914005b9628",
   "title":"Normans",
   "context":"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",
   "question":"In what country is Normandy located?",
   "answers":{
      "text":[
         "France",
         "France",
         "France",
         "France"
      ],
      "answer_start":[
         159,
         159,
         159,
         159
      ]
   }
}

which is then well suited for the Arrow columnar format of datasets. HTH!

6 Likes

That worked! Thanks a lot!

I tried this but then get the error

JSONDecodeError: Extra data: line 1 column 31362 (char 31361)

Hi @shainaraza are you using Windows by any chance? If so, you might want to try adding encoding="utf-8" to the open() commands

hi @lewtun
So here is the data that I want to squadify (as per squad 2), https://raw.githubusercontent.com/deepset-ai/COVID-QA/master/data/question-answering/200423_covidQA.json
I run the same program as you did above but since there are no titles in this dataset, so I just appended to make it fine for the squad format.

import json
from datasets import load_dataset

input_filename = "/content/drive/COVID-QA.json"
output_filename = "dev.json"

with open(input_filename) as f:
    dataset = json.load(f)

with open(output_filename, "w") as f:
    for article in dataset["data"]:
        #title = article["title"]
        title = "title" #this covid json file has no title but Squad 2.0 requires title, so I put fake one
        for paragraph in article["paragraphs"]:
            context = paragraph["context"]
            answers = {}
            for qa in paragraph["qas"]:
                question = qa["question"]
                idx = qa["id"]
                answers["text"] = [a["text"] for a in qa["answers"]]
                answers["answer_start"] = [a["answer_start"] for a in qa["answers"]]
                f.write(
                    json.dumps(
                        {
                            "id": idx,
                            "title": title,
                            "context": context,
                            "question": question,
                            "answers": answers,
                        }
                    )
                )
                f.write("\n")

ds = load_dataset("json", data_files=output_filename)

but when I try to open it simply using a python json.loads function, it shows me above error. I also run this code, same error

from transformers import squad_convert_examples_to_features
from transformers.data.processors.squad import SquadV2Processor
processor = SquadV2Processor()
examples = processor.get_dev_examples('/content/')
features, dataset = squad_convert_examples_to_features(
    examples=examples,
    tokenizer=tokenizer,
    max_seq_length=512,
    doc_stride = 128,
    max_query_length=256,
    is_training=False,
    return_dataset='pt',
    threads=4, # number of CPU cores to use
)

I also run this command but same error

python run_squad.py \

--model_type 'roberta-base-squad2-covid'   \

--model_name_or_path 'deepset/roberta-base-squad2-covid'  \

--output_dir models/bert/ \

--overwrite_output_dir \

--overwrite_cache \

--do_lower_case  \

--do_eval   \

    --predict_file dev-v2.0.json   \

    --per_gpu_train_batch_size 2   \

    --learning_rate 3e-5   \

    --num_train_epochs 2.0   \

    --max_seq_length 384   \

    --doc_stride 128   \

    --threads 10   \

    --save_steps 5000

the error is in all methods

is it format issue?

@shainaraza
The issue you faced is one that I also faced and found a solution for it. Please note however, I am using the training script for QA from here.

The issue seem to be with how the dataset is being read (code). Specifically, try modifying it from this

        raw_datasets = load_dataset(
            extension,
            data_files=data_files,
            field="data",
            cache_dir=model_args.cache_dir,
            use_auth_token=True if model_args.use_auth_token else None,
        )

to

        raw_datasets = load_dataset(
            extension,
            data_files=data_files)

This solution worked for me but also note that I made sure my data is formatted to be exactly the same as how it is here:

from datasets import load_dataset
squad = load_dataset("squad_v2")
squad['train']