Hello everybody
I would like to fine-tune a custom QAbot that will work on italian texts (I was thinking about using the model ‘dbmdz/bert-base-italian-cased’) in a very specific field (medical reports). I already followed this guide and fine-tuned an english model by using the default train and dev file.
The problem is that now I’m trying to use my own files (formatted in SQuaD 2.0), but I’m not able to perform the same operations.
This is my code:
datasets = load_dataset('json', data_files='/content/SQuAD_it-train.json', field='data')
Instead of getting something like this…
DatasetDict({
train: Dataset({
features: [‘id’, ‘title’, ‘context’, ‘question’, ‘answers’],
num_rows: 130319
})
validation: Dataset({
features: [‘id’, ‘title’, ‘context’, ‘question’, ‘answers’],
num_rows: 11873
})
})
…I get this:
DatasetDict({
train: Dataset({
features: [‘title’, ‘paragraphs’],
num_rows: 442
})
})
I tried the same command with the train-v2.0.json file downloaded from the official SQuaD website…
datasets = load_dataset('json', data_files='/content/dev-v2.0.json', field='data')
…and this is what I got:
DatasetDict({
train: Dataset({
features: [‘title’, ‘paragraphs’],
num_rows: 442
})
})
So I’m assuming that this is not related to the file format but maybe with some parameter of the function load_dataset?
Thanks a lot for you attention
Claudio