I am trying to create a custom QA dataset to use to fine tune Distilbert. However, I can’t get past a type error. I’ve compared my dataset to the squad dataset format and the only thing I can see is that my answer text is wrapped in quotes which are not in the original csv. They appear after I load the dataset. How do I create a QA data set in the csv format so I can use it for fine tuning?
I load the data using from datasets import load_dataset ds = load_dataset('csv', data_files='path/to/local/my_dataset.csv')
Error that I am getting when I run:
tokenized_ds = ds.map(preprocess_function, batched=True, remove_columns=ds["train"].column_names)```
Cell In[41], line 19, in preprocess_function(examples)
17 for i, offset in enumerate(offset_mapping):
18 answer = answers[i]
---> 19 start_char = answer["answer_start"][0]
20 end_char = answer["answer_start"][0] + len(answer["text"][0])
21 sequence_ids = inputs.sequence_ids(i)
TypeError: string indices must be integers