I am trying to create a custom QA dataset to use to fine tune Distilbert. However, I can’t get past a type error. I’ve compared my dataset to the squad dataset format and the only thing I can see is that my answer text is wrapped in quotes which are not in the original csv. They appear after I load the dataset. How do I create a QA data set in the csv format so I can use it for fine tuning?
I load the data using
from datasets import load_dataset ds = load_dataset('csv', data_files='path/to/local/my_dataset.csv')
Error that I am getting when I run:
tokenized_ds = ds.map(preprocess_function, batched=True, remove_columns=ds["train"].column_names)```
Cell In, line 19, in preprocess_function(examples) 17 for i, offset in enumerate(offset_mapping): 18 answer = answers[i] ---> 19 start_char = answer["answer_start"] 20 end_char = answer["answer_start"] + len(answer["text"]) 21 sequence_ids = inputs.sequence_ids(i) TypeError: string indices must be integers