Creating QA Data Set for Distilbert

carterrees · February 16, 2023, 9:59am

I am trying to create a custom QA dataset to use to fine tune Distilbert. However, I can’t get past a type error. I’ve compared my dataset to the squad dataset format and the only thing I can see is that my answer text is wrapped in quotes which are not in the original csv. They appear after I load the dataset. How do I create a QA data set in the csv format so I can use it for fine tuning?

I load the data using from datasets import load_dataset ds = load_dataset('csv', data_files='path/to/local/my_dataset.csv')

Error that I am getting when I run:

tokenized_ds = ds.map(preprocess_function, batched=True, remove_columns=ds["train"].column_names)```

Cell In[41], line 19, in preprocess_function(examples)
     17 for i, offset in enumerate(offset_mapping):
     18     answer = answers[i]
---> 19     start_char = answer["answer_start"][0]
     20     end_char = answer["answer_start"][0] + len(answer["text"][0])
     21     sequence_ids = inputs.sequence_ids(i)

TypeError: string indices must be integers

carterrees · February 17, 2023, 8:35pm

Figured this out. When you build a tabular data set in Excel and export to csv it will wrap some text in double quotes for parsing reasons. Not a good thing when it comes to importing data.

Topic		Replies	Views
Creating a dataset with custom data Beginners	3	8655	September 5, 2022
Getting a value Error: Unable to create a tensor because the feature 'text' has excessive nesting and it expects it to be 'int' for some reason, Beginners	0	464	February 1, 2023
Problems with Dataset.from_dict() and Feature types 🤗Datasets	1	2202	September 6, 2021
Answer column not dictionary it is string when load csv using load_dataset 🤗Datasets	1	318	May 2, 2023
Fine-tuning distilbert/distilgpt2 for text to sql yields weird characters 🤗Transformers	0	142	June 24, 2024

Creating QA Data Set for Distilbert

Related topics