Convert dataframe to NER dataset format

Hi. I am trying to convert a dataframe to the format for NER I have seen in example notebook. See below and example of the format
image

I have a dataframe which looks like:
image

The ner_tags is an object column

If I convert this dataframe to a datasets format by using Dataset.from_pandas(data) I get:
image

As you can see the format of the ‘ner_tags’ is not the same as in the NER dataset example (tokens are good). I am struggling to get this in the same format for the ner_tags. Can someone make a suggestion on this?

Hi ! It looks like your the ner_tags column in your dataframe contains data of type list of one string, instead of list of integers. I guess you are loading your dataset from a CSV file ? When you have nested data, I would recommend using JSON Lines instead of CSV.

Anyway, to fix is you can process the ner_tags column to get lists of integers this way :slight_smile:

dataset = dataset.map(lambda x: {"ner_tags": [int(i) for i in x["ner_tags"].split(",")]})
2 Likes