How do I set feature type when loading dataset(ClassLabel etc)?

I am loading my dataset from a local file, and I’m getting error “TypeError: new(): invalid data type ‘numpy.str_’” which I believe is due to the features not being defined

It’s mentioned here and a solution is to pass a features dictionary when loading. But I am having trouble with the format.

I’ve tried things like :

emotions = load_dataset("csv", data_files="train.txt", sep=";",
                              names=["text", "label"],features = {'text': datasets.Value(dtype='int32', id=None),
 'label':datasets.ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None)})


load_dataset("csv", data_files="train.txt", sep=";",
                              names=["text", "label"],features = {'text': 'str',
 'label':['not_equivalent', 'equivalent']})

Without success.

I’m trying to follow the documentation here but can’t seem to figure it out…is there an example of how to do this somewhere? Thanks!

Hi! This should work:

import datasets
features = datasets.Features({"text": datasets.Value("string"), "label": datasets.ClassLabel(names=['not_equivalent', 'equivalent'])})
dset = datasets.load_dataset("csv", data_files="train.txt", sep=";", names=["text", "label"], features=features)

Also note that the label column in your csv file has to contain numbers as labels (0 and 1) and not strings (not_equivalent and equivalent), otherwise you’ll get an error.

1 Like

Thank you!