Numpy.str_ error during training phase

Hey @ivanlau I think your idea to apply ClassLabel.str2int() is the simplest approach, e.g.

from datasets import load_dataset, ClassLabel, Features

dset = load_dataset("amazon_reviews_multi", "all_languages", split="test")
# Create ClassLabel feature
langs = dset.unique("language")
lang_feature = ClassLabel(names=langs)
# Update default features
features = dset.features
features["language"] = lang_feature
# Update dataset
dset_with_classlabel = dset.map(lambda x : {"language": lang_feature.str2int(x["language"])}, features=features)

dset_with_classlabel.features
# {'language': ClassLabel(num_classes=6, names=['de', 'en', 'es', 'fr', 'ja', 'zh'], names_file=None, id=None),
#  'product_category': Value(dtype='string', id=None),
#  'product_id': Value(dtype='string', id=None),
#  'review_body': Value(dtype='string', id=None),
#  'review_id': Value(dtype='string', id=None),
#  'review_title': Value(dtype='string', id=None),
#  'reviewer_id': Value(dtype='string', id=None),
#  'stars': Value(dtype='int32', id=None)}

Alternatively, you can provide the features dictionary when you load the dataset with load_dataset. Hope that helps!

1 Like