Numpy.str_ error during training phase

Hi, I was using amazon datasets for trying to build a small language detector. but bumped up the numpy.str_ error during training phase. You can view my colab notebook here: Google Colab .

I was using review body field as text and language field as label. and dropped the other fields.

I found that the ‘language’ data field is type of datasets.Value, not the datasets.ClassLabel. I guessed this causing the numpy.str_ error during training.

Question: how do I convert datasets.Value to datasets.ClassLabel ? One way I can think of is doing str2int inside preprocess_function/tokenize method but curious that is there any existing conversion method to do that.

Thanks

Hey @ivanlau I think your idea to apply ClassLabel.str2int() is the simplest approach, e.g.

from datasets import load_dataset, ClassLabel, Features

dset = load_dataset("amazon_reviews_multi", "all_languages", split="test")
# Create ClassLabel feature
langs = dset.unique("language")
lang_feature = ClassLabel(names=langs)
# Update default features
features = dset.features
features["language"] = lang_feature
# Update dataset
dset_with_classlabel = dset.map(lambda x : {"language": lang_feature.str2int(x["language"])}, features=features)

dset_with_classlabel.features
# {'language': ClassLabel(num_classes=6, names=['de', 'en', 'es', 'fr', 'ja', 'zh'], names_file=None, id=None),
#  'product_category': Value(dtype='string', id=None),
#  'product_id': Value(dtype='string', id=None),
#  'review_body': Value(dtype='string', id=None),
#  'review_id': Value(dtype='string', id=None),
#  'review_title': Value(dtype='string', id=None),
#  'reviewer_id': Value(dtype='string', id=None),
#  'stars': Value(dtype='int32', id=None)}

Alternatively, you can provide the features dictionary when you load the dataset with load_dataset. Hope that helps!

1 Like

I see… I don’t know ClassLabel has such method exist. I will try it then.
Thanks for the pointer. :slight_smile: