Numpy.str_ error during training phase

ivanlau · December 1, 2021, 4:48pm

Hi, I was using amazon datasets for trying to build a small language detector. but bumped up the numpy.str_ error during training phase. You can view my colab notebook here: Google Colab .

I was using review body field as text and language field as label. and dropped the other fields.

I found that the ‘language’ data field is type of datasets.Value, not the datasets.ClassLabel. I guessed this causing the numpy.str_ error during training.

Question: how do I convert datasets.Value to datasets.ClassLabel ? One way I can think of is doing str2int inside preprocess_function/tokenize method but curious that is there any existing conversion method to do that.

Thanks

lewtun · December 2, 2021, 2:36pm

Hey @ivanlau I think your idea to apply ClassLabel.str2int() is the simplest approach, e.g.

from datasets import load_dataset, ClassLabel, Features

dset = load_dataset("amazon_reviews_multi", "all_languages", split="test")
# Create ClassLabel feature
langs = dset.unique("language")
lang_feature = ClassLabel(names=langs)
# Update default features
features = dset.features
features["language"] = lang_feature
# Update dataset
dset_with_classlabel = dset.map(lambda x : {"language": lang_feature.str2int(x["language"])}, features=features)

dset_with_classlabel.features
# {'language': ClassLabel(num_classes=6, names=['de', 'en', 'es', 'fr', 'ja', 'zh'], names_file=None, id=None),
#  'product_category': Value(dtype='string', id=None),
#  'product_id': Value(dtype='string', id=None),
#  'review_body': Value(dtype='string', id=None),
#  'review_id': Value(dtype='string', id=None),
#  'review_title': Value(dtype='string', id=None),
#  'reviewer_id': Value(dtype='string', id=None),
#  'stars': Value(dtype='int32', id=None)}

Alternatively, you can provide the features dictionary when you load the dataset with load_dataset. Hope that helps!

ivanlau · December 2, 2021, 4:16pm

I see… I don’t know ClassLabel has such method exist. I will try it then.
Thanks for the pointer.

Topic		Replies	Views
How to convert string labels into ClassLabel classes for custom set in pandas Beginners	3	5813	April 25, 2023
Class Labels for Custom Datasets 🤗Datasets	4	18002	June 2, 2022
ValueError: Field 'ner_tags' from the JSON data of type list<item: string> is not compatible with ClassLabel. Compatible types are int64 and string 🤗Datasets	7	860	March 25, 2022
Problem with Classlabel : Class label -100 greater than configured num_classes 18 🤗Datasets	1	1263	June 19, 2022
How to apply training ClassLabels on test / validation Dataset 🤗Datasets	2	371	September 20, 2023

Numpy.str_ error during training phase

Related topics