Hey @ivanlau I think your idea to apply ClassLabel.str2int()
is the simplest approach, e.g.
from datasets import load_dataset, ClassLabel, Features
dset = load_dataset("amazon_reviews_multi", "all_languages", split="test")
# Create ClassLabel feature
langs = dset.unique("language")
lang_feature = ClassLabel(names=langs)
# Update default features
features = dset.features
features["language"] = lang_feature
# Update dataset
dset_with_classlabel = dset.map(lambda x : {"language": lang_feature.str2int(x["language"])}, features=features)
dset_with_classlabel.features
# {'language': ClassLabel(num_classes=6, names=['de', 'en', 'es', 'fr', 'ja', 'zh'], names_file=None, id=None),
# 'product_category': Value(dtype='string', id=None),
# 'product_id': Value(dtype='string', id=None),
# 'review_body': Value(dtype='string', id=None),
# 'review_id': Value(dtype='string', id=None),
# 'review_title': Value(dtype='string', id=None),
# 'reviewer_id': Value(dtype='string', id=None),
# 'stars': Value(dtype='int32', id=None)}
Alternatively, you can provide the features
dictionary when you load the dataset with load_dataset
. Hope that helps!