Class Labels for Custom Datasets

Iā€™m currently trying to prepare my data from a .csv for a multi-class classification task, for which I have 6 classes which are strings.

{'Coded_Text': Value(dtype='string', id=None),
 'Coded_Text_Length': Value(dtype='int64', id=None),
 'Label': Value(dtype='string', id=None)}

What is the most appropriate way to map these strings into ClassLabel objects?
I have attempted the solution in How to convert string labels into ClassLabel classes for custom set in pandas, but I encountered the following error.

ArrowInvalid: ("Could not convert 'training' with type str: tried to convert to int64", 'Conversion failed for column Label with type object')
3 Likes

Hi! You are getting this error most likely because the label training is not specified as a label in the names list of the ClassLabel feature. To avoid this error, I suggest you use class_encode_column instead, which will automatically find all the unique string values in the column:

from datasets import Dataset
dataset = Dataset.from_pandas(df)
dataset = dataset.class_encode_column("Label")
12 Likes

Thanks! That worked and was a lot cleaner than my alternative solution.

# Creating a ClassLabel Object
df = dataset["train"].to_pandas()
labels = df['label'].unique().tolist()
ClassLabels = ClassLabel(num_classes=len(labels), names=labels)

# Mapping Labels to IDs
def map_label2id(example):
    example['label'] = ClassLabels.str2int(example['label'])
    return example

dataset = dataset.map(map_label2id, batched=True)

# Casting label column to ClassLabel Object
dataset = dataset.cast_column('label', ClassLabels)
3 Likes

Hi,
Iā€™m experimenting with the emotion dataset from manually downloaded files.

from datasets import load_dataset
dataset = load_dataset('csv', data_files={'train': 'train.txt', 'validation': 'val.txt', 'test': 'test.txt'}, sep=";", 
                              names=["text", "label"])
dataset.cast_column("label", ClassLabel(names = ['anger', 'fear', 'joy', 'love', 'sadness', 'surprise'] ))

Iā€™m getting ā€œArrowInvalid: Failed to parse string: ā€˜angerā€™ as a scalar of type int64ā€ error, even after specifying anger as label in the names list

Thanks for your suggestion class_encode_column, it worked.

Any idea why cast_column isnā€™t working in this case?

Hi! Currently, only integer values support casting to the ClassLabel type hence the error. But weā€™ve recently added support for casting from string values, which will be available in the next release of datasets (currently only available on master if you want to try it).