Class Labels for Custom Datasets

I’m currently trying to prepare my data from a .csv for a multi-class classification task, for which I have 6 classes which are strings.

{'Coded_Text': Value(dtype='string', id=None),
 'Coded_Text_Length': Value(dtype='int64', id=None),
 'Label': Value(dtype='string', id=None)}

What is the most appropriate way to map these strings into ClassLabel objects?
I have attempted the solution in How to convert string labels into ClassLabel classes for custom set in pandas, but I encountered the following error.

ArrowInvalid: ("Could not convert 'training' with type str: tried to convert to int64", 'Conversion failed for column Label with type object')
3 Likes

Hi! You are getting this error most likely because the label training is not specified as a label in the names list of the ClassLabel feature. To avoid this error, I suggest you use class_encode_column instead, which will automatically find all the unique string values in the column:

from datasets import Dataset
dataset = Dataset.from_pandas(df)
dataset = dataset.class_encode_column("Label")
12 Likes

Thanks! That worked and was a lot cleaner than my alternative solution.

# Creating a ClassLabel Object
df = dataset["train"].to_pandas()
labels = df['label'].unique().tolist()
ClassLabels = ClassLabel(num_classes=len(labels), names=labels)

# Mapping Labels to IDs
def map_label2id(example):
    example['label'] = ClassLabels.str2int(example['label'])
    return example

dataset = dataset.map(map_label2id, batched=True)

# Casting label column to ClassLabel Object
dataset = dataset.cast_column('label', ClassLabels)
3 Likes

Hi,
I’m experimenting with the emotion dataset from manually downloaded files.

from datasets import load_dataset
dataset = load_dataset('csv', data_files={'train': 'train.txt', 'validation': 'val.txt', 'test': 'test.txt'}, sep=";", 
                              names=["text", "label"])
dataset.cast_column("label", ClassLabel(names = ['anger', 'fear', 'joy', 'love', 'sadness', 'surprise'] ))

I’m getting ā€œArrowInvalid: Failed to parse string: ā€˜anger’ as a scalar of type int64ā€ error, even after specifying anger as label in the names list

Thanks for your suggestion class_encode_column, it worked.

Any idea why cast_column isn’t working in this case?

Hi! Currently, only integer values support casting to the ClassLabel type hence the error. But we’ve recently added support for casting from string values, which will be available in the next release of datasets (currently only available on master if you want to try it).