Creating a Sequence of ClassLabel for multi-label and multi-class problems

Given a df with the following columns:

text labels
This is a sentence [0, 5, 3]
I am unhappy [1, 9, 10]

OR

text labels1 label2 label3
This is a sentence 0 5 3
I am unhappy 1 9 10

How would one create a Dataset with the following structure of dataset.features:

features = Features({
    'text': Value(dtype='string', id=None),
    'labels': Sequence({
        'label1': ClassLabel(num_classes=some_number, names=some_names,
        'label2': ClassLabel(num_classes=some_number, names=some_names,
        'label3': ClassLabel(num_classes=some_number, names=some_names
    })
})

Using from_pandas() with the first df results in a Sequence, but of type int64 and with the second df it results in three features of type int64 which can be encoded as a ClassLabel using class_encode_column(), but then one would end up with multiple label features, whereas they are required to be in one feature called labels as I understood from this tutorial.

Therefore, how to create a DataSet with a Sequence feature that consists of ClassLabels that can be used for a classifier in multi-class and multi-label scenarios?

I saw this post, but the implementation is unclear to me.

You can use the first format and then cast the column to a Sequence of ClassLabel :slight_smile:

ds = ds.from_pandas(df)
ds = ds.cast_column("labels", Sequence(ClassLabel(names=...)))

Amazing, thank you!

If we assume this to be the complete dataframe (same as above):

text labels
This is a sentence [0, 5, 3]
I am unhappy [1, 9, 10]

The names= to added would be [0, 5, 3, 1, 9, 10]:

ds = ds.from_pandas(df)
ds = ds.cast_column("labels", Sequence(ClassLabel(names=[0, 5, 3, 1, 9, 10])))

However, for label1 (first list value respectively) only the labels 0 and 1 would be valid (for label2 it’d be 5 and 9 and so on)
Will this lead to any problems further on? E.g., could it mistakenly predict a 5 for label1?

In this case it seems you have multiple classifications to make, ideally each one should have its own column (and be a single integer if there is only one possibility, or a list if there are multiple possible labels per column)

Hi @lhoestq ,

I have a dataset formatted for multi-class classification with 4 classes. My objective is to convert the labels to one-hot encodings and create a Hugging Face dataset. To achieve this, I utilized the sklearn OneHotEncoder function for encoding. However, I encountered an error when attempting to convert the dataset to the Hugging Face format using the following line of code:

features = Features({"text": Value("string"), "label": Sequence(ClassLabel(names=['ang', 'hap', 'neu', 'sad']))})

The error message I received is as follows:

TypeError: Couldn't cast array of type
string
to
Sequence(feature=ClassLabel(names=['ang', 'hap', 'neu', 'sad'], id=None), length=-1, id=None)

Could you please provide guidance on how to resolve this issue?

Thank you!

Hi ! ClassLabel is for columns of integers. To transform your strings to integers you can use .map() (or the .class_encode_column() method which infers the names automatically)