Creating a Sequence of ClassLabel for multi-label and multi-class problems

christofkaelin · March 3, 2024, 7:11pm

Given a df with the following columns:

text	labels
This is a sentence	[0, 5, 3]
I am unhappy	[1, 9, 10]

OR

text	labels1	label2	label3
This is a sentence	0	5	3
I am unhappy	1	9	10

How would one create a Dataset with the following structure of dataset.features:

features = Features({
    'text': Value(dtype='string', id=None),
    'labels': Sequence({
        'label1': ClassLabel(num_classes=some_number, names=some_names,
        'label2': ClassLabel(num_classes=some_number, names=some_names,
        'label3': ClassLabel(num_classes=some_number, names=some_names
    })
})

Using from_pandas() with the first df results in a Sequence, but of type int64 and with the second df it results in three features of type int64 which can be encoded as a ClassLabel using class_encode_column(), but then one would end up with multiple label features, whereas they are required to be in one feature called labels as I understood from this tutorial.

Therefore, how to create a DataSet with a Sequence feature that consists of ClassLabels that can be used for a classifier in multi-class and multi-label scenarios?

I saw this post, but the implementation is unclear to me.

lhoestq · March 4, 2024, 2:09pm

You can use the first format and then cast the column to a Sequence of ClassLabel

ds = ds.from_pandas(df)
ds = ds.cast_column("labels", Sequence(ClassLabel(names=...)))

christofkaelin · March 4, 2024, 2:37pm

Amazing, thank you!

If we assume this to be the complete dataframe (same as above):

text	labels
This is a sentence	[0, 5, 3]
I am unhappy	[1, 9, 10]

The names= to added would be [0, 5, 3, 1, 9, 10]:

ds = ds.from_pandas(df)
ds = ds.cast_column("labels", Sequence(ClassLabel(names=[0, 5, 3, 1, 9, 10])))

However, for label1 (first list value respectively) only the labels 0 and 1 would be valid (for label2 it’d be 5 and 9 and so on)
Will this lead to any problems further on? E.g., could it mistakenly predict a 5 for label1?

lhoestq · March 4, 2024, 4:50pm

In this case it seems you have multiple classifications to make, ideally each one should have its own column (and be a single integer if there is only one possibility, or a list if there are multiple possible labels per column)

Zahra99 · March 23, 2024, 7:27pm

Hi @lhoestq ,

I have a dataset formatted for multi-class classification with 4 classes. My objective is to convert the labels to one-hot encodings and create a Hugging Face dataset. To achieve this, I utilized the sklearn OneHotEncoder function for encoding. However, I encountered an error when attempting to convert the dataset to the Hugging Face format using the following line of code:

features = Features({"text": Value("string"), "label": Sequence(ClassLabel(names=['ang', 'hap', 'neu', 'sad']))})

The error message I received is as follows:

TypeError: Couldn't cast array of type
string
to
Sequence(feature=ClassLabel(names=['ang', 'hap', 'neu', 'sad'], id=None), length=-1, id=None)

Could you please provide guidance on how to resolve this issue?

Thank you!

lhoestq · March 26, 2024, 8:56pm

Hi ! ClassLabel is for columns of integers. To transform your strings to integers you can use .map() (or the .class_encode_column() method which infers the names automatically)

Topic		Replies	Views
Dataset label format for multi-label text classification 🤗Datasets	9	13323	February 9, 2023
Sequence features - Class Label Cast_ 🤗Datasets	9	1315	July 4, 2023
Multi-label classification: getting Sequence(ClassList()) for labels Beginners	0	612	March 23, 2022
Add Sequence(feature=ClassLabel(...), ...) to an existing dataset 🤗Datasets	1	1622	May 2, 2022
How to build a multi-label & multi-class dataset correctly? Beginners	4	864	April 18, 2025

Creating a Sequence of ClassLabel for multi-label and multi-class problems

Related topics