How to build a multi-label & multi-class dataset correctly?

christofkaelin · March 5, 2024, 3:46pm

I am unsure how to proceed creating a Dataset with multiple labels and classes where the classes are not the same for the different labels.

A multi-label example is shared here, but the classes are always either 0 or 1. My task is slightly different. To illustrate what I mean, take this dataset:

text (`X`)	region (`y₁`)	weather (`y₂`)	sentiment (`y₃`)
‘The Taj Mahal was beautiful, even in monsoon season’	‘asia’	‘rain’	‘positive’
‘Surprisingly warm in London with blue skies…’	‘europe’	‘sun’	‘positive’
‘Snowstorms and icey roads in Illinios, will never go there again!!!’	‘america’	‘snow’	‘negative’
‘I sincerely enjoyed my Safari despite the hot temperatures’	‘africa’	‘sun’	‘positive’

text: The input to predict the three outputs.
region: First output to predict, four possible classes 'asia', 'europe', 'america', 'africa'
weather: Second output to predict, three possible classes 'rain', 'sun', 'snow'
sentiment: Third output to predict, two possible classes 'positive', 'negative'

The questions stemming from such a multi-label/multi-class task:

What are the right types of Features? For text, it shall be Value(dtype='string', id=None) feature type, which will be tokenized.
The outputs are supposed to saved in a labels feature, illustrated in tutorials with the ClassLabel. So for the region label it should be something like this: ClassLabel(num_classes=3, names=['asia', 'europe', 'america'], id=None). But how does it work with three outputs and different names?

I tried looking into this where the labels feature would be of type Sequence (see this post).
However, as @lhoestq pointed out, they should have their own columns. Otherwise, names ends up having all classes in one place, e.g., Sequence(ClassLabel(names=['asia', 'europe', 'america', 'africa', 'rain', 'sun', 'snow', 'positive', 'negative'])) which would be wrong (?).
So now I am conflicted how to build my dataset to predict all labels and classes at once correctly.

More questions arising from this once we start training the model:

Given the above dataset, how would the id2label and label2id dicts be structured?
What would be the correct num_labels passed on to the model? Would it be 3 due to the three columns to predict? Or would it be 9 (4+3+2) due to the different outcomes?
The problem_type should be multi_label_classification, right? (asking because in previous trainings I ran into issues with tensor sizes not matching and some issues with Float and Long)

Thanks!

kevinmmckee · April 14, 2025, 6:46pm

DId you ever have any luck with this? I’m currently facing a similar problem.

John6666 · April 15, 2025, 5:52am

This method?

christofkaelin · April 15, 2025, 9:16am

@kevinmmckee No success, sorry!
I ended up predicting each class individually. I know there should be a more elegant way, but I never figured it out.

kevinmmckee · April 18, 2025, 3:51pm

@christofkaelin Dang. I thought this might be case and was hoping I could achieve this with a single pass across the dataset. Thanks for the reply!

Topic		Replies	Views
Creating a Sequence of ClassLabel for multi-label and multi-class problems 🤗Datasets	5	723	March 26, 2024
Could someone please explain how to make a multi-label dataset from csv? Beginners	2	3558	May 31, 2022
Dataset for multilabel classification 🤗Transformers	1	161	January 20, 2025
Dataset label format for multi-label text classification 🤗Datasets	9	13264	February 9, 2023
Multi-label classification: getting Sequence(ClassList()) for labels Beginners	0	610	March 23, 2022

How to build a multi-label & multi-class dataset correctly?

Related topics