Dataset label format for multi-label text classification


I’ve been able to train a multi-label Bert classifier using a custom Dataset object and the Trainer API from Transformers. The Dataset contains two columns: text and label. After tokenizing, I have all the needed columns for training.

For multi-label classification I also set model.config.problem_type = "multi_label_classification", and define each label as a multi-hot vector (a list of 0/1 values, each corresponding to a different class).

So for instance, given the task consists in assigning one or more topics (sports, politics, culture) to text articles, I end up with labels like [0, 1, 0] (politics) or [1, 1, 0] (sports & politics). This works as expected and I managed to train the model.

Now I’d like to define the label column as a Sequence of ClassLabel items, and keep labels like [0, 2] (sports and culture), [0] (sports), [0, 1, 2] (sports, politics, culture)… If I structure the dataset like that, I get the following error when trying to train:

ValueError: Target size (torch.Size([16, 2])) must be the same as input size (torch.Size([16, 3]))

Is it necessary to define the label column as a multi-hot encoded vector in order to train a multi-label classifier or can I use ClassLabel objects similar to single-label multi-class task?

Thank you in advance.


Defining the labels as one-hot worked for me, the other thing I did was cast it as a Sequence of ClassLabel!

Make sure your one-hot encoded labels is float datatype.

If you are using pandas dataframes, here’s a short snippet!

label_enum = {k:j for j, k in enumerate(df['label_categories'].unique())}
num_labels = len(label_enum)
df['labels'] = df['label_categories'].apply(lambda x: [1.0 if label_enum[x]==i else 0.0 for i in range(num_labels)])

Thanks @dhruvmetha.

Defining labels as “one”-hot vectors ([1, 1, 0, 0, 1]) also works for me, but I’d like to avoid it and define them as a Sequence(feature=ClassLabel(…)). I can’t manage to train the model that way.

When I train a basic single-label multi-class model with labels as ClassLabel it works, but if I have multiple possible labels for each entry, I’m not sure if I can use ClassLabel or I’m supposed to use raw “one”-hot vectors as labels.

For instance, take GoEmotions-simplified dataset as a reference. The labels are a list of ClassLabel elements, but then I can’t train the model as is, I have to convert those labels to multi-hot encoded vectors. Am I doing something wrong? Or it is the expected behaviour and I am required to convert the dataset to multi-hot vectors?

Maybe this thread can help you.
A notebook from that thread that helped me figure this out


That notebook was helpful! It does basically the same as I do, encoding the labels as multi-hot vectors. So I guess that is the way to go.

Thank you!


Hello, do you know any example notebooks to train using custom dataset for multi-label classification?

Hi @zbeloki,

I read your posts in this thread with attention as I have the same issue:

How to use a Dataset with input text as string and labels as Sequence(feature=ClassLabel(…)) like GoEmotions-simplified to train a multilabel classification model, and that WITHOUT converting my labels vectors ([4, 23, 1999] for example) to “one”-hot vectors?

Why? I have 2000 labels and 1.000.000 rows in my Dataset…

It’s a shame to do all these conversions to one-hot vectors.

Did you find another solution?

cc @lewtun @nielsr

HI @pierreguillou did you find a way to avoid converting the label vectors to one-hot encoded vectors?

No. I did not find another solution :frowning:

What about the solution provided here: Multi-class Using Dataset - #2 by lhoestq