I’ve been able to train a multi-label Bert classifier using a custom Dataset object and the Trainer API from Transformers. The Dataset contains two columns: text and label. After tokenizing, I have all the needed columns for training.
For multi-label classification I also set model.config.problem_type = "multi_label_classification", and define each label as a multi-hot vector (a list of 0/1 values, each corresponding to a different class).
So for instance, given the task consists in assigning one or more topics (sports, politics, culture) to text articles, I end up with labels like [0, 1, 0] (politics) or [1, 1, 0] (sports & politics). This works as expected and I managed to train the model.
Now I’d like to define the label column as a Sequence of ClassLabel items, and keep labels like [0, 2] (sports and culture), [0] (sports), [0, 1, 2] (sports, politics, culture)… If I structure the dataset like that, I get the following error when trying to train:
ValueError: Target size (torch.Size([16, 2])) must be the same as input size (torch.Size([16, 3]))
Is it necessary to define the label column as a multi-hot encoded vector in order to train a multi-label classifier or can I use ClassLabel objects similar to single-label multi-class task?
@zbeloki
Defining the labels as one-hot worked for me, the other thing I did was cast it as a Sequence of ClassLabel!
Make sure your one-hot encoded labels is float datatype.
If you are using pandas dataframes, here’s a short snippet!
label_enum = {k:j for j, k in enumerate(df['label_categories'].unique())}
num_labels = len(label_enum)
df['labels'] = df['label_categories'].apply(lambda x: [1.0 if label_enum[x]==i else 0.0 for i in range(num_labels)])
Defining labels as “one”-hot vectors ([1, 1, 0, 0, 1]) also works for me, but I’d like to avoid it and define them as a Sequence(feature=ClassLabel(…)). I can’t manage to train the model that way.
When I train a basic single-label multi-class model with labels as ClassLabel it works, but if I have multiple possible labels for each entry, I’m not sure if I can use ClassLabel or I’m supposed to use raw “one”-hot vectors as labels.
For instance, take GoEmotions-simplified dataset as a reference. The labels are a list of ClassLabel elements, but then I can’t train the model as is, I have to convert those labels to multi-hot encoded vectors. Am I doing something wrong? Or it is the expected behaviour and I am required to convert the dataset to multi-hot vectors?
I read your posts in this thread with attention as I have the same issue:
How to use a Dataset with input text as string and labels as Sequence(feature=ClassLabel(…)) like GoEmotions-simplified to train a multilabel classification model, and that WITHOUT converting my labels vectors ([4, 23, 1999] for example) to “one”-hot vectors?
Why? I have 2000 labels and 1.000.000 rows in my Dataset…
It’s a shame to do all these conversions to one-hot vectors.