Iāve been able to train a multi-label Bert classifier using a custom Dataset object and the Trainer API from Transformers. The Dataset contains two columns: text and label. After tokenizing, I have all the needed columns for training.
For multi-label classification I also set model.config.problem_type = "multi_label_classification", and define each label as a multi-hot vector (a list of 0/1 values, each corresponding to a different class).
So for instance, given the task consists in assigning one or more topics (sports, politics, culture) to text articles, I end up with labels like [0, 1, 0] (politics) or [1, 1, 0] (sports & politics). This works as expected and I managed to train the model.
Now Iād like to define the label column as a Sequence of ClassLabel items, and keep labels like [0, 2] (sports and culture), [0] (sports), [0, 1, 2] (sports, politics, culture)⦠If I structure the dataset like that, I get the following error when trying to train:
ValueError: Target size (torch.Size([16, 2])) must be the same as input size (torch.Size([16, 3]))
Is it necessary to define the label column as a multi-hot encoded vector in order to train a multi-label classifier or can I use ClassLabel objects similar to single-label multi-class task?
@zbeloki
Defining the labels as one-hot worked for me, the other thing I did was cast it as a Sequence of ClassLabel!
Make sure your one-hot encoded labels is float datatype.
If you are using pandas dataframes, hereās a short snippet!
label_enum = {k:j for j, k in enumerate(df['label_categories'].unique())}
num_labels = len(label_enum)
df['labels'] = df['label_categories'].apply(lambda x: [1.0 if label_enum[x]==i else 0.0 for i in range(num_labels)])
Defining labels as āoneā-hot vectors ([1, 1, 0, 0, 1]) also works for me, but Iād like to avoid it and define them as a Sequence(feature=ClassLabel(ā¦)). I canāt manage to train the model that way.
When I train a basic single-label multi-class model with labels as ClassLabel it works, but if I have multiple possible labels for each entry, Iām not sure if I can use ClassLabel or Iām supposed to use raw āoneā-hot vectors as labels.
For instance, take GoEmotions-simplified dataset as a reference. The labels are a list of ClassLabel elements, but then I canāt train the model as is, I have to convert those labels to multi-hot encoded vectors. Am I doing something wrong? Or it is the expected behaviour and I am required to convert the dataset to multi-hot vectors?