Hi,
I’ve been able to train a multi-label Bert classifier using a custom Dataset object and the Trainer API from Transformers. The Dataset contains two columns: text and label. After tokenizing, I have all the needed columns for training.
For multi-label classification I also set model.config.problem_type = "multi_label_classification"
, and define each label as a multi-hot vector (a list of 0/1 values, each corresponding to a different class).
So for instance, given the task consists in assigning one or more topics (sports, politics, culture) to text articles, I end up with labels like [0, 1, 0] (politics) or [1, 1, 0] (sports & politics). This works as expected and I managed to train the model.
Now I’d like to define the label column as a Sequence of ClassLabel items, and keep labels like [0, 2] (sports and culture), [0] (sports), [0, 1, 2] (sports, politics, culture)… If I structure the dataset like that, I get the following error when trying to train:
ValueError: Target size (torch.Size([16, 2])) must be the same as input size (torch.Size([16, 3]))
Is it necessary to define the label column as a multi-hot encoded vector in order to train a multi-label classifier or can I use ClassLabel objects similar to single-label multi-class task?
Thank you in advance.