Understanding multi-label classification training

I have looked at all kinds of examples for multi-label classification and I still don’t quite get how I’m supposed to be tokenizing the data. I have a number of questions.

The scenario is my input data has these columns:

        Text.    LabelA.     LabelB

LabelA and LabelB EACH have multiple possible values, but every row in the data has a label…there are no missing labels.

Questions:

I saw examples building a matrix, 2xN. I did that, and I successfully created a tokenized dataset. But I’m not confident I did it correctly. I assume it’s just a column each with an integer for each indicating which label I’m putting in that column.

  1. Do I even need to pre-tokenize the data before I train, or will the trainer somehow do that for me? If so, then what the heck is the point of passing in label2ID?

  2. label2id looks like it’s a linear input, so does that mean I should somehow have continuously linear IDs?

    OPTION A: ID’s overlap, but then what should I pass into label2id?

      LabelA = A1:0, A2:1, A3:2
      LabelB = B1:0, B2:1, B3:2
    

    OPTION B: ID’s are linear, but then that means I need to concatenate my label2ids into one?

      LabelA = A1:1, A2:2, A3:3
      LabelB = B1:4, B2:5, B3:6
    
  3. Essentially, what I tried to do was create a matrix of integers and then pre-tokenize my labels. But I see examples that make me think that it’s not necessary to tokenize, I can pass in a matrix of STRINGS.

  4. Is there a difference in training outcomes if I go with the linear IDs, or somehow am able to make the overlapping IDs work?

  5. If I can use overlapping IDs, then how in the world do I get the label2id/id2label parameters passed in correctly in .from_pretrained()?

1 Like