I have looked at all kinds of examples for multi-label classification and I still don’t quite get how I’m supposed to be tokenizing the data. I have a number of questions.
The scenario is my input data has these columns:
Text. LabelA. LabelB
LabelA and LabelB EACH have multiple possible values, but every row in the data has a label…there are no missing labels.
Questions:
I saw examples building a matrix, 2xN. I did that, and I successfully created a tokenized dataset. But I’m not confident I did it correctly. I assume it’s just a column each with an integer for each indicating which label I’m putting in that column.
-
Do I even need to pre-tokenize the data before I train, or will the trainer somehow do that for me? If so, then what the heck is the point of passing in label2ID?
-
label2id looks like it’s a linear input, so does that mean I should somehow have continuously linear IDs?
OPTION A: ID’s overlap, but then what should I pass into label2id?
LabelA = A1:0, A2:1, A3:2 LabelB = B1:0, B2:1, B3:2
OPTION B: ID’s are linear, but then that means I need to concatenate my label2ids into one?
LabelA = A1:1, A2:2, A3:3 LabelB = B1:4, B2:5, B3:6
-
Essentially, what I tried to do was create a matrix of integers and then pre-tokenize my labels. But I see examples that make me think that it’s not necessary to tokenize, I can pass in a matrix of STRINGS.
-
Is there a difference in training outcomes if I go with the linear IDs, or somehow am able to make the overlapping IDs work?
-
If I can use overlapping IDs, then how in the world do I get the label2id/id2label parameters passed in correctly in .from_pretrained()?