Understanding multi-label classification training

msafar · February 14, 2023, 6:18am

I have looked at all kinds of examples for multi-label classification and I still don’t quite get how I’m supposed to be tokenizing the data. I have a number of questions.

The scenario is my input data has these columns:

        Text.    LabelA.     LabelB

LabelA and LabelB EACH have multiple possible values, but every row in the data has a label…there are no missing labels.

Questions:

I saw examples building a matrix, 2xN. I did that, and I successfully created a tokenized dataset. But I’m not confident I did it correctly. I assume it’s just a column each with an integer for each indicating which label I’m putting in that column.

Do I even need to pre-tokenize the data before I train, or will the trainer somehow do that for me? If so, then what the heck is the point of passing in label2ID?
label2id looks like it’s a linear input, so does that mean I should somehow have continuously linear IDs?

OPTION A: ID’s overlap, but then what should I pass into label2id?
```
  LabelA = A1:0, A2:1, A3:2
  LabelB = B1:0, B2:1, B3:2
```
OPTION B: ID’s are linear, but then that means I need to concatenate my label2ids into one?
```
  LabelA = A1:1, A2:2, A3:3
  LabelB = B1:4, B2:5, B3:6
```
Essentially, what I tried to do was create a matrix of integers and then pre-tokenize my labels. But I see examples that make me think that it’s not necessary to tokenize, I can pass in a matrix of STRINGS.
Is there a difference in training outcomes if I go with the linear IDs, or somehow am able to make the overlapping IDs work?
If I can use overlapping IDs, then how in the world do I get the label2id/id2label parameters passed in correctly in .from_pretrained()?

Topic		Replies	Views
Label 2 id not working Beginners	1	183	June 12, 2025
Preprocessing data for text classification, HF dataset 🤗Datasets	1	572	October 3, 2022
No labels column for tokenized data 🤗Tokenizers	2	2232	June 27, 2022
Column names of custom dataset for use with trainer Beginners	3	5446	March 31, 2024
Dataset for multilabel classification 🤗Transformers	1	174	January 20, 2025

Understanding multi-label classification training

Related topics