How to structure labels for token classification?

The documentation for the label parameter for BertForTokenClassification says that

Indices should be in [0, ..., config.num_labels - 1]

But BertConfig doesn’t have a num_labels parameter as far as I can tell, so what is this config.num_labels argument?

Also, in this tutorial it says that we can set the labels we want the model to ignore, to -100. If that is correct, why doesn’t the documentation for BertForTokenClassification mention it? Maybe it’s not correct, because when I make my labels like this, I get the error

/opt/conda/conda-bld/pytorch_1591914880026/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [27,0,0] Assertion t >= 0 && t < n_classes failed

which indicates to me that I can not have labels with values outside the interval [0, n_classes].

What I have are labels for which class each wordpiece belongs to, after tokenization. Then I add the special tokens and padding, and I’m setting labels for the special tokens to -100. So for example if I want a sequence length of 10, and I want to classify wordpieces with an ‘o’ in them as class 1, and wordpieces with a ‘p’ in them as class 2, I would have, for the sentence “Oh, that school is pretty cool”:

Tokens: [‘oh’, ‘,’, ‘that’, ‘school’, ‘is’, ‘pretty’, ‘cool’]
With special tokens: [‘[CLS]’, ‘oh’, ‘,’, ‘that’, ‘school’, ‘is’, ‘pretty’, ‘cool’, ‘[SEP]’, ‘[PAD]’]
Labels: [-100, 1, 0, 0, 1, 0, 2, 0, -100, -100]

You should be able to do something like this:

config = AutoConfig.from_pretrained("bert-base-cased", num_labels=3)
model = AutoModel.from_pretrained("bert-base-cased", config=config)

Note that in your example you have three possible labels: with o, with p, and with neither. If you set num_labels to 2, you will have gotten the error that you described.

-100 is the default ignore index for NLLLoss. When a target item has this index, it will be ignored from loss computation.

2 Likes

Thank you, this worked. Do you know if it is correct to have -100 as the label for the CLS and SEP tokens when doing token classification? As I understand it, these tokens are required, even if I don’t really use them for anything, since that’s how BERT was pretrained, but I’m not sure how to label them. Maybe they should have the label 0 instead, like the rest of the tokens that don’t belong to any of the other classes, and only the PAD tokens should have the label -100?

I’d set all special tokens to - 100. You are not interested in the performance of your model on those tokens.

1 Like

And how to label padding? Should it also be -100?

Yes, padding is also typically ignored.