Convert tokens and token-labels to string

lewtun · November 25, 2020, 11:05am

Hi @Zack, the reasons why we might not want to assign NER tags to every subword are nicely summarised by facehugger2020 in the other thread you linked to: Converting Word-level labels to WordPiece-level for Token Classification

Besides efficiency, one is also technically breaking the IOB2 format by having consecutive tokens be associated with say B-ORG. Personally, I find this confusing during debugging, so prefer the convention to only tag the first subword in an entity. Of course, this is just a convention and facehugger2020 links to a tutorial showing an alternative along the lines you suggested.

Now why the -100 value? I believe this comes from the PyTorch convention to ignore indices with -100 on the cross-entropy loss: https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html

You can override this value if you want, but then you’ll need to tinker with the loss function of your Transformer model.

You should be careful about picking an index like 0 because the tokenizer probably has a predefined set of indices for natural numbers and you might be overriding special tokens like <s> this way.

Topic		Replies	Views
Token classification Beginners	1	445	October 11, 2021
Does a tokenizer keep the mapping between my labels to their encoding? 🤗Tokenizers	3	2168	April 4, 2022
Multi-input tag and ,multi-label output for token classification using Bert pretrained model 🤗Transformers	1	85	January 9, 2025
Converting Word-level labels to WordPiece-level for Token Classification Intermediate	9	4558	January 13, 2021
Predicting with Token Classifier on data with no gold labels Beginners	1	1431	August 20, 2021

Convert tokens and token-labels to string

Related topics