Convert tokens and token-labels to string

Zack · November 16, 2020, 11:24am

I am building a token classification model, and I am asking if there’s a good way that I can transform the tokens and labels (each token has its label) to string, I know there is a tokenizer.convert_tokens_to_string that convert tokens to strings, but I must also take into consideration the labels.
Any idea rather than creating my proper implementation !

rgwatwormhill · November 17, 2020, 7:57pm

Hi Zack,

what form are the labels currently in?

I don’t understand what you are building. Are the labels the classifications?

Zack · November 19, 2020, 11:19am

Hi @rgwatwormhill, thank you for your reply,
let’s say that I have the bellow predictions on the sentences “this is a test sentence using transformers model, the model used is CamemBert”,

The problem is I want to go back to string tokens instead of tokens generated by the tokenizer and on the same time preserving the labels, for example using the word “transformers” is mapped to “transfomer” and “s”, let’s say that transfomer is tagged as a label “model”, I am facing a problem to preserve the labels, I tried implementing a simple function to decode tokens and labels at the same time but i found problems because I don’t know very well how the tokenizer works,
I am asking if there’s a way to decode and preserve tokens using the tokenizer class !

rgwatwormhill · November 20, 2020, 12:12pm

Have you seen this tutorial https://huggingface.co/transformers/custom_datasets.html#token-classification-with-w-nut-emerging-entities

For more about the tokenizer, see https://mccormickml.com/2019/07/22/BERT-fine-tuning/

Chris McCormick also has a tutorial specifically about NER, but you’d have to pay for that one and I haven’t seen it.

lewtun · November 21, 2020, 10:25pm

Hi @Zack, the key trick contained in @rgwatwormhill’s first link is the following paragraph:

Now we arrive at a common obstacle with using pre-trained models for token-level classification: many of the tokens in the W-NUT corpus are not in DistilBert’s vocabulary. Bert and many models like it use a method called WordPiece Tokenization, meaning that single words are split into multiple tokens such that each token is likely to be in the vocabulary. For example, DistilBert’s tokenizer would split the Twitter handle @huggingface into the tokens ['@', 'hugging', '##face'] . This is a problem for us because we have exactly one tag per token. If the tokenizer splits a token into multiple sub-tokens, then we will end up with a mismatch between our tokens and our labels.

One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in Transformers by setting the labels we wish to ignore to -100 . In the example above, if the label for @HuggingFace is 3 (indexing B-corporation ), we would set the labels of ['@', 'hugging', '##face'] to [3, -100, -100] .

So basically what you should do is create a mapping for each named entity tag to some integer (e.g. PER → 0, LOC → 1, ORG → 2 etc) and then use -100 to label all the entity’s subtokens beyond the first one.

Hope that helps!

Zack · November 22, 2020, 10:18pm

Hi @lewtun,
Thank you for your reply, it’s so helpful.
I am not sure that I get the trick, why not just setting for the tags [3, 3, 3] for [‘@’, ‘hugging’, ‘##face’] if @HuggingFace for example is the labels B-corporation with index 3 ?
Is it necessary to set the -100 as a value for labels to ignore or just use any other value, because for now I am using 0 as an index to tell the model to ignore this label ?
For the problem described in this discussion is something similar to this,

but in the reverse from WordPiece level to word-level labels.

lewtun · November 25, 2020, 11:05am

Hi @Zack, the reasons why we might not want to assign NER tags to every subword are nicely summarised by facehugger2020 in the other thread you linked to: Converting Word-level labels to WordPiece-level for Token Classification

Besides efficiency, one is also technically breaking the IOB2 format by having consecutive tokens be associated with say B-ORG. Personally, I find this confusing during debugging, so prefer the convention to only tag the first subword in an entity. Of course, this is just a convention and facehugger2020 links to a tutorial showing an alternative along the lines you suggested.

Now why the -100 value? I believe this comes from the PyTorch convention to ignore indices with -100 on the cross-entropy loss: https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html

You can override this value if you want, but then you’ll need to tinker with the loss function of your Transformer model.

You should be careful about picking an index like 0 because the tokenizer probably has a predefined set of indices for natural numbers and you might be overriding special tokens like <s> this way.

lzhnicholas · March 12, 2022, 7:58am

Hi @lewtun, So if when decode the id/label -100 for evaluation to compare the ground truth to our prediction sequence, do we just ignore the subsequent word piece?

['@', 'hugging', '##face'] to [3, -100, -100] becomes ['@'] to [3]? Cos it seems that it may be another issue for alignment.

Topic		Replies	Views
Token classification Beginners	1	445	October 11, 2021
Does a tokenizer keep the mapping between my labels to their encoding? 🤗Tokenizers	3	2185	April 4, 2022
Multi-input tag and ,multi-label output for token classification using Bert pretrained model 🤗Transformers	1	96	January 9, 2025
Converting Word-level labels to WordPiece-level for Token Classification Intermediate	9	4564	January 13, 2021
Predicting with Token Classifier on data with no gold labels Beginners	1	1432	August 20, 2021

Convert tokens and token-labels to string

Related topics