Convert tokens and token-labels to string

I am building a token classification model, and I am asking if there’s a good way that I can transform the tokens and labels (each token has its label) to string, I know there is a tokenizer.convert_tokens_to_string that convert tokens to strings, but I must also take into consideration the labels.
Any idea rather than creating my proper implementation !

Hi Zack,

what form are the labels currently in?

I don’t understand what you are building. Are the labels the classifications?

Hi @rgwatwormhill, thank you for your reply,
let’s say that I have the bellow predictions on the sentences “this is a test sentence using transformers model, the model used is CamemBert”,

The problem is I want to go back to string tokens instead of tokens generated by the tokenizer and on the same time preserving the labels, for example using the word “transformers” is mapped to “transfomer” and “s”, let’s say that transfomer is tagged as a label “model”, I am facing a problem to preserve the labels, I tried implementing a simple function to decode tokens and labels at the same time but i found problems because I don’t know very well how the tokenizer works,
I am asking if there’s a way to decode and preserve tokens using the tokenizer class !

Have you seen this tutorial https://huggingface.co/transformers/custom_datasets.html#token-classification-with-w-nut-emerging-entities

For more about the tokenizer, see https://mccormickml.com/2019/07/22/BERT-fine-tuning/

Chris McCormick also has a tutorial specifically about NER, but you’d have to pay for that one and I haven’t seen it.

Hi @Zack, the key trick contained in @rgwatwormhill’s first link is the following paragraph:

Now we arrive at a common obstacle with using pre-trained models for token-level classification: many of the tokens in the W-NUT corpus are not in DistilBert’s vocabulary. Bert and many models like it use a method called WordPiece Tokenization, meaning that single words are split into multiple tokens such that each token is likely to be in the vocabulary. For example, DistilBert’s tokenizer would split the Twitter handle @huggingface into the tokens ['@', 'hugging', '##face'] . This is a problem for us because we have exactly one tag per token. If the tokenizer splits a token into multiple sub-tokens, then we will end up with a mismatch between our tokens and our labels.

One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in :hugs: Transformers by setting the labels we wish to ignore to -100 . In the example above, if the label for @HuggingFace is 3 (indexing B-corporation ), we would set the labels of ['@', 'hugging', '##face'] to [3, -100, -100] .

So basically what you should do is create a mapping for each named entity tag to some integer (e.g. PER → 0, LOC → 1, ORG → 2 etc) and then use -100 to label all the entity’s subtokens beyond the first one.

Hope that helps!

2 Likes

Hi @lewtun,
Thank you for your reply, it’s so helpful.
I am not sure that I get the trick, why not just setting for the tags [3, 3, 3] for [‘@’, ‘hugging’, ‘##face’] if @HuggingFace for example is the labels B-corporation with index 3 ?
Is it necessary to set the -100 as a value for labels to ignore or just use any other value, because for now I am using 0 as an index to tell the model to ignore this label ?
For the problem described in this discussion is something similar to this,

but in the reverse from WordPiece level to word-level labels.

Hi @Zack, the reasons why we might not want to assign NER tags to every subword are nicely summarised by facehugger2020 in the other thread you linked to: Converting Word-level labels to WordPiece-level for Token Classification

Besides efficiency, one is also technically breaking the IOB2 format by having consecutive tokens be associated with say B-ORG. Personally, I find this confusing during debugging, so prefer the convention to only tag the first subword in an entity. Of course, this is just a convention and facehugger2020 links to a tutorial showing an alternative along the lines you suggested.

Now why the -100 value? I believe this comes from the PyTorch convention to ignore indices with -100 on the cross-entropy loss: https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html

You can override this value if you want, but then you’ll need to tinker with the loss function of your Transformer model.

You should be careful about picking an index like 0 because the tokenizer probably has a predefined set of indices for natural numbers and you might be overriding special tokens like <s> this way.

3 Likes

Hi @lewtun, So if when decode the id/label -100 for evaluation to compare the ground truth to our prediction sequence, do we just ignore the subsequent word piece?

['@', 'hugging', '##face'] to [3, -100, -100] becomes ['@'] to [3]? Cos it seems that it may be another issue for alignment.