Tokenization in a NER context

Hello everyone, I am trying to understand how to use the tokenizers in a NER context.

Basically, I have a text corpus with entities annotations, usually in IOB format [1], which can be seen as a mapping f: word → tag (annotators are working on a non-tokenized text and we ask them to annotate entire words).

When I am using any modern tokenizer, basically, I will get several tokens for a single word (for instance “huggingface” might produce something like [“hugging#”, “face”]). I need to transfer the original annotations to each token in order to have a new labelling fonction g: token → tag

E.g. what I have in input

text  = "Huggingface is amazing"
labels = [B_org, O, O]"

what I need to produce if the tokenizer output is ["Hugging#", "face", "is", "amazin"] is

labels_per_tokens = [B_org, I_org, O, O]

To do so I need to backtrack for every token produced by the tokenizer what is the original word / annotation that I got in input but it seems not so easy to do (especially with [UNK] tokens). Am I missing something obvious ? Are there some good practice or solution to my problem ?

Thanks a lot for your help !


hey @Thrix you can see how to align the NER tags and tokens in the tokenize_and_align_labels function in this tutorial:

1 Like

Thanks a lot for the link ! Sounds exactly what I’m looking for ; I will check the tutorial !

1 Like

For a word that is divided into multiple tokens by bpe or sentencepiece like model, you use the first token as your reference token that you want to predict. Since all the tokens are connected via self-attention you won’t have the problem not predicting the rest of the bpe tokens of a word. In PyTorch, you can ignore computing loss (see ignore_index argument) of those tokens by providing -100 as a label to those tokens (life is so easy with pytorch :wink: ).

Thanks for the trick ! Indeed, that’s also a very reasonable way to go