Tokenization in a NER context

Thrix · April 20, 2021, 1:13pm

Hello everyone, I am trying to understand how to use the tokenizers in a NER context.

Basically, I have a text corpus with entities annotations, usually in IOB format [1], which can be seen as a mapping f: word → tag (annotators are working on a non-tokenized text and we ask them to annotate entire words).

When I am using any modern tokenizer, basically, I will get several tokens for a single word (for instance “huggingface” might produce something like [“hugging#”, “face”]). I need to transfer the original annotations to each token in order to have a new labelling fonction g: token → tag

E.g. what I have in input

text  = "Huggingface is amazing"
labels = [B_org, O, O]"

what I need to produce if the tokenizer output is ["Hugging#", "face", "is", "amazin"] is

labels_per_tokens = [B_org, I_org, O, O]
```

To do so I need to backtrack for every token produced by the tokenizer what is the original word / annotation that I got in input but it seems not so easy to do (especially with [UNK] tokens). Am I missing something obvious ? Are there some good practice or solution to my problem ?

Thanks a lot for your help !

[1] https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)

lewtun · April 20, 2021, 1:46pm

hey @Thrix you can see how to align the NER tags and tokens in the tokenize_and_align_labels function in this tutorial: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb#scrollTo=n9qywopnIrJH

Thrix · April 20, 2021, 2:04pm

Thanks a lot for the link ! Sounds exactly what I’m looking for ; I will check the tutorial !

sbmaruf · April 29, 2021, 4:43pm

For a word that is divided into multiple tokens by bpe or sentencepiece like model, you use the first token as your reference token that you want to predict. Since all the tokens are connected via self-attention you won’t have the problem not predicting the rest of the bpe tokens of a word. In PyTorch, you can ignore computing loss (see ignore_index argument) of those tokens by providing -100 as a label to those tokens (life is so easy with pytorch ).

Thrix · May 7, 2021, 12:13pm

Thanks for the trick ! Indeed, that’s also a very reasonable way to go

s4sarath · August 11, 2021, 7:07am

@lewtun - Just wanted to know, how to align multiple sub words into individual word in case of SentencePiece tokenizer like Albert, for NER?

For example,

My name is Abraham George and I am living in Philadelphia" .

Now this will be split into sub words, and predictions happens per sub word. Now how we align it back to the individual words.

@sgugger

Topic		Replies	Views
Token alignment for word-level tasks 🤗Tokenizers	1	2527	August 5, 2020
Tokenizer splits up pre-split tokens 🤗Tokenizers	9	6658	February 9, 2024
How to align sub words to words in Sentencepiece like Albert for NER? 🤗Transformers	0	207	August 12, 2021
Converting Word-level labels to WordPiece-level for Token Classification Intermediate	9	4564	January 13, 2021
Ask for help with prediction results of Named Entity Recognition Task 🤗Transformers	10	3230	May 21, 2021

Tokenization in a NER context

Related topics