Hello everyone, I am trying to understand how to use the tokenizers in a NER context.
Basically, I have a text corpus with entities annotations, usually in IOB format [1], which can be seen as a mapping f: word → tag (annotators are working on a non-tokenized text and we ask them to annotate entire words).
When I am using any modern tokenizer, basically, I will get several tokens for a single word (for instance “huggingface” might produce something like [“hugging#”, “face”]). I need to transfer the original annotations to each token in order to have a new labelling fonction g: token → tag
E.g. what I have in input
text = "Huggingface is amazing"
labels = [B_org, O, O]"
what I need to produce if the tokenizer output is ["Hugging#", "face", "is", "amazin"]
is
labels_per_tokens = [B_org, I_org, O, O]
```
To do so I need to backtrack for every token produced by the tokenizer what is the original word / annotation that I got in input but it seems not so easy to do (especially with [UNK] tokens). Am I missing something obvious ? Are there some good practice or solution to my problem ?
Thanks a lot for your help !
[1] https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)