I’m looking to fine-tune an NER model (dslim/bert-base-NER-uncased) with my own data.
My annotations are of this form: for each example I have a piece of raw text (str) and a list of annotated spans of this form: {start_index: int, end_index: int, tag: str}
However, to fine-tune the NER model, I need to prepare X (tokens) and Y (token tags) for each example. So, those spans have to be translated into token tags, matching the model’s tokenizer.
Hope that makes sense.
Is there a way to handle this? Or, what would you recommend?
Yes that’s the right approach. You can take a look at Spacy’s offsets_to_biluo_tags method. It’s great to convert character index-level annotations to token annotations (in BILOU-format, which is a bit more exotic than IOB).
The problem with that is that offsets_to_biluo_tags uses some spacy tokenizer right? And they use a different tokenizer from Huggingface BERT tokenizer for example. So don’t I need something that uses HF’s BERT tokenizer to go from spans to token labels?
Hi @astarostap I understand your issue! I was struggling with the same thing, as I needed the tokens and tags to align to my chosen tokenizer, which was subject to change.