Generating NER data sample for RoBERTa model

I have a trained RoBERTa model with a Byte Level BPE Encoding algorithm, which I want to benchmark on a custom NER dataset.
Each sample is as followed:

Text: John is playing football
Labels: B-PER O O O

The text could be run through the tokenizer to generate subword tokens. However, the total count of the subwords tokens may be different from the text tokens, and I don’t know how to align the labels accordingly

cc @stefan-it

Hello, it’s this topiv still relevant?

VĂ o 15:26, CN, 6 Th9, 2020 Suraj Patil via Hugging Face Forums <hellohellohello@discoursemail.com> Ä‘ĂŁ viáşżt:

Maybe something in here can help you:

https://github.com/huggingface/transformers/tree/master/examples/token-classification

1 Like