Generating NER data sample for RoBERTa model

proxyht · September 3, 2020, 10:06am

I have a trained RoBERTa model with a Byte Level BPE Encoding algorithm, which I want to benchmark on a custom NER dataset.
Each sample is as followed:

Text: John is playing football
Labels: B-PER O O O

The text could be run through the tokenizer to generate subword tokens. However, the total count of the subwords tokens may be different from the text tokens, and I don’t know how to align the labels accordingly

valhalla · September 6, 2020, 8:16am

cc @stefan-it

proxyht · September 14, 2020, 1:53am

Hello, it’s this topiv still relevant?

Vào 15:26, CN, 6 Th9, 2020 Suraj Patil via Hugging Face Forums <hellohellohello@discoursemail.com> đã viết:

nbroad · September 14, 2020, 2:11am

Maybe something in here can help you:

https://github.com/huggingface/transformers/tree/master/examples/token-classification

Topic		Replies	Views
How to handle <s> and </s> tags for custom NER using RoBERTa? Beginners	0	725	May 19, 2022
NER for short technical phrases Models	0	603	December 16, 2020
How to train a model for ner pipeline [RoBERTa] Beginners	0	603	July 2, 2021
[URGENT] Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task 🤗Transformers	6	216	March 19, 2024
Discussing the Pros and Cons of Using add_tokens vs. Byte Pair Encoding (BPE) for Adding New Tokens to an Existing RoBERTa Model 🤗Tokenizers	0	766	July 14, 2023

Generating NER data sample for RoBERTa model

Related topics