Punctuation and Spaces in RoBERTa Tokenizer for NER with Pre-tokenized Data

sapirw · January 16, 2022, 10:52pm

Hi,
I’m trying to use RoBERTa for NER using CoNLL-2003 dataset.
My question is about the tokenizer definition and use at run_ner.py.
An example of an output of that exact tokenizer is:
Input: ['SOCCER', '-', 'JAPAN', 'GET', 'LUCKY', 'WIN', ',', 'CHINA', 'IN', 'SURPRISE', 'DEFEAT', '.']
Output: ['ĠSO', 'CC', 'ER', 'Ġ-', 'ĠJ', 'AP', 'AN', 'ĠGET', 'ĠL', 'UCK', 'Y', 'ĠWIN', 'Ġ,', 'ĠCH', 'INA', 'ĠIN', 'ĠSUR', 'PR', 'ISE', 'ĠDE', 'FE ', 'AT', 'Ġ.']

RoBERTa tokenizer outputs different tokens for punctuation marks depending on whether there is a I think it would have made more sense if the tokens for “.” were ‘.’ instead of ‘Ġ.’ (but the output offset_mapping should stay the same.).

My question is - why is this ignored in run_ner.py?

Thanks

Topic		Replies	Views
RoBERTa Tokenizer supported characters 🤗Transformers	0	627	December 24, 2020
RobertaTokenizer decode and tokenize do not have the same output 🤗Tokenizers	0	247	October 24, 2023
How to handle <s> and </s> tags for custom NER using RoBERTa? Beginners	0	725	May 19, 2022
Getting spans from tokenizer Beginners	1	938	October 11, 2023
Using roberta for token-classification, strange characters Models	0	267	July 10, 2023

Punctuation and Spaces in RoBERTa Tokenizer for NER with Pre-tokenized Data

Related topics