Punctuation and Spaces in RoBERTa Tokenizer for NER with Pre-tokenized Data

Hi,
I’m trying to use RoBERTa for NER using CoNLL-2003 dataset.
My question is about the tokenizer definition and use at run_ner.py.
An example of an output of that exact tokenizer is:
Input: ['SOCCER', '-', 'JAPAN', 'GET', 'LUCKY', 'WIN', ',', 'CHINA', 'IN', 'SURPRISE', 'DEFEAT', '.']
Output: ['ĠSO', 'CC', 'ER', 'Ġ-', 'ĠJ', 'AP', 'AN', 'ĠGET', 'ĠL', 'UCK', 'Y', 'ĠWIN', 'Ġ,', 'ĠCH', 'INA', 'ĠIN', 'ĠSUR', 'PR', 'ISE', 'ĠDE', 'FE ', 'AT', 'Ġ.']

RoBERTa tokenizer outputs different tokens for punctuation marks depending on whether there is a I think it would have made more sense if the tokens for “.” were ‘.’ instead of ‘Ġ.’ (but the output offset_mapping should stay the same.).

My question is - why is this ignored in run_ner.py?

Thanks