Punctuation and Spaces in RoBERTa Tokenizer for NER with Pre-tokenized Data

Hi,
I’m trying to use RoBERTa for NER using CoNLL-2003 dataset.
My question is about the tokenizer definition and use at run_ner.py.
An example of an output of that exact tokenizer is:
Input: ['SOCCER', '-', 'JAPAN', 'GET', 'LUCKY', 'WIN', ',', 'CHINA', 'IN', 'SURPRISE', 'DEFEAT', '.']
Output: ['Ä SO', 'CC', 'ER', 'Ä -', 'Ä J', 'AP', 'AN', 'Ä GET', 'Ä L', 'UCK', 'Y', 'Ä WIN', 'Ä ,', 'Ä CH', 'INA', 'Ä IN', 'Ä SUR', 'PR', 'ISE', 'Ä DE', 'FE ', 'AT', 'Ä .']

RoBERTa tokenizer outputs different tokens for punctuation marks depending on whether there is a I think it would have made more sense if the tokens for “.” were ‘.’ instead of ‘Ġ.’ (but the output offset_mapping should stay the same.).

My question is - why is this ignored in run_ner.py?

Thanks