Hi,
Iâm trying to use RoBERTa for NER using CoNLL-2003 dataset.
My question is about the tokenizer definition and use at run_ner.py.
An example of an output of that exact tokenizer is:
Input: ['SOCCER', '-', 'JAPAN', 'GET', 'LUCKY', 'WIN', ',', 'CHINA', 'IN', 'SURPRISE', 'DEFEAT', '.']
Output: ['Ä SO', 'CC', 'ER', 'Ä -', 'Ä J', 'AP', 'AN', 'Ä GET', 'Ä L', 'UCK', 'Y', 'Ä WIN', 'Ä ,', 'Ä CH', 'INA', 'Ä IN', 'Ä SUR', 'PR', 'ISE', 'Ä DE', 'FE ', 'AT', 'Ä .']
RoBERTa tokenizer outputs different tokens for punctuation marks depending on whether there is a I think it would have made more sense if the tokens for â.â were â.â instead of âÄ .â (but the output offset_mapping
should stay the same.).
My question is - why is this ignored in run_ner.py?
Thanks