Hi,
I was wondering which characters are supported by the RoBERTa Tokenizer.
When I tokenize âÀâ for example, I get 2 unicode characters back.
>>> tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
>>> tokens = tokenizer.tokenize("À")
>>> tokens
['À']
>>> len(tokens[0])
2
>>>
I am currently building an NER application and therefore need to reconstruct the original span of characters for entities. I dont really mind that the character is changed, as long as the number of characters is the same. Therefore my idea was to replace all not-supported characters with some supported character that does not occur in the data otherwise. To do so I would need the set of supported characters.
Thanks in advance.
If I missed something, and there is an easier approach to the problem, feel free to correct me.