I was wondering which characters are supported by the RoBERTa Tokenizer.
When I tokenize “ä” for example, I get 2 unicode characters back.
>>> tokenizer = RobertaTokenizer.from_pretrained("roberta-base") >>> tokens = tokenizer.tokenize("ä") >>> tokens ['Ã¤'] >>> len(tokens) 2 >>>
I am currently building an NER application and therefore need to reconstruct the original span of characters for entities. I dont really mind that the character is changed, as long as the number of characters is the same. Therefore my idea was to replace all not-supported characters with some supported character that does not occur in the data otherwise. To do so I would need the set of supported characters.
Thanks in advance.
If I missed something, and there is an easier approach to the problem, feel free to correct me.