RoBERTa Tokenizer supported characters


I was wondering which characters are supported by the RoBERTa Tokenizer.

When I tokenize “ä” for example, I get 2 unicode characters back.

>>> tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
>>> tokens = tokenizer.tokenize("ä")
>>> tokens
>>> len(tokens[0])

I am currently building an NER application and therefore need to reconstruct the original span of characters for entities. I dont really mind that the character is changed, as long as the number of characters is the same. Therefore my idea was to replace all not-supported characters with some supported character that does not occur in the data otherwise. To do so I would need the set of supported characters.

Thanks in advance.
If I missed something, and there is an easier approach to the problem, feel free to correct me.

1 Like