RoBERTa Tokenizer supported characters

ArsenieBoca · December 24, 2020, 3:50pm

Hi,

I was wondering which characters are supported by the RoBERTa Tokenizer.

When I tokenize “ä” for example, I get 2 unicode characters back.

>>> tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
>>> tokens = tokenizer.tokenize("ä")
>>> tokens
['Ã¤']
>>> len(tokens[0])
2
>>>

I am currently building an NER application and therefore need to reconstruct the original span of characters for entities. I dont really mind that the character is changed, as long as the number of characters is the same. Therefore my idea was to replace all not-supported characters with some supported character that does not occur in the data otherwise. To do so I would need the set of supported characters.

Thanks in advance.
If I missed something, and there is an easier approach to the problem, feel free to correct me.

Topic		Replies	Views
Punctuation and Spaces in RoBERTa Tokenizer for NER with Pre-tokenized Data 🤗Transformers	0	582	January 16, 2022
Using roberta for token-classification, strange characters Models	0	267	July 10, 2023
RobertaTokenizer decode and tokenize do not have the same output 🤗Tokenizers	0	247	October 24, 2023
Why BPE encoding trained on English and applied on Bengali doesnot return unknown tokens? Beginners	1	327	February 25, 2024
Getting spans from tokenizer Beginners	1	929	October 11, 2023

RoBERTa Tokenizer supported characters

Related topics