I use roberta-base tokenizer tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base',add_prefix_space=True)
trained on english data to tokenize bengali just to see how it behaves . When I try to to encode a bengali character tokenizer.encode('বা')
, I get [0, 1437, 35861, 11582, 35861, 4726, 2]
which means that it finds some tokens in it vocabulary which match bengali characters even though train on english. On further exploration I find these are all special characters ['<s>', 'Ġ', 'à¦', '¬', 'à¦', '¾', '</s>']
. My question is why does it happen, isn’t it supposed to output unknown tokens when applied on a new language ? Any help greatly appreciated
Hey @Sam2021 did you find anything on why the tokenizer behaves like this? Would love to know if you have an update on this