Why BPE encoding trained on English and applied on Bengali doesnot return unknown tokens?

I use roberta-base tokenizer tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base',add_prefix_space=True) trained on english data to tokenize bengali just to see how it behaves . When I try to to encode a bengali character tokenizer.encode('বা') , I get [0, 1437, 35861, 11582, 35861, 4726, 2] which means that it finds some tokens in it vocabulary which match bengali characters even though train on english. On further exploration I find these are all special characters ['<s>', 'Ġ', 'à¦', '¬', 'à¦', '¾', '</s>'] . My question is why does it happen, isn’t it supposed to output unknown tokens when applied on a new language ? Any help greatly appreciated

Hey @Sam2021 did you find anything on why the tokenizer behaves like this? Would love to know if you have an update on this