Why BPE encoding trained on English and applied on Bengali doesnot return unknown tokens?

Sam2021 · September 7, 2021, 2:17pm

I use roberta-base tokenizer tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base',add_prefix_space=True) trained on english data to tokenize bengali just to see how it behaves . When I try to to encode a bengali character tokenizer.encode('বা') , I get [0, 1437, 35861, 11582, 35861, 4726, 2] which means that it finds some tokens in it vocabulary which match bengali characters even though train on english. On further exploration I find these are all special characters ['<s>', 'Ġ', 'à¦', '¬', 'à¦', '¾', '</s>'] . My question is why does it happen, isn’t it supposed to output unknown tokens when applied on a new language ? Any help greatly appreciated

gsayak · February 25, 2024, 4:07pm

Hey @Sam2021 did you find anything on why the tokenizer behaves like this? Would love to know if you have an update on this

Topic		Replies	Views
Creating a custom tokenizer for Roberta Beginners	5	4322	August 1, 2021
Discussing the Pros and Cons of Using add_tokens vs. Byte Pair Encoding (BPE) for Adding New Tokens to an Existing RoBERTa Model 🤗Tokenizers	0	768	July 14, 2023
Build a RoBERTa tokenizer from scratch 🤗Tokenizers	5	3351	December 12, 2020
Using roberta for token-classification, strange characters Models	0	267	July 10, 2023
Tokenizer decoding using BERT, RoBERTa, XLNet, GPT2 Beginners	7	8445	September 21, 2020

Why BPE encoding trained on English and applied on Bengali doesnot return unknown tokens?

Related topics