Hi, I encountered a weird behavior of the ByteLevelBPETokenizer: this publicly available notebook is parameterized to run on two almost identical text files. The first one is a transliteration of the Hebrew Bible text while the second one is the same transliteration with 2 modifications - the ‘ and ’ characters are replaced by the Hebrew letters ע and א accordingly (the transliteration used these 2 types of apostrophes for denoting these Hebrew consonants). After training on the first file(tanach_translit_orig) the tokenizer is given a test sentence for encoding and results with 19 tokens but when running the same process on the second file and the same test sentence modified accordingly (by replacing the apostrophes by the Hebrew letters) the tokenizer results with 9 tokens. I assumed the ByteLevelBPETokenizer to be agnostic to the character meanings and I cannot understand why the results vary between the experiments. Can anyone shed some light please?
P.S. for toggling between the files all you need is to un-comment the appropriate line in the second step of the notebook:
Thank you in advance,