ByteLevelBPETokenizer inconsistent behavior

mstekel · July 23, 2020, 3:47pm

Hi, I encountered a weird behavior of the ByteLevelBPETokenizer: this publicly available notebook is parameterized to run on two almost identical text files. The first one is a transliteration of the Hebrew Bible text while the second one is the same transliteration with 2 modifications - the ‘ and ’ characters are replaced by the Hebrew letters ע and א accordingly (the transliteration used these 2 types of apostrophes for denoting these Hebrew consonants). After training on the first file(tanach_translit_orig) the tokenizer is given a test sentence for encoding and results with 19 tokens but when running the same process on the second file and the same test sentence modified accordingly (by replacing the apostrophes by the Hebrew letters) the tokenizer results with 9 tokens. I assumed the ByteLevelBPETokenizer to be agnostic to the character meanings and I cannot understand why the results vary between the experiments. Can anyone shed some light please?
P.S. for toggling between the files all you need is to un-comment the appropriate line in the second step of the notebook:

Thank you in advance,
Moshe.

Topic		Replies	Views
BPEtokenizer reports error "not valid UTF-8" when processing txt file 🤗Tokenizers	7	74	January 29, 2025
Does the ByteLevelBPETokenizer need to be wrapped in a normal Tokenizer? 🤗Tokenizers	0	1824	March 18, 2023
Ġ token inserted by ByteLevelBPETokenizer 🤗Transformers	0	542	November 1, 2023
WordLevel Tokenization with GPT2? 🤗Transformers	1	732	March 26, 2023
BartTokenizer with vocab.json and merge.txt which were created by ByteLevelBPETokenizer encode <s> into 3 tokens Beginners	1	5627	January 27, 2021

ByteLevelBPETokenizer inconsistent behavior

Related topics