Inconsistent, mojibaked tokenization in some but not all Huggingface tokenizers

timdadum · February 12, 2025, 11:22am

Hi all. I’m trying to get subword overlap between languages. As such, I have a set of large corpora C in a set of languages L:

## Corpus example: 
## {'en':
##      "And verily thy Lord is He, the Exalted in Might Most Merciful." They said, \'Peace!\' He is the Most G.",
##  'nl':
##      "'En voorwaar, uw Heer is de Machtige, de Genadevolle. Zij zegden: "Heilig zijt Gij. Voorzeker, Hij is.'"
##   etc.
## }

I would like to calculate subword overlap for various models M, specifically:

MODEL_NAMES = [
    "meta-llama/Llama-3.1-8B-Instruct",
    "mistralai/Ministral-8B-Instruct-2410",
    "CohereForAI/aya-expanse-8b",
    "google/gemma-2-9b-it"
]

My list of languages L consists of a mixture of Latin and non-Latin languages:

selected_languages = ['en', 'nl', 'fr', 'it', 'de', 'es', 'pt', 'ru', 'zh', 'ja']

When I tokenize using:

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenized = tokenizer.tokenize(text.lower())

I get neatly tokenized corpora for all Latin languages:

['and', 'Ġver', 'ily', 'Ġthy', 'Ġlord', 'Ġis', 'Ġhe', ',', 'Ġthe', 'Ġex', 'alted', 'Ġin', 'Ġmight', 'Ġmost', ...]

But for Chinese, Japanese and Russian, I get weird Mojibaked tokens. Example of Chinese output:

['éĢĻ', 'æĺ¯', 'ä½ł', 'åĢij', 'çļĦ', '主', 'æīĢ', 'éĻį', '示', 'çļĦ', 'æ¸Ľ', 'è¼ķ', 'åĴĮ', 'æħ', ...]

The only model that doesn’t suffer from this behaviour is gemma-2-9b-it:

['這是', '你們', '的主', '所', '降', '示', '的', '減', '輕', '和', '慈', '恩', '。', '▁以', ...]

I need to resolve this mojibaked behavior, but am not certain how. I suspect this has to do with encoding - UTF-8 would be desired but somehow even when I enforce UTF-8, I still get mojibaked tokens. The fact that Gemma neatly tokenizes the same corpus text seems to indicate the issue lies in the tokenization process rather than my corpus’ encoding type.

Perhaps there is a preprocessing step I’m missing for these non-Gemma tokenizers?

Any help would be much appreciated! Many thanks

John6666 · February 12, 2025, 1:41pm

That’s interesting. It doesn’t seem to have much of an effect on the final decoding, so it’s unlikely to cause any fatal problems, but depending on the language knowledge of the model, there may be a difference in how well the tokenizer separates the tokens, which may affect the way it learns.

from transformers import AutoTokenizer

texts = ["日本語です", "這是你們的主", "This is English"]
models = ["Sakalti/SabaMath1.5-pro", "Qwen/Qwen2.5-3B-Instruct", "Sakalti/Llama3.2-3B-Uranus-1", "AXCXEPT/Borea-Phi-3.5-mini-Instruct-Common"]
for model in models:
    tokenizer = AutoTokenizer.from_pretrained(model)
    for text in texts:
        tokenized = tokenizer.tokenize(text)
        encoded = tokenizer.encode(text)
        decoded = tokenizer.decode(encoded, skip_special_tokens=True)
        print(f'"{model}" text:"{text}", tokenized:"{tokenized}", encoded:"{encoded}, decoded:"{decoded}".')

“Sakalti/SabaMath1.5-pro” text:“日本語です”, tokenized:“[‘æĹ¥æľ¬’, ‘èªŀ’, ‘ãģ§ãģĻ’]”, encoded:“[101059, 102819, 37541], decoded:“日本語です”.
“Sakalti/SabaMath1.5-pro” text:“這是你們的主”, tokenized:”[‘éĢĻ’, ‘æĺ¯ä½ł’, ‘åĢĳ’, ‘çļĦ’, ‘ä¸»’]“, encoded:”[99672, 106753, 100190, 9370, 35568], decoded:“這是你們的主”.
“Sakalti/SabaMath1.5-pro” text:“This is English”, tokenized:“[‘This’, ‘Ġis’, ‘ĠEnglish’]”, encoded:“[1986, 374, 6364], decoded:“This is English”.
“Qwen/Qwen2.5-3B-Instruct” text:“日本語です”, tokenized:”[‘æĹ¥æľ¬’, ‘èªŀ’, ‘ãģ§ãģĻ’]“, encoded:”[101059, 102819, 37541], decoded:“日本語です”.
“Qwen/Qwen2.5-3B-Instruct” text:“這是你們的主”, tokenized:“[‘éĢĻ’, ‘æĺ¯ä½ł’, ‘åĢĳ’, ‘çļĦ’, ‘ä¸»’]”, encoded:“[99672, 106753, 100190, 9370, 35568], decoded:“這是你們的主”.
“Qwen/Qwen2.5-3B-Instruct” text:“This is English”, tokenized:”[‘This’, ‘Ġis’, ‘ĠEnglish’]“, encoded:”[1986, 374, 6364], decoded:“This is English”.
“Sakalti/Llama3.2-3B-Uranus-1” text:“日本語です”, tokenized:“[‘æĹ¥æľ¬’, ‘èªŀ’, ‘ãģ§ãģĻ’]”, encoded:“[128000, 102433, 102158, 38641], decoded:“日本語です”.
“Sakalti/Llama3.2-3B-Uranus-1” text:“這是你們的主”, tokenized:”[‘éĢĻ’, ‘æĺ¯’, ‘ä½ł’, ‘åĢĳ’, ‘çļĦ’, ‘ä¸»’]“, encoded:”[128000, 103864, 21043, 57668, 106310, 9554, 36668], decoded:“這是你們的主”.
“Sakalti/Llama3.2-3B-Uranus-1” text:“This is English”, tokenized:“[‘This’, ‘Ġis’, ‘ĠEnglish’]”, encoded:“[128000, 2028, 374, 6498], decoded:“This is English”.
“AXCXEPT/Borea-Phi-3.5-mini-Instruct-Common” text:“日本語です”, tokenized:”[‘▁’, ‘日’, ‘本’, ‘語’, ‘で’, ‘す’]“, encoded:”[29871, 30325, 30346, 30968, 30499, 30427], decoded:“日本語です”.
“AXCXEPT/Borea-Phi-3.5-mini-Instruct-Common” text:“這是你們的主”, tokenized:“[‘▁’, ‘<0xE9>’, ‘<0x80>’, ‘<0x99>’, ‘是’, ‘你’, ‘<0xE5>’, ‘<0x80>’, ‘<0x91>’, ‘的’, ‘主’]”, encoded:“[29871, 236, 131, 156,
30392, 30919, 232, 131, 148, 30210, 30888], decoded:“這是你們的主”.
“AXCXEPT/Borea-Phi-3.5-mini-Instruct-Common” text:“This is English”, tokenized:”[‘▁This’, ‘▁is’, ‘▁English’]“, encoded:”[910, 338, 4223], decoded:“This is English”.

Topic		Replies	Views
Questions on model's tokens 🤗Tokenizers	0	601	March 24, 2021
Incorporating my tokenizer into huggingface 🤗Tokenizers	0	248	February 15, 2024
Huggingface tokenizer not working properly when defined in a function / different program Beginners	0	356	May 29, 2023
Xlm-Roberta Tokenizing 🤗Transformers	3	470	January 19, 2021
Tokenizing Domain Specific Text 🤗Tokenizers	5	1445	November 20, 2020

Inconsistent, mojibaked tokenization in some but not all Huggingface tokenizers

Related topics