Inconsistent, mojibaked tokenization in some but not all Huggingface tokenizers

Hi all. I’m trying to get subword overlap between languages. As such, I have a set of large corpora C in a set of languages L:

## Corpus example: 
## {'en':
##      "And verily thy Lord is He, the Exalted in Might Most Merciful." They said, \'Peace!\' He is the Most G.",
##  'nl':
##      "'En voorwaar, uw Heer is de Machtige, de Genadevolle. Zij zegden: "Heilig zijt Gij. Voorzeker, Hij is.'"
##   etc.
## }

I would like to calculate subword overlap for various models M, specifically:

MODEL_NAMES = [
    "meta-llama/Llama-3.1-8B-Instruct",
    "mistralai/Ministral-8B-Instruct-2410",
    "CohereForAI/aya-expanse-8b",
    "google/gemma-2-9b-it"
]

My list of languages L consists of a mixture of Latin and non-Latin languages:

selected_languages = ['en', 'nl', 'fr', 'it', 'de', 'es', 'pt', 'ru', 'zh', 'ja']

When I tokenize using:

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenized = tokenizer.tokenize(text.lower())

I get neatly tokenized corpora for all Latin languages:

['and', 'Ġver', 'ily', 'Ġthy', 'Ġlord', 'Ġis', 'Ġhe', ',', 'Ġthe', 'Ġex', 'alted', 'Ġin', 'Ġmight', 'Ġmost', ...]

But for Chinese, Japanese and Russian, I get weird Mojibaked tokens. Example of Chinese output:

['éĢĻ', 'æĺ¯', 'ä½ł', 'åĢij', 'çļĦ', '主', 'æīĢ', 'éĻį', '示', 'çļĦ', 'æ¸Ľ', 'è¼ķ', 'åĴĮ', 'æħ', ...]

The only model that doesn’t suffer from this behaviour is gemma-2-9b-it:

['這是', '你們', '的主', '所', '降', '示', '的', '減', '輕', '和', '慈', '恩', '。', '▁以', ...]

I need to resolve this mojibaked behavior, but am not certain how. I suspect this has to do with encoding - UTF-8 would be desired but somehow even when I enforce UTF-8, I still get mojibaked tokens. The fact that Gemma neatly tokenizes the same corpus text seems to indicate the issue lies in the tokenization process rather than my corpus’ encoding type.

Perhaps there is a preprocessing step I’m missing for these non-Gemma tokenizers?

Any help would be much appreciated! Many thanks :hugs:

1 Like

That’s interesting. It doesn’t seem to have much of an effect on the final decoding, so it’s unlikely to cause any fatal problems, but depending on the language knowledge of the model, there may be a difference in how well the tokenizer separates the tokens, which may affect the way it learns.

from transformers import AutoTokenizer

texts = ["日本語です", "這是你們的主", "This is English"]
models = ["Sakalti/SabaMath1.5-pro", "Qwen/Qwen2.5-3B-Instruct", "Sakalti/Llama3.2-3B-Uranus-1", "AXCXEPT/Borea-Phi-3.5-mini-Instruct-Common"]
for model in models:
    tokenizer = AutoTokenizer.from_pretrained(model)
    for text in texts:
        tokenized = tokenizer.tokenize(text)
        encoded = tokenizer.encode(text)
        decoded = tokenizer.decode(encoded, skip_special_tokens=True)
        print(f'"{model}" text:"{text}", tokenized:"{tokenized}", encoded:"{encoded}, decoded:"{decoded}".')

“Sakalti/SabaMath1.5-pro” text:“日本語です”, tokenized:“[‘æĹ¥æľ¬’, ‘èªŀ’, ‘ãģ§ãģĻ’]”, encoded:“[101059, 102819, 37541], decoded:“日本語です”.
“Sakalti/SabaMath1.5-pro” text:“這是你們的主”, tokenized:”[‘éĢĻ’, ‘æĺ¯ä½ł’, ‘åĢij’, ‘çļĦ’, ‘主’]“, encoded:”[99672, 106753, 100190, 9370, 35568], decoded:“這是你們的主”.
“Sakalti/SabaMath1.5-pro” text:“This is English”, tokenized:“[‘This’, ‘Ġis’, ‘ĠEnglish’]”, encoded:“[1986, 374, 6364], decoded:“This is English”.
“Qwen/Qwen2.5-3B-Instruct” text:“日本語です”, tokenized:”[‘æĹ¥æľ¬’, ‘èªŀ’, ‘ãģ§ãģĻ’]“, encoded:”[101059, 102819, 37541], decoded:“日本語です”.
“Qwen/Qwen2.5-3B-Instruct” text:“這是你們的主”, tokenized:“[‘éĢĻ’, ‘æĺ¯ä½ł’, ‘åĢij’, ‘çļĦ’, ‘主’]”, encoded:“[99672, 106753, 100190, 9370, 35568], decoded:“這是你們的主”.
“Qwen/Qwen2.5-3B-Instruct” text:“This is English”, tokenized:”[‘This’, ‘Ġis’, ‘ĠEnglish’]“, encoded:”[1986, 374, 6364], decoded:“This is English”.
“Sakalti/Llama3.2-3B-Uranus-1” text:“日本語です”, tokenized:“[‘æĹ¥æľ¬’, ‘èªŀ’, ‘ãģ§ãģĻ’]”, encoded:“[128000, 102433, 102158, 38641], decoded:“日本語です”.
“Sakalti/Llama3.2-3B-Uranus-1” text:“這是你們的主”, tokenized:”[‘éĢĻ’, ‘æĺ¯’, ‘ä½ł’, ‘åĢij’, ‘çļĦ’, ‘主’]“, encoded:”[128000, 103864, 21043, 57668, 106310, 9554, 36668], decoded:“這是你們的主”.
“Sakalti/Llama3.2-3B-Uranus-1” text:“This is English”, tokenized:“[‘This’, ‘Ġis’, ‘ĠEnglish’]”, encoded:“[128000, 2028, 374, 6498], decoded:“This is English”.
“AXCXEPT/Borea-Phi-3.5-mini-Instruct-Common” text:“日本語です”, tokenized:”[‘▁’, ‘日’, ‘本’, ‘語’, ‘で’, ‘す’]“, encoded:”[29871, 30325, 30346, 30968, 30499, 30427], decoded:“日本語です”.
“AXCXEPT/Borea-Phi-3.5-mini-Instruct-Common” text:“這是你們的主”, tokenized:“[‘▁’, ‘<0xE9>’, ‘<0x80>’, ‘<0x99>’, ‘是’, ‘你’, ‘<0xE5>’, ‘<0x80>’, ‘<0x91>’, ‘的’, ‘主’]”, encoded:“[29871, 236, 131, 156,
30392, 30919, 232, 131, 148, 30210, 30888], decoded:“這是你們的主”.
“AXCXEPT/Borea-Phi-3.5-mini-Instruct-Common” text:“This is English”, tokenized:”[‘▁This’, ‘▁is’, ‘▁English’]“, encoded:”[910, 338, 4223], decoded:“This is English”.