Hi all. I’m trying to get subword overlap between languages. As such, I have a set of large corpora C in a set of languages L:
## Corpus example:
## {'en':
## "And verily thy Lord is He, the Exalted in Might Most Merciful." They said, \'Peace!\' He is the Most G.",
## 'nl':
## "'En voorwaar, uw Heer is de Machtige, de Genadevolle. Zij zegden: "Heilig zijt Gij. Voorzeker, Hij is.'"
## etc.
## }
I would like to calculate subword overlap for various models M, specifically:
MODEL_NAMES = [
"meta-llama/Llama-3.1-8B-Instruct",
"mistralai/Ministral-8B-Instruct-2410",
"CohereForAI/aya-expanse-8b",
"google/gemma-2-9b-it"
]
My list of languages L consists of a mixture of Latin and non-Latin languages:
selected_languages = ['en', 'nl', 'fr', 'it', 'de', 'es', 'pt', 'ru', 'zh', 'ja']
When I tokenize using:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenized = tokenizer.tokenize(text.lower())
I get neatly tokenized corpora for all Latin languages:
['and', 'Ġver', 'ily', 'Ġthy', 'Ġlord', 'Ġis', 'Ġhe', ',', 'Ġthe', 'Ġex', 'alted', 'Ġin', 'Ġmight', 'Ġmost', ...]
But for Chinese, Japanese and Russian, I get weird Mojibaked tokens. Example of Chinese output:
['éĢĻ', 'æĺ¯', 'ä½ł', 'åĢij', 'çļĦ', '主', 'æīĢ', 'éĻį', '示', 'çļĦ', 'æ¸Ľ', 'è¼ķ', 'åĴĮ', 'æħ', ...]
The only model that doesn’t suffer from this behaviour is gemma-2-9b-it
:
['這是', '你們', '的主', '所', '降', '示', '的', '減', '輕', '和', '慈', '恩', '。', '▁以', ...]
I need to resolve this mojibaked behavior, but am not certain how. I suspect this has to do with encoding - UTF-8 would be desired but somehow even when I enforce UTF-8, I still get mojibaked tokens. The fact that Gemma neatly tokenizes the same corpus text seems to indicate the issue lies in the tokenization process rather than my corpus’ encoding type.
Perhaps there is a preprocessing step I’m missing for these non-Gemma tokenizers?
Any help would be much appreciated! Many thanks