How can I get a list of word segmentation results for non-English string?

John6666 · November 6, 2025, 4:44am

The definitions of terms aren’t very strict, ranging from conceptual definitions to specific practical implementations, so it’s all rather confusing…

You’re right: with Hugging Face fast tokenizers, the ByteLevel pre-tokenizer already inserts the visible space marker Ġ. pre_tokenize_str(...) shows those markers because ByteLevel replaces spaces and remaps bytes before the BPE model runs. That output is expected. The later BPE model then merges the mapped characters inside each pre-token into the final vocab tokens and ids; decode() applies the ByteLevel decoder to get back normal text. (Hugging Face)

Where `Ġ` can appear

Pre-tokenizer stage: spaces are turned into a visible marker (Ġ) so merges can learn “word-start” patterns. You will see Ġwanna, Ġgo, … even in pre_tokenize_str. (Hugging Face Forums)
Token stage: those same strings become actual vocab tokens like Ġwanna and ids. decode() reverses the byte→Unicode mapping and space handling. (Hugging Face)

Why your Chinese “errors” aren’t errors

['法国','的','首','都是','巴黎'] is not a grammatical analysis. It’s a sequence of subword/byte-level tokens chosen to compress frequent patterns. BPE is trained to minimize sequence length and handle any UTF-8 text with no <unk>, not to output linguistically correct words. In languages without spaces, merges can cross human “word” boundaries, e.g., the token “都是” is very frequent, so it appears as one piece even when the intended segmentation is “首都 + 是”. This is a known behavior on Chinese. (Hugging Face)

Why LLMs don’t use jieba for model tokenization

Coverage and robustness: byte-level schemes guarantee every byte sequence is representable. No OOV. Word segmenters depend on lexicons and can fail on names, slang, or mixed-script text. (Hugging Face)
Multilingual consistency: one tokenizer for many scripts is simpler and more stable than per-language segmenters. (Hugging Face)
Compression vs. linguistics: BPE optimizes token length/frequency, not grammatical boundaries. That tradeoff improves throughput and training stability even if tokens don’t align with words. (Hugging Face)

Practical guidance

Need human-readable per-token text: request return_offsets_mapping=True and slice the original string; or decode each id separately. Both avoid mojibake. (Hugging Face)
Need linguistic words: run a Chinese segmenter (e.g., jieba, pkuseg, THULAC) on the original text; do not expect the model tokenizer to give you words. (Segmenters are separate tools with different goals.)
Seeing Ġ in pre-tokens is normal for GPT-2/RoBERTa-style ByteLevel pipelines; the space marker is introduced before BPE and often survives into the final tokens. (Hugging Face Forums)

Short references

HF Tokenizers pipeline and pre-tokenization overview. Spaces → markers happen pre-BPE. (Hugging Face)
ByteLevel pre-tokenizer description: remap bytes and split into words. (Hugging Face)
decode(...) behavior and relation to convert_* helpers. (Hugging Face)
Why BPE uses visible markers like Ġ, and examples. (Hugging Face Forums)
On Chinese and whitespace-free scripts, why merges can cross “word” boundaries. (The Digital Orientalist)

Summary: Ġ in your pre_tokenize_str is expected. Model tokenization ≠ word segmentation. Use offsets or per-id decode for readable token pieces; use dedicated Chinese segmenters if you need grammatical words.

Topic		Replies	Views
Use a pretrained ByteLevelBPETokenizer on text 🤗Tokenizers	1	3962	July 17, 2020
Show Submodels of PegasusTokenizer 🤗Tokenizers	1	640	April 28, 2022
Continuation token in pertained tokenizer bert-base-chinese 🤗Tokenizers	0	531	July 11, 2020
BartTokenizer with vocab.json and merge.txt which were created by ByteLevelBPETokenizer encode <s> into 3 tokens Beginners	1	5694	January 27, 2021
Build a RoBERTa tokenizer from scratch 🤗Tokenizers	5	3379	December 12, 2020