How can I get a list of word segmentation results for non-English string?

The definitions of terms aren’t very strict, ranging from conceptual definitions to specific practical implementations, so it’s all rather confusing…:sweat_smile:


You’re right: with Hugging Face fast tokenizers, the ByteLevel pre-tokenizer already inserts the visible space marker Ġ. pre_tokenize_str(...) shows those markers because ByteLevel replaces spaces and remaps bytes before the BPE model runs. That output is expected. The later BPE model then merges the mapped characters inside each pre-token into the final vocab tokens and ids; decode() applies the ByteLevel decoder to get back normal text. (Hugging Face)

Where Ġ can appear

  • Pre-tokenizer stage: spaces are turned into a visible marker (Ġ) so merges can learn “word-start” patterns. You will see Ġwanna, Ġgo, … even in pre_tokenize_str. (Hugging Face Forums)
  • Token stage: those same strings become actual vocab tokens like Ġwanna and ids. decode() reverses the byte→Unicode mapping and space handling. (Hugging Face)

Why your Chinese “errors” aren’t errors

['法国','的','首','都是','巴黎'] is not a grammatical analysis. It’s a sequence of subword/byte-level tokens chosen to compress frequent patterns. BPE is trained to minimize sequence length and handle any UTF-8 text with no <unk>, not to output linguistically correct words. In languages without spaces, merges can cross human “word” boundaries, e.g., the token “都是” is very frequent, so it appears as one piece even when the intended segmentation is “首都 + 是”. This is a known behavior on Chinese. (Hugging Face)

Why LLMs don’t use jieba for model tokenization

  • Coverage and robustness: byte-level schemes guarantee every byte sequence is representable. No OOV. Word segmenters depend on lexicons and can fail on names, slang, or mixed-script text. (Hugging Face)
  • Multilingual consistency: one tokenizer for many scripts is simpler and more stable than per-language segmenters. (Hugging Face)
  • Compression vs. linguistics: BPE optimizes token length/frequency, not grammatical boundaries. That tradeoff improves throughput and training stability even if tokens don’t align with words. (Hugging Face)

Practical guidance

  • Need human-readable per-token text: request return_offsets_mapping=True and slice the original string; or decode each id separately. Both avoid mojibake. (Hugging Face)
  • Need linguistic words: run a Chinese segmenter (e.g., jieba, pkuseg, THULAC) on the original text; do not expect the model tokenizer to give you words. (Segmenters are separate tools with different goals.)
  • Seeing Ġ in pre-tokens is normal for GPT-2/RoBERTa-style ByteLevel pipelines; the space marker is introduced before BPE and often survives into the final tokens. (Hugging Face Forums)

Short references

  • HF Tokenizers pipeline and pre-tokenization overview. Spaces → markers happen pre-BPE. (Hugging Face)
  • ByteLevel pre-tokenizer description: remap bytes and split into words. (Hugging Face)
  • decode(...) behavior and relation to convert_* helpers. (Hugging Face)
  • Why BPE uses visible markers like Ġ, and examples. (Hugging Face Forums)
  • On Chinese and whitespace-free scripts, why merges can cross “word” boundaries. (The Digital Orientalist)

Summary: Ġ in your pre_tokenize_str is expected. Model tokenization ≠ word segmentation. Use offsets or per-id decode for readable token pieces; use dedicated Chinese segmenters if you need grammatical words.

1 Like