How can I get a list of word segmentation results for non-English string?

It seems to be an intentional internal symbol. Many characters and symbols from the real world can be confusing to computers…:sweat_smile:

I think each tokenizer handles this differently. When I tried it before, Gemma 2’s tokenizer might have returned more straightforward output than this.


The weird strings are deliberate internal symbols. Byte-level BPE first maps every byte 0–255 to “printable” Unicode so the BPE algorithm can run on any UTF-8 text with zero unknowns and perfect reversibility. The decoder later inverts that mapping. tokenize/convert_ids_to_tokens expose the raw internal symbols; decode/batch_decode run the decoder and give you normal text. Qwen uses a byte-level GPT-2–style tokenizer, so this is expected. (qwen.readthedocs.io)

Why the bytes→Unicode mapping exists

  • Full coverage, no : Working at the byte level guarantees every UTF-8 sequence can be tokenized. No script or emoji breaks tokenization. (Hugging Face)
  • Reversible preprocessing: The pretokenizer replaces raw bytes and whitespace/control bytes with visible Unicode placeholders (e.g., the space marker Ġ). The ByteLevel decoder restores the original text on decode. (Hugging Face)
  • Make BPE work on bytes: Classical BPE implementations operate on character strings, not opaque bytes. Mapping bytes→printable Unicode lets the same BPE machinery merge byte sequences, then a decoder flips it back. (Christian Mills)

Why those APIs look “wrong” but aren’t

  • tokenizer.tokenize(text) → returns token strings from the vocab, which for byte-level BPE are the mapped bytes (mojibake-looking). No decoding is applied.
  • convert_ids_to_tokens(ids)direct vocab lookup. Still internal symbols.
  • convert_tokens_to_ids(tokens) → inverse lookup; expects those internal symbols.
  • decode(ids) / batch_decode(seqs)joins tokens and runs the decoder (and optional cleanup), yielding human text. In HF this is effectively convert_tokens_to_string(convert_ids_to_tokens(...)) plus the decoder/cleanup steps. (Hugging Face)
  • return_offsets_mapping=True (fast tokenizers) → gives (char_start, char_end) so you can slice the original string and get readable spans per token piece without touching the raw token strings. (Hugging Face)

Mental model of the pipeline

Normalizer → PreTokenizer → Model (BPE merges) → PostProcessor → Decoder.
Byte-level mapping happens in the PreTokenizer; ByteLevel Decoder undoes it on decode. The “garbled” symbols you saw are the pre-decoder view. (Hugging Face)

Practical rules

  • Need tokens for display → use return_offsets_mapping and slice the original text, or decode each id individually.
  • Need linguistic words → use a Chinese word segmenter (jieba, pkuseg, THULAC). Token pieces ≠ words.
  • For batches, feed a list to the tokenizer and batch_decode the resulting list of id sequences.

Canonical, version-safe snippets

# deps:
# pip install --upgrade transformers>=4.44 tokenizers>=0.15
# docs: https://huggingface.co/docs/transformers/main_classes/tokenizer  # API
# docs: https://huggingface.co/docs/tokenizers/python/latest/components   # ByteLevel decoder

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("unsloth/Qwen3-14B", use_fast=True)

# 1) Per-token readable spans via offsets (fast tokenizers only)
def token_spans(text):
    enc = tok(text, add_special_tokens=False, return_offsets_mapping=True)
    return [text[a:b] for a, b in enc["offset_mapping"]]  # ['今天','天气','真','好']

# 2) Per-token readable text via per-id decode
def token_decodes(text):
    ids = tok(text, add_special_tokens=False)["input_ids"]
    return [tok.decode([i], skip_special_tokens=True) for i in ids]

# 3) Batch: decode full sequences back to normal text
def batch_texts(texts):
    enc = tok(texts, add_special_tokens=False, padding=False, truncation=False)
    return tok.batch_decode(enc["input_ids"], skip_special_tokens=True)

print(token_spans("今天天气真好"))
print(token_decodes("今天天气真好"))
print(batch_texts(["今天天气真好", "法国的首都是巴黎"]))

Key takeaways

  • The mojibake tokens are intentional placeholders from byte-level BPE.
  • tokenize/convert_* return internal token symbols; they do not try to be human-readable.
  • decode/batch_decode or offset mappings give you the right text.
  • Qwen uses byte-level BPE, so you will see this behavior across Qwen models. (qwen.readthedocs.io)

Short curated references

Docs and source

  • HF tokenizers components: ByteLevel pretokenizer + decoder. Why the mapping and how it is reversed. (Hugging Face)
  • HF Tokenizer API: return_offsets_mapping is fast-only and yields (char_start, char_end). (Hugging Face)
  • GPT-2 repo discussion on space/whitespace remapping (Ġ) and encoder.py. Useful for understanding the design. (GitHub)

Model-specific

Background

  • Practical explanation of GPT-2’s byte→Unicode mapping motivation. (Christian Mills)
1 Like