It seems to be an intentional internal symbol. Many characters and symbols from the real world can be confusing to computers…![]()
I think each tokenizer handles this differently. When I tried it before, Gemma 2’s tokenizer might have returned more straightforward output than this.
The weird strings are deliberate internal symbols. Byte-level BPE first maps every byte 0–255 to “printable” Unicode so the BPE algorithm can run on any UTF-8 text with zero unknowns and perfect reversibility. The decoder later inverts that mapping. tokenize/convert_ids_to_tokens expose the raw internal symbols; decode/batch_decode run the decoder and give you normal text. Qwen uses a byte-level GPT-2–style tokenizer, so this is expected. (qwen.readthedocs.io)
Why the bytes→Unicode mapping exists
- Full coverage, no : Working at the byte level guarantees every UTF-8 sequence can be tokenized. No script or emoji breaks tokenization. (Hugging Face)
- Reversible preprocessing: The pretokenizer replaces raw bytes and whitespace/control bytes with visible Unicode placeholders (e.g., the space marker
Ġ). The ByteLevel decoder restores the original text ondecode. (Hugging Face) - Make BPE work on bytes: Classical BPE implementations operate on character strings, not opaque bytes. Mapping bytes→printable Unicode lets the same BPE machinery merge byte sequences, then a decoder flips it back. (Christian Mills)
Why those APIs look “wrong” but aren’t
tokenizer.tokenize(text)→ returns token strings from the vocab, which for byte-level BPE are the mapped bytes (mojibake-looking). No decoding is applied.convert_ids_to_tokens(ids)→ direct vocab lookup. Still internal symbols.convert_tokens_to_ids(tokens)→ inverse lookup; expects those internal symbols.decode(ids)/batch_decode(seqs)→ joins tokens and runs the decoder (and optional cleanup), yielding human text. In HF this is effectivelyconvert_tokens_to_string(convert_ids_to_tokens(...))plus the decoder/cleanup steps. (Hugging Face)return_offsets_mapping=True(fast tokenizers) → gives(char_start, char_end)so you can slice the original string and get readable spans per token piece without touching the raw token strings. (Hugging Face)
Mental model of the pipeline
Normalizer → PreTokenizer → Model (BPE merges) → PostProcessor → Decoder.
Byte-level mapping happens in the PreTokenizer; ByteLevel Decoder undoes it on decode. The “garbled” symbols you saw are the pre-decoder view. (Hugging Face)
Practical rules
- Need tokens for display → use
return_offsets_mappingand slice the original text, or decode each id individually. - Need linguistic words → use a Chinese word segmenter (jieba, pkuseg, THULAC). Token pieces ≠ words.
- For batches, feed a list to the tokenizer and
batch_decodethe resulting list of id sequences.
Canonical, version-safe snippets
# deps:
# pip install --upgrade transformers>=4.44 tokenizers>=0.15
# docs: https://huggingface.co/docs/transformers/main_classes/tokenizer # API
# docs: https://huggingface.co/docs/tokenizers/python/latest/components # ByteLevel decoder
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("unsloth/Qwen3-14B", use_fast=True)
# 1) Per-token readable spans via offsets (fast tokenizers only)
def token_spans(text):
enc = tok(text, add_special_tokens=False, return_offsets_mapping=True)
return [text[a:b] for a, b in enc["offset_mapping"]] # ['今天','天气','真','好']
# 2) Per-token readable text via per-id decode
def token_decodes(text):
ids = tok(text, add_special_tokens=False)["input_ids"]
return [tok.decode([i], skip_special_tokens=True) for i in ids]
# 3) Batch: decode full sequences back to normal text
def batch_texts(texts):
enc = tok(texts, add_special_tokens=False, padding=False, truncation=False)
return tok.batch_decode(enc["input_ids"], skip_special_tokens=True)
print(token_spans("今天天气真好"))
print(token_decodes("今天天气真好"))
print(batch_texts(["今天天气真好", "法国的首都是巴黎"]))
Key takeaways
- The mojibake tokens are intentional placeholders from byte-level BPE.
tokenize/convert_*return internal token symbols; they do not try to be human-readable.decode/batch_decodeor offset mappings give you the right text.- Qwen uses byte-level BPE, so you will see this behavior across Qwen models. (qwen.readthedocs.io)
Short curated references
Docs and source
- HF tokenizers components: ByteLevel pretokenizer + decoder. Why the mapping and how it is reversed. (Hugging Face)
- HF
TokenizerAPI:return_offsets_mappingis fast-only and yields(char_start, char_end). (Hugging Face) - GPT-2 repo discussion on space/whitespace remapping (
Ġ) andencoder.py. Useful for understanding the design. (GitHub)
Model-specific
- Qwen “Key Concepts”: byte-level BPE and no-unk design. (qwen.readthedocs.io)
Background
- Practical explanation of GPT-2’s byte→Unicode mapping motivation. (Christian Mills)