How can I get a list of word segmentation results for non-English string?

such as I got a string and a tokenizer like:

from transformers import AutoTokenizer

string = “今天天气真好”
tokenizer = AutoTokenizer(“unsloth/Qwen3-14B”)

then I want to get something like: [‘今天’, ‘天气’, ‘真’, ‘好’]

if I use:

input_ids = tokenizer(string)[“input_ids”]
print(tokenizer.decode(input_ids))

I just get the string itself
if I use print(tokenizer.tokenize(string)) or print(tokenizer.convert_ids_to_tokens(input_ids)), I’ll get something like:

['ä»Ĭ天', '天æ°Ķ', '羣', '好']

how can I convert ['ä»Ĭ天', '天æ°Ķ', '羣', '好'] to ['今天', '天气', '真', '好']

1 Like

Hmm… Like this?

"""
Human-readable text per *model token* for Chinese using a byte-level BPE tokenizer.

Dependencies:
  pip install --upgrade transformers>=4.44 tokenizers>=0.15

Notes and docs:
  - Fast tokenizers + offset mapping:
    https://huggingface.co/docs/transformers/main/en/fast_tokenizers#returning-offsets-and-encodings
  - Decoding and relation to convert_* helpers:
    https://huggingface.co/docs/transformers/main/en/internal/tokenization_utils#transformers.PreTrainedTokenizer.decode
  - Why mojibake happens (byte-level BPE → bytes↔unicode mapping like GPT-2):
    https://huggingface.co/docs/transformers/main/en/tokenizer_summary#bytelevelbpe

Model tokenizer:
  Default: "unsloth/Qwen3-14B" (tokenizer only, no model load). Change MODEL_ID if needed.
  Other compatible examples: "Qwen/Qwen2-7B", "Qwen/Qwen2.5-7B"
"""

from __future__ import annotations
import os
from typing import List, Tuple
from transformers import AutoTokenizer

MODEL_ID = os.environ.get("MODEL_ID", "unsloth/Qwen3-14B")
TEXT = "今天天气真好"  # target string

def fmt_list(xs: List[str]) -> str:
    # Show Python-list formatting with repr to make mojibake obvious
    return "[" + ", ".join(repr(x) for x in xs) + "]"

def main() -> None:
    # Load fast tokenizer to enable offset mapping
    tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
    if not getattr(tok, "is_fast", False):
        raise RuntimeError(
            "This demo requires a *fast* tokenizer (Rust-backed). "
            "Install `tokenizers` and pass use_fast=True, or pick a model with fast tokenizer."
        )

    print(f"Tokenizer: {MODEL_ID}")
    print(f"Input text: {TEXT}\n")

    # ---------- REPRO: shows why tokenize()/convert_ids_to_tokens look garbled ----------
    enc_plain = tok(TEXT, add_special_tokens=False)
    input_ids: List[int] = enc_plain["input_ids"]

    print("REPRO: what you saw")
    print("1) tokenizer.decode(input_ids) -> original text")
    print("   ", tok.decode(input_ids))
    print("2) tokenizer.tokenize(text) -> mojibake-like byte-level symbols")
    print("   ", fmt_list(tok.tokenize(TEXT)))
    print("3) tokenizer.convert_ids_to_tokens(input_ids) -> same mojibake symbols")
    print("   ", fmt_list(tok.convert_ids_to_tokens(input_ids)))
    print()

    # ---------- FIX A: use character offset mapping to slice original text ----------
    # This yields human-readable spans that correspond to each *model token piece*.
    enc_off = tok(TEXT, add_special_tokens=False, return_offsets_mapping=True)
    offsets: List[Tuple[int, int]] = enc_off["offset_mapping"]

    # Slice original TEXT by offsets to get readable per-token text
    pieces_offsets = [TEXT[a:b] for (a, b) in offsets]

    # Sanity: reconstructed string equals original (for add_special_tokens=False)
    reconstructed = "".join(pieces_offsets)
    ok = reconstructed == TEXT

    print("FIX A: return_offsets_mapping=True, then slice TEXT[a:b]")
    print("   human-readable per-token spans:", fmt_list(pieces_offsets))
    print("   reconstructed == original:", ok)
    print()

    # ---------- FIX B: decode per-id to readable text ----------
    # Decoding each token id individually also avoids mojibake.
    pieces_decode = [tok.decode([tid], skip_special_tokens=True) for tid in input_ids]
    print("FIX B: per-token decode([id])")
    print("   human-readable per-token decodes:", fmt_list(pieces_decode))
    print()

    # ---------- Optional: show a simple alignment table ----------
    print("Alignment table (id, token_str, offset_slice, per_id_decode):")
    id_tokens = tok.convert_ids_to_tokens(input_ids)
    for i, (tid, raw_tok, (a, b)) in enumerate(zip(input_ids, id_tokens, offsets)):
        slice_text = TEXT[a:b]
        per_id_text = pieces_decode[i]
        print(
            f"{i:2d}: id={tid:<8d} raw={raw_tok!r:>14}  "
            f"offset=({a:>2},{b:<2}) slice={slice_text!r:<6} "
            f"decode_one={per_id_text!r}"
        )

if __name__ == "__main__":
    main()

"""
...
REPRO: what you saw
1) tokenizer.decode(input_ids) -> original text
    今天天气真好
2) tokenizer.tokenize(text) -> mojibake-like byte-level symbols
    ['ä»Ĭ天', '天æ°Ķ', '羣', '好']
3) tokenizer.convert_ids_to_tokens(input_ids) -> same mojibake symbols
    ['ä»Ĭ天', '天æ°Ķ', '羣', '好']

FIX A: return_offsets_mapping=True, then slice TEXT[a:b]
   human-readable per-token spans: ['今天', '天气', '真', '好']
   reconstructed == original: True

FIX B: per-token decode([id])
   human-readable per-token decodes: ['今天', '天气', '真', '好']
...
"""

thank u, that’s helpful


such as I’ve got:

import transformers

tokenizer = AutoTokenizer(“unsloth/Qwen3-14B”)
prompt = "今天天气真好"
prompts = ['今天天气真好', '法国的首都是巴黎']

I just need use:

print(tokenizer.batch_decode(tokenizer(prompt)["input_ids"]))

or

input_ids = tokenizer(prompt)["input_ids"]
print([tokenizer.decode(input_id) for input_id in input_ids])

or

offsets_mappging = tokenizer(prompt, return_offsets_mapping=True)["offset_mapping"]
print([prompt[i: j] for i, j in offsets_mappging])

then I can get the right result: [‘今天’, ‘天气’, ‘真’, ‘好’]


for batch prompts, I can process every prompt and then put them into a list to get the right result like:

print([tokenizer.batch_decode(input_id) for input_id in tokenizer(prompts)["input_ids"]])

then I can get the right results easily: [[‘今天’, ‘天气’, ‘真’, ‘好’], [‘法国’, ‘的’, ‘首’, ‘都是’, ‘巴黎’]]


and I search something about byte-level tokenize, seems that the weird result like:

['ä»Ĭ天', '天æ°Ķ', '羣', '好']

was come from the bytes directly, they made a mapping from a single byte to a char which is easy to print, like:

def bytes_to_unicode():
    bs = (
        list(range(ord("!"), ord("~") + 1)) +
        list(range(ord("¡"), ord("¬") + 1)) +
        list(range(ord("®"), ord("ÿ") + 1))
    )
    
    cs = bs.copy()
    n = 0
    
    for b in range(2**8):
        if b not in bs:
            bs.append(b)
            cs.append(2**8 + n)
            n += 1
    
    cs = [chr(code) for code in cs]
    
    return dict(zip(bs, cs))

but why they need that mapping? for what?
and why tokenizer.tokenize and tokenizer.convert_ids_to_tokens and tokenizer.convert_tokens_to_ids do not return or use a “right” string like decode and batch_decode do?

1 Like

It seems to be an intentional internal symbol. Many characters and symbols from the real world can be confusing to computers…:sweat_smile:

I think each tokenizer handles this differently. When I tried it before, Gemma 2’s tokenizer might have returned more straightforward output than this.


The weird strings are deliberate internal symbols. Byte-level BPE first maps every byte 0–255 to “printable” Unicode so the BPE algorithm can run on any UTF-8 text with zero unknowns and perfect reversibility. The decoder later inverts that mapping. tokenize/convert_ids_to_tokens expose the raw internal symbols; decode/batch_decode run the decoder and give you normal text. Qwen uses a byte-level GPT-2–style tokenizer, so this is expected. (qwen.readthedocs.io)

Why the bytes→Unicode mapping exists

  • Full coverage, no : Working at the byte level guarantees every UTF-8 sequence can be tokenized. No script or emoji breaks tokenization. (Hugging Face)
  • Reversible preprocessing: The pretokenizer replaces raw bytes and whitespace/control bytes with visible Unicode placeholders (e.g., the space marker Ġ). The ByteLevel decoder restores the original text on decode. (Hugging Face)
  • Make BPE work on bytes: Classical BPE implementations operate on character strings, not opaque bytes. Mapping bytes→printable Unicode lets the same BPE machinery merge byte sequences, then a decoder flips it back. (Christian Mills)

Why those APIs look “wrong” but aren’t

  • tokenizer.tokenize(text) → returns token strings from the vocab, which for byte-level BPE are the mapped bytes (mojibake-looking). No decoding is applied.
  • convert_ids_to_tokens(ids)direct vocab lookup. Still internal symbols.
  • convert_tokens_to_ids(tokens) → inverse lookup; expects those internal symbols.
  • decode(ids) / batch_decode(seqs)joins tokens and runs the decoder (and optional cleanup), yielding human text. In HF this is effectively convert_tokens_to_string(convert_ids_to_tokens(...)) plus the decoder/cleanup steps. (Hugging Face)
  • return_offsets_mapping=True (fast tokenizers) → gives (char_start, char_end) so you can slice the original string and get readable spans per token piece without touching the raw token strings. (Hugging Face)

Mental model of the pipeline

Normalizer → PreTokenizer → Model (BPE merges) → PostProcessor → Decoder.
Byte-level mapping happens in the PreTokenizer; ByteLevel Decoder undoes it on decode. The “garbled” symbols you saw are the pre-decoder view. (Hugging Face)

Practical rules

  • Need tokens for display → use return_offsets_mapping and slice the original text, or decode each id individually.
  • Need linguistic words → use a Chinese word segmenter (jieba, pkuseg, THULAC). Token pieces ≠ words.
  • For batches, feed a list to the tokenizer and batch_decode the resulting list of id sequences.

Canonical, version-safe snippets

# deps:
# pip install --upgrade transformers>=4.44 tokenizers>=0.15
# docs: https://huggingface.co/docs/transformers/main_classes/tokenizer  # API
# docs: https://huggingface.co/docs/tokenizers/python/latest/components   # ByteLevel decoder

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("unsloth/Qwen3-14B", use_fast=True)

# 1) Per-token readable spans via offsets (fast tokenizers only)
def token_spans(text):
    enc = tok(text, add_special_tokens=False, return_offsets_mapping=True)
    return [text[a:b] for a, b in enc["offset_mapping"]]  # ['今天','天气','真','好']

# 2) Per-token readable text via per-id decode
def token_decodes(text):
    ids = tok(text, add_special_tokens=False)["input_ids"]
    return [tok.decode([i], skip_special_tokens=True) for i in ids]

# 3) Batch: decode full sequences back to normal text
def batch_texts(texts):
    enc = tok(texts, add_special_tokens=False, padding=False, truncation=False)
    return tok.batch_decode(enc["input_ids"], skip_special_tokens=True)

print(token_spans("今天天气真好"))
print(token_decodes("今天天气真好"))
print(batch_texts(["今天天气真好", "法国的首都是巴黎"]))

Key takeaways

  • The mojibake tokens are intentional placeholders from byte-level BPE.
  • tokenize/convert_* return internal token symbols; they do not try to be human-readable.
  • decode/batch_decode or offset mappings give you the right text.
  • Qwen uses byte-level BPE, so you will see this behavior across Qwen models. (qwen.readthedocs.io)

Short curated references

Docs and source

  • HF tokenizers components: ByteLevel pretokenizer + decoder. Why the mapping and how it is reversed. (Hugging Face)
  • HF Tokenizer API: return_offsets_mapping is fast-only and yields (char_start, char_end). (Hugging Face)
  • GPT-2 repo discussion on space/whitespace remapping (Ġ) and encoder.py. Useful for understanding the design. (GitHub)

Model-specific

Background

  • Practical explanation of GPT-2’s byte→Unicode mapping motivation. (Christian Mills)
1 Like

what’s the input and output of byte-level-pretokenizer?
do u mean pretokenizer will receive a string like “今天天气真好” and then:

  • transform the string into bytes
  • map every byte to every visible char, get string 'ä»Ĭ天天æ°ĶçľŁå¥½'
  • output the char list like : ["ä", "»", "Ĭ", "å", "¤", "©", "å", "¤", "©", "æ", "°", "Ķ", "ç", "ľ", "Ł", "å", "¥", "½"]

then model will process the list, get ['ä»Ĭ天', '天æ°Ķ', '羣', '好']
then map the list to id list: [100644, 104307, 88051, 52801] and output it


then posttokenizer will … I don’t know


then I got the id_list, this is the output of tokenizer(string)


is that right?

1 Like

Seems almost right?


Your mental model is close, but a few steps are off. The byte-level pre-tokenizer does not hand you a list of mapped single characters. It rewrites the string with a reversible byte→Unicode mapping and splits on whitespace; then the BPE model merges those mapped characters into the vocabulary tokens you see, and the decoder later reverses the byte mapping to give normal text.

What actually flows through the pipeline

Input to pre-tokenizer: a normalized Unicode string, e.g. "今天天气真好".
Output of pre-tokenizer: a sequence of “pre-tokens” plus alignment info. With ByteLevel, this stage:

  • Remaps each UTF-8 byte to a printable Unicode placeholder.
  • Splits on whitespace (Chinese has none, so you usually get one pre-token covering the whole sentence).
  • Carries character offsets so you can map back to the original text. (Hugging Face)

Model (BPE): runs merges over the byte-mapped characters inside each pre-token, producing the “mojibake-looking” token strings such as ['ä»Ĭ天','天æ°Ķ','羣','å¥1⁄2'], and then maps them to ids like [100644,104307,88051,52801]. Qwen uses byte-level BPE on UTF-8 specifically to avoid OOV and keep the base alphabet to 256 bytes. (qwen.readthedocs.io)

Post-processor: optionally inserts special tokens (BOS/EOS, chat template pieces, etc.). No text decoding happens here. (Hugging Face)

Decoder (ByteLevel): when you call decode/batch_decode, it inverts the byte→Unicode mapping and fixes spaces, yielding readable text. That is why tokenize()/convert_ids_to_tokens() look weird but decode() looks right. (Hugging Face)

Correcting your step-by-step sketch

Not quite:

  • The pre-tokenizer does apply the byte→Unicode mapping, but it does not output a list of single characters like ["ä","»","Ĭ",...] for you to see. It produces pre-tokens and offsets; then the BPE model merges those mapped characters into vocabulary tokens such as 'ä»Ĭ天'. (Hugging Face)
  • tokenizer.tokenize() and convert_ids_to_tokens() expose those internal vocabulary strings (post-BPE, pre-decoder). They are supposed to look unreadable for non-ASCII text because they are the mapped bytes. decode()/batch_decode() run the decoder to restore human text. (Hugging Face)

Minimal mental model

Unicode text
  └─ Normalizer (optional)
      └─ PreTokenizer: ByteLevel
          - map bytes → visible Unicode
          - split on whitespace
          - keep offsets
            ↓
      Model: BPE merges → token strings → ids
            ↓
      PostProcessor: add specials (optional)
            ↓
Output ids, masks, offsets
            ↓
Decoder on demand (decode/batch_decode) → readable text

Docs for each stage emphasize: ByteLevel remaps bytes and splits; the ByteLevel decoder exists purely to revert that remapping. (Hugging Face)

Why do this at all?

  • Full coverage, no <unk>: bytes cover any UTF-8 sequence, so tokenization never fails. (Hugging Face)
  • Uniform machinery: BPE expects strings, not raw bytes; mapping bytes to visible Unicode lets the same BPE algorithm operate and then be reversed losslessly. The classic GPT-2 design also marks spaces with visible characters (e.g., Ġ). (Hugging Face)
  • Offsets preserved with fast tokenizers, so you can slice original text per token span if you need human-readable pieces without decoding every id individually. (Hugging Face)

Practical API mapping

  • tokenizer.tokenize(text) → raw vocab tokens (mapped-byte strings).
  • tokenizer.convert_ids_to_tokens(ids) → same raw vocab tokens.
  • tokenizer.decode(ids) / batch_decode(list_of_ids) → joins tokens and runs the decoder → normal text.
  • return_offsets_mapping=True (fast tokenizers) → per-token (start,end) on the original string → slice to get human-readable spans per token piece. (Hugging Face)

For Qwen specifically

Qwen’s docs state they use byte-level BPE on UTF-8, starting from bytes and merging frequent pairs to build tokens. They note many tokens won’t look like valid Unicode to humans and should be viewed as a compression scheme. Your observations match this. (qwen.readthedocs.io)


Short, curated references

Core docs

  • Hugging Face Tokenizers “Components”: ByteLevel PreTokenizer and Decoder descriptions and examples. Good for input/output semantics. (Hugging Face)
  • Transformers “Tokenizer” page: explains fast tokenizers and alignment methods like offset mapping. (Hugging Face)
  • Tokenizers “The tokenization pipeline”: overview of Normalizer → PreTokenizer → Model → PostProcessor → Decoder. (Hugging Face)

Model-specific

  • Qwen Key Concepts: confirms byte-level BPE (UTF-8), motivations, and limitations. (qwen.readthedocs.io)

Background / source

  • GPT-2 tokenization style and space marker Ġ discussions and code lineage. Useful to understand why mapped characters appear. (Hugging Face)
1 Like

so u mean for the string like "今天天气真好,I wanna go swimming", pretokenizer will process it as:

  • transform the string to bytes
  • map byte to printable char and split on whitespace, get something like:
    ['ä»Ĭ天天æ°ĶçľŁå¥½ï¼ĮI', 'Ġwanna', 'Ġgo', 'Ġswimming'] and [(0, 8), (8, 14), (14, 17), (17, 26)] which representing “pre-tokens” and “mapping-offset”

then the model will use BPE algorithm to process the “pre-tokens” to get tokens, get:
['ä»Ĭ天', '天æ°Ķ', '羣', '好', 'ï¼Į', 'I', 'Ġwanna', 'Ġgo', 'Ġswimming'] and [(0, 2), (2, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 14), (14, 17), (17, 26)]
and then map the tokens to ids and return ids and mapping-offset

1 Like

Yeah.


Mostly right. Two fixes:

  1. The ByteLevel pre-tokenizer splits on whitespace and remaps bytes to printable code points, but it does not hand you a visible list of single characters. It outputs “pre-tokens” with offsets. Then the BPE model merges those mapped characters into vocab tokens. Decoding later inverts the byte→Unicode mapping. (Hugging Face)

  2. The English space is encoded into tokens with a visible space marker (e.g., Ġ). That’s why you see tokens like Ġwanna. Offsets can therefore include the leading space. (Hugging Face)

Walk-through on your example

Input:
"今天天气真好,I wanna go swimming"

Pre-tokenizer output (conceptual)

  • Operation: normalize → UTF-8 bytes → map bytes to printable Unicode → split on whitespace → keep offsets.
  • Pre-tokens (by whitespace):
    ["今天天气真好,", "I", "wanna", "go", "swimming"]
  • Character offsets (start, end) over the original string:
    [(0,7), (8,9), (10,15), (16,18), (19,27)]
    These are spans in the original text, not in the mapped “mojibake” string. (Hugging Face)

BPE model output

  • Runs merges inside each pre-token over the mapped characters.

  • Typical token strings and offsets (character indexes in original text):

    • Chinese chunk: ['ä»Ĭ天','天æ°Ķ','羣','å¥1⁄2','ï1⁄4Į'][(0,2),(2,4),(4,5),(5,6),(6,7)]
      which correspond to ['今天','天气','真','好',',']
    • English: ['I','Ġwanna','Ġgo','Ġswimming'][(8,9),(9,15),(15,18),(18,27)]
      where Ġ indicates the preceding space is part of the token span.
  • Map token strings → ids, e.g. [100644,104307,88051,52801, ...].

  • No human decoding has happened yet; these are internal symbols. Qwen uses byte-level BPE on UTF-8, so this behavior is expected and guarantees no OOV. (Qwen)

Post-processor

  • Adds special tokens (BOS/EOS, chat template pieces) if configured. It does not “fix” readability. (Hugging Face)

Decoder (when you call decode / batch_decode)

  • Inverts the byte→Unicode mapping and restores spaces, yielding normal text.
  • Fast tokenizers also expose return_offsets_mapping=True so you can slice the original string per token without decoding each id. (Hugging Face)

Quick rules to remember

  • tokenize() / convert_ids_to_tokens() → raw vocab strings (mapped bytes). They will look garbled for non-ASCII. Correct by design.
  • decode() / batch_decode() → runs the decoder → human text.
  • return_offsets_mapping=True (fast tokenizers) → character spans over the original text for each final token.
  • English tokens may include a leading space (Ġ...), so their offsets can start at the space. This depends on tokenizer settings like add_prefix_space and post-processing; be mindful of offset edge cases. (Hugging Face)

Minimal verification snippet

# deps:
# pip install --upgrade transformers>=4.44 tokenizers>=0.15
from transformers import AutoTokenizer
s = "今天天气真好,I wanna go swimming"
tok = AutoTokenizer.from_pretrained("unsloth/Qwen3-14B", use_fast=True)

enc = tok(s, add_special_tokens=False, return_offsets_mapping=True)
tokens = tok.convert_ids_to_tokens(enc["input_ids"])
spans  = [s[a:b] for a,b in enc["offset_mapping"]]

print(tokens)  # internal strings (byte-mapped), includes Ġ for spaces
print(spans)   # human-readable per-token text slices
print(tok.decode(enc["input_ids"]))  # original text

"""
['ä»Ĭ天', '天æ°Ķ', '羣', 'å¥1⁄2', 'ï1⁄4Į', 'I', 'Ġwanna', 'Ġgo', 'Ġswimming']
['今天', '天气', '真', '好', ',', 'I', ' wanna', ' go', ' swimming']
今天天气真好,I wanna go swimming
"""
1 Like

so what “token” and “pre-token” actually means?
I mean, is a token a string or bytes? and how about a pre-token?
u said Byte-level mapping happens in the PreTokenizer and pre-tokenizer outputs “pre-tokens” with offsets, can I think the “pre-tokens is” a list[str] like:
['ä»Ĭ天天æ°ĶçľŁå¥½ï¼Į', 'I', 'Ġwanna', 'Ġgo', 'Ġswimming']


and there is not a whitespace before letter I, will pre-tokenizers split it out?

1 Like

BTW, for general information on tokenization, this article should also be helpful.


Definitions first.

  • Pre-token: an intermediate span of the original text produced by the pre-tokenizer, plus its character offsets. With ByteLevel, the pre-tokenizer (a) remaps each UTF-8 byte to a visible Unicode placeholder and (b) splits on whitespace to yield “word-like” chunks; it also carries offsets so you can map back to the input. (Hugging Face)
  • Token: the result after the model step (BPE merges) runs inside each pre-token. Tokens are vocabulary strings (those mapped-byte symbols you saw) and their integer IDs. A decoder then inverts the byte mapping when you call decode/batch_decode. (Hugging Face)

Is it strings or bytes?

  • The pipeline runs on Unicode strings externally. Byte-level logic is handled by mapping bytes→printable Unicode during pre-tokenization, then reversing it during decoding. You interact with strings and IDs; no raw bytes are returned. Byte-level BPE is used so every UTF-8 sequence is representable without <unk>. (GitHub)

Your example, concretely

Input: 今天天气真好,I wanna go swimming

  1. Pre-tokenizer output (conceptual): splits on whitespace only. So you get pre-tokens like
    ["今天天气真好,I", "wanna", "go", "swimming"] with offsets over the original string. The “I” is attached to the first pre-token because there is no space before it. Punctuation does not force a split in ByteLevel; whitespace does. (Hugging Face)
  2. Model (BPE) output: inside each pre-token, BPE merges the byte-mapped characters into vocab tokens such as
    ['ä»Ĭ天','天æ°Ķ','羣','å¥1⁄2','ï1⁄4Į','I','Ġwanna','Ġgo','Ġswimming']
    and maps them to IDs. The leading Ġ on English pieces indicates a preceding space in GPT-2–style tokenizers. (GitHub)
  3. Decoder: decode/batch_decode inverts the byte mapping and restores normal spacing. (Hugging Face)

Clarifications to common confusions

  • “Are pre-tokens a list[str] I can see?” Conceptually yes (word-like chunks), but what the library exposes by default are the final tokens and IDs. If you want to inspect pre-tokens, call the underlying Rust pre-tokenizer:

    # deps: pip install tokenizers>=0.15 transformers>=4.44
    from transformers import AutoTokenizer
    tok = AutoTokenizer.from_pretrained("unsloth/Qwen3-14B", use_fast=True)
    print(tok.backend_tokenizer.pre_tokenizer.pre_tokenize_str("今天天气真好,I wanna go swimming"))
    # -> [(pre_token_string, (start, end)), ...]
    

    This shows the whitespace splits and their offsets. (Vinsmoke Three)

  • “Where does Ġ come from?” It’s a visible space marker baked into GPT-2–style vocabularies so BPE can learn merges that depend on preceding whitespace. It appears in tokens, not pre-tokens. (GitHub)

  • “Why byte-level at all?” Two reasons: coverage with no <unk> and exact reversibility. BPE expects strings, so bytes are first mapped to visible Unicode, then merged; the decoder reverses that map. (GitHub)

Mental model you can trust

Unicode text
  → Normalizer
  → Pre-tokenizer (ByteLevel): bytes→visible Unicode; split on whitespace; keep offsets
  → Model (BPE): merge mapped chars into vocab tokens; get token strings + IDs
  → Post-processor: add special tokens if needed
  → (on decode) Decoder (ByteLevel): visible Unicode → original bytes → human text

This matches Hugging Face’s Tokenizers pipeline and API terminology. (Hugging Face)

Minimal checks you can run

  • Pre-tokens (whitespace splits + offsets): use pre_tokenize_str as above. (Google Colab)
  • Final tokens (mapped-byte strings): tokenizer.convert_ids_to_tokens(...).
  • Readable per-token spans: ask for return_offsets_mapping=True and slice the original string. (Hugging Face)

Short references

  • HF Tokenizers: Pre-tokenizers (ByteLevel description) and Decoders (ByteLevel decoder). (Hugging Face)
  • HF Tokenizers: Pipeline overview and offset mapping. (Hugging Face)
  • GPT-2 space marker Ġ background. (GitHub)
  • Byte-level BPE rationale and UTF-8 coverage. (GitHub)

Summary: a pre-token is a whitespace-delimited span with offsets; a token is a BPE-merged vocab string (plus its ID). No space before I means the pre-tokenizer does not split there.

1 Like

why u said Ġ appears in tokens, not pre-tokens? I tried and got this:
微信图片_20251106121049_34_2


by the way, I noticed that the tokenizer can make grammatical errors in Chinese like: "法国的首都是巴黎" will get [‘法国’, ‘的’, ‘首’, ‘都是’, ‘巴黎’],but the right result should be [‘法国’, ‘的’, ‘首都', 是’, ‘巴黎’], emmm, does that mean BPE is not good enough or why don’t use something like jieba to do such work?

1 Like

The definitions of terms aren’t very strict, ranging from conceptual definitions to specific practical implementations, so it’s all rather confusing…:sweat_smile:


You’re right: with Hugging Face fast tokenizers, the ByteLevel pre-tokenizer already inserts the visible space marker Ġ. pre_tokenize_str(...) shows those markers because ByteLevel replaces spaces and remaps bytes before the BPE model runs. That output is expected. The later BPE model then merges the mapped characters inside each pre-token into the final vocab tokens and ids; decode() applies the ByteLevel decoder to get back normal text. (Hugging Face)

Where Ġ can appear

  • Pre-tokenizer stage: spaces are turned into a visible marker (Ġ) so merges can learn “word-start” patterns. You will see Ġwanna, Ġgo, … even in pre_tokenize_str. (Hugging Face Forums)
  • Token stage: those same strings become actual vocab tokens like Ġwanna and ids. decode() reverses the byte→Unicode mapping and space handling. (Hugging Face)

Why your Chinese “errors” aren’t errors

['法国','的','首','都是','巴黎'] is not a grammatical analysis. It’s a sequence of subword/byte-level tokens chosen to compress frequent patterns. BPE is trained to minimize sequence length and handle any UTF-8 text with no <unk>, not to output linguistically correct words. In languages without spaces, merges can cross human “word” boundaries, e.g., the token “都是” is very frequent, so it appears as one piece even when the intended segmentation is “首都 + 是”. This is a known behavior on Chinese. (Hugging Face)

Why LLMs don’t use jieba for model tokenization

  • Coverage and robustness: byte-level schemes guarantee every byte sequence is representable. No OOV. Word segmenters depend on lexicons and can fail on names, slang, or mixed-script text. (Hugging Face)
  • Multilingual consistency: one tokenizer for many scripts is simpler and more stable than per-language segmenters. (Hugging Face)
  • Compression vs. linguistics: BPE optimizes token length/frequency, not grammatical boundaries. That tradeoff improves throughput and training stability even if tokens don’t align with words. (Hugging Face)

Practical guidance

  • Need human-readable per-token text: request return_offsets_mapping=True and slice the original string; or decode each id separately. Both avoid mojibake. (Hugging Face)
  • Need linguistic words: run a Chinese segmenter (e.g., jieba, pkuseg, THULAC) on the original text; do not expect the model tokenizer to give you words. (Segmenters are separate tools with different goals.)
  • Seeing Ġ in pre-tokens is normal for GPT-2/RoBERTa-style ByteLevel pipelines; the space marker is introduced before BPE and often survives into the final tokens. (Hugging Face Forums)

Short references

  • HF Tokenizers pipeline and pre-tokenization overview. Spaces → markers happen pre-BPE. (Hugging Face)
  • ByteLevel pre-tokenizer description: remap bytes and split into words. (Hugging Face)
  • decode(...) behavior and relation to convert_* helpers. (Hugging Face)
  • Why BPE uses visible markers like Ġ, and examples. (Hugging Face Forums)
  • On Chinese and whitespace-free scripts, why merges can cross “word” boundaries. (The Digital Orientalist)

Summary: Ġ in your pre_tokenize_str is expected. Model tokenization ≠ word segmentation. Use offsets or per-id decode for readable token pieces; use dedicated Chinese segmenters if you need grammatical words.

1 Like

so BPE is not perfect, it even can’t get such a simple Chinese string’s good-enough segmentation result
I read the article u paste, it’s meaningful
but chat-gpt can reverse the word correctly, does not it use tokenizer?

1 Like

I personally think that just as there is no perfect natural language—let alone programming language—for all purposes, there is no perfect tokenizer for tokenizing it either… Well, there may be safer options or flawed ones.


Yes. ChatGPT (and almost every LLM you use) always runs text through a tokenizer first. OpenAI models use a byte-pair-encoding tokenizer (tiktoken). Tokens are subwords/bytes chosen for compression and full coverage, not grammatical “words.” So Chinese “都是” may be one token even when you want “都”“是.” That is expected and by design. (GitHub)

Why ChatGPT can still reverse “strawberry”

  • The model never manipulates raw characters. It predicts tokens whose bytes decode to characters. Reversing “strawberry” means emitting a token sequence whose decoded bytes are y r r e b w a r t s. That can be done even if the input tokenization groups “straw” and “berry” together. Tokenization granularity ≠ capability to produce character-level outputs. (GitHub)
  • But tokenization does make some character tasks brittle. This is the well-known “strawberry problem”: models often fail at fine-grained letter tasks because subword tokens hide character boundaries. Multiple studies document this and propose fixes. (arXiv)

Why BPE is used instead of a word segmenter like jieba

  • Coverage with no <unk>: byte-level BPE can represent any UTF-8 text and is lossless and reversible. No per-language rules, no OOV. (GitHub)
  • Compression and efficiency: frequent substrings become single tokens, shortening sequences and speeding training/inference. Word segmenters don’t guarantee short sequences across all scripts. (GitHub)
  • Multilingual simplicity: one tokenizer works across languages. A Chinese-specific segmenter would not generalize to other scripts or mixed-script text. (jieba is great for word segmentation, but LLM tokenizers solve a different problem.) (GitHub)

What your screenshot shows

  • You called pre_tokenize_str. In ByteLevel tokenizers, the pre-tokenizer already (a) maps bytes to visible placeholders and (b) marks spaces with a visible symbol like Ġ. Seeing Ġ at this stage is normal; the BPE model then merges inside each pre-token; decode() later inverts the byte mapping to produce human text. (Hugging Face)

About the Chinese example

  • [‘法国’, ‘的’, ‘首’, ‘都是’, ‘巴黎’] is not a grammatical parse. It is a subword sequence chosen by frequency statistics. On Chinese, high-frequency bigrams like “都是” often appear as a single token, even across your desired word boundary (“首都 + 是”). If you need linguistically correct words, run a segmenter such as jieba on the original string. Don’t expect the model tokenizer to do this job. (GitHub)

If you need better character-level behavior

  • Keep the model but tweak the input: insert spaces or separators to force single-character tokens for tasks like counting or reversing letters. (Common trick for the strawberry problem.) (Simbian AI)
  • Use or fine-tune models that operate on bytes/characters (e.g., ByT5, CharacterBERT-style approaches). These keep explicit character access but cost more due to longer sequences. (ACL Anthology)

Quick takeaways

  • ChatGPT uses a tokenizer. Tokens ≠ words. (GitHub)
  • Reversal works because the model outputs tokens that decode to the reversed bytes, not because it “thinks in characters.”
  • BPE trades linguistic neatness for universality and efficiency. For word-level Chinese segmentation, use a dedicated segmenter.

Short, curated references

Tokenizers and BPE

  • OpenAI tiktoken README: properties of BPE, encodings for models. Clear and practical. (GitHub)
  • HF Tokenizers docs: ByteLevel PreTokenizer and Decoder behavior. Shows why you see Ġ and how decoding reverses it. (Hugging Face)

Chinese word segmentation

  • jieba project page. Modes, custom dictionaries, and usage. Good when you need real words. (GitHub)

Character-level limitations and fixes

  • “Strawberry problem” and character-level brittleness in tokenized LMs. Background and evidence. (arXiv)
  • EMNLP 2025 paper on adding character access while keeping tokens. Shows gains on character tasks. (ACL Anthology)
1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.