How can I get a list of word segmentation results for non-English string?

Yeah.


Mostly right. Two fixes:

  1. The ByteLevel pre-tokenizer splits on whitespace and remaps bytes to printable code points, but it does not hand you a visible list of single characters. It outputs “pre-tokens” with offsets. Then the BPE model merges those mapped characters into vocab tokens. Decoding later inverts the byte→Unicode mapping. (Hugging Face)

  2. The English space is encoded into tokens with a visible space marker (e.g., Ġ). That’s why you see tokens like Ġwanna. Offsets can therefore include the leading space. (Hugging Face)

Walk-through on your example

Input:
"今天天气真好,I wanna go swimming"

Pre-tokenizer output (conceptual)

  • Operation: normalize → UTF-8 bytes → map bytes to printable Unicode → split on whitespace → keep offsets.
  • Pre-tokens (by whitespace):
    ["今天天气真好,", "I", "wanna", "go", "swimming"]
  • Character offsets (start, end) over the original string:
    [(0,7), (8,9), (10,15), (16,18), (19,27)]
    These are spans in the original text, not in the mapped “mojibake” string. (Hugging Face)

BPE model output

  • Runs merges inside each pre-token over the mapped characters.

  • Typical token strings and offsets (character indexes in original text):

    • Chinese chunk: ['ä»Ĭ天','天æ°Ķ','羣','å¥1⁄2','ï1⁄4Į'][(0,2),(2,4),(4,5),(5,6),(6,7)]
      which correspond to ['今天','天气','真','好',',']
    • English: ['I','Ġwanna','Ġgo','Ġswimming'][(8,9),(9,15),(15,18),(18,27)]
      where Ġ indicates the preceding space is part of the token span.
  • Map token strings → ids, e.g. [100644,104307,88051,52801, ...].

  • No human decoding has happened yet; these are internal symbols. Qwen uses byte-level BPE on UTF-8, so this behavior is expected and guarantees no OOV. (Qwen)

Post-processor

  • Adds special tokens (BOS/EOS, chat template pieces) if configured. It does not “fix” readability. (Hugging Face)

Decoder (when you call decode / batch_decode)

  • Inverts the byte→Unicode mapping and restores spaces, yielding normal text.
  • Fast tokenizers also expose return_offsets_mapping=True so you can slice the original string per token without decoding each id. (Hugging Face)

Quick rules to remember

  • tokenize() / convert_ids_to_tokens() → raw vocab strings (mapped bytes). They will look garbled for non-ASCII. Correct by design.
  • decode() / batch_decode() → runs the decoder → human text.
  • return_offsets_mapping=True (fast tokenizers) → character spans over the original text for each final token.
  • English tokens may include a leading space (Ġ...), so their offsets can start at the space. This depends on tokenizer settings like add_prefix_space and post-processing; be mindful of offset edge cases. (Hugging Face)

Minimal verification snippet

# deps:
# pip install --upgrade transformers>=4.44 tokenizers>=0.15
from transformers import AutoTokenizer
s = "今天天气真好,I wanna go swimming"
tok = AutoTokenizer.from_pretrained("unsloth/Qwen3-14B", use_fast=True)

enc = tok(s, add_special_tokens=False, return_offsets_mapping=True)
tokens = tok.convert_ids_to_tokens(enc["input_ids"])
spans  = [s[a:b] for a,b in enc["offset_mapping"]]

print(tokens)  # internal strings (byte-mapped), includes Ġ for spaces
print(spans)   # human-readable per-token text slices
print(tok.decode(enc["input_ids"]))  # original text

"""
['ä»Ĭ天', '天æ°Ķ', '羣', 'å¥1⁄2', 'ï1⁄4Į', 'I', 'Ġwanna', 'Ġgo', 'Ġswimming']
['今天', '天气', '真', '好', ',', 'I', ' wanna', ' go', ' swimming']
今天天气真好,I wanna go swimming
"""
1 Like