How can I get a list of word segmentation results for non-English string?

John6666 · November 5, 2025, 10:03am

Yeah.

Mostly right. Two fixes:

The ByteLevel pre-tokenizer splits on whitespace and remaps bytes to printable code points, but it does not hand you a visible list of single characters. It outputs “pre-tokens” with offsets. Then the BPE model merges those mapped characters into vocab tokens. Decoding later inverts the byte→Unicode mapping. (Hugging Face)
The English space is encoded into tokens with a visible space marker (e.g., Ġ). That’s why you see tokens like Ġwanna. Offsets can therefore include the leading space. (Hugging Face)

Walk-through on your example

Input:
"今天天气真好,I wanna go swimming"

Pre-tokenizer output (conceptual)

Operation: normalize → UTF-8 bytes → map bytes to printable Unicode → split on whitespace → keep offsets.
Pre-tokens (by whitespace):
["今天天气真好,", "I", "wanna", "go", "swimming"]
Character offsets (start, end) over the original string:
[(0,7), (8,9), (10,15), (16,18), (19,27)]
These are spans in the original text, not in the mapped “mojibake” string. (Hugging Face)

BPE model output

Runs merges inside each pre-token over the mapped characters.
Typical token strings and offsets (character indexes in original text):
- Chinese chunk: ['ä»Ĭå¤©','å¤©æ°Ķ','çľŁ','å¥1⁄2','ï1⁄4Į'] → [(0,2),(2,4),(4,5),(5,6),(6,7)]
  which correspond to ['今天','天气','真','好',',']
- English: ['I','Ġwanna','Ġgo','Ġswimming'] → [(8,9),(9,15),(15,18),(18,27)]
  where Ġ indicates the preceding space is part of the token span.
Map token strings → ids, e.g. [100644,104307,88051,52801, ...].
No human decoding has happened yet; these are internal symbols. Qwen uses byte-level BPE on UTF-8, so this behavior is expected and guarantees no OOV. (Qwen)

Post-processor

Adds special tokens (BOS/EOS, chat template pieces) if configured. It does not “fix” readability. (Hugging Face)

Decoder (when you call `decode` / `batch_decode`)

Inverts the byte→Unicode mapping and restores spaces, yielding normal text.
Fast tokenizers also expose return_offsets_mapping=True so you can slice the original string per token without decoding each id. (Hugging Face)

Quick rules to remember

tokenize() / convert_ids_to_tokens() → raw vocab strings (mapped bytes). They will look garbled for non-ASCII. Correct by design.
decode() / batch_decode() → runs the decoder → human text.
return_offsets_mapping=True (fast tokenizers) → character spans over the original text for each final token.
English tokens may include a leading space (Ġ...), so their offsets can start at the space. This depends on tokenizer settings like add_prefix_space and post-processing; be mindful of offset edge cases. (Hugging Face)

Minimal verification snippet

# deps:
# pip install --upgrade transformers>=4.44 tokenizers>=0.15
from transformers import AutoTokenizer
s = "今天天气真好,I wanna go swimming"
tok = AutoTokenizer.from_pretrained("unsloth/Qwen3-14B", use_fast=True)

enc = tok(s, add_special_tokens=False, return_offsets_mapping=True)
tokens = tok.convert_ids_to_tokens(enc["input_ids"])
spans  = [s[a:b] for a,b in enc["offset_mapping"]]

print(tokens)  # internal strings (byte-mapped), includes Ġ for spaces
print(spans)   # human-readable per-token text slices
print(tok.decode(enc["input_ids"]))  # original text

"""
['ä»Ĭå¤©', 'å¤©æ°Ķ', 'çľŁ', 'å¥1⁄2', 'ï1⁄4Į', 'I', 'Ġwanna', 'Ġgo', 'Ġswimming']
['今天', '天气', '真', '好', ',', 'I', ' wanna', ' go', ' swimming']
今天天气真好,I wanna go swimming
"""

Topic		Replies	Views
Decode token IDs into a list (not a single string) 🤗Tokenizers	4	4757	March 11, 2025
Byte Level Tokenizer While Training 🤗Tokenizers	0	79	December 14, 2024
Inconsistent, mojibaked tokenization in some but not all Huggingface tokenizers 🤗Tokenizers	1	33	February 12, 2025
Show Submodels of PegasusTokenizer 🤗Tokenizers	1	647	April 28, 2022
Use a pretrained ByteLevelBPETokenizer on text 🤗Tokenizers	1	3976	July 17, 2020

How can I get a list of word segmentation results for non-English string?

Walk-through on your example

Pre-tokenizer output (conceptual)

BPE model output

Post-processor

Decoder (when you call decode / batch_decode)

Quick rules to remember

Minimal verification snippet

Related topics

Decoder (when you call `decode` / `batch_decode`)