Yeah.
Mostly right. Two fixes:
-
The ByteLevel pre-tokenizer splits on whitespace and remaps bytes to printable code points, but it does not hand you a visible list of single characters. It outputs “pre-tokens” with offsets. Then the BPE model merges those mapped characters into vocab tokens. Decoding later inverts the byte→Unicode mapping. (Hugging Face)
-
The English space is encoded into tokens with a visible space marker (e.g.,
Ġ). That’s why you see tokens likeĠwanna. Offsets can therefore include the leading space. (Hugging Face)
Walk-through on your example
Input:
"今天天气真好,I wanna go swimming"
Pre-tokenizer output (conceptual)
- Operation: normalize → UTF-8 bytes → map bytes to printable Unicode → split on whitespace → keep offsets.
- Pre-tokens (by whitespace):
["今天天气真好,", "I", "wanna", "go", "swimming"] - Character offsets (start, end) over the original string:
[(0,7), (8,9), (10,15), (16,18), (19,27)]
These are spans in the original text, not in the mapped “mojibake” string. (Hugging Face)
BPE model output
-
Runs merges inside each pre-token over the mapped characters.
-
Typical token strings and offsets (character indexes in original text):
- Chinese chunk:
['ä»Ĭ天','天æ°Ķ','羣','å¥1⁄2','ï1⁄4Į']→[(0,2),(2,4),(4,5),(5,6),(6,7)]
which correspond to['今天','天气','真','好',','] - English:
['I','Ġwanna','Ġgo','Ġswimming']→[(8,9),(9,15),(15,18),(18,27)]
whereĠindicates the preceding space is part of the token span.
- Chinese chunk:
-
Map token strings → ids, e.g.
[100644,104307,88051,52801, ...]. -
No human decoding has happened yet; these are internal symbols. Qwen uses byte-level BPE on UTF-8, so this behavior is expected and guarantees no OOV. (Qwen)
Post-processor
- Adds special tokens (BOS/EOS, chat template pieces) if configured. It does not “fix” readability. (Hugging Face)
Decoder (when you call decode / batch_decode)
- Inverts the byte→Unicode mapping and restores spaces, yielding normal text.
- Fast tokenizers also expose
return_offsets_mapping=Trueso you can slice the original string per token without decoding each id. (Hugging Face)
Quick rules to remember
tokenize()/convert_ids_to_tokens()→ raw vocab strings (mapped bytes). They will look garbled for non-ASCII. Correct by design.decode()/batch_decode()→ runs the decoder → human text.return_offsets_mapping=True(fast tokenizers) → character spans over the original text for each final token.- English tokens may include a leading space (
Ġ...), so their offsets can start at the space. This depends on tokenizer settings likeadd_prefix_spaceand post-processing; be mindful of offset edge cases. (Hugging Face)
Minimal verification snippet
# deps:
# pip install --upgrade transformers>=4.44 tokenizers>=0.15
from transformers import AutoTokenizer
s = "今天天气真好,I wanna go swimming"
tok = AutoTokenizer.from_pretrained("unsloth/Qwen3-14B", use_fast=True)
enc = tok(s, add_special_tokens=False, return_offsets_mapping=True)
tokens = tok.convert_ids_to_tokens(enc["input_ids"])
spans = [s[a:b] for a,b in enc["offset_mapping"]]
print(tokens) # internal strings (byte-mapped), includes Ġ for spaces
print(spans) # human-readable per-token text slices
print(tok.decode(enc["input_ids"])) # original text
"""
['ä»Ĭ天', '天æ°Ķ', '羣', 'å¥1⁄2', 'ï1⁄4Į', 'I', 'Ġwanna', 'Ġgo', 'Ġswimming']
['今天', '天气', '真', '好', ',', 'I', ' wanna', ' go', ' swimming']
今天天气真好,I wanna go swimming
"""