thank u, that’s helpful
such as I’ve got:
import transformers
tokenizer = AutoTokenizer(“unsloth/Qwen3-14B”)
prompt = "今天天气真好"
prompts = ['今天天气真好', '法国的首都是巴黎']
I just need use:
print(tokenizer.batch_decode(tokenizer(prompt)["input_ids"]))
or
input_ids = tokenizer(prompt)["input_ids"]
print([tokenizer.decode(input_id) for input_id in input_ids])
or
offsets_mappging = tokenizer(prompt, return_offsets_mapping=True)["offset_mapping"]
print([prompt[i: j] for i, j in offsets_mappging])
then I can get the right result: [‘今天’, ‘天气’, ‘真’, ‘好’]
for batch prompts, I can process every prompt and then put them into a list to get the right result like:
print([tokenizer.batch_decode(input_id) for input_id in tokenizer(prompts)["input_ids"]])
then I can get the right results easily: [[‘今天’, ‘天气’, ‘真’, ‘好’], [‘法国’, ‘的’, ‘首’, ‘都是’, ‘巴黎’]]
and I search something about byte-level tokenize, seems that the weird result like:
['ä»Ĭ天', '天æ°Ķ', '羣', '好']
was come from the bytes directly, they made a mapping from a single byte to a char which is easy to print, like:
def bytes_to_unicode():
bs = (
list(range(ord("!"), ord("~") + 1)) +
list(range(ord("¡"), ord("¬") + 1)) +
list(range(ord("®"), ord("ÿ") + 1))
)
cs = bs.copy()
n = 0
for b in range(2**8):
if b not in bs:
bs.append(b)
cs.append(2**8 + n)
n += 1
cs = [chr(code) for code in cs]
return dict(zip(bs, cs))
but why they need that mapping? for what?
and why tokenizer.tokenize and tokenizer.convert_ids_to_tokens and tokenizer.convert_tokens_to_ids do not return or use a “right” string like decode and batch_decode do?