How can I get a list of word segmentation results for non-English string?

septemberlemon · November 6, 2025, 4:22am

why u said Ġ appears in tokens, not pre-tokens? I tried and got this:
微信图片_20251106121049_34_2

by the way, I noticed that the tokenizer can make grammatical errors in Chinese like: "法国的首都是巴黎" will get [‘法国’, ‘的’, ‘首’, ‘都是’, ‘巴黎’],but the right result should be [‘法国’, ‘的’, ‘首都', 是’, ‘巴黎’], emmm, does that mean BPE is not good enough or why don’t use something like jieba to do such work?

Topic		Replies	Views
Decode token IDs into a list (not a single string) 🤗Tokenizers	4	4757	March 11, 2025
Byte Level Tokenizer While Training 🤗Tokenizers	0	79	December 14, 2024
Inconsistent, mojibaked tokenization in some but not all Huggingface tokenizers 🤗Tokenizers	1	33	February 12, 2025
Show Submodels of PegasusTokenizer 🤗Tokenizers	1	647	April 28, 2022
Use a pretrained ByteLevelBPETokenizer on text 🤗Tokenizers	1	3976	July 17, 2020

How can I get a list of word segmentation results for non-English string?

Related topics