How can I get a list of word segmentation results for non-English string?

why u said Ġ appears in tokens, not pre-tokens? I tried and got this:
微信图片_20251106121049_34_2


by the way, I noticed that the tokenizer can make grammatical errors in Chinese like: "法国的首都是巴黎" will get [‘法国’, ‘的’, ‘首’, ‘都是’, ‘巴黎’],but the right result should be [‘法国’, ‘的’, ‘首都', 是’, ‘巴黎’], emmm, does that mean BPE is not good enough or why don’t use something like jieba to do such work?

1 Like