why u said Ġ appears in tokens, not pre-tokens? I tried and got this:

by the way, I noticed that the tokenizer can make grammatical errors in Chinese like: "法国的首都是巴黎" will get [‘法国’, ‘的’, ‘首’, ‘都是’, ‘巴黎’],but the right result should be [‘法国’, ‘的’, ‘首都', 是’, ‘巴黎’], emmm, does that mean BPE is not good enough or why don’t use something like jieba to do such work?