How can I get a list of word segmentation results for non-English string?

John6666 · November 6, 2025, 12:26pm

I personally think that just as there is no perfect natural language—let alone programming language—for all purposes, there is no perfect tokenizer for tokenizing it either… Well, there may be safer options or flawed ones.

Yes. ChatGPT (and almost every LLM you use) always runs text through a tokenizer first. OpenAI models use a byte-pair-encoding tokenizer (tiktoken). Tokens are subwords/bytes chosen for compression and full coverage, not grammatical “words.” So Chinese “都是” may be one token even when you want “都”“是.” That is expected and by design. (GitHub)

Why ChatGPT can still reverse “strawberry”

The model never manipulates raw characters. It predicts tokens whose bytes decode to characters. Reversing “strawberry” means emitting a token sequence whose decoded bytes are y r r e b w a r t s. That can be done even if the input tokenization groups “straw” and “berry” together. Tokenization granularity ≠ capability to produce character-level outputs. (GitHub)
But tokenization does make some character tasks brittle. This is the well-known “strawberry problem”: models often fail at fine-grained letter tasks because subword tokens hide character boundaries. Multiple studies document this and propose fixes. (arXiv)

Why BPE is used instead of a word segmenter like jieba

Coverage with no <unk>: byte-level BPE can represent any UTF-8 text and is lossless and reversible. No per-language rules, no OOV. (GitHub)
Compression and efficiency: frequent substrings become single tokens, shortening sequences and speeding training/inference. Word segmenters don’t guarantee short sequences across all scripts. (GitHub)
Multilingual simplicity: one tokenizer works across languages. A Chinese-specific segmenter would not generalize to other scripts or mixed-script text. (jieba is great for word segmentation, but LLM tokenizers solve a different problem.) (GitHub)

What your screenshot shows

You called pre_tokenize_str. In ByteLevel tokenizers, the pre-tokenizer already (a) maps bytes to visible placeholders and (b) marks spaces with a visible symbol like Ġ. Seeing Ġ at this stage is normal; the BPE model then merges inside each pre-token; decode() later inverts the byte mapping to produce human text. (Hugging Face)

About the Chinese example

[‘法国’, ‘的’, ‘首’, ‘都是’, ‘巴黎’] is not a grammatical parse. It is a subword sequence chosen by frequency statistics. On Chinese, high-frequency bigrams like “都是” often appear as a single token, even across your desired word boundary (“首都 + 是”). If you need linguistically correct words, run a segmenter such as jieba on the original string. Don’t expect the model tokenizer to do this job. (GitHub)

If you need better character-level behavior

Keep the model but tweak the input: insert spaces or separators to force single-character tokens for tasks like counting or reversing letters. (Common trick for the strawberry problem.) (Simbian AI)
Use or fine-tune models that operate on bytes/characters (e.g., ByT5, CharacterBERT-style approaches). These keep explicit character access but cost more due to longer sequences. (ACL Anthology)

Quick takeaways

ChatGPT uses a tokenizer. Tokens ≠ words. (GitHub)
Reversal works because the model outputs tokens that decode to the reversed bytes, not because it “thinks in characters.”
BPE trades linguistic neatness for universality and efficiency. For word-level Chinese segmentation, use a dedicated segmenter.

Short, curated references

Tokenizers and BPE

OpenAI tiktoken README: properties of BPE, encodings for models. Clear and practical. (GitHub)
HF Tokenizers docs: ByteLevel PreTokenizer and Decoder behavior. Shows why you see Ġ and how decoding reverses it. (Hugging Face)

Chinese word segmentation

jieba project page. Modes, custom dictionaries, and usage. Good when you need real words. (GitHub)

Character-level limitations and fixes

“Strawberry problem” and character-level brittleness in tokenized LMs. Background and evidence. (arXiv)
EMNLP 2025 paper on adding character access while keeping tokens. Shows gains on character tasks. (ACL Anthology)

Topic		Replies	Views
Decode token IDs into a list (not a single string) 🤗Tokenizers	4	4757	March 11, 2025
Byte Level Tokenizer While Training 🤗Tokenizers	0	79	December 14, 2024
Inconsistent, mojibaked tokenization in some but not all Huggingface tokenizers 🤗Tokenizers	1	33	February 12, 2025
Show Submodels of PegasusTokenizer 🤗Tokenizers	1	647	April 28, 2022
Use a pretrained ByteLevelBPETokenizer on text 🤗Tokenizers	1	3976	July 17, 2020