Tokenization for overlapping tokens

jiosephlee · July 13, 2025, 2:24am

Tokenization is a process in which words/sub-words are mapped to numerical indices that have corresponding embeddings. I know, long ago, the tokens were decided by byte pair encoding, and that algorithm would evenly partition the English language.

Things seem to have changed. I’m curious if anyone knows how it’s done now, or specifically how this process works when the vocabulary has overlapping tokens, e.g., “F”, “Fo”, “For”, “Form”, etc. (i.e. these are all unique, separate tokens) and the tokenizer is asked to encode a word like “Formula”. Here’s an example of a real vocabulary in which is the case: vocab.json · Qwen/Qwen2.5-14B-Instruct-1M at main

Ernst03 · July 13, 2025, 3:08am

Information theory is very interesting.
I’d love to read about such thing too.
The process of somehow intuiting associations to align with meaning is an aspect I am interested in.
But first, Welcome @jiosephlee with your first post!

Topic		Replies	Views
Tokenizer mapping the same token to multiple token_ids 🤗Tokenizers	4	711	April 22, 2024
Different tokenization for the same word fed alone vs in a sentence Beginners	0	281	July 6, 2021
Get intermediate tokens and merges used in tokenization 🤗Tokenizers	0	480	December 1, 2023
Urgent! Weird behavior of CLIPTokenizer when encoding out of vocabulary /non-English text with openai/clip-vit-base-patch32, and question about merges.txt 🤗Transformers	0	291	November 13, 2022
Meanings of different brackets during tokenization 🤗Transformers	0	307	December 10, 2021

Tokenization for overlapping tokens

Related topics