Tokenization for overlapping tokens

Tokenization is a process in which words/sub-words are mapped to numerical indices that have corresponding embeddings. I know, long ago, the tokens were decided by byte pair encoding, and that algorithm would evenly partition the English language.

Things seem to have changed. I’m curious if anyone knows how it’s done now, or specifically how this process works when the vocabulary has overlapping tokens, e.g., “F”, “Fo”, “For”, “Form”, etc. (i.e. these are all unique, separate tokens) and the tokenizer is asked to encode a word like “Formula”. Here’s an example of a real vocabulary in which is the case: vocab.json · Qwen/Qwen2.5-14B-Instruct-1M at main

1 Like

Information theory is very interesting.
I’d love to read about such thing too.
The process of somehow intuiting associations to align with meaning is an aspect I am interested in.
But first, Welcome @jiosephlee with your first post!

1 Like