Tokenization is a process in which words/sub-words are mapped to numerical indices that have corresponding embeddings. I know, long ago, the tokens were decided by byte pair encoding, and that algorithm would evenly partition the English language.
Things seem to have changed. I’m curious if anyone knows how it’s done now, or specifically how this process works when the vocabulary has overlapping tokens, e.g., “F”, “Fo”, “For”, “Form”, etc. (i.e. these are all unique, separate tokens) and the tokenizer is asked to encode a word like “Formula”. Here’s an example of a real vocabulary in which is the case: vocab.json · Qwen/Qwen2.5-14B-Instruct-1M at main