I’ve been thinking about how LLMs actually generate the text we see, and I realized something that feels like a small revelation.
When an AI like GPT or Gemini “generates text,” there are really two different processes happening:
-
Inside the model (math side):
-
Input text gets split into token IDs.
-
The model runs a softmax over all possible tokens to decide which ID should come next.
-
Example: ID
15496
="hello"
.
-
-
Outside the model (display side):
-
The AI’s job is done once it outputs
"hello"
. -
The browser/app does the actual visual rendering — fonts, pixels, showing you the word on your screen.
-
So the AI doesn’t “draw” the text — it just sends back the chosen string, and your system makes it visible.
That got me thinking:
-
Token IDs today are just flat numbers, basically arbitrary slots in a big lookup table.
-
But in principle, you could structure IDs in a binary or hierarchical system (like Huffman codes, tries, or prefix trees).
-
This wouldn’t change the fact that browsers handle rendering, but it could make the handoff between AI and display more efficient — through compression, faster decoding, or maybe even new ways of structuring reasoning.
Right now, direct indexing is optimal for speed, so this hasn’t been necessary. But I wonder if revisiting binary systems for token ID organization could have benefits in scaling, transmission efficiency, or model design.
Has anyone seen research that looks at token IDs in this way? Or thoughts on whether a binary/structured ID space could help in practice?