Transformer architecture and theory

Hello :slightly_smiling_face:

First, I’d recommend Jay Alammar’s wonderful “The Illustrated Transformer” (also in YouTube form, if you’re more of a video person).

Regarding your more specific question, the Transformer has a set vocabulary of words (or subwords, as is more common nowadays). For example, BERT has around 30,000 subwords in its vocabulary. We also have mappings from word to index and index to word, so we can convert between them. The Transformer’s decoder has a softmax layer at the end, which gives a probability distribution across all words in the vocabulary, with every index corresponding to a word. Once we pick an index, we use our mapping to get the corresponding word.