Transformer architecture and theory

Hi! The question is of a theoretical nature. Maybe someone knows or there is a material where a transformer model is described for dummies, so that one pass could be calculated in Excel. With examples, for example, there are 5 sentences, positional coding takes place here, as a result, the matrix of so-and-so. I am especially interested in how Q, K, V and all this magic appear. Is there a matrix of sentences or individual words at the output of the encoder? And how does the decoder turn the matrix into text?

Hello :slightly_smiling_face:

First, I’d recommend Jay Alammar’s wonderful “The Illustrated Transformer” (also in YouTube form, if you’re more of a video person).

Regarding your more specific question, the Transformer has a set vocabulary of words (or subwords, as is more common nowadays). For example, BERT has around 30,000 subwords in its vocabulary. We also have mappings from word to index and index to word, so we can convert between them. The Transformer’s decoder has a softmax layer at the end, which gives a probability distribution across all words in the vocabulary, with every index corresponding to a word. Once we pick an index, we use our mapping to get the corresponding word.

Okay, good. But according to the article, a vector is built there for each word, how is its dimension and values determined? Is it just a 1 among 0 in a sentence? And at the output of the encoder we get a probability distribution obtained from these vectors using the mechanism of self-attention? And how are K, V and Q defined? Randomly, and then they are trained? It’s just that all the articles show colored cubes as vectors, but there are no specific examples of converting numeric vectors anywhere.

The dimension of the word vectors, d_{model}, is a hyperparameter. In the Transformer paper, d_{model} = 512. Word vectors are initialized randomly, and then learned during training.

The output of the encoder is called the “context vector”, which isn’t comprised of probabilities, but of logits. This vector “remembers the important parts” of the input.

Q, K, and V are matrices are that defined to be the result of taking the packing of word embeddings x_{1}, \ldots, x_{n} into a matrix X, and then multiplying X by learned weight matrices W_Q, W_K, W_V.