I am trying to train a transformerXL model from scratch, but I am struggling to understand the meaning of the d_head and d_inner in the config.
I understand d_head as being the dimension of the value vector after attetion has been applied, but I have no clue what d_inner should be.
The doc only states that:
d_inner (int, optional, defaults to 4096) — Inner dimension in FF
It stands for FeedForward. d_inner is the dimensionality of the hidden layer of the feedforward neural network (FF, FFN, or also called MLP as it’s a multilayer perceptron) inside the layers of the Transformer-XL model.
I see. So just to be clear then: Does this mean that the very first attention layer of the transformer needs an input of dimension d_embed, and the second attention layer expects an input of dimensions d_inner?
Edit: re-reading the original “attention is all you need” paper, I realize that there are actually two feed-forward layers, and that d_inner is the dimension of the first one. The second one bring the input to the correct dimension of d_embed again.