Precise meaning of ```d_head``` and ```d_inner```

I am trying to train a transformerXL model from scratch, but I am struggling to understand the meaning of the d_head and d_inner in the config.
I understand d_head as being the dimension of the value vector after attetion has been applied, but I have no clue what d_inner should be.
The doc only states that:

d_inner (int, optional, defaults to 4096) — Inner dimension in FF

What does FF mean here?

It stands for FeedForward. d_inner is the dimensionality of the hidden layer of the feedforward neural network (FF, FFN, or also called MLP as it’s a multilayer perceptron) inside the layers of the Transformer-XL model.

I see. So just to be clear then: Does this mean that the very first attention layer of the transformer needs an input of dimension d_embed, and the second attention layer expects an input of dimensions d_inner?

Edit: re-reading the original “attention is all you need” paper, I realize that there are actually two feed-forward layers, and that d_inner is the dimension of the first one. The second one bring the input to the correct dimension of d_embed again.

1 Like