BERT: What is the shape of each Transformer Encoder block in the final hidden state?

Hi everyone,
I am studying BERT paper after I have studied the Transformer.bert

The thing I can’t understand yet is the output of each Transformer Encoder in the last hidden state (Trm before T1, T2, etc… in the image).

In particular, I should know that thanks (somehow) to the Positional Encoding, the most left Trm represents the embedding of the first token, the second left represents the embedding of the second token and so on.

Hence, the shape of each one of them should be simply the hidden_dim (for example, 768) if what I have said before is true.

However, I am not convinced of this logic; so is the answer correct?

Many thanks in advance

Your answer is true but your logic seems shaky. The input embedding has nothing to do with the final shape. In other words, you could still have an output of each token by having an 1*n_tokens to 768*n_tokens linear layer and that’s it. In the case of the transformer, these have the embedding has the same size as the hidden states of the encoder, but it is conceivable to have transformations in-between that deal with different shapes, e.g. a smaller embedding shape to reduce dimensionality for computational efficiency.

So in your image, assuming the encoder block has a hidden size of 768, the output will N*768, simply because 768 is the output shape of the last layer in the encoder block.

Thanks for the reply @BramVanroy.
However, when you say

this means that the output of each transformer of the final hidden layer will be N*768 or the “combined” output of all transformers in the final hidden layer?

I think that my problem concern the fact that BERT, in every hidden layers, uses a number of transformers encoder equal to the number of tokens in input.
In the image, if we have N tokens, so for each hidden layer we have N Encoders.
But if each Encoders outputs a value of shape N*768, so there is a problem.

P.S.: just to clarify, I use the term Hidden Layer to indicate the “Trm” horizontal blocks between the input and the output.
In the image, the hidden layer size is 2.

You seem to be confused about terminology. I often find that it helps to look at source code. For instance, have a look at how Transformer is implemented in PyTorch. The Transformer itself is both the whole encoder and the whole decoder and they both consist of N layers where each layer is the typical multiheadattention layer with dropout and normalisation (it’s all there in the source code). In LMs we often only use the encoder.

The image that you post is just an illustrative example, not a visualisation of the actual architecture. E1 is the input embedding of the first token, E2 the embedding of the second token, and so on. These are fed into your transformer encoder (here you have two encoder layers) but in reality BERT has more layers, depending on the model size.

You should not use the hidden layer size like that. That’s not what it typically refers to. The blue cells are not individual encoders either, each row of blue circles is a single encoder layer and every layers outputs hidden_dim*n_tokens to the next.

I suggest you re-read the transformer paper followed by the BERT paper.

1 Like

This was the answer I was looking for, many thanks.
Initially, I was thinking in this way, however, in the BERT paper there is a comparison between it and the old state of the art models, in particular with GPT (which I have not studied).

gpt

Looking at the image I imagined that each blue circle was effectively a separated Transformer.
Then, following your logic, in the case of GPT, in the first multihead attention layer only the first token is given in input, in the second layer, the first two and so on.
Is that correct?

No, as I said before those circles are not individual parts in the model. This is a very dummy visualization of Bert VS others simply to illustrate the directionality of the model. BERT is bidirectional and so it can predict a token in a masked position by taking into account left and right context. GPT is autoregressive and can only do left to right prediction/generation. That’s what the illustration shows. If you want to understand the architecture of BERT you should reread the Vaswani et al paper and look at the illustration of the Transformer. BERT is basically the encoder-only part of that.

Thanks for suggestions, I will certainly do.