Hi everyone,

I am studying BERT paper after I have studied the Transformer.

The thing I can’t understand yet is the output of each Transformer Encoder in the last hidden state (Trm before T1, T2, etc… in the image).

In particular, I should know that thanks (somehow) to the Positional Encoding, the most left Trm represents the embedding of the first token, the second left represents the embedding of the second token and so on.

Hence, the shape of each one of them should be simply the hidden_dim (for example, 768) if what I have said before is true.

However, I am not convinced of this logic; so is the answer correct?

Many thanks in advance