In the Transformer paper (Vaswani et al), the output dimension of encoder is d_model = 512. Is the hidden size in BERT (denoted as H in the BERT paper) actually the d_model in Transformer? But in BERT-base, this number changes from 512 to 768?
In the Transformer paper (Vaswani et al), the output dimension of encoder is d_model = 512. Is the hidden size in BERT (denoted as H in the BERT paper) actually the d_model in Transformer? But in BERT-base, this number changes from 512 to 768?