Why BertForMaskedLM has decoder layer

Hi @ccfeidao

For your first question about the decoder and the hidden size and output size: Internally, the model projects the input tokens which have dimension of the vocabulary size (20’000 in your case) to the hidden size (768 in your case). Inside the layers of BERT the embeddings of 768 are processed. Finally, after the last BERT layer we need to get back from the hidden size to the vocabulary size which corresponds to proper tokens. That’s what the decoder layer does: it takes embeddings of dim=768 and projects them to dim=20000.

As for your second question about pretraining: there is a tutorial on Google Colab.

1 Like