Why BertForMaskedLM has decoder layer

lvwerra · August 17, 2021, 9:09am

For your first question about the decoder and the hidden size and output size: Internally, the model projects the input tokens which have dimension of the vocabulary size (20’000 in your case) to the hidden size (768 in your case). Inside the layers of BERT the embeddings of 768 are processed. Finally, after the last BERT layer we need to get back from the hidden size to the vocabulary size which corresponds to proper tokens. That’s what the decoder layer does: it takes embeddings of dim=768 and projects them to dim=20000.

As for your second question about pretraining: there is a tutorial on Google Colab.

Topic		Replies	Views
One question is about the pretrain method in Transformer packge ？ 🤗Transformers	1	204	March 19, 2025
BertForMaskedLM model require fine-tuning? Beginners	0	646	August 7, 2022
About the Cross-attention Layer Shape in Encoder-Decoder Model 🤗Transformers	1	1913	March 18, 2022
BERT: What is the shape of each Transformer Encoder block in the final hidden state? Intermediate	7	12982	March 16, 2022
Unexpected result from transformer model prediction Beginners	0	289	November 21, 2021

Why BertForMaskedLM has decoder layer

Related topics