One question is about the pretrain method in Transformer packge ?

I am a beginner and need your help very much, thank you

In pretrain, the classical BERT input dimension and output dimension are 768 dimensions, and then the cross entropy is calculated as loss, but why ‘BertForMaskedLM’ output dimension is vacob’s (tokens’) number?

BertForMaskedLM’s outpus:
(decoder): Linear(in_features=768, out_features=20000, bias=True)

BERTs outpus:
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()