I am a beginner and need your help very much, thank you
In pretrain, the classical BERT input dimension and output dimension are 768 dimensions, and then the cross entropy is calculated as loss, but why ‘BertForMaskedLM’ output dimension is vacob’s (tokens’) number?
BertForMaskedLM’s outpus:
(decoder): Linear(in_features=768, out_features=20000, bias=True)
BERTs outpus:
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()