I am a beginner and need your help very much, thank you
In pretrain, the classical BERT input dimension and output dimension are 768 dimensions, and then the cross entropy is calculated as loss, but why âBertForMaskedLMâ output dimension is vacobâs (tokensâ) numberďź
BertForMaskedLMâs outpus:
(decoder): Linear(in_features=768, out_features=20000, bias=True)
BERTs outpus:
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()