I’m sorry if this is a basic question but i’m new to NLP and trying to understand BERT. According to the paper
“the final hidden vectors corresponding to the
mask tokens are fed into an output softmax over
the vocabulary, as in a standard LM.”
“we only predict the masked words rather than reconstructing the entire input”.
I can’t get my head around how does BERT knows if a token is a mask to feed that embedding through a softmax and use it for the loss ?