How does BERT know if a token is a mask for prediction and loss

PanKo · June 30, 2022, 8:23pm

I’m sorry if this is a basic question but i’m new to NLP and trying to understand BERT. According to the paper

“the final hidden vectors corresponding to the
mask tokens are fed into an output softmax over
the vocabulary, as in a standard LM.”
…
“we only predict the masked words rather than reconstructing the entire input”.

I can’t get my head around how does BERT knows if a token is a mask to feed that embedding through a softmax and use it for the loss ?

Topic		Replies	Views
How does BERT only compute the softmax for the masked hidden vectors? Models	0	481	January 6, 2023
Where in the code does masking of tokens happen when pretraining BERT Beginners	5	7268	August 17, 2020
Unexpected result from transformer model prediction Beginners	0	288	November 21, 2021
Is the attention mask and tokenization taken into account? Beginners	0	351	December 7, 2021
Batched BertForMaskedLM inference loss issue Intermediate	0	690	February 23, 2022

How does BERT know if a token is a mask for prediction and loss

Related topics