How does BERT only compute the softmax for the masked hidden vectors?

Hey all! I recently read the paper “BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding” and I was facinated by their masked language modeling method of pre-training. However, attempting to implement the method into pytorch for my own transformer model became difficult. In the paper, it states:

“In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM.”

How is it possible to only consider the masked embeddings and output only those outputs from the transformer encoder into an output softmax?

I tried to mask the output of the model to only output into the softmax but, the model learned this and outputted the mask by default. I felt like wasn’t a correct implementation of masked language modeling so I disregarded it. Anyone have any insight? Thanks!