Is the attention mask and tokenization taken into account?

I’m using Bert for token classification.
The outputs.logits of my model is of shape (batch_size, padded_input_length, num_labels).
Moreover, when taking the argmax over the last dimension (num_labels), it turns out there are predictions for the pad tokens the -100 tokens, and they are distributed along the range of num_labels.
So, how are those labels taken into account? It the documentation it says the attention mask tells the model where not to attend. But the model does output predictions for those tokens. Is the forward pass happening as usual on all inputs, but the -100 and pad tokens ignored in the loss computation?