Is the attention mask and tokenization taken into account?

dini · December 7, 2021, 2:15pm

I’m using Bert for token classification.
The outputs.logits of my model is of shape (batch_size, padded_input_length, num_labels).
Moreover, when taking the argmax over the last dimension (num_labels), it turns out there are predictions for the pad tokens the -100 tokens, and they are distributed along the range of num_labels.
So, how are those labels taken into account? It the documentation it says the attention mask tells the model where not to attend. But the model does output predictions for those tokens. Is the forward pass happening as usual on all inputs, but the -100 and pad tokens ignored in the loss computation?

Topic		Replies	Views
Do automatically generated attention masks ignore padding? 🤗Transformers	4	16434	March 8, 2022
Bert attention mask question 🤗Transformers	4	1199	March 11, 2024
Bert output for padding tokens Beginners	3	3285	February 22, 2023
Role of attention mask in base Bert 🤗Transformers	0	329	December 22, 2022
BertForTokenClassification Classifying [PAD] tokens Models	0	281	August 13, 2021

Is the attention mask and tokenization taken into account?

Related topics