I am trying to build a domain specific RoBERTa model and I need clarification on the attention_mask usage. Glossary (https://huggingface.co/transformers/glossary.html#attention-mask) says “The attention mask is a binary tensor indicating the position of the padded indices so that the model does not attend to them.”
On the github page (https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_roberta.py)
if attention_mask is not None:
# Apply the attention mask is (precomputed for all layers in RobertaModel forward() function)
attention_scores = attention_scores + attention_mask
I would think masking would be a multiplication of attention scores with the attention_mask, not addition. Could someone clarify this for me?
My purpose looking into attention_mask is to be able to use it as a quick way of assigning fixed weights to certain tokens. For example, if I have a domain specific sentence like this: “Parasequence boundaries may be distinguished by differences in physical and chemical properties”, where I care more about the domain specific words than the common language words. I assumed attention_mask could be used as a domain specific weighting. thanks!