Clarification on the attention_mask

Hello everyone,

I am trying to build a domain specific RoBERTa model and I need clarification on the attention_mask usage. Glossary (https://huggingface.co/transformers/glossary.html#attention-mask) says “The attention mask is a binary tensor indicating the position of the padded indices so that the model does not attend to them.”

On the github page (https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_roberta.py)
lines 196-198:
if attention_mask is not None:
# Apply the attention mask is (precomputed for all layers in RobertaModel forward() function)
attention_scores = attention_scores + attention_mask

I would think masking would be a multiplication of attention scores with the attention_mask, not addition. Could someone clarify this for me?

My purpose looking into attention_mask is to be able to use it as a quick way of assigning fixed weights to certain tokens. For example, if I have a domain specific sentence like this: “Parasequence boundaries may be distinguished by differences in physical and chemical properties”, where I care more about the domain specific words than the common language words. I assumed attention_mask could be used as a domain specific weighting. thanks!

1 Like

Hi! The attention mask is made so that it accepts 0s and 1s. Putting a 1 indicates that this token should be attended to, while putting a 0 indicates a value that should not be attended to.

In the models it is modified here (inversion), which means that tokens that have a 1 value will now have 0, and tokens that have a 0 value will now have -10000.

This value is then summed to the attention scores here: if tokens had 1 initially, nothing changes. If tokens has 0 initially, their attention scores becomes a very negative value, nullifying their impact on the sequence.

Unfortunately, you won’t be able to use this attention mask to assign fix weights to certain tokens. The attention_mask can be used to nullify the impact of certain tokens on the rest of the sequence, but that’s about it.

3 Likes

Thank you for your response, very clear. I have missed the inversion part. Now it makes perfect sense. Thanks again!

1 Like

If you see this, wondering if you pursued this idea @zb1 and whether you implemented something of your own or found some other workaround? Thanks!

suppose my sentence is :
[‘Mr’ , ‘Obama’, ‘had’, ‘a’, ‘very’, ‘good’, ‘standing’, ‘among’, ‘world’, ‘leaders’ , ‘.’]
if I give its attention_mask as :
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0 , 0]
And since attention, in context of transformers, basically is K.QT
that is key and query ,
So my doubt is , what does ignoring mean ??
Will the attention score be only attend to Kobama . Qobama ??
or does it mean the following :
while encoding ‘Obama’ , the self attention will encode it by attending to all other words
and while encoding other words , the self attention will not look to other words/tokens in the text ??