Clarification on the attention_mask

zb1 · October 14, 2020, 8:14pm

Hello everyone,

I am trying to build a domain specific RoBERTa model and I need clarification on the attention_mask usage. Glossary (https://huggingface.co/transformers/glossary.html#attention-mask) says “The attention mask is a binary tensor indicating the position of the padded indices so that the model does not attend to them.”

On the github page (https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_roberta.py)
lines 196-198:
if attention_mask is not None:
# Apply the attention mask is (precomputed for all layers in RobertaModel forward() function)
attention_scores = attention_scores + attention_mask

I would think masking would be a multiplication of attention scores with the attention_mask, not addition. Could someone clarify this for me?

My purpose looking into attention_mask is to be able to use it as a quick way of assigning fixed weights to certain tokens. For example, if I have a domain specific sentence like this: “Parasequence boundaries may be distinguished by differences in physical and chemical properties”, where I care more about the domain specific words than the common language words. I assumed attention_mask could be used as a domain specific weighting. thanks!

lysandre · October 15, 2020, 12:47pm

Hi! The attention mask is made so that it accepts 0s and 1s. Putting a 1 indicates that this token should be attended to, while putting a 0 indicates a value that should not be attended to.

In the models it is modified here (inversion), which means that tokens that have a 1 value will now have 0, and tokens that have a 0 value will now have -10000.

This value is then summed to the attention scores here: if tokens had 1 initially, nothing changes. If tokens has 0 initially, their attention scores becomes a very negative value, nullifying their impact on the sequence.

Unfortunately, you won’t be able to use this attention mask to assign fix weights to certain tokens. The attention_mask can be used to nullify the impact of certain tokens on the rest of the sequence, but that’s about it.

zb1 · October 15, 2020, 2:03pm

Thank you for your response, very clear. I have missed the inversion part. Now it makes perfect sense. Thanks again!

rahul-evenup · April 20, 2023, 4:23pm

If you see this, wondering if you pursued this idea @zb1 and whether you implemented something of your own or found some other workaround? Thanks!

rajcrimecheck · May 3, 2024, 4:18pm

suppose my sentence is :
[‘Mr’ , ‘Obama’, ‘had’, ‘a’, ‘very’, ‘good’, ‘standing’, ‘among’, ‘world’, ‘leaders’ , ‘.’]
if I give its attention_mask as :
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0 , 0]
And since attention, in context of transformers, basically is K.QT
that is key and query ,
So my doubt is , what does ignoring mean ??
Will the attention score be only attend to Kobama . Qobama ??
or does it mean the following :
while encoding ‘Obama’ , the self attention will encode it by attending to all other words
and while encoding other words , the self attention will not look to other words/tokens in the text ??

Topic		Replies	Views
Role of attention mask in base Bert 🤗Transformers	0	330	December 22, 2022
Bert attention mask question 🤗Transformers	4	1210	March 11, 2024
Is attention_mask in LanguageModels such as GPT2LMHeadModel related to attention mechanism is it just to specify padding tokens Beginners	2	207	June 27, 2024
Can attention_mask hold float values in [0,1] in T5? How these masks act in Attention blocks? 🤗Transformers	0	694	May 26, 2022
Pass a custom mask when using RoBERTa 🤗Transformers	5	2308	January 10, 2023

Clarification on the attention_mask

Related topics