I am trying to build a domain specific RoBERTa model and I need clarification on the attention_mask usage. Glossary (https://huggingface.co/transformers/glossary.html#attention-mask) says âThe attention mask is a binary tensor indicating the position of the padded indices so that the model does not attend to them.â
I would think masking would be a multiplication of attention scores with the attention_mask, not addition. Could someone clarify this for me?
My purpose looking into attention_mask is to be able to use it as a quick way of assigning fixed weights to certain tokens. For example, if I have a domain specific sentence like this: âParasequence boundaries may be distinguished by differences in physical and chemical propertiesâ, where I care more about the domain specific words than the common language words. I assumed attention_mask could be used as a domain specific weighting. thanks!
Hi! The attention mask is made so that it accepts 0s and 1s. Putting a 1 indicates that this token should be attended to, while putting a 0 indicates a value that should not be attended to.
In the models it is modified here (inversion), which means that tokens that have a 1 value will now have 0, and tokens that have a 0 value will now have -10000.
This value is then summed to the attention scores here: if tokens had 1 initially, nothing changes. If tokens has 0 initially, their attention scores becomes a very negative value, nullifying their impact on the sequence.
Unfortunately, you wonât be able to use this attention mask to assign fix weights to certain tokens. The attention_mask can be used to nullify the impact of certain tokens on the rest of the sequence, but thatâs about it.
suppose my sentence is :
[âMrâ , âObamaâ, âhadâ, âaâ, âveryâ, âgoodâ, âstandingâ, âamongâ, âworldâ, âleadersâ , â.â]
if I give its attention_mask as :
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0 , 0]
And since attention, in context of transformers, basically is K.QT
that is key and query ,
So my doubt is , what does ignoring mean ??
Will the attention score be only attend to Kobama . Qobama ??
or does it mean the following :
while encoding âObamaâ , the self attention will encode it by attending to all other words
and while encoding other words , the self attention will not look to other words/tokens in the text ??