Specify attention masks for some heads in multi-head attention

Suppose I have 16-head Transformer layers in a standard BERT model. I want to constrain the first head of all the transformer layers to attend to tokens only in the same sentence, while the other 15 heads can attend to all the (non-padding) tokens (which is the default).

I looked at head_mask, but that merely specifies which heads to deactivate (0/1).

Looked at attention_mask, but that does not provide a way to specify different masks for different heads.

Any suggestions would be awesome!

EDIT - example

The input is a number of sentences, let’s say:

Drums, drums in the deep. We cannot get out. They are coming.

I want the first head of multi-head attention to just attend to tokens/words in the same sentence. So, when calculating the dot-product attention for “We”, the only words considered are “We cannot get out”. The other sentences are ignored. This can be specified by getting a num_words x num_words mask for the first head, and for each row, placing 1’s for other words in the same sentence, and 0s for words not in the same sentence.

However, there doesn’t seem to be a clean way of specifying per-head attention masks. I want to make sure that I am not missing some obvious way of doing this using the huggingface methods.

Hi arunirc,

(I’m not sure I understand what you want to do. Are you planning to do a Next Sentence Prediction type of task, where you input pairs of texts having up to 512 tokens each? Or, when you say “same sentence” do you mean a set of words separated by full stops?)

Within each layer, after the attention heads, the output from each head needs to be concatenated before the feed-forward network. For that concatenation to work, I suspect that the matrices need to be the same size. If so, then I don’t think you could have one head only attending to a few tokens, because its matrix would be smaller.

Have you considered using two completely separate BERT models, one of which has only 1 head per layer and takes as input only the first text, and the other of which has 15 heads per layer and takes as input the pairs of texts, and combining the outputs later.

Hi rgwatwormhill,

thanks for taking the time to answer this. I put in an example in the main question which might help explain what I am trying to do.

Yes, the two completely separate BERT models seems to be the cleanest way to go! I would try to of course avoid that if the hugginface library does provide a way to specify per-head masks, which I may be overlooking.

How about Longformer?

It can be used to do local attention for all words in a text plus global attention for specified words. Maybe you could specify all except the first sentence to have global, and the first sentence to have only ocal.

(I’ve never used Longformer, so I don’t know whether the parameters are flexible enough for your scenario.)