Suppose I have 16-head Transformer layers in a standard BERT model. I want to constrain the first head of all the transformer layers to attend to tokens only in the same sentence, while the other 15 heads can attend to all the (non-padding) tokens (which is the default).
I looked at head_mask
, but that merely specifies which heads to deactivate (0/1).
Looked at attention_mask
, but that does not provide a way to specify different masks for different heads.
Any suggestions would be awesome!
EDIT - example
The input is a number of sentences, let’s say:
Drums, drums in the deep. We cannot get out. They are coming.
I want the first head of multi-head attention to just attend to tokens/words in the same sentence. So, when calculating the dot-product attention for “We”, the only words considered are “We cannot get out”. The other sentences are ignored. This can be specified by getting a num_words x num_words
mask for the first head, and for each row, placing 1’s for other words in the same sentence, and 0s for words not in the same sentence.
However, there doesn’t seem to be a clean way of specifying per-head attention masks. I want to make sure that I am not missing some obvious way of doing this using the huggingface methods.