Hi. My issue relates to this issue. I wanted to have a graph-like input and was trying to mimic the behaviour of graph neural networks by masking the tokens (nodes) outside of the neighbourhood. Specifically, if I had for example a graph like I -> am -> hungry
, and the mapping for tokens was one to one, I would like to have an attention mask like: [[1, 1, 0], [0, 1, 1], [0, 0, 1]], meaning the token I
would attend to itself and the token am
, but not hungry
since it is outside its neighbourhood.
Is this behaviour possible? I have been searching and found the get_extended_attention_mask
function. It does work to send a 3D attention mask to the model. However, this line says that the 3D attention should be (batch, from_seq_len, to_seq_len) which suggests that this mask is for cross attention.
Are attention_masks
for self-attention or cross-attention?
HuggingFace is still working on it. There is a PR to support passing custom attention masks. Allow passing 2D attention mask · Issue #27640 · huggingface/transformers · GitHub