Can attention_mask hold float values in [0,1] in T5? How these masks act in Attention blocks?

Hello Everybody,

I was wondering whether attention_mask input for T5 could be used as a float in [0,1] instead of an integer as in the documentation

attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
1 for tokens that are not masked,
0 for tokens that are masked. 

Would you think passing something like (I write tokens for clarity, but they would be tokens_ids)

tokens               = ['hello','how','are','you','pad','pad']
attention_mask       = [0.5,    0.9,  0.2, 1,     0,      0] 

to attribute a particular emphasis on several tokens respect to the others (still keeping components 0 for Pad tokens) ?
Clearly the model requires a finetuning, but I wonder if this different usage of attention_mask could harm it in some way…

As far as I see from transformers/modeling_t5.py at main · huggingface/transformers · GitHub
It is not really clear to me how the attention_masks act in the attention blocks (see here for instance transformers/modeling_t5.py at 8f46ac98498dd47701971064617e00d7e723a98e · huggingface/transformers · GitHub).
I was expecting some kind of “hard” attention, but as far as I see it’s a “soft” implementation shifting the position_bias. How this translates into the removal of ‘pad’ token contribution from the attention (is a shift of “1” as in the original attention_mask, enough to ensure a reasonable suppression of pads) ?

Any answer is very welcomed! Thank you!

1 Like