Can attention_mask hold float values in [0,1] in T5? How these masks act in Attention blocks?

facehuggingface · May 26, 2022, 8:35am

Hello Everybody,

I was wondering whether attention_mask input for T5 could be used as a float in [0,1] instead of an integer as in the documentation

attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
1 for tokens that are not masked,
0 for tokens that are masked.

Would you think passing something like (I write tokens for clarity, but they would be tokens_ids)

tokens               = ['hello','how','are','you','pad','pad']
attention_mask       = [0.5,    0.9,  0.2, 1,     0,      0]

to attribute a particular emphasis on several tokens respect to the others (still keeping components 0 for Pad tokens) ?
Clearly the model requires a finetuning, but I wonder if this different usage of attention_mask could harm it in some way…

As far as I see from transformers/modeling_t5.py at main · huggingface/transformers · GitHub
It is not really clear to me how the attention_masks act in the attention blocks (see here for instance transformers/modeling_t5.py at 8f46ac98498dd47701971064617e00d7e723a98e · huggingface/transformers · GitHub).
I was expecting some kind of “hard” attention, but as far as I see it’s a “soft” implementation shifting the position_bias. How this translates into the removal of ‘pad’ token contribution from the attention (is a shift of “1” as in the original attention_mask, enough to ensure a reasonable suppression of pads) ?

Any answer is very welcomed! Thank you!

Topic		Replies	Views
Is T5 expected to ignore padding tokens in `decoder_input_ids` when `decoder_attention_mask` is not provided 🤗Transformers	4	2690	April 5, 2023
Clarification on the attention_mask 🤗Transformers	4	23498	May 3, 2024
Self-attention masking for T5 encoder? 🤗Transformers	0	1702	February 27, 2022
Do automatically generated attention masks ignore padding? 🤗Transformers	4	16489	March 8, 2022
Why can padding tokens attend to other tokens in masked self attention? 🤗Transformers	0	70	November 4, 2024

Can attention_mask hold float values in [0,1] in T5? How these masks act in Attention blocks?

Related topics