Hello Everybody,
I was wondering whether attention_mask input for T5 could be used as a float in [0,1] instead of an integer as in the documentation
attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
1 for tokens that are not masked,
0 for tokens that are masked.
Would you think passing something like (I write tokens for clarity, but they would be tokens_ids)
tokens = ['hello','how','are','you','pad','pad']
attention_mask = [0.5, 0.9, 0.2, 1, 0, 0]
to attribute a particular emphasis on several tokens respect to the others (still keeping components 0 for Pad tokens) ?
Clearly the model requires a finetuning, but I wonder if this different usage of attention_mask could harm it in some way…
As far as I see from transformers/modeling_t5.py at main · huggingface/transformers · GitHub
It is not really clear to me how the attention_masks act in the attention blocks (see here for instance transformers/modeling_t5.py at 8f46ac98498dd47701971064617e00d7e723a98e · huggingface/transformers · GitHub).
I was expecting some kind of “hard” attention, but as far as I see it’s a “soft” implementation shifting the position_bias. How this translates into the removal of ‘pad’ token contribution from the attention (is a shift of “1” as in the original attention_mask, enough to ensure a reasonable suppression of pads) ?
Any answer is very welcomed! Thank you!