Sometimes we pack multiple short examples in the same input sequence to increase training efficiency(so we don’t need to waste computation in adding and computing paddings in attention).
For example, assume we have sample 1 = [‘a’, ‘b’, ‘c’, ‘<eos>’], sample 2 = [‘d’, ‘e’, ‘f’, ‘<eos>’], we can pack them into a new packed_sample = [‘a’, ‘b’, ‘c’, ‘<eos>’, ‘d’, ‘e’, ‘f’, ‘<eos>’].
This procedure is quite simple, and I think ConstantLengthDataset and group_texts(examples) function in examples/pytorch/language-modeling/run_clm.py can do it well, but I think during training we can’t just use the original casual mask(triangle) for it, otherwise packed samples will attend to information from other samples, which is unwanted. To be more specific, for packed_sample = [‘a’, ‘b’, ‘c’, ‘<eos>’, ‘d’, ‘e’, ‘f’, ‘<eos>’], I think the correct mask should be like this
In the case of packing, why can’t an attention matrix like [1,1,1,1,2,2,2,2] be passed and interpreted for self attention (consistent with the original post)?
The code for AttentionMaskConverter is a bit confusing and the 4D output is unclear to me if this happens correctly already or not. Does anyone know?
In this example, to make it shorter to write…
[1,1,2,2] translates to
[[[[0, m, m, m],
[0, 0, m, m],
[0, 0, m, m],
[0, 0, m, m]]]]
Where m = minimum or -3.4028e38
That seems like it says sample 2 has attention to sample 1, which is not desired here.