Sometimes we pack multiple short examples in the same input sequence to increase training efficiency(so we don’t need to waste computation in adding and computing paddings in attention).
For example, assume we have sample 1 = [‘a’, ‘b’, ‘c’, ‘<eos>’], sample 2 = [‘d’, ‘e’, ‘f’, ‘<eos>’], we can pack them into a new packed_sample = [‘a’, ‘b’, ‘c’, ‘<eos>’, ‘d’, ‘e’, ‘f’, ‘<eos>’].
This procedure is quite simple, and I think ConstantLengthDataset and group_texts(examples) function in examples/pytorch/language-modeling/run_clm.py can do it well, but I think during training we can’t just use the original casual mask(triangle) for it, otherwise packed samples will attend to information from other samples, which is unwanted. To be more specific, for packed_sample = [‘a’, ‘b’, ‘c’, ‘<eos>’, ‘d’, ‘e’, ‘f’, ‘<eos>’], I think the correct mask should be like this
Am I correct? If yes, is there any element implementation for this? Any idea will be really appreciated, I can implement it myself.