I am curious how HF implements Bert and Roberta. In theory, dropping more tokens in BertForMaskedLM or RobertaForMaskedLM through attn_mask should speed up training and use less memory since less tokens are attended. For example, if I mask 50%, it should speed up by much more than 2x. Is this the case in HF’s implementations?
Sentences are padded to have the same length.
Tokens are masked, not dropped.
All tokens are attended (<pad> included) but masked ones get a 0 weight after the softmax computation thanks to the mask.
So there is no speedup if sentences have a fixed length even if most tokens are <pad>.