It appears that this metric also includes padding tokens. If one would use example packing, then it really tracks the “correct” number of tokens seen by the model.
However, I can think of two cases where this will not be accurate:
- In cases where packing is not used, training examples are padded to the longest sequence in the batch
- For SFT training on completions only
A more accurate calculation would be to sum the attention mask.
Any thoughts?