Track include_num_input_tokens_seen in Trainer

It appears that this metric also includes padding tokens. If one would use example packing, then it really tracks the “correct” number of tokens seen by the model.

However, I can think of two cases where this will not be accurate:

  1. In cases where packing is not used, training examples are padded to the longest sequence in the batch
  2. For SFT training on completions only

A more accurate calculation would be to sum the attention mask.
Any thoughts?