Track include_num_input_tokens_seen in Trainer

iarbel · August 6, 2024, 2:14pm

It appears that this metric also includes padding tokens. If one would use example packing, then it really tracks the “correct” number of tokens seen by the model.

However, I can think of two cases where this will not be accurate:

In cases where packing is not used, training examples are padded to the longest sequence in the batch
For SFT training on completions only

A more accurate calculation would be to sum the attention mask.
Any thoughts?

Topic		Replies	Views
Seq2seq padding 🤗Transformers	1	68	October 10, 2024
Padding strategy for classification Beginners	3	2483	July 20, 2020
Importance of padding for tokens and same size inputs for transformers 🤗Transformers	1	680	October 22, 2021
Track number of tokens seen during training in wandb with Trainer API 🤗Transformers	2	1230	October 23, 2023
Gemma-2 & Phi-3 SFT nuances Models	0	107	September 18, 2024

Track include_num_input_tokens_seen in Trainer

Related topics