Track number of tokens seen during training in wandb with Trainer API

jpgard · April 13, 2023, 5:43am

I have a training loop that uses the Trainer API and reports the default metrics to weights and biases.

It is common to report the number of tokens seen during model training (not the number of steps or examples seen) in order to study scaling behavior. However, I don’t see a straightforward way to do this.

Is there a simple way to track the number of tokens my Trainer has seen during training, and report this to wandb? I can see this might require using a custom WandbCallBack but it isn’t clear where, if at all, the number of tokens is even tracked in the Trainer state.

Thanks!

g-ronimo · October 23, 2023, 3:33pm

Did you find a way?
Hard to believe there’s nobody who wants this feature.

jpgard · October 23, 2023, 6:40pm

No. I don’t see an easy way to implement this without refactoring the entire Trainer.__init__() and Trainer._inner_training_loop() methods, which seems like a total mess.

I think transformers should instead include this behavior by default. I created an issue on their Github page and am willing to take a stab at an implementation if they are willing to give it the green light and provide some guidance on design. Please upvote the issue and comment there!

github.com/huggingface/transformers

Count of tokens seen during training in Trainer

opened 06:38PM - 23 Oct 23 UTC

jpgard

### Feature request The `Trainer` API should track and log the number of tokens… seen during training. While it sometimes could (maybe?) be possible to back out the number of tokens seen from the FLOS, or by iterating over the whole dataset, it would make a lot of sense for the Trainer API to track the number of tokens seen (and it shouldn't be necessary to completely iterate over a model's training loop just to compute the count of tokens, which is the only current implementation of any token-related metric in Trainer, [`Trainer.num_tokens()`](https://github.com/huggingface/transformers/blob/acc394c4f5e1283c19783581790b3dc3105a3697/src/transformers/trainer.py#L1180)). This can't currently be implemented in a CallBack, because callbacks don't have access to the training data (only the trainer state). ### Motivation Number of tokens seen is an essential metric tracked in nearly every LLM training run. It is widely considered one of the fundamental drivers of model quality (tokens seen during training is reported for nearly every major LLM release). It seems that any language model developer using Hugging Face would like to know this metric for their training runs -- it maybe even more important and useful than the FLOS, and perhaps as important as the number of gradient steps. In any case, it's an extremely useful number to have, and it must be tracked during training as the model consumes examples. ### Your contribution I'm willing to contribute this but would like some guidance on the overall design first. In particular, here's what I think a reasonable implementation would include: - Add a `global_tokens_seen` or similar to the `TrainerState`. This would add only a single integer value to the `TrainerState`. - Increment this during `Trainer._inner_training_loop()` - Probably add this information to the logging outputs What do the folks at HF think about that?

Topic		Replies	Views
Trainer: log token count 🤗Transformers	0	242	October 19, 2023
Logging & Experiment tracking with W&B 🤗Transformers	78	44850	February 28, 2024
Track multiple losses & different outputs size with Trainer and callbacks 🤗Transformers	4	3100	July 11, 2024
Wandb does not display train/eval loss except for last one Beginners	2	3594	March 4, 2022
Track more than one loss using Trainer and Wandb Intermediate	1	648	July 11, 2024

Track number of tokens seen during training in wandb with Trainer API

Related topics