Here is my understanding. All these graphs are related to the time it takes to evaluate on the evaluation_dataset that you pass to the Trainer.
eval\runtime - Total time it takes to evaluate (in seconds) eval/steps_per_second - Avg time to evaluate each batch eval/samples_per_second - Avg time to evaluate each sample
You can confirm this by plugging in your values to this example below.
Still looking for concrete evidence on this, but the calculation works for my use case. I will update this answer if I find the relevant code/documentation!
I think this is all correct, but I found something for training speed metrics that seems counter-intuitive that I thought would be worth sharing for anyone finding this discussion like myself. Someone can correct me if I misunderstood:
For eval, it is all based on the time for each evaluation phase (end of epoch or at step interval), so that you get several values over the course of a training run, whenever you stop to evaluate. For training, you get a single value that considers the total runtime of trainer.train() including evaluation phases in between, since it is only called a the end of training.
Personally I am interested in measuring training samples per second in the same way as in evaluation: including only the phases where the trainer is actually trying to take training steps, and perhaps getting many values per run as a result that could later be averaged, etc. This way if I am trying to improve my training speed and compare to others, I can focus on whats going on in the training steps. I may be in the minority on this though and the idea that train_runtime spans the time spent in trainer.train() is sensible. The undesirable part is you can’t use train/total_flos and the train/train_runtime to compute a ballpark estimate of the TFLOPS you achieved in training steps, since train/total_flos seems to only include (estimated) operations in training steps but train_runtime includes non-training phases.
Anyway, I think it is worth documenting somewhere, and would love to know if anyone has a good solution for measuring training TFLOPs for comparison with what they could theoretically be getting from their GPU(s), or if my desired approach doesn’t actually work.