What are eval/runtime, eval/steps_per_second, eval/samples_per_second graphs?

I train a BERT model and use WandB for monitoring.

I can see following graphs and I am trying to understand those.

  • eval/runtime
  • eval/steps_per_second
  • eval/samples_per_second

Also have train/global_step graph.

I would like to what are these and differents of each.

1 Like

Here is my understanding. All these graphs are related to the time it takes to evaluate on the evaluation_dataset that you pass to the Trainer.

eval\runtime - Total time it takes to evaluate (in seconds)
eval/steps_per_second - Avg time to evaluate each batch
eval/samples_per_second - Avg time to evaluate each sample

You can confirm this by plugging in your values to this example below.

eval\runtime = 60
len(eval_dataset) = 200
per_device_eval_batch_size = 10

Then the values for the other graphs are as follows.

eval/samples_per_second = eval\runtime / len(eval_dataset)
eval/steps_per_second = eval\runtime / (len(eval_dataset) / per_device_eval_batch_size)

Still looking for concrete evidence on this, but the calculation works for my use case. I will update this answer if I find the relevant code/documentation!

I think this is all correct, but I found something for training speed metrics that seems counter-intuitive that I thought would be worth sharing for anyone finding this discussion like myself. Someone can correct me if I misunderstood:

For eval, it is all based on the time for each evaluation phase (end of epoch or at step interval), so that you get several values over the course of a training run, whenever you stop to evaluate. For training, you get a single value that considers the total runtime of trainer.train() including evaluation phases in between, since it is only called a the end of training.

Personally I am interested in measuring training samples per second in the same way as in evaluation: including only the phases where the trainer is actually trying to take training steps, and perhaps getting many values per run as a result that could later be averaged, etc. This way if I am trying to improve my training speed and compare to others, I can focus on whats going on in the training steps. I may be in the minority on this though and the idea that train_runtime spans the time spent in trainer.train() is sensible. The undesirable part is you can’t use train/total_flos and the train/train_runtime to compute a ballpark estimate of the TFLOPS you achieved in training steps, since train/total_flos seems to only include (estimated) operations in training steps but train_runtime includes non-training phases.

Anyway, I think it is worth documenting somewhere, and would love to know if anyone has a good solution for measuring training TFLOPs for comparison with what they could theoretically be getting from their GPU(s), or if my desired approach doesn’t actually work.

1 Like