What are eval/runtime, eval/steps_per_second, eval/samples_per_second graphs?

Indramal · August 8, 2023, 4:43am

I train a BERT model and use WandB for monitoring.

I can see following graphs and I am trying to understand those.

eval/runtime
eval/steps_per_second
eval/samples_per_second

Also have train/global_step graph.

I would like to what are these and differents of each.

bartley · August 31, 2023, 6:36pm

Here is my understanding. All these graphs are related to the time it takes to evaluate on the evaluation_dataset that you pass to the Trainer.

eval\runtime - Total time it takes to evaluate (in seconds)
eval/steps_per_second - Avg time to evaluate each batch
eval/samples_per_second - Avg time to evaluate each sample

You can confirm this by plugging in your values to this example below.

eval\runtime = 60
len(eval_dataset) = 200
per_device_eval_batch_size = 10

Then the values for the other graphs are as follows.

eval/samples_per_second = eval\runtime / len(eval_dataset)
eval/steps_per_second = eval\runtime / (len(eval_dataset) / per_device_eval_batch_size)

Still looking for concrete evidence on this, but the calculation works for my use case. I will update this answer if I find the relevant code/documentation!

Brendan · September 7, 2023, 5:23pm

I think this is all correct, but I found something for training speed metrics that seems counter-intuitive that I thought would be worth sharing for anyone finding this discussion like myself. Someone can correct me if I misunderstood:

For eval, it is all based on the time for each evaluation phase (end of epoch or at step interval), so that you get several values over the course of a training run, whenever you stop to evaluate. For training, you get a single value that considers the total runtime of trainer.train() including evaluation phases in between, since it is only called a the end of training.

Personally I am interested in measuring training samples per second in the same way as in evaluation: including only the phases where the trainer is actually trying to take training steps, and perhaps getting many values per run as a result that could later be averaged, etc. This way if I am trying to improve my training speed and compare to others, I can focus on whats going on in the training steps. I may be in the minority on this though and the idea that train_runtime spans the time spent in trainer.train() is sensible. The undesirable part is you can’t use train/total_flos and the train/train_runtime to compute a ballpark estimate of the TFLOPS you achieved in training steps, since train/total_flos seems to only include (estimated) operations in training steps but train_runtime includes non-training phases.

Anyway, I think it is worth documenting somewhere, and would love to know if anyone has a good solution for measuring training TFLOPs for comparison with what they could theoretically be getting from their GPU(s), or if my desired approach doesn’t actually work.

Topic		Replies	Views
How to get per-eval-step score when using trainer? 🤗Transformers	4	1904	October 18, 2022
Batch size during training vs batch size during evaluation Beginners	1	1881	August 27, 2023
Batch size in trainer eval loop DeepSpeed	3	4551	April 22, 2022
Eval_batch_size VS per_device_eval_batch_size DeepSpeed	0	888	August 4, 2023
Evaluation loss depends on batch size Beginners	1	145	October 14, 2024

What are eval/runtime, eval/steps_per_second, eval/samples_per_second graphs?

Related topics