I train a BERT model and use WandB for monitoring.
I can see following graphs and I am trying to understand those.
Also have train/global_step graph.
I would like to what are these and differents of each.
Here is my understanding. All these graphs are related to the time it takes to evaluate on the evaluation_dataset that you pass to the Trainer.
eval\runtime - Total time it takes to evaluate (in seconds)
eval/steps_per_second - Avg time to evaluate each batch
eval/samples_per_second - Avg time to evaluate each sample
You can confirm this by plugging in your values to this example below.
eval\runtime = 60
len(eval_dataset) = 200
per_device_eval_batch_size = 10
Then the values for the other graphs are as follows.
eval/samples_per_second = eval\runtime / len(eval_dataset)
eval/steps_per_second = eval\runtime / (len(eval_dataset) / per_device_eval_batch_size)
Still looking for concrete evidence on this, but the calculation works for my use case. I will update this answer if I find the relevant code/documentation!
I think this is all correct, but I found something for training speed metrics that seems counter-intuitive that I thought would be worth sharing for anyone finding this discussion like myself. Someone can correct me if I misunderstood:
For eval, it is all based on the time for each evaluation phase (end of epoch or at step interval), so that you get several values over the course of a training run, whenever you stop to evaluate. For training, you get a single value that considers the total runtime of
trainer.train() including evaluation phases in between, since it is only called a the end of training.
Personally I am interested in measuring training samples per second in the same way as in evaluation: including only the phases where the trainer is actually trying to take training steps, and perhaps getting many values per run as a result that could later be averaged, etc. This way if I am trying to improve my training speed and compare to others, I can focus on whats going on in the training steps. I may be in the minority on this though and the idea that
train_runtime spans the time spent in
trainer.train() is sensible. The undesirable part is you can’t use
train/total_flos and the
train/train_runtime to compute a ballpark estimate of the TFLOPS you achieved in training steps, since
train/total_flos seems to only include (estimated) operations in training steps but train_runtime includes non-training phases.
Anyway, I think it is worth documenting somewhere, and would love to know if anyone has a good solution for measuring training TFLOPs for comparison with what they could theoretically be getting from their GPU(s), or if my desired approach doesn’t actually work.