Measuring training speed

Hey folks,

I am pre-training a Roberta model on the c4 corpus on a g5.xlarge ec2 instance using pytorch and an Adam8bit optimizer. The instance has an A10g that is rated at 30 tflops (NVIDIA A10G Specs | TechPowerUp GPU Database).

How do I estimate how well I am using the GPU? Here is a training run on wandb: Weights & Biases

Dividing total_flos by total time gives approximate 6 tera flops or about 20% utilization which feels low.

Using the formula 6ND also gives a number in the same ballpark.

Am I thinking about this right? Pointers much appreciated.


1 Like