Larger instance types to do not reduce training time?

OlivierCR · February 8, 2022, 2:56pm

indeed, switching from g4dn.xlarge to g4dn.4xlarge will speed up training only if you have a CPU bottleneck (check in Cloudwatch what is % utilisation for CPU and GPU)

It’s kind of odd than p3.2xlarge doesn’t speed up things, I recommend you check CloudWatch to verify that your GPU is busy enough.

I also see a couple for loop in your code: are you sure it’s worth running that on a GPU instance? how long is it taking compared to the expected training time? if you have long CPU steps, consider running them out of the GPU jobs

Topic		Replies	Views
Estimating Training Time for Fine Tuning Beginners	2	4263	November 2, 2020
Accelerate not performing distributed training 🤗Accelerate	2	576	October 5, 2023
Distributed Training run_summarization.py Amazon SageMaker	3	935	July 30, 2021
Run_clm.py is very slow on gpu (used to take seconds) Beginners	0	892	May 20, 2021
Organization Pricing Beginners	1	411	February 22, 2021

Larger instance types to do not reduce training time?

Related topics