indeed, switching from g4dn.xlarge to g4dn.4xlarge will speed up training only if you have a CPU bottleneck (check in Cloudwatch what is % utilisation for CPU and GPU)
It’s kind of odd than p3.2xlarge doesn’t speed up things, I recommend you check CloudWatch to verify that your GPU is busy enough.
I also see a couple for loop
in your code: are you sure it’s worth running that on a GPU instance? how long is it taking compared to the expected training time? if you have long CPU steps, consider running them out of the GPU jobs