It makes sense that your training is not faster when using g4dn.2xlarge
or g4dn.4xlarge
since they only also have 1 GPU.
But when using p3.2xlarge
there should be some difference. What are you seeing when taking a look at you GPUUtilization and GPUMemoryUtilization at your training job overview?