CUDA OOM error when using data-distributed mode on AWS p4d.24xlarge instance

Your batch and optimiser backlog also need to fit into VRAM.

Try using FP16 and/or Gradient Checkpointing, Gradient Accumulation