Your batch and optimiser backlog also need to fit into VRAM.
Try using FP16 and/or Gradient Checkpointing, Gradient Accumulation
Your batch and optimiser backlog also need to fit into VRAM.
Try using FP16 and/or Gradient Checkpointing, Gradient Accumulation