CUDA out of memory only during validation not training

Here are a few things:

  1. Make sure your model only returns logits and not extra tensors (as everything is accumulated on the GPU)
  2. Use eval_accumulation_steps to regularly offload the predictions on the GPU to the CPU (slower but will avoid this OOM error).
1 Like