Evaluation error: CUDA out of memory

I ran evaluation after resume from checkpoint, but I got OOM error
Here is my command line:

  --output_dir=./pretrained-bert-c4-bpe-fp32-workers8 \
  --resume_checkpoint="./pretrained-bert-c4-bpe-fp32-workers8/checkpoint-1000000" \
  --overwrite_output_dir=True \
  --max_steps=1000000 \
  --warmup_steps=10000 \
  --learning_rate=1e-04 \
  --per_device_train_batch_size=8 \
  --per_device_eval_batch_size=8 \
  --gradient_accumulation_steps=1 \
  --logging_steps=100 \
  --save_steps=10000 \
  --save_total_limit=1 \
  --evaluation_strategy steps \
  --eval_steps 1000 \
  --ignore_data_skip

Here is the error log

^M  4%|â–Ť         | 3826/86776 [40:46<14:39:15,  1.57it/s]^[[ATraceback (most recent call last):
  File "pretrain_c4_megatron_bpe_moe.py", line 178, in <module>
    train_model()
  File "pretrain_c4_megatron_bpe_moe.py", line 174, in train_model
    trainer.train(resume_from_checkpoint=model_args.resume_checkpoint)
  File "/workspace/transformers/src/transformers/trainer.py", line 1539, in train
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/workspace/transformers/src/transformers/trainer.py", line 2323, in evaluate
    output = eval_loop(
  File "/workspace/transformers/src/transformers/trainer.py", line 2528, in evaluation_loop
    logits = self._nested_gather(logits)
  File "/workspace/transformers/src/transformers/trainer.py", line 2661, in _nested_gather
    tensors = distributed_concat(tensors)
  File "/workspace/transformers/src/transformers/trainer_pt_utils.py", line 171, in distributed_concat
    return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
  File "/workspace/transformers/src/transformers/trainer_pt_utils.py", line 171, in <genexpr>
    return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
  File "/workspace/transformers/src/transformers/trainer_pt_utils.py", line 174, in distributed_concat
    dist.all_gather(output_tensors, tensor)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2083, in all_gather
    work = default_pg.allgather([tensor_list], [tensor], opts)
RuntimeError: CUDA out of memory. Tried to allocate 3.73 GiB (GPU 1; 15.78 GiB total capacity; 8.16 GiB already allocated; 797.75 MiB free; 13.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I also tried to add --eval_accumulation_step=1, but the eval speed is very very low (18.31s/it vs 1.57it/s)

  9%|â–‰         | 7937/86776 [8:26:04<400:55:34, 18.31s/it]
2 Likes